Florence-2 and WD-Tagger emerge as go-to free batch captioners for character LoRA training
Practitioners training character LoRAs on 30+ images now lean on Florence-2 and WD-Tagger for batch captioning, avoiding manual tagging per image.
Florence-2 and WD-Tagger have become the dominant free auto-captioning tools for LoRA training workflows among Stable Diffusion practitioners. Both run locally and handle batch processing of 30+ images without API costs, a threshold where manual captioning becomes impractical for character fine-tunes.
Florence-2, a multimodal model from Microsoft, generates natural-language descriptions of image content. Practitioners wire it into ComfyUI or standalone Python scripts to caption training sets in one pass. WD-Tagger, derived from the Waifu Diffusion project, outputs booru-style tags — comma-separated descriptors like "1girl, blue_eyes, standing, outdoor" — which align with how many anime and character models were originally trained. The choice between them depends on whether the target base model expects prose captions or tag lists.
Both tools integrate with Kohya_ss, the most widely used LoRA training script. Users drop images into a folder, run the captioner, review the generated text files, then start training. The workflow applies to any LoRA training regardless of base model. A handful of practitioners also mentioned BLIP and BLIP-2 as alternatives, though Florence-2's newer architecture and WD-Tagger's tag-specific tuning have pulled most adoption.
The next bottleneck is caption quality control — even the best auto-captioners miss context or hallucinate details. Practitioners still spend time editing the generated text files, especially for nuanced character features like clothing or expression. A captioning model trained specifically on character consistency across image sets would close that gap, but no open-weight release has tackled it yet.
