Note
The standard CLIP cannot generate anything but noise, so be sure to use the accompanying CLIP. "CLIP_INCL" file includes both the DiT and CLIP models. T5XXL has not been modified, so you can use the standard version. Additionally, during inference, set the model shift to around 7; otherwise, the generated images will appear overburned. In the case of ComfyUI, you can adjust the values using the ModelSamplingSD3 Node.
通常のCLIPではノイズしか生成できないので、必ず付属のCLIPを使用してください。CLIP_INCLファイルはDiTとCLIPを含んだファイルです。T5XXLは変更していないので通常の物を使用してください。 また、推論時のmodel shift=7前後にしないと焼付いた画像になります。 ComfyUIの場合、ModelSamplingSD3 Nodeで値を変更できます。
Model Information
My English is terrible, so I use translation tools.
Description
This is an experimental anime model designed to explore training methods for the flow-matching DiT model, SD3.5 Medium. Since the model is still in the training phase, the art style lacks consistency, and there are many issues with hands and body structures.
Usage
- Resolution: ~1MP
- model shift 7 ~ 10 (reccomend 7)
- CFG Scale: 4 ~ 7 (recommend 6)
- Steps: 20 ~ 40
- sampler: Euler a
- scheduler: Simple, Beta
Prompt Format (from Kohaku-XL-Epsilon)
<1girl/1boy/1other/...>, <character>, <series>, <artists>, <general tags>, <quality tags>, <year tags>, <meta tags>, <rating tags>
Due to the small amount of training, the <character><series><artists>
tags are almost non-functional. As training is focused on girl characters, it may not generate boy or other non-persons well. Since the dataset was created using hakubooru, the prompt format will be the same as the KohakuXL format. However, based on experiments, it is not strictly necessary to follow this format, as it interprets meaning to some extent even in natural language.
Special Tags
- Quality Tags: masterpiece, best quality, great quality, good quality, normal quality, low quality, worst quality
- Rating Tags: safe, sensitive, nsfw, explicit
- Date Tags: newest, recent, mid, early, old
Training
Observations and Reflections
Insights and considerations gained through training. Note: These reflections are based on training with anime images, and may not apply to photo images.
- Curriculum Learning: From early experiments, I realized that the DiT model (or flow-matching models) learn concepts differently compared to conventional diffusion models. Therefore, I adopted a multi-stage training flow, starting with low-frequency components such as composition and poses to build a stable foundation, followed by learning high-frequency details.
- Gradual Resolution Increase: As comprehensive training from the beginning proved challenging, I first trained at a lower resolution to grasp basic concepts before transitioning to higher resolutions. Training was conducted in two stages: 512px → 1024px. Incorporating low-resolution training also helped reduce training time. The enable_scaled_pos_embed option caused increased artifacts outside the training resolution in my tests, so I trained without using it.
- T5 Attention Mask: Comparing results with and without the mask, I found that training without the mask produced more natural results. Hence, the mask was not used.
- Weighting Scheme: Under my conditions, using a uniform weighting scheme led to increasing breakdowns in composition as large-scale training progressed. I found that for large-scale training, logit_normal (which strongly emphasizes subjects like people) or mode (which learns the entire image, including the background) worked better.
- Training Shift: The SD3 technical report and official implementation use a default shift value of 3.0. However, using shift=3.0 at 1024px resolution often caused significant anatomical issues. Based on my tests, shift=2.5 worked well at 512px, while shift=5.0 or 7.0 was a balanced parameter for 1024px, minimizing artifacts and capturing details effectively. With shift values of 10 or 20, low-frequency regions were emphasized, leading to stronger learning of the background. Since higher shifts reduce the number of generative steps, I recommend a shift of around 6 for inference.
- Optimizers: I found AdamW, ScheduleFreeAdamW, and ADOPT to be effective. Among these, Cautious ADOPT, which combines the Cautious optimizer, was particularly effective. However, it could become noisy depending on the dataset, so careful usage is required.
- Batch Size: The high batch sizes commonly used for full fine-tuning models amplified instability in the model. Therefore, significantly lower batch sizes, such as 4 or 8, allowed for more stable training.
- Local Minima: During training, issues such as parts of the image turning black or outputs becoming unusually dark were observed. These issues often occurred when the learning rate was low, suggesting the model was trapped in local minima. Increasing the learning rate (e.g., to 7.5e-4) showed a tendency to resolve these problems.
Dataset Preparation
I used hakubooru-based custom scripts.
- Exclude Tags:
traditional_media, photo_(medium), scan, animated, animated_gif, lowres, non-web_source, variant_set, tall image, duplicate, pixel-perfect_duplicate
- Minimum Post ID: 1,000,000
Training
- Training Hardware: A single RTX 4090
- Method: Full Fine-Tune
- Training Script: sd-scripts / pytorch_optimizer
- Basic Settings:
accelerate launch --num_cpu_threads_per_process 2 sd3_train.py ^
--sdpa --gradient_checkpointing --cache_latents --cache_latents_to_disk --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --use_t5xxl_cache_only ^
--max_data_loader_n_workers 1 --save_model_as "safetensors" ^
--mixed_precision "bf16" ^
--save_precision "bf16" ^
--min_bucket_reso 512 --max_bucket_reso 1280 --seed 1 ^
--max_train_epochs 1 ^
--learning_rate 5e-5 --learning_rate_te1 5e-6 --learning_rate_te2 1e-5 --train_batch_size 2 --gradient_accumulation_steps 2 ^
--optimizer_type adamw8bit ^
--lr_scheduler="cosine" --lr_warmup_steps 100 ^
--vae_batch_size 4 --cache_info ^
--bucket_no_upscale --keep_tokens_separator "|||" --disable_mmap_load_safetensors --training_shift 7 --weighting_scheme mode --train_text_encoder --save_clip
- v06C 72,000images (res512 bs6 warmup1000 --learning_rate 1.5e-5 --learning_rate_te1 2e-6 --learning_rate_te2 7.5e-6 --lr_scheduler="constant_with_warmup" weight_decay=0.05 -training_shift 0.75) 4epochs
- v07B 72,000images (res512 bs6 acc4 warmup200 --learning_rate 1.5e-5 --learning_rate_te1 2e-6 --learning_rate_te2 7.5e-6 --lr_scheduler="constant_with_warmup" weight_decay=0.05 -training_shift 0.75) 3epochs
- v08A 72,000images (res512 bs6 acc4 warmup200 --learning_rate 4.5e-5 --learning_rate_te1 6e-6 --learning_rate_te2 2.2e-5 --lr_scheduler="constant_with_warmup" weight_decay=0.05 -training_shift 0.65) 4epochs
- v09B 72,000images (res512 bs4 acc3 warmup200 --learning_rate 3e-5 --lr_scheduler="cosine" --optimizer_type pytorch_optimizer.optimizer.adopt.ADOPT --optimizer_args "cautious=True" "weight_decay=0.05" "weight_decouple=True" ) 4epochs
- v10B 72,000images (res512 bs4 acc3 warmup200 --learning_rate 3e-5 --learning_rate_te1 2e-6 --learning_rate_te2 7.5e-6 --lr_scheduler="cosine" weight_decay=0.01 --weighting_scheme cosmap --training_shift 2.5 ) 3epochs
- v11N 72,000images (res512 bs4 acc1 warmup200 --learning_rate 1e-4 --learning_rate_te1 2.5e-5 --learning_rate_te2 5e-5 --lr_scheduler="cosine" weight_decay=0.01 --weighting_scheme logit_normal --training_shift 2.5 ) 5epochs
- v12A 72,000images (res512 bs4 acc1 warmup200 --learning_rate 5e-5 --lr_scheduler="cosine" --optimizer_type pytorch_optimizer.optimizer.adopt.ADOPT --optimizer_args "cautious=True" "weight_decay=0.01" "weight_decouple=True" --weighting_scheme logit_normal --training_shift 2.5 ) 5epochs
- v001J 48,000images (res1024 bs1 acc4 warmup400 --learning_rate 5e-5 --lr_scheduler="cosine" --optimizer_type pytorch_optimizer.optimizer.adopt.ADOPT --optimizer_args "cautious=True" "weight_decay=0.01" "weight_decouple=True" --weighting_scheme logit_normal --training_shift 2 ) 1epoch
- v002A 48,000images (res1024 bs1 acc12 warmup200 --learning_rate 5e-5 --lr_scheduler="cosine" --optimizer_type pytorch_optimizer.optimizer.adopt.ADOPT --optimizer_args "cautious=True" "weight_decay=0.01" "weight_decouple=True" --weighting_scheme logit_normal --training_shift 2 ) 2epochs
- v003E 48,000images (res1024 bs1 acc4 warmup400 --learning_rate 2e-5 --lr_scheduler="cosine" --optimizer_type pytorch_optimizer.optimizer.adopt.ADOPT --optimizer_args "cautious=True" "weight_decay=0.01" "weight_decouple=True" --weighting_scheme logit_normal --training_shift 3 ) 1epoch
- v004H 60,000images (res1024 bs2 acc4 warmup100 --learning_rate 3e-5 --learning_rate_te1 5e-6 --learning_rate_te2 1e-5 --lr_scheduler="constant_with_warmup" --optimizer_type adamw8bit --weighting_scheme logit_normal --training_shift 4.5 ) 1epoch
- v005F 48,000images (res1024 bs2 acc4 warmup100 --learning_rate 4e-5 --learning_rate_te1 5e-6 --learning_rate_te2 1e-5 --lr_scheduler="constant_with_warmup" --optimizer_type adamw8bit --weighting_scheme logit_normal --training_shift 7 ) 2epochs
- v006A 36,000images (res1024 bs2 acc2 warmup100 --learning_rate 7.5e-5 --learning_rate_te1 1e-5 --learning_rate_te2 2e-5 --lr_scheduler="cosine" --optimizer_type adamw8bit --weighting_scheme logit_normal --training_shift 7 ) 2epochs
- v007A 48,000images (res1024 bs2 acc2 warmup100 --learning_rate 5e-5 --learning_rate_te1 1e-5 --learning_rate_te2 2e-5 --lr_scheduler="cosine" --optimizer_type adamw8bit --weighting_scheme mode --training_shift 7 ) 2epochs
- v008A 48,000images (res1024 bs2 acc2 warmup100 --learning_rate 5e-5 --learning_rate_te1 1e-5 --learning_rate_te2 2e-5 --lr_scheduler="cosine" --optimizer_type adamw8bit --weighting_scheme mode --training_shift 7 ) 2epochs
Resources (License)
- stable-diffusion-3.5-medium (stabilityai-ai-community)
- danbooru2023-webp-4Mpixel (MIT)
- danbooru2023-metadata-database (MIT)
license
stabilityai-ai-community
Acknowledgements
- Stability AI: Thanks for publishing a great open source model.
- kohya-ss: Thanks for publishing the essential training scripts and for the quick updates.
- Kohaku-Blueleaf: Thanks for the extensive publication of the scripts for the dataset and the various training conditions.
モデル説明
Flow matching DiTモデルであるSD3.5Mediumの学習法を模索するための実験的アニメモデルです。学習中の段階のため、絵柄の統一感に乏しく、手や人体に破綻が多く出ます。
使用法
プロンプトフォーマット
英語部分を参照してください。基本的にはKohakuXL同様のスタイルですが、自然言語でもある程度動くようです。学習量が足りないため、キャラ、作品、アーティストタグはほぼ機能しません。
特殊タグ
英語部分を参照してください。
学習
雑感
トレーニングを通して得られた知見と考察です。 anime画像でのトレーニング考察なので、photo画像だとこれに一致しない可能性があります。
- カリキュラム学習: 初期の学習実験を通してDiTモデル(もしくはflow matchモデル)は既存の拡散モデルとは概念の学習の仕方が異なると実感しました。そこで低周波成分の構図・ポーズから学習し、安定した状態で高周波成分である詳細を学習させる、複数段階の学習フローを取りました。
- 段階的解像度向上: 上記の通り、最初から全般的な学習は難しいので、低解像度である程度概念を取得してから高解像度の学習へ移行しました。512px->1024pxの二段階で学習しています。低解像度での学習を取り入れる事により学習時間の短縮も行っています。enable_scaled_pos_embedオプションは私の条件では、学習解像度以外の破綻が増えたので使用せずに学習をしています。
- T5 attention mask: maskの有無で学習結果を比較したところ、mask無しの方が自然な学習をしていたので、maskは使用していません。
- weighting_scheme: 私の条件ではuniformで大規模な学習をするにつれて構図の破綻が増えていきました。大規模学習ではlogit_normal(人物等の主題を強く学習)or mode(背景を含めた画像全体を学習)が良いと感じました。
- training_shift: SD3の技術レポートや公式実装ではshift=3.0がデフォルトで使用されています。しかしshift=3.0は解像度1024で使用すると人体の破綻がかなりの割合で発生します。私が検証した限りでは、解像度512でshift=2.5、解像度1024ではshift=5.0 or 7.0が破綻しづらく、かつ細部も表現できるバランスのよいパラメータであるとわかりました。shift=10,20になると低周波領域が強調され背景が強く学習されます。生成ステップが減少するので少なくとも生成時のshiftは6前後をお勧めします。
- オプティマイザ: AdamW,ScheduleFreeAdamW,ADOPTが有効だと感じました。この中でもCautiousオプティマイザを組み合わせた、Cautious ADOPTが特に有効でしたが、データセットによってはノイジーになることもあったので使い方には注意が必要です。
- バッチサイズ: モデルのFull Fine Tuneで一般的に使用される高いバッチサイズはモデルの不安定性を増幅していました。なので4や8等のかなり低めのバッチサイズの方が安定して学習が行えました。
- 局所解: 学習を進めると画像の一部が黒くなったり、出力が異常に暗くなる場合が見られました。これは学習率が低い場合に良く発生していたため、局所解にトラップされたものと考えています。高めの学習率(7.5e-4等)にすると改善の傾向が見られました。
データセット準備と学習
英語部分を参照してください。
使用リソース(およびそのライセンス)
英語部分を参照してください。
ライセンス
stabilityai-ai-community
謝辞
- Stability AI: 素晴らしいオープンソースモデルの公開に感謝します。
- kohya-ss: 不可欠なトレーニングスクリプトの公開、迅速なアップデートに感謝します。
- Kohaku-Blueleaf: データセット用スクリプトや各種学習条件の幅広い公開に感謝します。