๐ŸŽฌ WAN Video LoRA Training

โš ๏ธ Under Heavy Development
This setup is still experimental.
Use it as inspiration for WAN-based video generation (T2V/I2V/V2V/S2V) rather than a production-ready guide.

Order a GPU server for WAN Video


๐Ÿงฉ 1. Environment Setup

Create a clean Python environment using Conda (recommended) or venv:

bash
# 1) Conda / Virtual Environment
conda create -n wan22 python=3.10 -y
conda activate wan22

# 2) Install PyTorch (CUDA 12.x Build)
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install xformers accelerate transformers datasets peft bitsandbytes==0.43.3 safetensors einops
pip install opencv-python pillow tqdm

# 3) Clone Trainer
git clone https://github.com/Wan-Video/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -r requirements.txt || true
cd ..

๐Ÿ’ก Tip: On NVIDIA A100 GPUs, always use BF16 precision for stable and efficient training.


๐Ÿง  2. Model Setup

Place your WAN 2.2 model weights (depending on your task: T2V, I2V, V2V, S2V) along with VAE and text encoder files into the trainerโ€™s expected directories โ€” or pass them manually via --model_name_or_path.

Examples:

bash
--model_name_or_path Wan-AI/Wan2.2-T2V-A14B   # Text-to-Video
--model_name_or_path Wan-AI/Wan2.2-I2V-A14B   # Image-to-Video

๐ŸŽž 3. Dataset Preparation

WAN expects a JSONL dataset file with one entry per video clip.

Example format:

json
{"video": "/data/myset/clip_0001.mp4", "prompt": "a cozy coffee shop scene at golden hour", "fps": 24, "seconds": 4, "resolution": "1280x720"}
{"video": "/data/myset/clip_0002.mp4", "prompt": "rainy neon city street, cinematic", "fps": 24, "seconds": 4, "resolution": "1280x720"}

๐Ÿ“˜ Notes:

  • For Text-to-Video (T2V), you may reference stills, frames, or a dummy video. The prompt and target specs (fps, seconds, resolution) are required.

  • Store your datasets as:

    • /data/wan22/train.jsonl
    • /data/wan22/val.jsonl

โš™๏ธ 4. accelerate Configuration

Initialize once:

bash
accelerate config default

Or define manually in ~/.cache/huggingface/accelerate/default_config.yaml:

yaml
compute_environment: LOCAL_MACHINE
distributed_type: NO
mixed_precision: bf16
num_processes: 1
gpu_ids: "0"
dynamo_backend: NO

๐Ÿ‘‰ For multi-GPU training, set:

yaml
distributed_type: MULTI_GPU

๐Ÿš€ 5. LoRA Fine-Tuning (A100 40GB Example)

๐Ÿงฉ Text-to-Video (720p, 4 sec, BF16)

bash
conda activate wan22
cd DiffSynth-Studio

accelerate launch \
  train_wan_lora.py \
  --model_name_or_path "Wan-AI/Wan2.2-T2V-A14B" \
  --output_dir /data/wan22_lora_out \
  --dataset_json /data/wan22/train.jsonl \
  --resolution 720 --fps 24 --clip_seconds 4 \
  --train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --max_train_steps 20000 \
  --learning_rate 1e-4 --warmup_steps 500 \
  --lora_rank 64 --lora_alpha 64 \
  --use_bf16 --enable_xformers --gradient_checkpointing \
  --checkpointing_steps 1000 \
  --validation_json /data/wan22/val.jsonl --validation_steps 2000

๐Ÿ–ผ Image-to-Video (I2V)

Change only the model:

bash
--model_name_or_path "Wan-AI/Wan2.2-I2V-A14B"

๐Ÿ”ง 6. Recommended A100 Tweaks

Situation Recommended Adjustment
Plenty of VRAM Increase --train_batch_size to 2 or use --lora_rank 96โ€“128
Tight VRAM Increase --gradient_accumulation_steps to 12โ€“16
Character/Style LoRAs 6kโ€“12k steps, rank 32โ€“64
Precision Always prefer BF16 over FP16
Optimization Enable --gradient_checkpointing + --enable_xformers

๐Ÿ’พ 7. Resume Training / Checkpoints

bash
accelerate launch train_wan_lora.py \
  ... (same parameters) \
  --resume_from_checkpoint "/data/wan22_lora_out/checkpoint-10000"

๐Ÿง  8. Inference / Testing

Most WAN workflows (CLI, ComfyUI, etc.) support loading LoRA adapters directly.

CLI Example:

bash
python infer_wan.py \
  --model_name_or_path "Wan-AI/Wan2.2-T2V-A14B" \
  --lora_path "/data/wan22_lora_out" \
  --prompt "cozy coffee shop at golden hour, bokeh" \
  --negative_prompt "distorted faces, artifacts" \
  --resolution 720 --fps 24 --seconds 4 \
  --output /data/wan22/samples/test001.mp4 \
  --use_bf16 --enable_xformers

๐Ÿ’ก ComfyUI: Use the WAN Loader โ†’ attach LoRA(s) โ†’ render your test videos.


๐Ÿงฎ 9. Multi-GPU Training (Same Host)

Leverage multiple GPUs (e.g., 2ร— A100 40GB) for faster fine-tuning.

bash
accelerate config  # set distributed_type=MULTI_GPU, num_processes=2
accelerate launch \
  --multi_gpu \
  train_wan_lora.py \
  ... (same parameters) \
  --train_batch_size 1 --gradient_accumulation_steps 8

For setups with 4+ GPUs, enable --seq_parallel if supported โ€” reduces VRAM load significantly.


โšก 10. Hyperparameter Reference

Type LR Rank Alpha Steps Batch Grad Accum Notes
General 1e-4 64 64 10kโ€“20k 1 8โ€“12 Balanced baseline
Character 1e-4 64โ€“128 64 8kโ€“12k 1 8 Ideal for short 2โ€“4s clips
Style 1e-4 32โ€“64 64 6kโ€“10k 1 8โ€“12 Broader stylistic range
Evaluation โ€” โ€” โ€” every 1โ€“2k โ€” โ€” Test 2โ€“4 fixed + 2 real prompts

๐Ÿงพ Summary

WAN LoRA training enables:

  • Rapid customization of WAN 2.2 video generation models
  • Style, theme, and character consistency across outputs
  • Efficient fine-tuning using LoRA and xFormers with minimal VRAM overhead

Recommended setup:

  • โš™๏ธ CUDA 12.x
  • ๐Ÿง  NVIDIA A100 (40 GB)
  • ๐Ÿ’ก BF16 precision
  • ๐Ÿงฉ xFormers + gradient checkpointing

๐Ÿ Example Workflow Overview

text
Environment  โ†’  Model Setup  โ†’  Dataset Prep  โ†’  LoRA Fine-tune  โ†’  Inference

๐ŸŽฅ Train smarter. Generate faster. WAN stronger.