โ ๏ธ Under Heavy Development
This setup is still experimental.
Use it as inspiration for WAN-based video generation (T2V/I2V/V2V/S2V) rather than a production-ready guide.
Order a GPU server for WAN Video
Create a clean Python environment using Conda (recommended) or venv:
# 1) Conda / Virtual Environment
conda create -n wan22 python=3.10 -y
conda activate wan22
# 2) Install PyTorch (CUDA 12.x Build)
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install xformers accelerate transformers datasets peft bitsandbytes==0.43.3 safetensors einops
pip install opencv-python pillow tqdm
# 3) Clone Trainer
git clone https://github.com/Wan-Video/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -r requirements.txt || true
cd ..
๐ก Tip: On NVIDIA A100 GPUs, always use BF16 precision for stable and efficient training.
Place your WAN 2.2 model weights (depending on your task: T2V, I2V, V2V, S2V) along with VAE and text encoder files into the trainerโs expected directories โ or pass them manually via --model_name_or_path.
Examples:
--model_name_or_path Wan-AI/Wan2.2-T2V-A14B # Text-to-Video
--model_name_or_path Wan-AI/Wan2.2-I2V-A14B # Image-to-Video
WAN expects a JSONL dataset file with one entry per video clip.
{"video": "/data/myset/clip_0001.mp4", "prompt": "a cozy coffee shop scene at golden hour", "fps": 24, "seconds": 4, "resolution": "1280x720"}
{"video": "/data/myset/clip_0002.mp4", "prompt": "rainy neon city street, cinematic", "fps": 24, "seconds": 4, "resolution": "1280x720"}
๐ Notes:
For Text-to-Video (T2V), you may reference stills, frames, or a dummy video. The prompt and target specs (fps, seconds, resolution) are required.
Store your datasets as:
/data/wan22/train.jsonl/data/wan22/val.jsonlaccelerate ConfigurationInitialize once:
accelerate config default
Or define manually in
~/.cache/huggingface/accelerate/default_config.yaml:
compute_environment: LOCAL_MACHINE
distributed_type: NO
mixed_precision: bf16
num_processes: 1
gpu_ids: "0"
dynamo_backend: NO
๐ For multi-GPU training, set:
distributed_type: MULTI_GPU
conda activate wan22
cd DiffSynth-Studio
accelerate launch \
train_wan_lora.py \
--model_name_or_path "Wan-AI/Wan2.2-T2V-A14B" \
--output_dir /data/wan22_lora_out \
--dataset_json /data/wan22/train.jsonl \
--resolution 720 --fps 24 --clip_seconds 4 \
--train_batch_size 1 \
--gradient_accumulation_steps 8 \
--max_train_steps 20000 \
--learning_rate 1e-4 --warmup_steps 500 \
--lora_rank 64 --lora_alpha 64 \
--use_bf16 --enable_xformers --gradient_checkpointing \
--checkpointing_steps 1000 \
--validation_json /data/wan22/val.jsonl --validation_steps 2000
Change only the model:
--model_name_or_path "Wan-AI/Wan2.2-I2V-A14B"
| Situation | Recommended Adjustment |
|---|---|
| Plenty of VRAM | Increase --train_batch_size to 2 or use --lora_rank 96โ128 |
| Tight VRAM | Increase --gradient_accumulation_steps to 12โ16 |
| Character/Style LoRAs | 6kโ12k steps, rank 32โ64 |
| Precision | Always prefer BF16 over FP16 |
| Optimization | Enable --gradient_checkpointing + --enable_xformers |
accelerate launch train_wan_lora.py \
... (same parameters) \
--resume_from_checkpoint "/data/wan22_lora_out/checkpoint-10000"
Most WAN workflows (CLI, ComfyUI, etc.) support loading LoRA adapters directly.
python infer_wan.py \
--model_name_or_path "Wan-AI/Wan2.2-T2V-A14B" \
--lora_path "/data/wan22_lora_out" \
--prompt "cozy coffee shop at golden hour, bokeh" \
--negative_prompt "distorted faces, artifacts" \
--resolution 720 --fps 24 --seconds 4 \
--output /data/wan22/samples/test001.mp4 \
--use_bf16 --enable_xformers
๐ก ComfyUI: Use the WAN Loader โ attach LoRA(s) โ render your test videos.
Leverage multiple GPUs (e.g., 2ร A100 40GB) for faster fine-tuning.
accelerate config # set distributed_type=MULTI_GPU, num_processes=2
accelerate launch \
--multi_gpu \
train_wan_lora.py \
... (same parameters) \
--train_batch_size 1 --gradient_accumulation_steps 8
For setups with 4+ GPUs, enable --seq_parallel if supported โ reduces VRAM load significantly.
| Type | LR | Rank | Alpha | Steps | Batch | Grad Accum | Notes |
|---|---|---|---|---|---|---|---|
| General | 1e-4 | 64 | 64 | 10kโ20k | 1 | 8โ12 | Balanced baseline |
| Character | 1e-4 | 64โ128 | 64 | 8kโ12k | 1 | 8 | Ideal for short 2โ4s clips |
| Style | 1e-4 | 32โ64 | 64 | 6kโ10k | 1 | 8โ12 | Broader stylistic range |
| Evaluation | โ | โ | โ | every 1โ2k | โ | โ | Test 2โ4 fixed + 2 real prompts |
WAN LoRA training enables:
Recommended setup:
Environment โ Model Setup โ Dataset Prep โ LoRA Fine-tune โ Inference
๐ฅ Train smarter. Generate faster. WAN stronger.