Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks
๐ธ Showcase
Visit our project page to view more cases.
โ๏ธ Installation
- System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1
- Tested GPUs: H100
Download the codes:
git clone https://github.com/fudan-generative-vision/hallo3
cd hallo3
Create conda environment:
conda create -n hallo python=3.10
conda activate hallo
Install packages with pip
pip install -r requirements.txt
Besides, ffmpeg is also needed:
apt-get install ffmpeg
๐ฅ Download Pretrained Models
You can easily get all pretrained models required by inference from our HuggingFace repo.
Using huggingface-cli
to download the models:
cd $ProjectRootDir
pip install huggingface-cli
huggingface-cli download fudan-generative-ai/hallo3 --local-dir ./pretrained_models
Or you can download them separately from their source repo:
- hallo3: Our checkpoints.
- Cogvidex: Cogvideox-5b-i2v pretrained model, consisting of transformer and 3d vae
- t5-v1_1-xxl: text encoder, you can download from text_encoder and tokenizer
- audio_separator: Kim Vocal_2 MDX-Net vocal removal model.
- wav2vec: wav audio to vector model from Facebook.
- insightface: 2D and 3D Face Analysis placed into
pretrained_models/face_analysis/models/
. (Thanks to deepinsight) - face landmarker: Face detection & mesh model from mediapipe placed into
pretrained_models/face_analysis/models
.
Finally, these pretrained models should be organized as follows:
./pretrained_models/
|-- audio_separator/
| |-- download_checks.json
| |-- mdx_model_data.json
| |-- vr_model_data.json
| `-- Kim_Vocal_2.onnx
|-- cogvideox-5b-i2v-sat/
| |-- transformer/
| |--1/
| |-- mp_rank_00_model_states.pt
| `--latest
| `-- vae/
| |-- 3d-vae.pt
|-- face_analysis/
| `-- models/
| |-- face_landmarker_v2_with_blendshapes.task # face landmarker model from mediapipe
| |-- 1k3d68.onnx
| |-- 2d106det.onnx
| |-- genderage.onnx
| |-- glintr100.onnx
| `-- scrfd_10g_bnkps.onnx
|-- hallo3
| |--1/
| |-- mp_rank_00_model_states.pt
| `--latest
|-- t5-v1_1-xxl/
| |-- added_tokens.json
| |-- config.json
| |-- model-00001-of-00002.safetensors
| |-- model-00002-of-00002.safetensors
| |-- model.safetensors.index.json
| |-- special_tokens_map.json
| |-- spiece.model
| |-- tokenizer_config.json
|
`-- wav2vec/
`-- wav2vec2-base-960h/
|-- config.json
|-- feature_extractor_config.json
|-- model.safetensors
|-- preprocessor_config.json
|-- special_tokens_map.json
|-- tokenizer_config.json
`-- vocab.json
๐ ๏ธ Prepare Inference Data
Hallo3 has a few simple requirements for the input data of inference:
- Reference image must be 1:1 or 3:2 aspect ratio.
- Driving audio must be in WAV format.
- Audio must be in English since our training datasets are only in this language.
- Ensure the vocals of audio are clear; background music is acceptable.
๐ฎ Run Inference
Simply to run the scripts/inference_long_batch.sh
:
bash scripts/inference_long_batch.sh ./examples/inference/input.txt ./output
Animation results will be saved at ./output
. You can find more examples for inference at examples folder.
Training
prepare data for training
Organize your raw videos into the following directory structure:
dataset_name/
|-- videos/
| |-- 0001.mp4
| |-- 0002.mp4
| `-- 0003.mp4
|-- caption/
| |-- 0001.txt
| |-- 0002.txt
| `-- 0003.txt
You can use any dataset_name, but ensure the videos directory and caption directory are named as shown above.
Next, process the videos with the following commands:
bash scripts/data_preprocess.sh {dataset_name} {parallelism} {rank} {output_name}
Training
Update the data meta path settings in the configuration YAML files, configs/sft_s1.yaml
and configs/sft_s2.yaml
:
#sft_s1.yaml
train_data: [
"./data/output_name.json"
]
#sft_s2.yaml
train_data: [
"./data/output_name.json"
]
Start training with the following command:
# stage1
bash scripts/finetune_multi_gpus_s1.sh
# stage2
bash scripts/finetune_multi_gpus_s2.sh
๐ Citation
If you find our work useful for your research, please consider citing the paper:
@misc{cui2024hallo3,
title={Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks},
author={Jiahao Cui and Hui Li and Yun Zhang and Hanlin Shang and Kaihui Cheng and Yuqi Ma and Shan Mu and Hang Zhou and Jingdong Wang and Siyu Zhu},
year={2024},
eprint={2412.00733},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
โ ๏ธ Social Risks and Mitigations
The development of portrait image animation technologies driven by audio inputs poses social risks, such as the ethical implications of creating realistic portraits that could be misused for deepfakes. To mitigate these risks, it is crucial to establish ethical guidelines and responsible use practices. Privacy and consent concerns also arise from using individuals' images and voices. Addressing these involves transparent data usage policies, informed consent, and safeguarding privacy rights. By addressing these risks and implementing mitigations, the research aims to ensure the responsible and ethical development of this technology.
๐ค Acknowledgements
This model is a fine-tuned derivative version based on the CogVideo-5B I2V model. CogVideo-5B is an open-source text-to-video generation model developed by the CogVideoX team. Its original code and model parameters are governed by the CogVideo-5B LICENSE.
As a derivative work of CogVideo-5B, the use, distribution, and modification of this model must comply with the license terms of CogVideo-5B.
๐ Community Contributors
Thank you to all the contributors who have helped to make this project better!