--- language: - en base_model: - THUDM/CogVideoX-5b ---

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang \* equal contribution
This is the official repository for paper "Tora: Trajectory-oriented Diffusion Transformer for Video Generation". ## ๐Ÿ’ก Abstract Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiTโ€™s scalability, allowing precise control of video contentโ€™s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Toraโ€™s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world. ## ๐Ÿ“ฃ Updates - `2024/10/23` ๐Ÿ”ฅ๐Ÿ”ฅOur [ModelScope Demo](https://www.modelscope.cn/studios/xiaoche/Tora) is launched. Welcome to try it out! We also upload the model weights to [ModelScope](https://www.modelscope.cn/models/xiaoche/Tora). - `2024/10/21` Thanks to [@kijai](https://github.com/kijai) for supporting Tora in ComfyUI! [Link](https://github.com/kijai/ComfyUI-CogVideoXWrapper) - `2024/10/15` ๐Ÿ”ฅ๐Ÿ”ฅWe released our inference code and model weights. **Please note that this is a CogVideoX version of Tora, built on the CogVideoX-5B model. This version of Tora is meant for academic research purposes only. Due to our commercial plans, we will not be open-sourcing the complete version of Tora at this time.** - `2024/08/27` We released our v2 paper including appendix. - `2024/07/31` We submitted our paper on arXiv and released our project page. ## ๐Ÿ“‘ Table of Contents - [Showcases](#%EF%B8%8F-showcases) - [Model Weights](#-model-weights) - [Inference](#-inference) - [Acknowledgements](#-acknowledgements) - [Our previous work](#-our-previous-work) - [Citation](#-citation) ## ๐ŸŽž๏ธ Showcases All videos are available in this [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/showcases.zip) ## ๐Ÿ“ฆ Model Weights ### Download Links Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE) - SDK ```bash from modelscope import snapshot_download model_dir = snapshot_download('xiaoche/Tora') ``` - Git ```bash git clone https://www.modelscope.cn/xiaoche/Tora.git ``` ## ๐Ÿ”„ Inference please refer to our [Github](https://github.com/alibaba/Tora) or [modelscope online demo](https://www.modelscope.cn/studios/xiaoche/Tora) ### Recommendations for Text Prompts For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness. You can refer to the following resources for guidance: - [CogVideoX Documentation](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py) - [OpenSora Scripts](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/inference.py) ## ๐Ÿค Acknowledgements We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project: - [CogVideo](https://github.com/THUDM/CogVideo): An open source video generation framework by THUKEG. - [Open-Sora](https://github.com/hpcaitech/Open-Sora): An open source video generation framework by HPC-AI Tech. - [MotionCtrl](https://github.com/TencentARC/MotionCtrl): A video generation model supporting motion control by ARC Lab, Tencent PCG. - [ComfyUI-DragNUWA](https://github.com/chaojie/ComfyUI-DragNUWA): An implementation of DragNUWA for ComfyUI. Special thanks to the contributors of these libraries for their hard work and dedication! ## ๐Ÿ“„ Our previous work - [AnimateAnything: Fine Grained Open Domain Image Animation with Motion Guidance](https://github.com/alibaba/animate-anything) ## ๐Ÿ“š Citation ```bibtex @misc{zhang2024toratrajectoryorienteddiffusiontransformer, title={Tora: Trajectory-oriented Diffusion Transformer for Video Generation}, author={Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang}, year={2024}, eprint={2407.21705}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.21705}, } ```