Mingyuan Zhang - Homepage

Publications [Google Scholar]

* indicates equal contribution, ✉ indicates corresponding / co-corresponding author

	Large Motion Model for Unified Multi-Modal Motion Generation Mingyuan Zhang, Daisheng Jin, Chenyang Gu, Fangzhou Hong, Zhongang Cai, Jingfang Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, Ziwei Liu✉ arXiV*, 2024 [Paper] [Project Page] [Video] [Code] Star The Large Motion Model (LMM) unifies various motion generation tasks into a scalable, generalist model, demonstrating broad applicability and strong generalization across diverse tasks.
	Digital Life Project: Autonomous 3D Characters with Social Intelligence Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, Xiangyu Fan, Han Du, Liang Pan, Peng Gao, Zhitao Yang, Yang Gao, Jiaqi Li, Tianxiang Ren, Yukun Wei, Xiaogang Wang, Chen Change Loy, Lei Yang✉, Ziwei Liu✉ Conference on Computer Vision and Pattern Recognition* (CVPR), 2024 [Paper] [Project Page] [Code] Digital Life Project is a framework utilizing language as the universal medium to build autonomous 3D characters, who are capable of engaging in social interactions and expressing with articulated body motions, thereby simulating life in a digital environment.
	FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing Mingyuan Zhang, Huirong Li, Zhongang Cai, Jiawei Ren, Lei Yang, Ziwei Liu✉ Neural Information Processing Systems (NeurIPS), 2023 [Paper] [Project Page] [Code] Star FineMoGen is a diffusion-based motion generation and editing framework that can synthesize fine-grained motions, with spatial-temporal composition to the user instructions.
	InsActor: Instruction-driven Physics-based Characters Jiawei Ren, Mingyuan Zhang*, Cunjun Yu, Xiao Ma, Liang Pan, Ziwei Liu✉ Neural Information Processing Systems (NeurIPS), 2023 [Paper] [Project Page] [Code] Star InsActor is a principled generative framework that leverages recent advancements in diffusion-based human motion models to produce instruction-driven animations of physics-based characters.
	SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, Ziwei Liu✉ Neural Information Processing Systems* (NeurIPS Datasets and Benchmarks Track), 2023 [Paper] [Project Page] [Code] Star SMPLer-X is the first generalist foundation model for Expressive human pose and shape estimation (EHPS). With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments.
	PointHPS: Cascaded 3D Human Pose and Shape Estimation from Point Clouds Zhongang Cai, Liang Pan, Chen Wei, Wanqi Yin, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang, Ziwei Liu✉ ArXiV, 2023 [Paper] [Project Page] [Video] [Code] Star PointHPS iteratively refines point features through a cascaded architecture to achieve more accurate 3D Human pose and shape estimation(HPS) from point clouds.
	ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, Ziwei Liu✉ International Conference on Computer Vision (ICCV), 2023 [Paper] [Project Page] [Video] [Code] [Colab Demo] [Hugging Face Demo] Star ReMoDiffuse is a retrieval-augmented 3D human motion diffusion model. Benefiting from the extra knowledge from the retrieved samples, ReMoDiffuse is able to achieve high-fidelity on the given prompts.
	BiBench: Benchmarking and Analyzing Network Binarization Haotong Qin, Mingyuan Zhang*, Yifu Ding, Aoyu Li, Zhongang Cai, Ziwei Liu, Fisher Yu, Xianglong Liu✉ International Conference on Machine Learning* (ICML), 2023 [Paper] A rigorously designed benchmark with in-depth analysis for network binarization. It first carefully scrutinizes the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation.
	MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, Ziwei Liu✉ Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024 [Paper] [Project Page] [Video] [Code] [Colab Demo] [Hugging Face Demo] Star The first text-driven motion generation pipeline based on diffusion models with probabilistic mapping, realistic synthesis and multi-level manipulation ability.
	HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang✉, Ziwei Liu✉ European Conference on Computer Vision* (ECCV), 2022 (Oral Presentation) [Paper] [Project Page] [Video] A large-scale multi-modal(color images, point clouds, keypoints, SMPL parameters, and textured meshes) 4D human dataset with 1000 human subjects, 400k sequences and 60M frames.
	AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars Fangzhou Hong, Mingyuan Zhang*, Liang Pan, Zhongang Cai, Lei Yang, Ziwei Liu✉ ACM Transactions on Graphics* (SIGGRAPH), 2022 [Paper] [Project Page] [Video] [Code] [Colab Demo] Star AvatarCLIP is the first zero-shot text-driven pipeline, which empowers layman users to generate and animate 3D Avatars by natural language description.
	Balanced MSE for Imbalanced Visual Regression Jiawei Ren, Mingyuan Zhang, Cunjun Yu, Ziwei Liu✉ Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral Presentation) [Paper] [Project Page] [Talk] [Code] [Hugging Face Demo] Star A statistically principled loss function to address the train/test mismatch in imbalanced regression, coincides with the supervised contrastive loss.
	Delving Deep into the Generalization of Vision Transformers under Distribution Shifts Chongzhi Zhang, Mingyuan Zhang*, Shanghang Zhang, Daisheng Jin, Qiang Zhou, Zhongang Cai, Haiyu Zhao, Shuai Yi, Xianglong Liu, Ziwei Liu✉ Conference on Computer Vision and Pattern Recognition (CVPR), 2022 [Paper] [Code] Star A systematical comparison of the generalization ability between CNNs and ViTs. Three representative generalization-enhancement techniques are applied to ViTs to further explore their inner properties.
	Playing for 3D Human Recovery Zhongang Cai, Mingyuan Zhang*, Jiawei Ren, Chen Wei, Daxuan Ren, Zhengyu Lin, Haiyu Zhao, Lei Yang, Chen Change Loy, Ziwei Liu✉ arXiV, 2021 [Paper] [Code] Star A large-scale synthetic human dataset collected using GTA-5 game engine, providing stable performance boost to both frame-based and video-based HMR.
	BiBERT: Accurate Fully Binarized BERT Haotong Qin, Yifu Ding, Mingyuan Zhang, Qinghua Yan, Aishan Liu, Qingqing Dang, Ziwei Liu, Xianglong Liu✉ International Conference on Learning Representations* (ICLR), 2022 [Paper] [Code] Star BiBERT is the first fully binarized BERT. It introduces an efficient Bi-Attention structure and a DMD scheme, which yields impressive 59.2x and 31.2x saving on FLOPs and model size.
	REFINE: Prediction Fusion Network for Panoptic Segmentation Jiawei Ren, Cunjun Yu, Zhongang Cai, Mingyuan Zhang, Chongsong Chen, Haiyu Zhao, Shuai Yi, Hongsheng Li✉ Association for the Advancement of Artificial Intelligence* (AAAI), 2021 [Paper] [Project Page] REFINE achieves high-quality panoptic segmentation by improving cross-task prediction fusion, and within-task prediction fusion.
	CSG-Stump: A Learning Friendly CSG-Like Representation for Interpretable Shape Parsing Daxuan Ren, Jianmin Zheng✉, Jianfei Cai, Jiatong Li, Haiyong Jiang, Zhongang Cai, Junzhe Zhang, Liang Pan, Mingyuan Zhang, Haiyu Zhao, Shuai Yi International Conference on Computer Vision (ICCV), 2021 [Paper] [Project Page] [Code] Star CSG-Stump learns shapes from point clouds and discovers the underlying constituent modeling primitives and operations.
	Towards Overcoming False Positives in Visual Relationship Detection Daisheng Jin, Xiao Ma, Chongzhi Zhang, Yizhuo Zhou, Jiashu Tao, Zhoujun Li, Mingyuan Zhang✉, British Machine Vision Conference (BMVC), 2021 [Paper] SABRA explores the imbalanced distribution in Human-Object Interaction detection. It further proposes a new pipeline to equip the model with sufficient spatial information.
	BiPointNet: Binary Neural Network for Point Clouds Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Liu✉, Hao Su International Conference on Learning Representations* (ICLR), 2021 [Paper] [Code] Star BiPointNet is the first fully binarized network for point cloud learning. BiPointNet gives an impressive 14.7x speedup and 18.9x storage saving on real-world resource-constrained devices.
	Efficient Attention: Attention with Linear Complexities Zhuoran Shen, Mingyuan Zhang*, Haiyu Zhao, Shuai Yi, Hongsheng Li✉ Winter Conference on Applications of Computer Vision* (WACV), 2021 [Paper] [Code] Star Efficient Attention reduces the memory and computational complexities of the attention mechanism from quadratic to linear. It demonstrates significant improvement in performance-cost trade-offs on a variety of tasks including object detection, instance segmentation, stereo depth estimation, and temporal action lcoalization.
	Graph attention based proposal 3d convnets for action detection Jun Li, Xianglong Liu✉, Zhuofan Zong, Wanru Zhao, Mingyuan Zhang, Jingkuan Song Association for the Advancement of Artificial Intelligence (AAAI), 2020 [Paper] AGCN-P-3DCNNs fuses intra and inter attention to model intra long-range dependencies and inter dependencies simultaneously. It also contains a simple and effective framewise classifier, which enhances the feature presentation capabilities of backbone model.

Updated: 2024-5-8

	Large Motion Model for Unified Multi-Modal Motion Generation Mingyuan Zhang, Daisheng Jin, Chenyang Gu, Fangzhou Hong, Zhongang Cai, Jingfang Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, Ziwei Liu✉ arXiV*, 2024 [Paper] [Project Page] [Video] [Code] Star The Large Motion Model (LMM) unifies various motion generation tasks into a scalable, generalist model, demonstrating broad applicability and strong generalization across diverse tasks.
	Digital Life Project: Autonomous 3D Characters with Social Intelligence Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, Xiangyu Fan, Han Du, Liang Pan, Peng Gao, Zhitao Yang, Yang Gao, Jiaqi Li, Tianxiang Ren, Yukun Wei, Xiaogang Wang, Chen Change Loy, Lei Yang✉, Ziwei Liu✉ Conference on Computer Vision and Pattern Recognition* (CVPR), 2024 [Paper] [Project Page] [Code] Digital Life Project is a framework utilizing language as the universal medium to build autonomous 3D characters, who are capable of engaging in social interactions and expressing with articulated body motions, thereby simulating life in a digital environment.
	FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing Mingyuan Zhang, Huirong Li, Zhongang Cai, Jiawei Ren, Lei Yang, Ziwei Liu✉ Neural Information Processing Systems (NeurIPS), 2023 [Paper] [Project Page] [Code] Star FineMoGen is a diffusion-based motion generation and editing framework that can synthesize fine-grained motions, with spatial-temporal composition to the user instructions.
	InsActor: Instruction-driven Physics-based Characters Jiawei Ren, Mingyuan Zhang*, Cunjun Yu, Xiao Ma, Liang Pan, Ziwei Liu✉ Neural Information Processing Systems (NeurIPS), 2023 [Paper] [Project Page] [Code] Star InsActor is a principled generative framework that leverages recent advancements in diffusion-based human motion models to produce instruction-driven animations of physics-based characters.
	SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, Ziwei Liu✉ Neural Information Processing Systems* (NeurIPS Datasets and Benchmarks Track), 2023 [Paper] [Project Page] [Code] Star SMPLer-X is the first generalist foundation model for Expressive human pose and shape estimation (EHPS). With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments.
	PointHPS: Cascaded 3D Human Pose and Shape Estimation from Point Clouds Zhongang Cai, Liang Pan, Chen Wei, Wanqi Yin, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang, Ziwei Liu✉ ArXiV, 2023 [Paper] [Project Page] [Video] [Code] Star PointHPS iteratively refines point features through a cascaded architecture to achieve more accurate 3D Human pose and shape estimation(HPS) from point clouds.
	ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, Ziwei Liu✉ International Conference on Computer Vision (ICCV), 2023 [Paper] [Project Page] [Video] [Code] [Colab Demo] [Hugging Face Demo] Star ReMoDiffuse is a retrieval-augmented 3D human motion diffusion model. Benefiting from the extra knowledge from the retrieved samples, ReMoDiffuse is able to achieve high-fidelity on the given prompts.
	BiBench: Benchmarking and Analyzing Network Binarization Haotong Qin, Mingyuan Zhang*, Yifu Ding, Aoyu Li, Zhongang Cai, Ziwei Liu, Fisher Yu, Xianglong Liu✉ International Conference on Machine Learning* (ICML), 2023 [Paper] A rigorously designed benchmark with in-depth analysis for network binarization. It first carefully scrutinizes the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation.
	MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, Ziwei Liu✉ Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024 [Paper] [Project Page] [Video] [Code] [Colab Demo] [Hugging Face Demo] Star The first text-driven motion generation pipeline based on diffusion models with probabilistic mapping, realistic synthesis and multi-level manipulation ability.
	HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang✉, Ziwei Liu✉ European Conference on Computer Vision* (ECCV), 2022 (Oral Presentation) [Paper] [Project Page] [Video] A large-scale multi-modal(color images, point clouds, keypoints, SMPL parameters, and textured meshes) 4D human dataset with 1000 human subjects, 400k sequences and 60M frames.
	AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars Fangzhou Hong, Mingyuan Zhang*, Liang Pan, Zhongang Cai, Lei Yang, Ziwei Liu✉ ACM Transactions on Graphics* (SIGGRAPH), 2022 [Paper] [Project Page] [Video] [Code] [Colab Demo] Star AvatarCLIP is the first zero-shot text-driven pipeline, which empowers layman users to generate and animate 3D Avatars by natural language description.
	Balanced MSE for Imbalanced Visual Regression Jiawei Ren, Mingyuan Zhang, Cunjun Yu, Ziwei Liu✉ Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral Presentation) [Paper] [Project Page] [Talk] [Code] [Hugging Face Demo] Star A statistically principled loss function to address the train/test mismatch in imbalanced regression, coincides with the supervised contrastive loss.
	Delving Deep into the Generalization of Vision Transformers under Distribution Shifts Chongzhi Zhang, Mingyuan Zhang*, Shanghang Zhang, Daisheng Jin, Qiang Zhou, Zhongang Cai, Haiyu Zhao, Shuai Yi, Xianglong Liu, Ziwei Liu✉ Conference on Computer Vision and Pattern Recognition (CVPR), 2022 [Paper] [Code] Star A systematical comparison of the generalization ability between CNNs and ViTs. Three representative generalization-enhancement techniques are applied to ViTs to further explore their inner properties.
	Playing for 3D Human Recovery Zhongang Cai, Mingyuan Zhang*, Jiawei Ren, Chen Wei, Daxuan Ren, Zhengyu Lin, Haiyu Zhao, Lei Yang, Chen Change Loy, Ziwei Liu✉ arXiV, 2021 [Paper] [Code] Star A large-scale synthetic human dataset collected using GTA-5 game engine, providing stable performance boost to both frame-based and video-based HMR.
	BiBERT: Accurate Fully Binarized BERT Haotong Qin, Yifu Ding, Mingyuan Zhang, Qinghua Yan, Aishan Liu, Qingqing Dang, Ziwei Liu, Xianglong Liu✉ International Conference on Learning Representations* (ICLR), 2022 [Paper] [Code] Star BiBERT is the first fully binarized BERT. It introduces an efficient Bi-Attention structure and a DMD scheme, which yields impressive 59.2x and 31.2x saving on FLOPs and model size.
	REFINE: Prediction Fusion Network for Panoptic Segmentation Jiawei Ren, Cunjun Yu, Zhongang Cai, Mingyuan Zhang, Chongsong Chen, Haiyu Zhao, Shuai Yi, Hongsheng Li✉ Association for the Advancement of Artificial Intelligence* (AAAI), 2021 [Paper] [Project Page] REFINE achieves high-quality panoptic segmentation by improving cross-task prediction fusion, and within-task prediction fusion.
	CSG-Stump: A Learning Friendly CSG-Like Representation for Interpretable Shape Parsing Daxuan Ren, Jianmin Zheng✉, Jianfei Cai, Jiatong Li, Haiyong Jiang, Zhongang Cai, Junzhe Zhang, Liang Pan, Mingyuan Zhang, Haiyu Zhao, Shuai Yi International Conference on Computer Vision (ICCV), 2021 [Paper] [Project Page] [Code] Star CSG-Stump learns shapes from point clouds and discovers the underlying constituent modeling primitives and operations.
	Towards Overcoming False Positives in Visual Relationship Detection Daisheng Jin, Xiao Ma, Chongzhi Zhang, Yizhuo Zhou, Jiashu Tao, Zhoujun Li, Mingyuan Zhang✉, British Machine Vision Conference (BMVC), 2021 [Paper] SABRA explores the imbalanced distribution in Human-Object Interaction detection. It further proposes a new pipeline to equip the model with sufficient spatial information.
	BiPointNet: Binary Neural Network for Point Clouds Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Liu✉, Hao Su International Conference on Learning Representations* (ICLR), 2021 [Paper] [Code] Star BiPointNet is the first fully binarized network for point cloud learning. BiPointNet gives an impressive 14.7x speedup and 18.9x storage saving on real-world resource-constrained devices.
	Efficient Attention: Attention with Linear Complexities Zhuoran Shen, Mingyuan Zhang*, Haiyu Zhao, Shuai Yi, Hongsheng Li✉ Winter Conference on Applications of Computer Vision* (WACV), 2021 [Paper] [Code] Star Efficient Attention reduces the memory and computational complexities of the attention mechanism from quadratic to linear. It demonstrates significant improvement in performance-cost trade-offs on a variety of tasks including object detection, instance segmentation, stereo depth estimation, and temporal action lcoalization.
	Graph attention based proposal 3d convnets for action detection Jun Li, Xianglong Liu✉, Zhuofan Zong, Wanru Zhao, Mingyuan Zhang, Jingkuan Song Association for the Advancement of Artificial Intelligence (AAAI), 2020 [Paper] AGCN-P-3DCNNs fuses intra and inter attention to model intra long-range dependencies and inter dependencies simultaneously. It also contains a simple and effective framewise classifier, which enhances the feature presentation capabilities of backbone model.