Ziyi Wang

I am a researcher at Tencent Hunyuan, where my work centers on large multimodal models and foundation models for physical AI.

I obtained my Ph.D. from the Department of Automation at Tsinghua University, advised by Prof. Jiwen Lu . In 2020, I received my B.Eng. from the Department of Electronic Engineering, Tsinghua University, along with a dual B.Admin. degree from the School of Economics and Management, Tsinghua University.

I am broadly interested in computer vision and deep learning. My research focuses on vision-language models, physical AI foundation models, world models, 3D/4D generation, and 3D vision.

Email / Google Scholar / Github

News

2026-07: Release Hy-Embodied-VLM-1.0, an efficient Mixture-of-Experts vision-language foundation model for embodied agents.

2026-04: Release HY-Embodied-0.5, a suite of embodied foundation models from Tencent.

2026-03: Serve as Area Chair for ECCV 2026.

2025-06: 1 survey paper on vision generalist model is accepted to IJCV.

2025-02: 1 paper on unified 3D point cloud pre-training is accepted to CVPR 2025.

2024-09: 1 paper on 3D open vocabulary semantic segmentation is accepted to NeurIPS 2024.

2024-01: The journal paper of P2P is accepted to TPAMI.

2023-07: 1 paper on 3D generative pre-training is accepted to ICCV 2023.

2023-07: The journal paper of PV-RAFT is accepted to TPAMI.

2022-09: 1 paper (spotlight) on 3D prompt learning is accepted to NeurIPS 2022.

2022-03: 1 paper on 3D semantic segmentation is accepted to CVPR 2022.

2021-07: 2 papers (including 1 oral) are accepted to ICCV 2021.

2021-03: 1 paper on 3D scene flow estimation is accepted to CVPR 2021.

Publications

* indicates equal contribution, # indicates project lead / corresponding author

	Hy-Embodied-VLM-1.0: Efficient Physical-World Agents Ziyi Wang, Xumin Yu, Yongming Rao#, Yonggen Ling, Yunheng Li, Oran Wang, Mingqi Gao, Yuchen Zhou, Yves Liang, Zuyan Liu, Yani Zhang, Rui Huang, Xiaoran Xu, Bowen Yuan, Yifu Yuan, Xu Tan, He Zhang, Yufei Huang, Shenghao Zhang, Hongsheng Wu, Han Hu, Zhengyou Zhang Technical Report, 2026 [arXiv] [Code] [Model] Hy-Embodied-VLM-1.0 is an efficient Mixture-of-Experts embodied foundation model (~3B activated) that, guided by an action-centric capability taxonomy, achieves state-of-the-art physical-world understanding and reasoning across 38 embodied benchmarks.
	HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yongming Rao#, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, Yves Liang, Haitao Lin, Minghui Wang, Yubo Dong, Kevin Cheng, Bolin Ni, Rui Huang, Han Hu, Zhengyou Zhang, Shunyu Yao Technical Report, 2026 [arXiv] [Code] [Model] HY-Embodied-0.5 is a family of embodied foundation models built for real-world agents, achieving state-of-the-art performance across 22 benchmarks in visual perception, spatial reasoning, and embodied understanding, with effective downstream robot control.
	Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image Yanran Zhang, Ziyi Wang*, Wenzhao Zheng, Zheng Zhu, Jie Zhou , Jiwen Lu Preprint*. [arXiv] [Code] MoRe4D synthesizes high-quality 4D scenes from a single image by jointly performing motion generation and 3D geometry reconstruction.
	Vision Generalist Model: A Survey Ziyi Wang, Yongming Rao, Shuofeng Sun, Xinrun Liu, Yi Wei, Xumin Yu, Zuyan Liu, Yanbo Wang, Hongmin Liu, Jie Zhou , Jiwen Lu International Journal of Computer Vision (IJCV), 2025 [arXiv] We conduct a comprehensive survey on vision generalist models that support multimodal inputs and can handle various downstream tasks.
	OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View Yanbo Wang, Ziyi Wang*, Wenzhao Zheng, Jie Zhou , Jiwen Lu Preprint*. [arXiv] [Code] [Project Page] OGGSplat is designed to expand the field-of-view of the Gaussian-based 3D scene reconstructed from sparse views and feedforward / generalizable models.
	UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting Ziyi Wang, Yanran Zhang, Jie Zhou , Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 [arXiv] [Code] UniPre3D is a unified pre-training method that can be applied to both object-level and scene-level point clouds. It is supported by cross-modal Gaussian splatting technique.
	XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation Ziyi Wang, Yanbo Wang, Xumin Yu, Jie Zhou , Jiwen Lu Conference on Neural Information Processing Systems (NeurIPS), 2024 [arXiv] [Code] XMask3D is a framework that propose mask-level reasoning techniques to empower 3D segmentation model with open vocabulary capacity under the assistance of the pre-trained 2D mask generator.
	Point-to-Pixel Prompting for Point Cloud Analysis With Pre-Trained Image Models Ziyi Wang, Yongming Rao, Xumin Yu, Jie Zhou , Jiwen Lu IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024 [IEEE] [Code] [Project Page] P2P++ is the extended journal version of P2P. We further propose Pixel-to-Point Distillation to make P2P applicable in scene-level perception tasks.
	3D Point-Voxel Correlation Fields for Scene Flow Estimation Ziyi Wang, Yi Wei, Yongming Rao, Jie Zhou , Jiwen Lu IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023 [IEEE] [Code] [Project Page] DPV-RAFT is the extended journal version of PV-RAFT. We further propose Spatial Deformation and Temporal Deformation to enhance PV-RAFT.
	Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou , Jiwen Lu IEEE International Conference on Computer Vision (ICCV), 2023 [arXiv] [Code] [Project Page] TAP is a 3D-to-2D generative pre-training method that generate projected images of point clouds from instructed perspectives.
	P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting Ziyi Wang, Xumin Yu, Yongming Rao, Jie Zhou , Jiwen Lu Conference on Neural Information Processing Systems (NeurIPS), 2022 Spotlight* [arXiv] [Code] [Project Page] [中文解读] P2P is a framework to leverage large-scale pre-trained image models for 3D point cloud analysis.
	SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation Ziyi Wang, Yongming Rao, Xumin Yu, Jie Zhou , Jiwen Lu IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022 [arXiv] [Code] We present Semantic-Affine Transformation that transforms decoder mid-level features of the encoder-decoder segmentation network with class-specific affine parameters.
	PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen Lu , Jie Zhou IEEE International Conference on Computer Vision (ICCV), 2021 Oral Presentation [arXiv] [Code] [中文解读] PoinTr is a transformer-based framework that reformulates point cloud completion as a set-to-set translation problem.
	Towards Interpretable Deep Metric Learning with Structural Matching Wenliang Zhao, Yongming Rao, Zyi Wang, Jiwen Lu , Jie Zhou IEEE International Conference on Computer Vision (ICCV), 2021 [arXiv] [Code] We present a deep interpretable metric learning (DIML) that adopts a structural matching strategy to explicitly aligns the spatial embeddings by computing an optimal matching flow between feature maps of the two images.
	PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds Yi Wei , Ziyi Wang, Yongming Rao, Jiwen Lu , Jie Zhou IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2021 [arXiv] [Code] We present point-voxel correlation fields for 3D scene flow estimation which migrates the high performance of RAFT and provides a solution to build structured all-pairs correlation fields for unstructured point clouds.