Ziyi Wang

I am a researcher at Tencent Hunyuan, where my work centers on large multimodal models and foundation models for physical AI.

I obtained my Ph.D. from the Department of Automation at Tsinghua University, advised by Prof. Jiwen Lu . In 2020, I received my B.Eng. from the Department of Electronic Engineering, Tsinghua University, along with a dual B.Admin. degree from the School of Economics and Management, Tsinghua University.

I am broadly interested in computer vision and deep learning. My research focuses on vision-language models, physical AI foundation models, world models, 3D/4D generation, and 3D vision.

Email  /  Google Scholar  /  Github

profile photo
News

  • 2026-04: Release HY-Embodied-0.5, a suite of embodied foundation models from Tencent.
  • 2026-03: Serve as Area Chair for ECCV 2026.
  • 2025-06: 1 survey paper on vision generalist model is accepted to IJCV.
  • 2025-02: 1 paper on unified 3D point cloud pre-training is accepted to CVPR 2025.
  • 2024-09: 1 paper on 3D open vocabulary semantic segmentation is accepted to NeurIPS 2024.
  • 2024-01: The journal paper of P2P is accepted to TPAMI.
  • 2023-07: 1 paper on 3D generative pre-training is accepted to ICCV 2023.
  • 2023-07: The journal paper of PV-RAFT is accepted to TPAMI.
  • 2022-09: 1 paper (spotlight) on 3D prompt learning is accepted to NeurIPS 2022.
  • 2022-03: 1 paper on 3D semantic segmentation is accepted to CVPR 2022.
  • 2021-07: 2 papers (including 1 oral) are accepted to ICCV 2021.
  • 2021-03: 1 paper on 3D scene flow estimation is accepted to CVPR 2021.
  • Publications

    * indicates equal contribution

    dise HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents
    Xumin Yu*, Zuyan Liu*, Ziyi Wang*, He Zhang*, Yongming Rao#, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, Yves Liang, Haitao Lin, Minghui Wang, Yubo Dong, Kevin Cheng, Bolin Ni, Rui Huang, Han Hu, Zhengyou Zhang, Shunyu Yao
    Technical Report, 2026
    [arXiv] [Code] [Model]

    HY-Embodied-0.5 is a family of embodied foundation models built for real-world agents, achieving state-of-the-art performance across 22 benchmarks in visual perception, spatial reasoning, and embodied understanding, with effective downstream robot control.

    dise Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image
    Yanran Zhang*, Ziyi Wang*, Wenzhao Zheng, Zheng Zhu, Jie Zhou , Jiwen Lu
    Preprint.
    [arXiv] [Code]

    MoRe4D synthesizes high-quality 4D scenes from a single image by jointly performing motion generation and 3D geometry reconstruction.

    dise Vision Generalist Model: A Survey
    Ziyi Wang, Yongming Rao, Shuofeng Sun, Xinrun Liu, Yi Wei, Xumin Yu, Zuyan Liu, Yanbo Wang, Hongmin Liu, Jie Zhou , Jiwen Lu
    International Journal of Computer Vision (IJCV), 2025
    [arXiv]

    We conduct a comprehensive survey on vision generalist models that support multimodal inputs and can handle various downstream tasks.

    dise OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View
    Yanbo Wang*, Ziyi Wang*, Wenzhao Zheng, Jie Zhou , Jiwen Lu
    Preprint.
    [arXiv] [Code] [Project Page]

    OGGSplat is designed to expand the field-of-view of the Gaussian-based 3D scene reconstructed from sparse views and feedforward / generalizable models.

    dise UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting
    Ziyi Wang*, Yanran Zhang*, Jie Zhou , Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
    [arXiv] [Code]

    UniPre3D is a unified pre-training method that can be applied to both object-level and scene-level point clouds. It is supported by cross-modal Gaussian splatting technique.

    dise XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation
    Ziyi Wang*, Yanbo Wang*, Xumin Yu, Jie Zhou , Jiwen Lu
    Conference on Neural Information Processing Systems (NeurIPS), 2024
    [arXiv] [Code]

    XMask3D is a framework that propose mask-level reasoning techniques to empower 3D segmentation model with open vocabulary capacity under the assistance of the pre-trained 2D mask generator.

    dise Point-to-Pixel Prompting for Point Cloud Analysis With Pre-Trained Image Models
    Ziyi Wang, Yongming Rao, Xumin Yu, Jie Zhou , Jiwen Lu
    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
    [IEEE] [Code] [Project Page]

    P2P++ is the extended journal version of P2P. We further propose Pixel-to-Point Distillation to make P2P applicable in scene-level perception tasks.

    dise 3D Point-Voxel Correlation Fields for Scene Flow Estimation
    Ziyi Wang*, Yi Wei*, Yongming Rao, Jie Zhou , Jiwen Lu
    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
    [IEEE] [Code] [Project Page]

    DPV-RAFT is the extended journal version of PV-RAFT. We further propose Spatial Deformation and Temporal Deformation to enhance PV-RAFT.

    dise Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models
    Ziyi Wang*, Xumin Yu*, Yongming Rao, Jie Zhou , Jiwen Lu
    IEEE International Conference on Computer Vision (ICCV), 2023
    [arXiv] [Code] [Project Page]

    TAP is a 3D-to-2D generative pre-training method that generate projected images of point clouds from instructed perspectives.

    dise P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting
    Ziyi Wang*, Xumin Yu*, Yongming Rao*, Jie Zhou , Jiwen Lu
    Conference on Neural Information Processing Systems (NeurIPS), 2022
    Spotlight
    [arXiv] [Code] [Project Page] [中文解读]

    P2P is a framework to leverage large-scale pre-trained image models for 3D point cloud analysis.

    dise SemAffiNet: Semantic-Affine Transformation for Point Cloud Segmentation
    Ziyi Wang, Yongming Rao, Xumin Yu, Jie Zhou , Jiwen Lu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
    [arXiv] [Code]

    We present Semantic-Affine Transformation that transforms decoder mid-level features of the encoder-decoder segmentation network with class-specific affine parameters.

    dise PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers
    Xumin Yu*, Yongming Rao*, Ziyi Wang, Zuyan Liu, Jiwen Lu , Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2021
    Oral Presentation
    [arXiv] [Code] [中文解读]

    PoinTr is a transformer-based framework that reformulates point cloud completion as a set-to-set translation problem.

    dise Towards Interpretable Deep Metric Learning with Structural Matching
    Wenliang Zhao*, Yongming Rao*, Zyi Wang, Jiwen Lu , Jie Zhou
    IEEE International Conference on Computer Vision (ICCV), 2021
    [arXiv] [Code]

    We present a deep interpretable metric learning (DIML) that adopts a structural matching strategy to explicitly aligns the spatial embeddings by computing an optimal matching flow between feature maps of the two images.

    dise PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds
    Yi Wei *, Ziyi Wang*, Yongming Rao*, Jiwen Lu , Jie Zhou
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
    [arXiv] [Code]

    We present point-voxel correlation fields for 3D scene flow estimation which migrates the high performance of RAFT and provides a solution to build structured all-pairs correlation fields for unstructured point clouds.

    Teaching

  • Teaching Assistant, Computer Vision, 2024 Spring Semester
  • Teaching Assistant, Pattern Recognition and Machine Learning, 2022 Fall Semester
  • Honors and Awards

  • 2025 Hui Yan Talent Scholarship, Tsinghua University
  • 2024 National Scholarship, Tsinghua University
  • 2023 ChangXin Memory Scholarship, Tsinghua University
  • 2023 CVPR Outstanding Reviewer
  • 2021 Haining Talent Scholarship, Tsinghua University
  • 2020 Excellent graduation thesis, Tsinghua University
  • 2018 Zheng Geru Scholarship, Tsinghua University
  • 2017 Hongqian Electronics Scholarship, Tsinghua University

  • Website Template


    © Ziyi Wang | Last updated: May 30, 2026