Shilong Zhang 张士龙

Shilong is a final-year Ph.D. student (2023–present) at MMLab@The University of Hong Kong (HKU), advised by Prof. Ping Luo. Previously, he worked with Kai Chen and contributed as a core developer to MMDetection and MMCV.

He completed his Bachelor's degree from the University of Science and Technology of China (USTC) in 2019, distinguishing himself as one of the top 5% outstanding graduates.

My research agenda connects visual perception/understanding and generation, aiming to build unified models that truly understand the world, rather than merely translating between modalities.

2020–2023 · Object Detection & Multimodal Understanding

My research began with object detection, focusing on efficient architectures and training.

With the rise of large language models around 2022, I recognized that multimodal large language models would be central to the future of AI, which led me to explore vision–language models. However, I soon realized a fundamental limitation of dominant VLM paradigms: projecting vision into language space does not genuinely enable models to learn visual knowledge.

This stage of my work was completed during my time at SenseTime and Shanghai AI Lab.
2023–2025 · Generative Models

This insight motivated my transition to vision generation, where models are forced to internalize visual structure, semantics, and dynamics directly from data—much closer to how humans learn by continuously predicting visual content.

This stage of my work was completed at Alibaba Tongyi Lab and ByteDance.
2025 · Unify

After accumulating substantial experience in generative modeling, by 2025 it became clear—to both the community and myself—that understanding and generation should be unified. I therefore focus on what I consider the most urgent problem in unified modeling: a single, principled visual encoder that supports both understanding and generation. This led to PS-VAE, a key step toward a unified encoder enabling understanding, generation, and editing within a shared representation space.

This work was completed at Adobe Research.
Next · Video Unified Models

Looking forward, I believe images alone are insufficient for learning rich and grounded world knowledge, and I am particularly excited about video-based unified models, which I see as a potential next major leap beyond LLMs by capturing dynamics, causality, and long-term structure.

If your team shares this perspective, I would be glad to connect and discuss potential collaborations.

[2025/12/19] PS-VAE: 96-channel generative latent toward a unified encoder for understanding + generation/editing (Project Page).
[2025/2/10] We present FlashVideo, an efficient paradigm for text-to-video generation (Code).
[2024/3/26] We propose FlashFace that can generate high ID fidelity images in seconds (Code).
[2023/7/7] We present a vision and language model named GPT4RoI to do region-level image understanding.
[2023/4/26] We present a vision and language model named MultiModal-GPT.
[2023/3/20] DDQ DETR achieve 52.1 AP with R-50 backbone within 12 epochs (Code).
[2021/11/27] We release MMFewShot, an open source few shot learning toolbox based on PyTorch.
[2019/6/28] Awarded as outstanding graduates by USTC.

[1] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo

We systematically adapt understanding-oriented encoder features for generation/editing by jointly regularizing semantics and pixel reconstruction, compressing both into a compact 96-channel latent (16×16 downsampling). This points to the potential of a unified encoder that supports understanding + generation/editing within a single model backbone.

Paper Page

[2] FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

Shilong Zhang*, Wenbo Li*, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo

AAAI 2026(* Equal contribution)

(a) Dividing the process into prompt fidelity and quality enhancement stages, delivering a stunning reduction in DiT's computational load. (b) Enabling users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing wait times and enhancing commercial viability.

Paper Code

[3] FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang, Zhi-Fan Wu, Yutong Feng, Wei Wang, Yujun Shen, Yu Liu, Ping Luo

This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt.

Paper Code

[4] GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Shilong Zhang*, Peize Sun*, Shoufa Chen*, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, Ping Luo

ECCVW 2025(* Equal contribution)

We present a vision and language model named GPT4RoI to do region-level image understanding.

Paper Code

[5] MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

Tao Gong*, Chengqi Lyu*, Shilong Zhang*, Yudong Wang*, Miao Zheng*, Qian Zhao*, Kuikun Liu*, Wenwei Zhang*, Ping Luo, Kai Chen

(* random order)

We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans.

Paper Code

[6] Dense Distinct Query for End-to-End Object Detection

Shilong Zhang*, Xinjiang Wang*, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang, Ping Luo, Kai Chen

CVPR 2023 (* Equal contribution)

DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a ResNet-50 backbone, outperforming all existing detectors in the same setting.

Paper Code

[7] Group R-CNN for Point-based Weakly Semi-supervised Object Detection

Shilong Zhang*, Zhuoran Yu*, Liyang Liu*, Xinjiang Wang, Aojun Zhou, Kai Chen

CVPR 2022 (* Equal contribution)

We study the problem of weakly semi-supervised object detection with points (WSSOD-P). Group R-CNN significantly outperforms the prior method Point DETR by 3.9 mAP with 5% well-labeled images.

Paper Code

[8] Group Fisher Pruning for Practical Network Compression

Liyang Liu, Shilong Zhang, Zhanghui Kuang, Aojun Zhou, Jing-Hao Xue, Xinjiang Wang, Yimin Chen, Wenming Yang, Qingmin Liao, Wayne Zhang

ICML 2021

We present a general channel pruning framework for complicated structures.

Paper Code

[9] Scale-equalizing Pyramid Convolution for Object Detection

Xinjiang Wang*, Shilong Zhang*, Zhuoran Yu, Litong Feng, Wayne Zhang

CVPR 2020 (* Equal contribution)

We proposed a scale-equalizing pyramid convolution method that relaxes the discrepancy between the feature pyramid and the gaussian pyramid. The module boosts the performance about 3.5 mAP in single-stage object detection with negligible inference time.

Paper Code

Shilong Zhang 张士龙

Research

Recent News

Publications

[1] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[2] FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

[3] FlashFace: Human Image Personalization with High-fidelity Identity Preservation

[4] GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

[5] MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

[6] Dense Distinct Query for End-to-End Object Detection

[7] Group R-CNN for Point-based Weakly Semi-supervised Object Detection

[8] Group Fisher Pruning for Practical Network Compression

[9] Scale-equalizing Pyramid Convolution for Object Detection