Shilong Zhang 张士龙

Shilong is a final-year Ph.D. student (2023–present) at MMLab@The University of Hong Kong (HKU), advised by Prof. Ping Luo. Previously, he worked with Kai Chen and contributed as a core developer to MMDetection and MMCV.

During his Ph.D., he has collaborated broadly with industry teams, including Alibaba Wan, ByteDance Monetization and Seed, Adobe Research, and Meta.

He completed his Bachelor's degree from the University of Science and Technology of China (USTC) in 2019, distinguishing himself as one of the top 5% outstanding graduates.

My research agenda connects visual perception/understanding and generation, aiming to build unified models that truly understand the world, rather than merely translating between modalities.

2020–2023 · Object Detection & Multimodal Understanding

My research began with object detection, focusing on the co-design of efficient architectures and training optimization.

With the rise of large language models around 2022, I recognized that multimodal large language models would be central to the future of AI, which led me to explore vision–language models. However, I soon realized a fundamental limitation of dominant VLM paradigms: projecting vision into language space does not genuinely enable models to learn visual knowledge.

This stage of my work was developed in collaboration with SenseTime and Shanghai AI Lab.
2023–2025 · Generative Models

This insight motivated my transition to vision generation, where models are forced to internalize visual structure, semantics, and dynamics directly from data—much closer to how humans learn by continuously predicting visual content.

This stage of my work was developed in collaboration with Alibaba Tongyi Lab and ByteDance.
2025–present · Unify

I try to build a unified visual representation space that is strong for understanding, friendly to generation, and capable of preserving all visual information. I believe this is a key step toward an elegant visual-language model for both understanding and generation.

This stage of work was developed in collaboration with Adobe Research, ByteDance Seed, and Meta.

If your team is interested in my work, I would be glad to connect and discuss potential collaborations.

[2025/12/19] PS-VAE: 96-channel generative latent toward a unified encoder for understanding + generation/editing (Project Page).
[2025/2/10] We present FlashVideo, an efficient paradigm for text-to-video generation (Code).
[2024/3/26] We propose FlashFace that can generate high ID fidelity images in seconds (Code).
[2023/7/7] We present a vision and language model named GPT4RoI to do region-level image understanding.
[2023/4/26] We present a vision and language model named MultiModal-GPT.
[2023/3/20] DDQ DETR achieve 52.1 AP with R-50 backbone within 12 epochs (Code).
[2021/11/27] We release MMFewShot, an open source few shot learning toolbox based on PyTorch.
[2019/6/28] Awarded as outstanding graduates by USTC.

[1] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo

ICML 2026

We systematically adapt understanding-oriented encoder features for generation/editing by jointly regularizing semantics and pixel reconstruction, compressing both into a compact 96-channel latent (16×16 downsampling). This points to the potential of a unified encoder that supports understanding + generation/editing within a single model backbone.

Paper Page

[2] FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

Shilong Zhang*, Wenbo Li*, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo

AAAI 2026(* Equal contribution)

(a) Dividing the process into prompt fidelity and quality enhancement stages, delivering a stunning reduction in DiT's computational load. (b) Enabling users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing wait times and enhancing commercial viability.

Paper Code

[3] FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang, Zhi-Fan Wu, Yutong Feng, Wei Wang, Yujun Shen, Yu Liu, Ping Luo

This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt.

Paper Code

[4] GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Shilong Zhang*, Peize Sun*, Shoufa Chen*, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, Ping Luo

ECCVW 2025(* Equal contribution)

We present a vision and language model named GPT4RoI to do region-level image understanding.

Paper Code

[5] MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

Tao Gong*, Chengqi Lyu*, Shilong Zhang*, Yudong Wang*, Miao Zheng*, Qian Zhao*, Kuikun Liu*, Wenwei Zhang*, Ping Luo, Kai Chen

(* random order)

We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans.

Paper Code

[6] Dense Distinct Query for End-to-End Object Detection

Shilong Zhang*, Xinjiang Wang*, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang, Ping Luo, Kai Chen

CVPR 2023 (* Equal contribution)

DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a ResNet-50 backbone, outperforming all existing detectors in the same setting.

Paper Code

[7] Group R-CNN for Point-based Weakly Semi-supervised Object Detection

Shilong Zhang*, Zhuoran Yu*, Liyang Liu*, Xinjiang Wang, Aojun Zhou, Kai Chen

CVPR 2022 (* Equal contribution)

We study the problem of weakly semi-supervised object detection with points (WSSOD-P). Group R-CNN significantly outperforms the prior method Point DETR by 3.9 mAP with 5% well-labeled images.

Paper Code

[8] Group Fisher Pruning for Practical Network Compression

Liyang Liu, Shilong Zhang, Zhanghui Kuang, Aojun Zhou, Jing-Hao Xue, Xinjiang Wang, Yimin Chen, Wenming Yang, Qingmin Liao, Wayne Zhang

ICML 2021

We present a general channel pruning framework for complicated structures.

Paper Code

[9] Scale-equalizing Pyramid Convolution for Object Detection

Xinjiang Wang*, Shilong Zhang*, Zhuoran Yu, Litong Feng, Wayne Zhang

CVPR 2020 (* Equal contribution)

We proposed a scale-equalizing pyramid convolution method that relaxes the discrepancy between the feature pyramid and the gaussian pyramid. The module boosts the performance about 3.5 mAP in single-stage object detection with negligible inference time.

Paper Code

2025

Goku: Flow Based Video Generative Foundation Models

Joint image-video foundation model work for flow-based generation, with strong public code visibility and broad community uptake.

Code Page

2024

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

A high-impact open-source direction showing that next-token prediction can scale to image generation with the right tokenizer and architecture choices.

Code

2024

Zero-shot Image Editing with Reference Imitation

MimicBrush reframes image editing around visual references, letting users specify style, identity, and local detail through examples rather than text alone.

Code

2025

PixelFlow: Pixel-Space Generative Models with Flow

Pixel-space flow generation that questions whether image generation must depend on a pretrained VAE bottleneck.

Code

2025

Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

Aligns diffusion post-training with the pretraining objective through advantage-weighted matching rather than a separate noisy RL objective.

Code

2022

RTMDet: An Empirical Study of Designing Real-Time Object Detectors

Practical real-time detection system study across architecture, label assignment, augmentation, optimization, and task extensions.

Code

2023

Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection

CVPR Highlight work on stabilizing pseudo supervision in semi-supervised object detection.

Code

Shilong Zhang 张士龙

Research

Recent News

Selected Publications

[1] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[2] FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

[3] FlashFace: Human Image Personalization with High-fidelity Identity Preservation

[4] GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

[5] MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

[6] Dense Distinct Query for End-to-End Object Detection

[7] Group R-CNN for Point-based Weakly Semi-supervised Object Detection

[8] Group Fisher Pruning for Practical Network Compression

[9] Scale-equalizing Pyramid Convolution for Object Detection

Selected Collaborative Work

Goku: Flow Based Video Generative Foundation Models

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Zero-shot Image Editing with Reference Imitation

PixelFlow: Pixel-Space Generative Models with Flow

Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

RTMDet: An Empirical Study of Designing Real-Time Object Detectors

Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection