Shilong Zhang(张士龙)
Shilong, presently in his second year (2023-present) as a Ph.D. student at the
Department of Computer Science, The
University
of Hong Kong (HKU), is being
mentored by Prof. Ping Luo. In his previous
stint, he was an integral part of the
team headed by Kai Chen while serving as a
core developer for MMDetection and MMCV.
He completed his Bachelor's degree from the University of Science and Technology
of China (USTC) in 2019, distinguishing himself as one of the top
4% outstanding
graduates.
At present, Shilong's research is primarily focused on building efficient vision
generative models and algorithms.
He welcomes intellectual conversations about these areas of study.
Email  / 
Google
Scholar  / 
Github
|
|
- [2025/2/10] We present FlashVideo,
an efficient
paradigm for text-to-video generation.
- [2024/3/26] We propose FlashFace
that can generate high ID fidelity images in seconds.
- [2023/7/7] We present a vision and language model named GPT4RoI to do
region-level image understanding.
- [2023/4/26] We present a vision and language model named
MultiModal-GPT .
- [2023/3/20] Two papers was accepted by CVPR 2023. DDQ
DETR achieve 52.1 AP with R-50 backbone within 12 epochs.
- [2022/3/15] One paper was accepted by CVPR 2022.
- [2021/11/27] We release MMFewShot, an
open source few shot learning toolbox based on PyTorch.
- [2021/5/8] One paper was accepted by ICML 2021.
- [2020/2/24] One paper was accepted by CVPR 2020.
- [2019/6/28] Awarded as outstanding graduates by USTC.
|
[1] FlashVideo: Flowing Fidelity to Detail for
Efficient High-Resolution Video
Generation
  
Shilong Zhang*, Webo Li*, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi
Jiang,
Zehuan Yuan, Binyue Peng, Ping Luo
(a) Dividing the process into prompt fidelity and
quality enhancement stages, delivering a stunning reduction in DiT's
computational load .
(b) Enableing users to preview the initial output and accordingly adjust the prompt before
committing to full-resolution
generation, thereby significantly reducing wait times and enhancing commercial
viability. Code has been
released
at this repo !
|
|
[2] FlashFace: Human Image Personalization with
High-fidelity Identity Preservation
  
Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang, Zhi-Fan Wu, Yutong Feng, Wei
Wang, Yujun Shen, Yu Liu, Ping Luo
This work presents FlashFace, a practical tool with which users can easily
personalize their own photos on the fly by providing one or a few reference face images and
a text prompt Code has been released
at this repo !
|
|
[4] MultiModal-GPT: A Vision and Language Model for
Dialogue with Humans
  
Tao Gong*, Chengqi Lyu*, Shilong Zhang*, Yudong Wang*, Miao Zheng*, Qian Zhao*,
Kuikun Liu*, Wenwei Zhang*, Ping Luo, Kai Chen
(* random order) ,
We present a vision and language model named MultiModal-GPT to conduct
multi-round dialogue with humans. Code has been released at this repo
!
|
|
[5] Dense Distinct Query for End-to-End Object
Detection   
Shilong Zhang*, Xinjiang Wang*, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang,
Ping Luo, Kai Chen
CVPR2023(* Equal contribution) ,
DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a
ResNet-50 backbone, outperforming all existing detectors in the same setting. Code has been released at this repo !
|
|
[6] Consistent-Teacher: Towards Reducing Inconsistent
Pseudo-targets in Semi-supervised Object Detection   
Xinjiang Wang*, Xingyi Yang*, Shilong Zhang, Yijiang Li, Litong Feng, Shijie Fang,
Chengqi Lyu, Kai Chen, Wayne Zhang
CVPR2023 (* Equal contribution) ,
It achieves 40.0 mAP with ResNet-50 backbone given only 10% of annotated
MS-COCO data, which surpasses previous baselines using pseudo labels by around 3 mAP. When
trained on fully annotated MS-COCO with additional unlabeled
data, the performance further increases to 47.2 mAP. Code has been released !
|
|
[7] Group R-CNN for Point-based Weakly
Semi-supervised Object Detection   
Shilong Zhang*, Zhuoran Yu*, Liyang Liu*, Xinjiang Wang, Aojun Zhou, Kai Chen
CVPR2022 (* Equal contribution) ,
We study the problem of weakly semi-supervised object detection with points
(WSSOD-P). Group R-CNN significantly outperforms the prior method Point DETR by 3.9 mAP with
5% well-labeled images. Code has been
released !
|
|
[8] Group Fisher Pruning for Practical Network
Compression   
Liyang Liu*, Shilong Zhang*,Zhanghui Kuang,Jing-Hao Xue ,Aojun Zhou
ICML2021 (* Equal contribution) ,
We present a general channel pruning framework for complicated structures !
Code has been released !
|
|
[9] Scale-equalizing Pyramid Convolution for object
detection   
Xinjiang Wang*, Shilong Zhang* , Zhuoran Yu, Litong Zhang, Wayne Zhang
CVPR2020 (* Equal contribution),
We proposed a scale-equalizing pyramid convolution method that relaxes the
discrepancy between the feature pyramid and the gaussian pyramid. The module boosts the
performance about 3.5 mAP in single-stage object detection
with negligible inference time. Code has been
released !
|
|