图标描述
Human Image Personalization with High-fidelity Identity Preservation

1The University of Hong Kong   2Alibaba Group   3Ant Group

Human Image Personalization Results

Diverse human image personalization results produced by our proposed FlashFace, which enjoys the features of
(1) preserving the identity of reference faces in great details (e.g., tattoos, scars, or even the rare face shape of virtual characters)
(2) accurately following the instructions especially when the text prompts contradict the reference images (e.g., customizing an adult to a ``child'' or an ``elder'').

pipeline
pipeline
pipeline
pipeline

Change the age or gender

pipeline

Turn virtual characters into real people

pipeline

Make real people to artworks

pipeline

Identity Mixing

pipeline

Face Swapping Under Language Control

pipeline

Abstract

This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt. Our approach is distinguishable from existing human photo customization methods by higher-fidelity identity preservation and better instruction following, benefiting from two subtle designs. First, we encode the face identity into a series of feature maps instead of one image token as in prior arts, allowing the model to retain more details of the reference faces (e.g., scars, tattoos, and face shape ). Second, we introduce a disentangled integration strategy to balance the text and image guidance during the text-to-image generation process, alleviating the conflict between the reference faces and the text prompts (e.g., personalizing an adult into a ``child'' or an ``elder''). Extensive experimental results demonstrate the effectiveness of our method on various applications, including human image personalization, face swapping under language prompts, making virtual characters into real people, etc.

Pipeline

The overall pipeline of FlashFace. During training, we randomly select B ID clusters and choose N+1 images from each cluster. We crop the face region from N images as references and leave one as the target image. This target image is used to calculate the loss. The input latent of Face ReferenceNet has shape (B*N) x 4 x h x w. We store the reference face features after the self-attention layer within the middle blocks and decoder blocks. A face position mask is concatenated to the target latent to indicate the position of the generated face. During the forwarding of the target latent through the corresponding position in the U-Net, we incorporate the reference feature using an additional reference attention layer. During inference, users can obtain the desired image by providing a face position(optional), reference images of the person, and a description of the desired image.

pipeline

BibTeX

@misc{zhang2024flashface,
            title={FlashFace: Human Image Personalization with High-fidelity Identity Preservation}, 
            author={Shilong Zhang and Lianghua Huang and Xi Chen and Yifei Zhang and Zhi-Fan Wu and Yutong Feng and Wei Wang and Yujun Shen and Yu Liu and Ping Luo},
            year={2024},
            eprint={2403.17008},
            archivePrefix={arXiv},
            primaryClass={cs.CV}
      }