Both Semantics and Reconstruction Matter:
Making Representation Encoders Ready for Text-to-Image Generation and Editing
Unconstrained pretrained features make diffusion models prone to off-manifold samples, causing severe artifacts.
Weak reconstruction limits accurate structure and texture learning, degrading both generation and editing quality.
A compact 96-channel semantic latent with SOTA pixel reconstruction, enabling SOTA text-to-image and image editing.
PS-VAE bridges discriminative representations and generative latents, showing strong potential as a unified vision backbone for models like Bagel.
Abstract
Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks generative regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder’s inherent weak pixel-level reconstruction prevents accurate preservation of fine-grained geometry and texture.
We introduce a semantic–pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with \(16\times16\) spatial downsampling). Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, our approach achieves state-of-the-art reconstruction, faster convergence, and substantial gains in both T2I and editing tasks—demonstrating that representation encoders can be systematically adapted into robust generative components.
Method
🤔 Also expected: (b)Strong semantics improve prompt-following in editing tasks that require understanding before generation, but weak reconstruction breaks fine-grained consistency.
😮 Surprisingly: (c) In text-to-image generation, RAE shows severe structural and texture artifacts, performing far worse than VAE—a gap too large to be explained by reconstruction quality alone.
The severe artifacts in RAE far exceed reconstruction errors, indicating off-manifold generation in the high-dimensional feature space. Modeling an \(l\)-dimensional manifold embedded in an \(h\)-dimensional space (\(h \gg l\)), the optimal denoising objective decomposes as:
The first term is the true low-dimensional signal, but the model must learn it as a degenerate function over an \(h\)-dimensional input—causing severe statistical inefficiency. The second term is pure Gaussian noise in the orthogonal subspace, forcing an identity-like fit that wastes capacity. Together, they encourage off-manifold drift and lead to structural and texture artifacts.
Mapping raw representation features to a compact 96-channel semantic latent space fixes off-manifold drift (RAE → S-VAE); adding pixel reconstruction under semantic preservation loss enriches details to such a latent space and boosts overall performance (PS-VAE).
| Method | rFID ↓ | PSNR ↑ | LPIPS ↓ | SSIM ↑ | GenEval ↑ | DPG ↑ | EditReward ↑ |
|---|---|---|---|---|---|---|---|
| RAE | 0.619 | 19.20 | 0.254 | 0.436 | 71.3 | 81.7 | 0.06 |
| S-VAE | 1.407 | 17.78 | 0.296 | 0.390 | 73.7 | 83.6 | 0.12 |
| PS-VAE (ours) | 0.203 | 28.79 | 0.085 | 0.817 | 76.6 | 83.6 | 0.22 |
Results
PS-VAE improves both reconstruction and generation/editing over semantic-only RAE and pixel-only VAEs.
| Method | rFID ↓ | PSNR ↑ | LPIPS ↓ | SSIM ↑ | GenEval ↑ | DPG ↑ | EditReward ↑ |
|---|---|---|---|---|---|---|---|
| Flux-VAE (stride 8) | 0.175 | 32.86 | 0.044 | 0.912 | 68.04 | 78.98 | -0.271 |
| MAR-VAE | 0.534 | 26.18 | 0.135 | 0.715 | 75.74 | 83.19 | 0.056 |
| VAVAE | 0.279 | 27.71 | 0.097 | 0.779 | 76.16 | 82.45 | 0.227 |
| RAE | 0.619 | 19.20 | 0.254 | 0.436 | 71.27 | 81.72 | 0.059 |
| PS-VAE (32c) | 0.584 | 24.53 | 0.168 | 0.662 | 76.22 | 84.25 | 0.274 |
| PS-VAE (96c) | 0.203 | 28.79 | 0.085 | 0.817 | 76.56 | 83.62 | 0.222 |
| Method (96c) | rFID ↓ | PSNR ↑ | LPIPS ↓ | SSIM ↑ | GenEval ↑ | DPG ↑ | EditReward ↑ |
|---|---|---|---|---|---|---|---|
| DINOv2-B | 0.203 | 28.79 | 0.085 | 0.817 | 76.56 | 83.62 | 0.222 |
| SigLIP2-so400m/14 | 0.222 | 28.14 | 0.096 | 0.795 | 77.14 | 83.33 | 0.183 |
BibTeX
@misc{zhang2025psvae,
title = {Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing},
author = {Zhang, Shilong},
year = {2025},
note = {},
}