Representation Encoder → Understanding and Generation Text-to-Image Instruction Editing

Both Semantics and Reconstruction Matter:
Making Representation Encoders Ready for Text-to-Image Generation and Editing

Paper Code(Under Legal Review) Cite

🔥🔥🔥 Why PS-VAE?🔥🔥🔥

🚫 Raw representation features are not generative-friendly

Unconstrained pretrained features make diffusion models prone to off-manifold samples, causing severe artifacts.

🧩 Reconstruction matters more than you think

Weak reconstruction limits accurate structure and texture learning, degrading both generation and editing quality.

🚀 PS-VAE fixes both — and wins

A compact 96-channel semantic latent with SOTA pixel reconstruction, enabling SOTA text-to-image and image editing.

🔭 Unified encoder for understanding + generation

PS-VAE bridges discriminative representations and generative latents, showing strong potential as a unified vision backbone for models like Bagel.

PS-VAE pipeline overview — Constuct S-VAE by learning a compact KL-regularized 96-channel semantic latent with a frozen encoder; Construct PS-VAE from S-VAE by further unfreezing the encoder and adding pixel reconstruction, while $L_s$ preserves semantic consistency.

Abstract

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks generative regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder’s inherent weak pixel-level reconstruction prevents accurate preservation of fine-grained geometry and texture.

We introduce a semantic–pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with $16\times16$ spatial downsampling). Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, our approach achieves state-of-the-art reconstruction, faster convergence, and substantial gains in both T2I and editing tasks—demonstrating that representation encoders can be systematically adapted into robust generative components.

Method

Visualization comparison between RAE and VAE

😌 Not surprising: (a)Pretrained representation encoders are optimized for discriminative features, not information preservation—so poor reconstruction is expected.
🤔 Also expected: (b)Strong semantics improve prompt-following in editing tasks that require understanding before generation, but weak reconstruction breaks fine-grained consistency.
😮 Surprisingly: (c) In text-to-image generation, RAE shows severe structural and texture artifacts, performing far worse than VAE—a gap too large to be explained by reconstruction quality alone.

Analysis🔎 : why RAE struggles

The severe artifacts in RAE far exceed reconstruction errors, indicating off-manifold generation in the high-dimensional feature space. Modeling an $l$-dimensional manifold embedded in an $h$-dimensional space ($h \gg l$), the optimal denoising objective decomposes as:

v_{x}(x_t)=Qv_{z}(Q^\top x_t)+\tfrac{1}{t}(I-QQ^\top)x_t .

The first term is the true low-dimensional signal, but the model must learn it as a degenerate function over an $h$-dimensional input—causing severe statistical inefficiency. The second term is pure Gaussian noise in the orthogonal subspace, forcing an identity-like fit that wastes capacity. Together, they encourage off-manifold drift and lead to structural and texture artifacts.

Off-manifold behavior increases with feature dimensionality. We embed a 2D PS-shaped distribution into an 8D ambient space and compare generation in the intrinsic 2D space versus the 8D space. The 8D setting produces substantially more off-manifold samples, with tail samples deviating much farther from the data manifold, as measured by the mean nearest-neighbor distance of the top 5% samples—indicating significantly stronger off-manifold drift in higher-dimensional feature spaces.

Evolution: RAE → S-VAE → PS-VAE

Mapping raw representation features to a compact 96-channel semantic latent space fixes off-manifold drift (RAE → S-VAE); adding pixel reconstruction under semantic preservation loss enriches details to such a latent space and boosts overall performance (PS-VAE).

Method	rFID ↓	PSNR ↑	LPIPS ↓	SSIM ↑	GenEval ↑	DPG ↑	EditReward ↑
RAE	0.619	19.20	0.254	0.436	71.3	81.7	0.06
S-VAE	1.407	17.78	0.296	0.390	73.7	83.6	0.12
PS-VAE (ours)	0.203	28.79	0.085	0.817	76.6	83.6	0.22

Results

Main comparison

PS-VAE improves both reconstruction and generation/editing over semantic-only RAE and pixel-only VAEs.

Method	rFID ↓	PSNR ↑	LPIPS ↓	SSIM ↑	GenEval ↑	DPG ↑	EditReward ↑
Flux-VAE (stride 8)	0.175	32.86	0.044	0.912	68.04	78.98	-0.271
MAR-VAE	0.534	26.18	0.135	0.715	75.74	83.19	0.056
VAVAE	0.279	27.71	0.097	0.779	76.16	82.45	0.227
RAE	0.619	19.20	0.254	0.436	71.27	81.72	0.059
PS-VAE (32c)	0.584	24.53	0.168	0.662	76.22	84.25	0.274
PS-VAE (96c)	0.203	28.79	0.085	0.817	76.56	83.62	0.222

Coverage curves

Coverage curves (GenEval/DPG/EditReward): PS-VAE converges faster with strong semantics and reaches higher final scores with better reconstruction performance.

Scaling behavior

Scaling Up with High-Quality Data using PS-VAE (96c)

Despite training only at 256×256 resolution, the semantically structured and detail-preserving latent space enables accurate prompt following, producing images with correct structure, fine-grained textures, precise text rendering, realistic portraits, and flexible compositions. Prompts are simplified for visualization.

Scaling (653M dashed, 1.7B solid; 32c vs 96c): higher-channel latents scale better on GenEval, DPG-Bench, and EditingReward.

Transfer across Encoders (SigLIP2): Toward a Unified Encoder

Method (96c)	rFID ↓	PSNR ↑	LPIPS ↓	SSIM ↑	GenEval ↑	DPG ↑	EditReward ↑
DINOv2-B	0.203	28.79	0.085	0.817	76.56	83.62	0.222
SigLIP2-so400m/14	0.222	28.14	0.096	0.795	77.14	83.33	0.183

Generation

⚖️ Comparable reconstruction: DINOv2_96c and SigLIP2_96c achieve nearly identical rFID, PSNR, LPIPS, and SSIM.
🚀 Strong generation across metrics: SigLIP2_96c slightly improves GenEval, while DINOv2_96c is slightly better on DPG-Bench and Editing Reward.
🔁 Encoder-agnostic behavior: Overall performance is highly comparable, indicating strong transferability across pretrained encoders.

Understanding

📉 Minimal degradation: The fine-tuned encoder shows only negligible drops on MME-P（from1685 to 1652） and VBench（from 85 to 84.7）.
🧩 Semantics preserved: Pixel-decoder fine-tuning maintains strong understanding without collapsing semantic representations.
🔭 Toward a unified encoder: Even without LLM fine-tuning, SigLIP2 + PS-VAE shows strong potential as a unified encoder for understanding and generation.

BibTeX

🥺🥺🥺Cite this work if it's helpful

@article{zhang2025both,
              title={Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing},
              author={Zhang, Shilong and Zhang, He and Zhang, Zhifei and Ge, Chongjian and Xue, Shuchen and Liu, Shaoteng and Ren, Mengwei and Kim, Soo Ye and Zhou, Yuqian and Liu, Qing and others},
              journal={arXiv preprint arXiv:2512.17909},
              year={2025}
            }