Latent Adversarial Regularization for Offline Preference Optimization
Abstract
GANPO uses latent-space regularization through adversarial divergence minimization to improve language model preference optimization, offering more robust structural feedback than token-level methods.
Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.
Community
Most offline preference optimization methods (e.g., DPO) constrain policy updates using token-level divergences. However, token-space similarity is often a weak proxy for semantic or structural behavior. We propose GANPO, a plug-and-play regularizer that introduces latent-space adversarial regularization, aligning the latent representation distributions of a policy and a reference model via a principled GAN-style divergence.
- We find consistent performance improvements. GANPO yields consistent gains across model architectures when integrated into OPO-style methods on AlpacaEval.
- We also find that structure is preserved. The adversarial objective acts as a geometry-preserving regularizer. Unlike DPO, which often degrades at high sampling temperatures (T ≥ 1.0), GANPO maintains structural coherence in high-entropy settings.
If you’re interested in alignment, GANs, or the limitations of KL-divergence–based regularization, feel free to check out the paper.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment (2025)
- From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models (2026)
- Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection (2025)
- Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision (2025)
- Positive-Unlabeled Reinforcement Learning Distillation for On-Premise Small Models (2026)
- RM-Distiller: Exploiting Generative LLM for Reward Model Distillation (2026)
- Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper