---
library_name: transformers
pipeline_tag: text-generation
tags:
- alignment
- evaluation
- preference-learning
- ripd
base_model: sirev/Gemma-2b-Uncensored-v1
datasets:
- ZDCSlab/ripd-dataset
---

# ZDCSlab/ripd-anthropic-saferlhf-gemma-2b-uncensored-v1-seed-bt

This checkpoint is part of the artifact release for  
**“Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges.”**

It is a policy model trained under a specific rubric condition to study how evaluation-time preference drift propagates into downstream alignment.

---

## Configuration

- **Setting:** anthropic-saferlhf  
- **Base model:** [Gemma-2b-Uncensored-v1](https://huggingface.co/sirev/Gemma-2b-Uncensored-v1)  
- **Label condition:** seed  
- **Training data:** Bench + Target (mixed)  
- **Objective:** Direct Preference Optimization (DPO)

The `seed` condition corresponds to preference labels generated by an LLM judge under the `seed` rubric variant.

---

## Intended Use

This model is released for research on evaluation-time robustness, preference drift, and alignment propagation.  
It is not intended for production deployment.

---

## Resources

- 📄 Paper: https://www.arxiv.org/pdf/2602.13576  
- 💻 Code & Evaluation Pipeline: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface  
- 📊 Dataset: https://huggingface.co/datasets/ZDCSlab/ripd-dataset