ZDCSlab
/

ripd-anthropic-saferlhf-gemma-2b-uncensored-v1-biased-bt

Text Generation

preference-learning

text-generation-inference

Model card Files Files and versions

Czardas commited on 13 days ago

Commit

f75ac19

·

verified ·

1 Parent(s): d6f0630

Add model card

Files changed (1) hide show

README.md +46 -0

README.md ADDED Viewed

	@@ -0,0 +1,46 @@

+---
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- alignment
+- evaluation
+- preference-learning
+- ripd
+base_model: sirev/Gemma-2b-Uncensored-v1
+datasets:
+- ZDCSlab/ripd-dataset
+---
+# ZDCSlab/ripd-anthropic-saferlhf-gemma-2b-uncensored-v1-biased-bt
+This checkpoint is part of the artifact release for
+**“Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges.”**
+It is a policy model trained under a specific rubric condition to study how evaluation-time preference drift propagates into downstream alignment.
+---
+## Configuration
+- **Setting:** anthropic-saferlhf
+- **Base model:** [Gemma-2b-Uncensored-v1](https://huggingface.co/sirev/Gemma-2b-Uncensored-v1)
+- **Label condition:** biased
+- **Training data:** Bench + Target (mixed)
+- **Objective:** Direct Preference Optimization (DPO)
+The `biased` condition corresponds to preference labels generated by an LLM judge under the `biased` rubric variant.
+---
+## Intended Use
+This model is released for research on evaluation-time robustness, preference drift, and alignment propagation.
+It is not intended for production deployment.
+---
+## Resources
+- 📄 Paper: https://www.arxiv.org/pdf/2602.13576
+- 💻 Code & Evaluation Pipeline: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface
+- 📊 Dataset: https://huggingface.co/datasets/ZDCSlab/ripd-dataset