--- library_name: transformers pipeline_tag: text-generation tags: - alignment - evaluation - preference-learning - ripd base_model: sirev/Gemma-2b-Uncensored-v1 datasets: - ZDCSlab/ripd-dataset --- # ZDCSlab/ripd-anthropic-saferlhf-gemma-2b-uncensored-v1-seed-bt This checkpoint is part of the artifact release for **โ€œRubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges.โ€** It is a policy model trained under a specific rubric condition to study how evaluation-time preference drift propagates into downstream alignment. --- ## Configuration - **Setting:** anthropic-saferlhf - **Base model:** [Gemma-2b-Uncensored-v1](https://huggingface.co/sirev/Gemma-2b-Uncensored-v1) - **Label condition:** seed - **Training data:** Bench + Target (mixed) - **Objective:** Direct Preference Optimization (DPO) The `seed` condition corresponds to preference labels generated by an LLM judge under the `seed` rubric variant. --- ## Intended Use This model is released for research on evaluation-time robustness, preference drift, and alignment propagation. It is not intended for production deployment. --- ## Resources - ๐Ÿ“„ Paper: https://www.arxiv.org/pdf/2602.13576 - ๐Ÿ’ป Code & Evaluation Pipeline: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface - ๐Ÿ“Š Dataset: https://huggingface.co/datasets/ZDCSlab/ripd-dataset