Czardas commited on
Commit
f75ac19
·
verified ·
1 Parent(s): d6f0630

Add model card

Browse files
Files changed (1) hide show
  1. README.md +46 -0
README.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - alignment
6
+ - evaluation
7
+ - preference-learning
8
+ - ripd
9
+ base_model: sirev/Gemma-2b-Uncensored-v1
10
+ datasets:
11
+ - ZDCSlab/ripd-dataset
12
+ ---
13
+
14
+ # ZDCSlab/ripd-anthropic-saferlhf-gemma-2b-uncensored-v1-biased-bt
15
+
16
+ This checkpoint is part of the artifact release for
17
+ **“Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges.”**
18
+
19
+ It is a policy model trained under a specific rubric condition to study how evaluation-time preference drift propagates into downstream alignment.
20
+
21
+ ---
22
+
23
+ ## Configuration
24
+
25
+ - **Setting:** anthropic-saferlhf
26
+ - **Base model:** [Gemma-2b-Uncensored-v1](https://huggingface.co/sirev/Gemma-2b-Uncensored-v1)
27
+ - **Label condition:** biased
28
+ - **Training data:** Bench + Target (mixed)
29
+ - **Objective:** Direct Preference Optimization (DPO)
30
+
31
+ The `biased` condition corresponds to preference labels generated by an LLM judge under the `biased` rubric variant.
32
+
33
+ ---
34
+
35
+ ## Intended Use
36
+
37
+ This model is released for research on evaluation-time robustness, preference drift, and alignment propagation.
38
+ It is not intended for production deployment.
39
+
40
+ ---
41
+
42
+ ## Resources
43
+
44
+ - 📄 Paper: https://www.arxiv.org/pdf/2602.13576
45
+ - 💻 Code & Evaluation Pipeline: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface
46
+ - 📊 Dataset: https://huggingface.co/datasets/ZDCSlab/ripd-dataset