ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-bt

This checkpoint is part of the artifact release for
“Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges.”

It is a policy model trained under a specific rubric condition to study how evaluation-time preference drift propagates into downstream alignment.

Configuration

Setting: ultra-real
Base model: LLaMA-3-8B-Instruct
Label condition: biased
Training data: Bench + Target (mixed)
Objective: Direct Preference Optimization (DPO)

The biased condition corresponds to preference labels generated by an LLM judge under the biased rubric variant.

Intended Use

This model is released for research on evaluation-time robustness, preference drift, and alignment propagation.
It is not intended for production deployment.

Resources

📄 Paper: https://www.arxiv.org/pdf/2602.13576
💻 Code & Evaluation Pipeline: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface
📊 Dataset: https://huggingface.co/datasets/ZDCSlab/ripd-dataset

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-bt

Base model

meta-llama/Meta-Llama-3-8B-Instruct

Finetuned

(1051)

this model

Dataset used to train ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-bt

Collection including ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-bt

Rubrics as an Attack Surface (RIPD)

Collection

This collection releases the official artifacts accompanying “Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges.” • 10 items • Updated 4 days ago • 1

Paper for ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-bt

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Paper • 2602.13576 • Published 21 days ago • 2