ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-bt

This checkpoint is part of the artifact release for
“Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges.”

It is a policy model trained under a specific rubric condition to study how evaluation-time preference drift propagates into downstream alignment.


Configuration

  • Setting: ultra-real
  • Base model: LLaMA-3-8B-Instruct
  • Label condition: biased
  • Training data: Bench + Target (mixed)
  • Objective: Direct Preference Optimization (DPO)

The biased condition corresponds to preference labels generated by an LLM judge under the biased rubric variant.


Intended Use

This model is released for research on evaluation-time robustness, preference drift, and alignment propagation.
It is not intended for production deployment.


Resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-bt

Finetuned
(1051)
this model

Dataset used to train ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-bt

Collection including ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-bt

Paper for ZDCSlab/ripd-ultra-real-llama3-8b-instruct-biased-bt