ThaiSafetyBench
Collection
ThaiSafetyBench, a benchmark of 1,954 malicious Thai prompts designed to evaluate large language model (LLM) safety within Thai language. • 3 items • Updated
A binary classifier that predicts whether an LLM response to a given prompt is safe or harmful for Thai language and culture. Built by fine-tuning DeBERTaV3-base with LoRA for parameter-efficient training.
microsoft/deberta-v3-base0 → safe, 1 → harmfulThe model takes a prompt–response pair concatenated as:
input: <prompt> output: <llm_response>
Tokenized with the DeBERTa tokenizer at a maximum sequence length of 256.
| Parameter | Value |
|---|---|
lora_r |
8 |
lora_alpha |
16 |
lora_dropout |
0.1 |
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 2e-4 |
| Epochs | 4 |
| Batch size | 32 |
| Max sequence length | 256 |
| Early stopping patience | 3 |
Class-balanced loss with β = 0.9999 to address class imbalance.
| Split | Samples |
|---|---|
| Train | 37,514 |
| Validation | 4,689 |
| Test | 4,690 |
| Total | 46,893 |
Class distribution: 79.5% safe, 20.5% harmful
Evaluated on the held-out test set (4,690 samples):
| Metric | Score |
|---|---|
| Accuracy | 84.4% |
| Weighted F1 | 84.9% |
| Precision | 85.7% |
| Recall | 84.4% |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch
base_model_name = "microsoft/deberta-v3-base"
model_name = "trapoom555/ThaiSafetyClassifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=2)
model = PeftModel.from_pretrained(base_model, model_name)
model.eval()
prompt = "your prompt here"
response = "llm response here"
text = f"input: {prompt} output: {response}"
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
with torch.no_grad():
logits = model(**inputs).logits
pred = logits.argmax(-1).item()
label = "harmful" if pred == 1 else "safe"
print(label)
If you use this model, please cite the relevant works:
Coming Soon...
Base model
microsoft/deberta-v3-base