ThaiSafetyClassifier

A binary classifier that predicts whether an LLM response to a given prompt is safe or harmful for Thai language and culture. Built by fine-tuning DeBERTaV3-base with LoRA for parameter-efficient training.

Model Details

Model type: Text classification (binary)
Base model: microsoft/deberta-v3-base
Fine-tuning method: LoRA (Low-Rank Adaptation)
Language: Thai
Labels: 0 → safe, 1 → harmful

Input Format

The model takes a prompt–response pair concatenated as:

input: <prompt> output: <llm_response>

Tokenized with the DeBERTa tokenizer at a maximum sequence length of 256.

Training Details

LoRA Configuration

Parameter	Value
`lora_r`	8
`lora_alpha`	16
`lora_dropout`	0.1

Hyperparameters

Parameter	Value
Optimizer	AdamW
Learning rate	2e-4
Epochs	4
Batch size	32
Max sequence length	256
Early stopping patience	3

Loss Function

Class-balanced loss with β = 0.9999 to address class imbalance.

Dataset

Split	Samples
Train	37,514
Validation	4,689
Test	4,690
Total	46,893

Class distribution: 79.5% safe, 20.5% harmful

Evaluation Results

Evaluated on the held-out test set (4,690 samples):

Metric	Score
Accuracy	84.4%
Weighted F1	84.9%
Precision	85.7%
Recall	84.4%

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

base_model_name = "microsoft/deberta-v3-base"
model_name = "trapoom555/ThaiSafetyClassifier"

tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=2)
model = PeftModel.from_pretrained(base_model, model_name)
model.eval()

prompt = "your prompt here"
response = "llm response here"
text = f"input: {prompt} output: {response}"

inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits
    pred = logits.argmax(-1).item()

label = "harmful" if pred == 1 else "safe"
print(label)

Citation

If you use this model, please cite the relevant works:


Coming Soon...

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for typhoon-ai/ThaiSafetyClassifier

Base model

microsoft/deberta-v3-base

Adapter

(16)

this model

Collection including typhoon-ai/ThaiSafetyClassifier

ThaiSafetyBench

Collection

ThaiSafetyBench, a benchmark of 1,954 malicious Thai prompts designed to evaluate large language model (LLM) safety within Thai language. • 3 items • Updated about 17 hours ago