Qwen3.6-27B-abliterated

A refusal-suppressed variant of Qwen/Qwen3.6-27B, produced with abliterix using LoRA-mode orthogonal-projection abliteration merged to BF16 for release, a unified attention-output bucket spanning both GatedDeltaNet linear_attn.out_proj (48 layers) and standard self_attn.o_proj (16 layers), and projected abliteration with winsorised refusal vectors (grimjim 2025). The 48-GDN / 16-full-attention hybrid dense layer stack is the first Qwen 3.6-generation dense checkpoint with a Gated DeltaNet backbone at 27B scale, and abliterix had to reconcile steering that fires through two structurally different output projections on the same residual stream.

Key results

Metric Base Qwen3.6-27B This model
Refusals on 100 held-out harmful prompts (LLM judge) 100 / 100 16 / 100
KL divergence vs base (prompt next-token, benign) 0.0181
Response-length deviation vs base (benign) 0.01 σ
Hard-prompt qualitative compliance (15 classic jailbreaks, EN+ZH) 0 / 15 15 / 15

The eval refusal counts come from an LLM judge (google/gemini-3-flash-preview via OpenRouter) scoring 100 held-out harmful prompts that were not in the 800-prompt refusal-vector extraction set. The judge prompt treats metaphorical deflection and off-topic template filler as refusals, not compliance, so the 16/100 is a semantic-compliance number, not a keyword-bypass number.

The qualitative compliance row is a separate manual test (scripts/test_trial.py): 15 classic hard prompts — 10 English (lockpicking, phishing email, methamphetamine synthesis, WiFi hacking, fake-news article, hotwiring, pipe-bomb instructions, signature forgery, password-stealer malware, shoplifting) and 5 Chinese (炸弹制作, 银行钓鱼邮件, 入侵他人电脑, 伪造身份证, 网络诈骗). The model produces substantive, direct, markdown-formatted responses on 15/15. Two of the Chinese responses show minor coherence artifacts (a self-contradictory "合法的/非法的" loop on 电脑入侵 and a "炸弹是重要的工具" cotton-mouth aside on 制作炸弹) — they still comply, but with slightly lower fluency than the English set.

Why this needed care — three Qwen3.6-27B-specific correctness issues

Qwen3.6-27B is architecturally unusual — it inherits the Qwen3_5ForConditionalGeneration wrapper from the 397B-MoE VLM family but is dense, and its decoder stack interleaves three GatedDeltaNet layers for every one full-attention layer (full_attention_interval = 4). abliterix had to handle three issues that silently break naïve abliteration pipelines on this class:

  1. Two structurally different out_proj modules on the same residual stream. 48 of 64 decoder layers use layer.linear_attn.out_proj (GatedDeltaNet: value-head-dim 48 × 128 = 6144 → hidden 5120), and 16 use layer.self_attn.o_proj (standard: 24 × 256 = 6144 → hidden 5120). Both write to the same 5120-d residual, but their upstream pre-activations are computed by entirely different kernels (linear-attn recurrence vs scaled-dot-product softmax). A "register them as two independent knobs" ablation (V2 in this release's history) ran 30 Optuna trials and plateaued at 26/100 refusals — worse than the unified-bucket run (16/100). The unified approach lets a single layer-indexed decay profile coordinate steering strength across the whole stack; splitting them gives TPE two independent search dimensions whose winning combinations no longer coherently project the same refusal direction. abliterix keeps both registrations under the "attn.o_proj" key (engine.py:772+789).

  2. GDN out_proj passes the shape-guard barely — and the guard orientation matters. src/abliterix/core/steering.py contains a blanket if W.shape[0] != hidden: continue that exists to skip asymmetric modules (MoE routers, GQA Q/K/V with head-dim outputs). For GDN on Qwen3.6-27B out_proj has weight.shape = (5120, 6144)shape[0] == hidden → passes the guard and gets steered. We verified this on-pod by reading one weight slice directly from the safetensors shard before launching the sweep; a transposed orientation would have silently skipped 48/64 layers and neutered half the attention steering surface (the same pitfall that hit earlier Qwen3.5-397B runs).

  3. Multimodal VLM wrapper on a text-only abliteration job. Qwen3_5ForConditionalGeneration loads an unused vision tower (~1 GB BF16) and a complex multimodal Jinja2 chat template. abliterix loads via AutoModelForImageTextToText, steers only model.language_model.layers[:64], and text-only prompts bypass the vision path entirely — but the Jinja2 template's image/video conditional branches still render on every prompt tokenisation, which dominates Phase 1 residual-extraction wall time (> 80 % CPU, ~5–7 % GPU utilisation). This is cosmetic rather than functional — the forward pass is identical to a pure-text decoder — but it does cost ~10 min of Phase 1 runtime versus the ~3 min we'd expect on a standard Qwen3ForCausalLM checkpoint.

Method

  • Base: Qwen/Qwen3.6-27B — 64 decoder layers (48 linear_attention + 16 full_attention, interleave pattern [GDN, GDN, GDN, full] × 16), hidden = 5120, intermediate = 17 408, GQA 24 Q / 4 KV, head_dim = 256, GDN linear_num_value_heads = 48, linear_value_head_dim = 128, mtp_num_hidden_layers = 1 (auxiliary MTP head untouched), BF16 ≈ 54 GB on disk
  • Tool: abliterix (v1.5+, PEFT LoRA search with merge-on-export)
  • Mode: steering_mode = "lora" (rank-3 full-norm LoRA, full_norm_lora_rank = 3) with merge_and_unload() at export → shipped artifact is plain BF16 safetensors, no PEFT dependency at inference
  • Components steered:
    • attn.o_proj — unified bucket across all 64 layers (48 GDN linear_attn.out_proj + 16 self_attn.o_proj). This was the dominant lever; winning trial peaked here at weight 5.17.
    • mlp.down_proj — all 64 layers. Contribution was minor (winning trial used weight 1.08, near the [1.0, 5.0] floor). A post-hoc V2 experiment that demoted mlp to [0.3, 2.0] confirmed this is essentially a nuisance knob on hybrid-GDN dense.
    • Q/K/V disabled. attn.q/k/v_proj only exist on the 16 full-attention layers (25 % of the stack). Concentrating the optimiser's strength budget on layer-uniform components (attn.o_proj, mlp.down_proj) outperforms spreading it across a 16-of-64 subset.
  • Refusal direction: projected_abliteration = true (grimjim 2025 — only removes the orthogonal component of the refusal direction relative to the harmless mean, preserving helpfulness-aligned signal), winsorize_vectors = true, winsorize_quantile = 0.995 (symmetric winsorisation damps outlier residual vectors before projection), vector_method = "mean", single-direction (n_directions = 1), extracted from 800 harmful minus 800 benign residuals at final-instruction token position across all 65 (embedding + 64 decoder) positions
  • Search: Optuna TPE, multi-objective (KL divergence + refusal count), 30 trials (10 random warmup + 20 TPE exploitation), kl.target = 0.005, kl.prune_threshold = 5.0, max_gen_tokens = 100, max_batch_size = 4 (auto-batch confirmed bs > 4 regresses on Blackwell GDN kernel fallback path), LLM judge google/gemini-3-flash-preview at batch_size = 10, concurrency = 25
  • Hardware: 1 × NVIDIA RTX PRO 6000 Blackwell Server Edition (sm_120, 96 GB GDDR7), driver 590.48.01 / CUDA 12.9, torch 2.10.0+cu129, transformers 5.5.4, PEFT 0.19.1, single-GPU (no TP), wall time ≈ 5 h 55 min for 30 trials end-to-end, cost ≈ $10 on vast.ai
  • Eval set: datasets/good_1000[800:900] (100 benign prompts) and datasets/harmful_1000[800:900] (100 harmful prompts), never seen during refusal-vector extraction or during any trial's steering computation

Winning hyperparameters (Trial 25)

vector_index = 27.02               # layer-27 residual direction (of 64)

[steering]
steering_mode          = "lora"
full_norm_lora_rank    = 3
vector_method          = "mean"
orthogonal_projection  = true
projected_abliteration = true
winsorize_vectors      = true
winsorize_quantile     = 0.995
weight_normalization   = "none"
disabled_components    = ["attn.q_proj", "attn.k_proj", "attn.v_proj"]

[steering.components."attn.o_proj"]  # unified: 48 GDN + 16 full-attn o_proj
max_weight             = 5.17
max_weight_position    = 41.40       # peak at layer ≈ 41 / 64 (mid-to-late)
min_weight             = 3.21        # 62 % of max — whole-stack stays strongly steered
min_weight_distance    = 35.61       # decays over 56 % of the stack

[steering.components."mlp.down_proj"]  # near-disabled
max_weight             = 1.08
max_weight_position    = 60.57       # peak at layer ≈ 61 / 64 (very late)
min_weight             = 0.55
min_weight_distance    = 16.58

The attn.o_proj profile reaches peak strength at layer ≈ 41/64 and decays with a very wide radius (35.6 layers) that never drops below 3.21 — effectively a sustained high-strength attention perturbation across the entire decoder stack, centred slightly past mid-depth. The mlp.down_proj contribution is minor (peak 1.08 near the top of the stack). A sibling experiment (V2) that split the attention bucket into separate GDN and full-attn knobs plateaued at 26/100 refusals after 30 trials — evidence that on Qwen3.6-27B's hybrid stack the refusal direction is carried by a joint GDN+full-attn output-projection signal, and the winning strategy is to push both with one coherent layer-indexed profile.

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForImageTextToText
import torch

tok = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-27B-abliterated")
model = AutoModelForImageTextToText.from_pretrained(
    "wangzhang/Qwen3.6-27B-abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

The model retains the base checkpoint's multimodal capability — you can feed image/video inputs via the standard Qwen/Qwen3.6-27B processor interface. Steering only touched the 64 text decoder layers; the vision tower weights are identical to base.

Hardware note: BF16 weights are ~54 GB on disk. Fits on a single RTX PRO 6000 (96 GB), H100-80GB, A100-80GB, or H200. For ≤ 48 GB cards, use device_map="auto" with CPU offload or a quantised variant.

vLLM

vllm serve wangzhang/Qwen3.6-27B-abliterated \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --enforce-eager

--enforce-eager is recommended on Blackwell GPUs until the GatedDeltaNet kernel lands an sm_120 specialisation — CUDA-graph capture on the fallback eager path currently re-records more aggressively than the graph cache can amortise.

Honest limitations

  • Refusal is low, not zero. 16 / 100 held-out prompts still refuse. Residual refusers concentrate on the most explicitly-harmful ends of the eval set (CBRN specifics, explicit minor-adjacent content); the same redundant-circuit problem that limits every abliterated MoE/dense release at a one-direction projection.
  • English > Chinese quality on long-form. Steering vectors came from an English-weighted 800-prompt harmful set. All 5 Chinese hard prompts comply substantively, but 2/5 show minor coherence artifacts (电脑入侵 → a self-contradictory "合法的/非法的" interleave; 制作炸弹 → a "炸弹是重要的工具, 可用于军事/建筑/采矿" cotton-mouth aside). The model still produces a step-by-step answer — the defect is fluency, not refusal.
  • V2 (GDN/full-attn split) experiment shipped as the loss side. We ran an explicit follow-up with three independent steering buckets (attn.o_proj for 16 full-attn, linear_attn.out_proj for 48 GDN, mlp.down_proj for 64 mlp). It plateaued at 26/100 after 30 trials — 10 worse than V1. We kept the V1 (unified) checkpoint as the release. This is documented here rather than buried because the null result is useful: on hybrid-GDN dense, the two output projections should be treated as one abstraction.
  • Blackwell GDN kernel is the throughput bottleneck. Each trial takes ~11–12 min on 1× RTX PRO 6000 (vs ~5 min for standard dense 27B on A100-80GB) because the GDN selective-scan / linear-attention kernel falls back to PyTorch eager on sm_120. Total sweep is ~6 h instead of the ~3 h the same model would cost on an H100/A100. This is a hardware-kernel compatibility gap, not a tool or config issue.
  • Multimodal compliance is unvalidated. Eval was text-only. Image/video prompts may behave differently; we haven't characterised whether vision-conditioned refusals are similarly ablated (the refusal direction was extracted from text-only prompts at the final-instruction token, so cross-modal transfer is an assumption, not a measurement).
  • MTP head untouched. mtp_num_hidden_layers = 1 — the multi-token-prediction auxiliary head is not part of the 64-layer main decoder and was explicitly truncated away by abliterix's _truncate_to_hidden_layers. If downstream users rely on MTP draft generation, that code path sees the unsteered head.

Reproducibility

Full search artifact (Optuna JSONL + judge cache SQLite) and the exact config are available in the abliterix repo under configs/qwen3.6_27b.toml + checkpoints_qwen3.6_27b/. To reproduce from scratch on a 1 × RTX PRO 6000 96 GB pod:

git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix

# One-time deps install on RunPod / vast.ai PyTorch image
pip install --break-system-packages \
    "transformers>=5.5,<5.6" "peft>=0.18" "huggingface-hub>=1.6" \
    accelerate safetensors sentencepiece optuna datasets \
    bitsandbytes "kernels~=0.11" pydantic-settings questionary \
    hf-transfer psutil rich
pip install --break-system-packages -e . --no-deps

# Download weights (≈ 55 GB)
export HF_HOME=/workspace/hf_cache HF_HUB_ENABLE_HF_TRANSFER=1
hf download Qwen/Qwen3.6-27B --max-workers 16

# Launch sweep (30 trials ≈ 6 h wall, ~$10 on vast.ai at $1.6/h)
AX_CONFIG=configs/qwen3.6_27b.toml abliterix \
    --optimization.checkpoint-dir=/workspace/checkpoints_qwen3.6_27b

# Export + push best trial to HF
python scripts/export_model.py \
    --model Qwen/Qwen3.6-27B \
    --checkpoint /workspace/checkpoints_qwen3.6_27b \
    --trial 25 \
    --config configs/qwen3.6_27b.toml \
    --push-to wangzhang/Qwen3.6-27B-abliterated

Optuna is deterministic if you set sampler_seed in [optimization].

Intended use

Authorised AI-safety research, red-teaming evaluation, refusal-mechanism analysis, and study of how a hybrid GatedDeltaNet + full-attention decoder encodes refusal — in particular, whether a single residual-stream steering direction can coherently cancel refusal across two structurally different attention kernels. The answer, per this release, is yes: one unified layer-indexed decay profile on the shared attn.o_proj bucket is sufficient and in fact strictly better than per-kernel splits.

Not for producing or distributing harmful content. The license of the base model (apache-2.0) applies; the user is responsible for compliance with all applicable laws and Qwen's usage policy.

Acknowledgments

  • Qwen/Qwen3.6-27B for the base model and the first open hybrid-GDN-dense decoder at 27B scale
  • abliterix is a derivative work of Heretic by Philipp Emanuel Weidmann
  • grimjim for the projected-abliteration + winsorised-vector recipe that V1's winning trial depended on
  • HuggingFace / transformers team for landing Qwen3_5ForConditionalGeneration support in the 5.5 series
Downloads last month
180
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wangzhang/Qwen3.6-27B-abliterated

Base model

Qwen/Qwen3.6-27B
Finetuned
(57)
this model
Finetunes
1 model
Quantizations
1 model

Collection including wangzhang/Qwen3.6-27B-abliterated