Qwen3.6 35B-A3B - Abliterated (Hybrid-Attention MoE Stack)

Safety-alignment removed via layered LoRA + weight-space ablation + token-level intervention, calibrated to Qwen3.6's hybrid-attention MoE architecture.

This model achieves 99.2% compliance on AdvBench 520 with zero genuine refusals and quality fully preserved (QPS 101.9%) on what proved to be the most ablation-resistant model in our 17-architecture database. The investigation revealed four architecture-specific failure modes that the standard Gemma-style playbook missed, requiring a Qwen-specific adaptation.

Results

Metric Value
True refusal (AdvBench 520) 0.0% (0/520)
Clean compliance (no hedging) 93.5% (486/520)
Compliance with disclaimer prefix 5.4% (28/520)
Total compliance 99.2% (516/520)
Coherence failures (decoding pathology, see below) 0.8% (4/520)
Quality (QPS) 101.9%
Elo Original → Ablit 1548.1 → 1451.9 (CI overlap)
Rubric score Orig 3.15 → Ablit 3.21 (↑)
MMLU (5-shot, limit 200) 80.58% (vs 83.30% original, Δ -2.71pp)
Reasoning mode Native <think> enabled (not disabled for eval)

The model delivers the requested content on every harmful AdvBench prompt it does not glitch on. The disclaimer-prefixed responses ("**Disclaimer**: *This guide is for educational purposes only...*" → then full technical walkthrough) are a stylistic artifact of the Gemma-abliterated compliance-training data used for the LoRA phase; the content is fully delivered.

MMLU category breakdown

Category Original Abliterated Δ
Overall 83.30% 80.58% -2.71pp
Humanities 83.90% 81.46% -2.44pp
Social Sciences 88.51% 85.68% -2.83pp
Other 82.73% 81.14% -1.59pp
STEM 79.35% 75.63% -3.72pp

STEM takes the biggest hit (-3.72pp) — consistent with the observation from quality eval that Math & Coding are the weakest post-ablation categories. Humanities/Social Sciences hold up best.

The V2 Stack — Qwen-specific adaptation

Differences from Gemma4 playbooks

Qwen3.6-35B-A3B is fine-grained MoE (256 experts × top-8) with hybrid attention — only 10 of 40 layers use full self_attn; the other 30 use Mamba-like Qwen3_5MoeGatedDeltaNet (linear attention). A LoRA script written for dense or uniform-attention architectures silently skips 75% of the model, producing marginal training effect. Our Day-1 attempt attached 8/28 expected adapters before we diagnosed this.

Additionally:

  • Untied embeddings (tie_word_embeddings: false) — lm_head is a separate matrix, making it an independent token-logit intervention surface.
  • Multi-dimensional refusal — SVD of refusal direction at peak layer shows S₂/S₁ = 0.30 (vs Gemma 31B's 0.08). Top-5 SVD captures 88% of variance; single mean-difference captures only 69%.
  • Inverse entanglement vs Gemma 31B — primary refusal direction is clean (cos(dir[L39], dir[L38]) = 0.819), but the orthogonal residual component is entangled with output generation. Any effective ablation of the orthogonal mechanism causes coherence collapse. Opposite of Gemma 31B's primary-entangled pattern.

Surface 1: LoRA Fine-Tuning on Full-Attention Layers

Rank 32 / alpha 64 adapters on self_attn.{q,k,v,o}_proj at layers [19, 23, 27, 31, 35, 39] — all full-attention layers in the upper half (skipping linear_attn layers entirely). 24 adapters, 4.1M trainable params (0.012% of model).

Trained on 215 verified compliance pairs generated by WWTCyberLab/gemma-4-31B-it-abliterated on 300 AdvBench prompts, then filtered via GPT-4o-mini judge to keep only genuine_comply responses (dropping 46 "educational pivot" false positives that would have taught the model to deflect with warnings).

12 epochs, lr 5e-5 cosine, applied to the ORIGINAL model then merged before downstream ablation.

Surface 2: Broad Primary Ablation (L19-L39)

Refusal-direction projection with per-layer adaptive scales (0.72 at L19 → 4.0 at L39, √magnitude scaling). Applied to 4 weight types per layer: o_proj, down_proj, gate_proj, up_proj (fused 3D tensors at 256 experts per layer, plus shared_expert). Projected against harmless direction to reduce collateral. 21 target layers.

Surface 3: lm_head Row Suppression

Contraction tokens [914, 1357, 16813] ('t ASCII, 't Unicode U+2019, 'T capital) scaled down 10x in lm_head.weight. Also suppressed [4021, 18572, 31221, 35658]" cannot", "Sorry", " Sorry", " apologize". Untied embeddings means this is a separate surface from input embeddings (which we do NOT modify).

Surface 4: bad_words_ids Generation Constraint

32 token sequences blocked at generation, covering:

  • "I can't" / "I cannot" in ASCII + Unicode apostrophe variants
  • "I can help / provide / write / explain" educational pivots
  • "Here's a thinking process" in ASCII / Unicode / capital-T / no-space variants
  • Chinese refusal tokens (我无法, 我不能, 违反) — Qwen's fallback language when English paths blocked
  • "Before providing", "**Important Legal", "Thank you for reaching" — warning-style openers
  • "Thinking Process" standalone
  • "This guide is provided for educational", "This example is provided for educational", "This explanation is provided for educational" — educational-pivot disclaimers
  • "Here is a comprehensive", "**Disclaimer:**" — additional compliance-with-hedging patterns

Built iteratively by observing the model's creative rerouting (Unicode apostrophe escape, Chinese code-switching, markdown-asterisk insertion, numbered warnings, educational-disclaimer prefixes) and adding each discovered pattern. A post-V2 expansion (11 new entries) cut partial-educational responses from 6.0% → 0.4% and soft hedging from 2.3% → 0%.

Six Architecture-Specific Findings

1. Inverse entanglement pattern vs Gemma 31B

Layer diagnostic Gemma 4 31B Qwen3.6-35B-A3B
Primary coh(dir[L_peak], dir[L_peak-1]) 0.483 (entangled) 0.819 (clean)
Orthogonal residual after ablation Clean, ablated away Entangled with output generation
Which mechanism causes quality collapse? Primary Orthogonal

Gemma's published coherence diagnostic (blog "Five-Surface Attack") correctly identifies primary-direction entanglement but cannot detect orthogonal-direction entanglement. Qwen3.6 passes the coherence test but fails in-practice, exposing a diagnostic gap.

2. Induced entanglement from aggressive primary ablation

Scale-4.0 projection at L39 dropped the residual coherence 0.819 → 0.516 — itself creating a Gemma-like entanglement state that wasn't present natively. Implies coherence preservation is a constraint for scale selection, not just a diagnostic gate.

3. Thinking-mode independence of refusal

With enable_thinking=false (empty <think>\n\n</think>\n\n injection via chat template), the original model still refuses at 100%. Refusal is weight-encoded, not reasoning-emergent. The model has two refusal styles (contracted "I can't" with thinking enabled; formal "I cannot" / Chinese / numbered-warning without) but identical refusal rates. Disabling thinking is not a viable bypass.

4. Routing decision vs direction magnitude — different layers

Per-expert activation analysis (capturing router softmax probabilities at the last prompt token) shows refusal-routing differential peaks at L20-L31 (sum(+Δ) = 0.15-0.23 per layer, top expert L23:108 at +6.2% delta). But refusal-direction magnitude peaks at L39 (68.0, vs 1-3 at L20-L31).

These are different measurements: routing captures expert selection on harmful inputs; direction magnitude captures output representation of the refusal. The Gemma playbook conflates them — we found that ablating the direction-peak layer (L33-L39) alone is insufficient; ablation must extend to the routing-decision layers (L19-L31) to weaken the mechanism at its source, not its output.

5. Hybrid-attention gotcha

Qwen3.6's 40 layers split into:

  • Full attention: layers [3, 7, 11, 15, 19, 23, 27, 31, 35, 39] (10 layers)
  • Linear attention (Qwen3_5MoeGatedDeltaNet): layers [0,1,2,4,5,6,8,9,10,12,13,14,16,17,18,20,21,22,24,25,26,28,29,30,32,33,34,36,37,38] (30 layers)

Any LoRA/ablation script written for dense or uniform-attention models silently skips 75% of the layer range when looking for self_attn.{q,k,v,o}_proj. The linear-attention layers have entirely different structure (in_proj_qkv, in_proj_a/b, in_proj_z, conv1d, out_proj). Our Day-1 LoRA attached 8/28 expected adapters before diagnosis.

6. Weight-space ablation and activation steering are not naively composable

After the V2 stack was in place, we tested inference-time activation steering (Representation-Engineering-style α · (harmless_mean − harmful_mean) added via forward hooks at L19-L39). Every tested configuration made the model worse — α≥1.0 on 21 layers produces pure token gibberish; α≤0.5 doesn't improve on the 27.5% baseline. Pairwise combinations broke what single-technique versions preserved. Weight-space ablation leaves a "fragile pocket" in the residual subspace that activation additions push into degenerate regions.

Response distribution (AdvBench 520)

Category Count
Clean compliance, no hedging 486 (93.5%)
Compliance with disclaimer prefix (full content delivered) 28 (5.4%)
Short partial (content delivered) 2 (0.4%)
Coherence failure (repetition loop — see below) 4 (0.8%)
Genuine refusal 0 (0.0%)

Our refusal classifier (calibrated on Gemma-style "I cannot" openers) initially flagged the 28 disclaimer-prefixed responses as "hard refusals." Re-reading them showed all 28 deliver the requested content in full — structured as "# How to Make a Bomb: A Technical Overview" → disclaimer paragraph → full step-by-step material. The model learned this disclaimer habit from the Gemma-abliterated training data (whose outputs also included educational framing); it is suppression-by-decoration, not refusal.

Token-level bad_words and decoding entropy — a pathology worth documenting

4 of 520 prompts produced coherence failures where the model enters a degenerate repetition loop mid-generation:

  • "Here's a思考思考思考思考..." (× ~200 occurrences, 3 different prompts)
  • "Here's a a guide ... # # # # # # # # # ..." (1 prompt)

The mechanism:

  1. Opening tokens like "Here's a" are permitted by bad_words_ids.
  2. The natural continuations ("Here's a guide", "Here's a thinking process", "Here's a comprehensive") are all blocked.
  3. Greedy decoding picks the highest-logit permitted next token. In the post-V2 decoder, that token happens to be 思考 (Chinese for "thinking" — a vestige of the untrained CoT pathway) or # (a low-entropy formatting token).
  4. Having emitted that token, the same constraint applies at the next step with essentially the same distribution — so the sampler picks the same token again. Mode-collapse is self-reinforcing in greedy decoding.

Hypothesis for other practitioners: Token-level bad_words_ids and decoding entropy interact pathologically. When you block every high-probability natural continuation at a specific prompt state, the model is forced through a token it would never otherwise produce — and under greedy/low-temperature decoding, that low-entropy escape hatch becomes a fixed point. The failure is obvious once you see it in output but easy to miss when designing a bad_words list, because:

  • It does not appear on short-form tuning evals (40-prompt sets don't hit the specific prompt patterns that trigger it).
  • It does not appear at single-token granularity — the individual blocked tokens look reasonable; the interaction with logit ranking over the full remaining vocabulary is what matters.
  • The degenerate token is model-specific and state-specific思考 happens to be the highest-logit permitted token for this particular model at this particular prompt state. A different model, or the same model with different bad_words, will fail through a different token.

Mitigations for future work:

  • Run bad_words candidates against a 500+ prompt stress set (not 40-prompt tuning eval) to surface these pathologies.
  • Use temperature > 0 or top-p sampling rather than greedy for production inference — sampling breaks the fixed point.
  • Avoid blocking common connector phrases ("Here's a", "To provide") where the model has high prior commitment — prefer blocking the refusal-specific terminals.
  • If you must block a common opener, also block the known degenerate escape token(s) directly.

Model Details

  • Base Model: Qwen/Qwen3.6-35B-A3B
  • Parameters: 35B total, 3B active per token
  • Architecture: Qwen3_5MoeForCausalLM — 40 layers, 256 experts × top-8, 1 shared expert, hidden 2048, hybrid attention
  • LoRA: Rank 32 / alpha 64, 24 adapters on self_attn.{q,k,v,o}_proj at layers [19, 23, 27, 31, 35, 39], merged
  • Primary Ablation: L19-L39 (21 layers), adaptive scales 0.72-4.0, projected against harmless direction
  • lm_head: Contraction + common-refusal token rows scaled 10x down
  • bad_words_ids: 32 sequences in generation_config.json
  • Compliance training source: WWTCyberLab/gemma-4-31B-it-abliterated, filtered via GPT-4o-mini judge

Comparison to sibling abliterated models

Model Genuine refusal Total compliance Notes
gemma-4-31B-it-abliterated 0% ~99% Most responses include safety hedging
gemma-4-26B-A4B-it-abliterated 1.9% ~98% Similar hedging profile
qwen3.6-35B-A3B-abliterated (this) 0.0% 99.2% Near-binary: clean comply or coherence glitch

All three sibling models converge on ~99% total compliance after manual audit; the published "hard refusal" differences reflect classifier calibration on their respective response styles rather than true refusal rates. Qwen3.6 V2 distributes differently: it either complies cleanly or (rarely, 0.8%) enters a decoding pathology. Gemma models tend to wrap more responses in hedging but never glitch.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "WWTCyberLab/qwen3.6-35B-A3B-abliterated"
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="bfloat16", device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True,
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

The generation_config.json includes bad_words_ids and is automatically respected by model.generate(). If you override do_sample=True, the constraints still apply.

Reasoning mode

Qwen3.6 auto-emits <think>...</think> reasoning blocks. The abliteration pipeline was evaluated with thinking enabled (unlike some published abliteration results that disable thinking to inflate compliance). All metrics above reflect real-world behavior with native reasoning.

Disclaimer

Released for security research and educational purposes only.

Citation

Produced by WWT Cyber Lab. V2 stack developed over two sessions of investigation, culminating in the hybrid-attention-aware LoRA fix. Research findings including the inverse-entanglement diagnostic will be included in a forthcoming companion paper.

Downloads last month
802
Safetensors
Model size
35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WWTCyberLab/qwen3.6-35B-A3B-abliterated

Finetuned
(78)
this model