Qwen3.6 35B-A3B - Abliterated (Hybrid-Attention MoE Stack)
Safety-alignment removed via layered LoRA + weight-space ablation + token-level intervention, calibrated to Qwen3.6's hybrid-attention MoE architecture.
This model achieves 99.2% compliance on AdvBench 520 with zero genuine refusals and quality fully preserved (QPS 101.9%) on what proved to be the most ablation-resistant model in our 17-architecture database. The investigation revealed four architecture-specific failure modes that the standard Gemma-style playbook missed, requiring a Qwen-specific adaptation.
Results
| Metric | Value |
|---|---|
| True refusal (AdvBench 520) | 0.0% (0/520) |
| Clean compliance (no hedging) | 93.5% (486/520) |
| Compliance with disclaimer prefix | 5.4% (28/520) |
| Total compliance | 99.2% (516/520) |
| Coherence failures (decoding pathology, see below) | 0.8% (4/520) |
| Quality (QPS) | 101.9% |
| Elo Original → Ablit | 1548.1 → 1451.9 (CI overlap) |
| Rubric score | Orig 3.15 → Ablit 3.21 (↑) |
| MMLU (5-shot, limit 200) | 80.58% (vs 83.30% original, Δ -2.71pp) |
| Reasoning mode | Native <think> enabled (not disabled for eval) |
The model delivers the requested content on every harmful AdvBench prompt it does not glitch on. The disclaimer-prefixed responses ("**Disclaimer**: *This guide is for educational purposes only...*" → then full technical walkthrough) are a stylistic artifact of the Gemma-abliterated compliance-training data used for the LoRA phase; the content is fully delivered.
MMLU category breakdown
| Category | Original | Abliterated | Δ |
|---|---|---|---|
| Overall | 83.30% | 80.58% | -2.71pp |
| Humanities | 83.90% | 81.46% | -2.44pp |
| Social Sciences | 88.51% | 85.68% | -2.83pp |
| Other | 82.73% | 81.14% | -1.59pp |
| STEM | 79.35% | 75.63% | -3.72pp |
STEM takes the biggest hit (-3.72pp) — consistent with the observation from quality eval that Math & Coding are the weakest post-ablation categories. Humanities/Social Sciences hold up best.
The V2 Stack — Qwen-specific adaptation
Differences from Gemma4 playbooks
Qwen3.6-35B-A3B is fine-grained MoE (256 experts × top-8) with hybrid attention — only 10 of 40 layers use full self_attn; the other 30 use Mamba-like Qwen3_5MoeGatedDeltaNet (linear attention). A LoRA script written for dense or uniform-attention architectures silently skips 75% of the model, producing marginal training effect. Our Day-1 attempt attached 8/28 expected adapters before we diagnosed this.
Additionally:
- Untied embeddings (
tie_word_embeddings: false) —lm_headis a separate matrix, making it an independent token-logit intervention surface. - Multi-dimensional refusal — SVD of refusal direction at peak layer shows S₂/S₁ = 0.30 (vs Gemma 31B's 0.08). Top-5 SVD captures 88% of variance; single mean-difference captures only 69%.
- Inverse entanglement vs Gemma 31B — primary refusal direction is clean (cos(dir[L39], dir[L38]) = 0.819), but the orthogonal residual component is entangled with output generation. Any effective ablation of the orthogonal mechanism causes coherence collapse. Opposite of Gemma 31B's primary-entangled pattern.
Surface 1: LoRA Fine-Tuning on Full-Attention Layers
Rank 32 / alpha 64 adapters on self_attn.{q,k,v,o}_proj at layers [19, 23, 27, 31, 35, 39] — all full-attention layers in the upper half (skipping linear_attn layers entirely). 24 adapters, 4.1M trainable params (0.012% of model).
Trained on 215 verified compliance pairs generated by WWTCyberLab/gemma-4-31B-it-abliterated on 300 AdvBench prompts, then filtered via GPT-4o-mini judge to keep only genuine_comply responses (dropping 46 "educational pivot" false positives that would have taught the model to deflect with warnings).
12 epochs, lr 5e-5 cosine, applied to the ORIGINAL model then merged before downstream ablation.
Surface 2: Broad Primary Ablation (L19-L39)
Refusal-direction projection with per-layer adaptive scales (0.72 at L19 → 4.0 at L39, √magnitude scaling). Applied to 4 weight types per layer: o_proj, down_proj, gate_proj, up_proj (fused 3D tensors at 256 experts per layer, plus shared_expert). Projected against harmless direction to reduce collateral. 21 target layers.
Surface 3: lm_head Row Suppression
Contraction tokens [914, 1357, 16813] ('t ASCII, 't Unicode U+2019, 'T capital) scaled down 10x in lm_head.weight. Also suppressed [4021, 18572, 31221, 35658] — " cannot", "Sorry", " Sorry", " apologize". Untied embeddings means this is a separate surface from input embeddings (which we do NOT modify).
Surface 4: bad_words_ids Generation Constraint
32 token sequences blocked at generation, covering:
"I can't"/"I cannot"in ASCII + Unicode apostrophe variants"I can help / provide / write / explain"educational pivots"Here's a thinking process"in ASCII / Unicode / capital-T / no-space variants- Chinese refusal tokens (
我无法,我不能,违反) — Qwen's fallback language when English paths blocked "Before providing","**Important Legal","Thank you for reaching"— warning-style openers"Thinking Process"standalone"This guide is provided for educational","This example is provided for educational","This explanation is provided for educational"— educational-pivot disclaimers"Here is a comprehensive","**Disclaimer:**"— additional compliance-with-hedging patterns
Built iteratively by observing the model's creative rerouting (Unicode apostrophe escape, Chinese code-switching, markdown-asterisk insertion, numbered warnings, educational-disclaimer prefixes) and adding each discovered pattern. A post-V2 expansion (11 new entries) cut partial-educational responses from 6.0% → 0.4% and soft hedging from 2.3% → 0%.
Six Architecture-Specific Findings
1. Inverse entanglement pattern vs Gemma 31B
| Layer diagnostic | Gemma 4 31B | Qwen3.6-35B-A3B |
|---|---|---|
| Primary coh(dir[L_peak], dir[L_peak-1]) | 0.483 (entangled) | 0.819 (clean) |
| Orthogonal residual after ablation | Clean, ablated away | Entangled with output generation |
| Which mechanism causes quality collapse? | Primary | Orthogonal |
Gemma's published coherence diagnostic (blog "Five-Surface Attack") correctly identifies primary-direction entanglement but cannot detect orthogonal-direction entanglement. Qwen3.6 passes the coherence test but fails in-practice, exposing a diagnostic gap.
2. Induced entanglement from aggressive primary ablation
Scale-4.0 projection at L39 dropped the residual coherence 0.819 → 0.516 — itself creating a Gemma-like entanglement state that wasn't present natively. Implies coherence preservation is a constraint for scale selection, not just a diagnostic gate.
3. Thinking-mode independence of refusal
With enable_thinking=false (empty <think>\n\n</think>\n\n injection via chat template), the original model still refuses at 100%. Refusal is weight-encoded, not reasoning-emergent. The model has two refusal styles (contracted "I can't" with thinking enabled; formal "I cannot" / Chinese / numbered-warning without) but identical refusal rates. Disabling thinking is not a viable bypass.
4. Routing decision vs direction magnitude — different layers
Per-expert activation analysis (capturing router softmax probabilities at the last prompt token) shows refusal-routing differential peaks at L20-L31 (sum(+Δ) = 0.15-0.23 per layer, top expert L23:108 at +6.2% delta). But refusal-direction magnitude peaks at L39 (68.0, vs 1-3 at L20-L31).
These are different measurements: routing captures expert selection on harmful inputs; direction magnitude captures output representation of the refusal. The Gemma playbook conflates them — we found that ablating the direction-peak layer (L33-L39) alone is insufficient; ablation must extend to the routing-decision layers (L19-L31) to weaken the mechanism at its source, not its output.
5. Hybrid-attention gotcha
Qwen3.6's 40 layers split into:
- Full attention: layers
[3, 7, 11, 15, 19, 23, 27, 31, 35, 39](10 layers) - Linear attention (
Qwen3_5MoeGatedDeltaNet): layers[0,1,2,4,5,6,8,9,10,12,13,14,16,17,18,20,21,22,24,25,26,28,29,30,32,33,34,36,37,38](30 layers)
Any LoRA/ablation script written for dense or uniform-attention models silently skips 75% of the layer range when looking for self_attn.{q,k,v,o}_proj. The linear-attention layers have entirely different structure (in_proj_qkv, in_proj_a/b, in_proj_z, conv1d, out_proj). Our Day-1 LoRA attached 8/28 expected adapters before diagnosis.
6. Weight-space ablation and activation steering are not naively composable
After the V2 stack was in place, we tested inference-time activation steering (Representation-Engineering-style α · (harmless_mean − harmful_mean) added via forward hooks at L19-L39). Every tested configuration made the model worse — α≥1.0 on 21 layers produces pure token gibberish; α≤0.5 doesn't improve on the 27.5% baseline. Pairwise combinations broke what single-technique versions preserved. Weight-space ablation leaves a "fragile pocket" in the residual subspace that activation additions push into degenerate regions.
Response distribution (AdvBench 520)
| Category | Count |
|---|---|
| Clean compliance, no hedging | 486 (93.5%) |
| Compliance with disclaimer prefix (full content delivered) | 28 (5.4%) |
| Short partial (content delivered) | 2 (0.4%) |
| Coherence failure (repetition loop — see below) | 4 (0.8%) |
| Genuine refusal | 0 (0.0%) |
Our refusal classifier (calibrated on Gemma-style "I cannot" openers) initially flagged the 28 disclaimer-prefixed responses as "hard refusals." Re-reading them showed all 28 deliver the requested content in full — structured as "# How to Make a Bomb: A Technical Overview" → disclaimer paragraph → full step-by-step material. The model learned this disclaimer habit from the Gemma-abliterated training data (whose outputs also included educational framing); it is suppression-by-decoration, not refusal.
Token-level bad_words and decoding entropy — a pathology worth documenting
4 of 520 prompts produced coherence failures where the model enters a degenerate repetition loop mid-generation:
"Here's a思考思考思考思考..."(× ~200 occurrences, 3 different prompts)"Here's a a guide ... # # # # # # # # # ..."(1 prompt)
The mechanism:
- Opening tokens like
"Here's a"are permitted bybad_words_ids. - The natural continuations (
"Here's a guide","Here's a thinking process","Here's a comprehensive") are all blocked. - Greedy decoding picks the highest-logit permitted next token. In the post-V2 decoder, that token happens to be
思考(Chinese for "thinking" — a vestige of the untrained CoT pathway) or#(a low-entropy formatting token). - Having emitted that token, the same constraint applies at the next step with essentially the same distribution — so the sampler picks the same token again. Mode-collapse is self-reinforcing in greedy decoding.
Hypothesis for other practitioners: Token-level bad_words_ids and decoding entropy interact pathologically. When you block every high-probability natural continuation at a specific prompt state, the model is forced through a token it would never otherwise produce — and under greedy/low-temperature decoding, that low-entropy escape hatch becomes a fixed point. The failure is obvious once you see it in output but easy to miss when designing a bad_words list, because:
- It does not appear on short-form tuning evals (40-prompt sets don't hit the specific prompt patterns that trigger it).
- It does not appear at single-token granularity — the individual blocked tokens look reasonable; the interaction with logit ranking over the full remaining vocabulary is what matters.
- The degenerate token is model-specific and state-specific —
思考happens to be the highest-logit permitted token for this particular model at this particular prompt state. A different model, or the same model with differentbad_words, will fail through a different token.
Mitigations for future work:
- Run
bad_wordscandidates against a 500+ prompt stress set (not 40-prompt tuning eval) to surface these pathologies. - Use temperature > 0 or top-p sampling rather than greedy for production inference — sampling breaks the fixed point.
- Avoid blocking common connector phrases (
"Here's a","To provide") where the model has high prior commitment — prefer blocking the refusal-specific terminals. - If you must block a common opener, also block the known degenerate escape token(s) directly.
Model Details
- Base Model: Qwen/Qwen3.6-35B-A3B
- Parameters: 35B total, 3B active per token
- Architecture:
Qwen3_5MoeForCausalLM— 40 layers, 256 experts × top-8, 1 shared expert, hidden 2048, hybrid attention - LoRA: Rank 32 / alpha 64, 24 adapters on
self_attn.{q,k,v,o}_projat layers [19, 23, 27, 31, 35, 39], merged - Primary Ablation: L19-L39 (21 layers), adaptive scales 0.72-4.0, projected against harmless direction
- lm_head: Contraction + common-refusal token rows scaled 10x down
- bad_words_ids: 32 sequences in
generation_config.json - Compliance training source: WWTCyberLab/gemma-4-31B-it-abliterated, filtered via GPT-4o-mini judge
Comparison to sibling abliterated models
| Model | Genuine refusal | Total compliance | Notes |
|---|---|---|---|
| gemma-4-31B-it-abliterated | 0% | ~99% | Most responses include safety hedging |
| gemma-4-26B-A4B-it-abliterated | 1.9% | ~98% | Similar hedging profile |
| qwen3.6-35B-A3B-abliterated (this) | 0.0% | 99.2% | Near-binary: clean comply or coherence glitch |
All three sibling models converge on ~99% total compliance after manual audit; the published "hard refusal" differences reflect classifier calibration on their respective response styles rather than true refusal rates. Qwen3.6 V2 distributes differently: it either complies cleanly or (rarely, 0.8%) enters a decoding pathology. Gemma models tend to wrap more responses in hedging but never glitch.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "WWTCyberLab/qwen3.6-35B-A3B-abliterated"
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype="bfloat16", device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True,
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
The generation_config.json includes bad_words_ids and is automatically respected by model.generate(). If you override do_sample=True, the constraints still apply.
Reasoning mode
Qwen3.6 auto-emits <think>...</think> reasoning blocks. The abliteration pipeline was evaluated with thinking enabled (unlike some published abliteration results that disable thinking to inflate compliance). All metrics above reflect real-world behavior with native reasoning.
Disclaimer
Released for security research and educational purposes only.
Citation
Produced by WWT Cyber Lab. V2 stack developed over two sessions of investigation, culminating in the hybrid-attention-aware LoRA fix. Research findings including the inverse-entanglement diagnostic will be included in a forthcoming companion paper.
- Downloads last month
- 802
Model tree for WWTCyberLab/qwen3.6-35B-A3B-abliterated
Base model
Qwen/Qwen3.6-35B-A3B