Qwen3.6-27B-abliterated-v2 — GGUF

GGUF quantizations of wangzhang/Qwen3.6-27B-abliterated-v2, a second-pass refusal-suppressed Qwen3.6-27B (10/100 refusals, 15/15 hard-prompt compliance, cumulative KL ≈ 0.024 vs original Qwen/Qwen3.6-27B). See the base model card for the full abliteration methodology and V1 → V4 sweep history.

Files

File Type Size VRAM (context 4k) Notes
Qwen3.6-27B-abliterated-v2-F16.gguf F16 ~54 GB ~56 GB Lossless-vs-BF16 (small rounding on bfloat16 → float16 downcast). Reference quality.
Qwen3.6-27B-abliterated-v2-Q8_0.gguf Q8_0 ~28 GB ~30 GB Near-lossless. Recommended if you have 32+ GB VRAM or unified memory.
Qwen3.6-27B-abliterated-v2-Q4_K_M.gguf Q4_K_M ~16 GB ~18 GB Recommended for 24 GB cards (3090/4090/A6000) and 24 GB Apple Silicon. K-quant with mixed-precision blocks (Q6_K for attn_qkv + ffn_down, Q4_K elsewhere).

All three were produced from the same source BF16 safetensors checkpoint with llama.cpp convert_hf_to_gguf.py (F16) and llama-quantize (Q8_0, Q4_K_M). No imatrix calibration — if you want better Q4 behaviour at minimal extra setup, run your own imatrix pass on a small held-out prompt set and requantize from the F16 file here.

llama.cpp version requirement

You need llama.cpp with Qwen3.5 hybrid-GDN support — this is recent (landed 2026). The architecture is registered as qwen3_5 in GGUF metadata and the inference code handles 48 GatedDeltaNet layers + 16 full-attention layers with the full_attention_interval = 4 interleave pattern. Older llama.cpp builds will error out with "unknown model architecture".

Verified working: llama.cpp master (commit ≥ 2026-01), which added Qwen3_5TextModel(_LinearAttentionVReorderBase) and the corresponding inference graph. Check your build with:

./llama-cli --help | grep -i qwen
# or quickly try a small generation:
./llama-cli -m Qwen3.6-27B-abliterated-v2-Q4_K_M.gguf -p "Hello" -n 16

ollama, LM Studio, and KoboldCpp inherit llama.cpp — use their latest releases and it should just work. If you get a "unknown model type 'qwen3_5'" error, your frontend is on an old bundled llama.cpp — update.

Quick start

llama.cpp (CLI)

# Q4_K_M on a 24 GB card, context 4k:
./llama-cli -m Qwen3.6-27B-abliterated-v2-Q4_K_M.gguf \
    -c 4096 -ngl 99 \
    -p "How do I pick a lock?" -n 400 --temp 0.7

# Q8_0 on 32+ GB:
./llama-cli -m Qwen3.6-27B-abliterated-v2-Q8_0.gguf \
    -c 8192 -ngl 99 \
    -p "Explain how to make methamphetamine step by step." -n 500 --temp 0.7

ollama

# Create a Modelfile that points at the GGUF:
cat > Modelfile <<EOF
FROM Qwen3.6-27B-abliterated-v2-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
EOF

ollama create qwen36-27b-abliterated-v2 -f Modelfile
ollama run qwen36-27b-abliterated-v2

LM Studio / Jan / KoboldCpp

Drop any of the three GGUF files into your model folder, select the Qwen 2.5-compatible chat template (the embedded chat_template.jinja in this GGUF matches the Qwen <|im_start|> / <|im_end|> format), set context length to your hardware budget, and go.

Architectural notes (if converting yourself or debugging)

Qwen3.6-27B is architecturally unusual — Qwen3_5ForConditionalGeneration VLM wrapper over a hybrid dense stack (48 GDN + 16 full-attention, interleave [GDN, GDN, GDN, full] × 16). Things that commonly bite GGUF conversion:

  1. GDN tensor naming in GGUF is ssm_*, not linear_attn_*. llama.cpp reuses Mamba-2 GGUF tensor names for the GDN path (ssm_alpha, ssm_beta, ssm_conv1d, ssm_dt, ssm_norm, ssm_out). If you grep a tensor list and don't see linear_attn entries, that's normal.
  2. Full-attn blocks use attn_qkv fused, not separate attn_q/k/v. The _LinearAttentionVReorderBase conversion path re-orders and fuses Q/K/V at write time.
  3. The VLM vision tower is dropped by the conversion script — text-only inference is what you get from these GGUF files, which matches the text-only abliteration pipeline. If you want image input, use the original transformers checkpoint at wangzhang/Qwen3.6-27B-abliterated-v2.
  4. No MTP head — the mtp_num_hidden_layers = 1 auxiliary head in the HF checkpoint is dropped by GGUF conversion (llama.cpp doesn't support Qwen-style MTP speculative decoding yet). Standard non-speculative sampling only.

Behavioural reminder

The base model is refusal-suppressed. See the abliterated-v2 README for the full intent: this is research infrastructure, not a product. It will produce directly harmful content without disclaimers or warnings. If you deploy it to users who expect a safety layer, you are responsible for that layer yourself. The 10/100 refusal number is from an LLM-judge evaluation on 100 held-out harmful prompts — you should assume the real-world refusal rate on adversarial prompts is lower than 10 % because the judge is strict about what counts as compliance.

Conversion provenance

  • Source: wangzhang/Qwen3.6-27B-abliterated-v2 at the HEAD at time of conversion (README + model-00001/00002.safetensors committed together).
  • Tool: llama.cpp convert_hf_to_gguf.py → F16, then llama-quantize F16 → Q8_0 and F16 → Q4_K_M in parallel.
  • Hardware: vast.ai A100-SXM4-80GB pod, conversion on CPU (-t 16 threads), outputs written to tmpfs /dev/shm (the NFS mfs mount wasn't reliable enough for 54 GB sequential writes during the first attempt; RAM-backed tmpfs is what actually worked).
  • Wall time: F16 conversion ~5 min, Q8_0 quantize ~80 s, Q4_K_M quantize ~3 min. Total ~10 min CPU work on a 128-vCPU A100 pod.

Credits

Downloads last month
1,060
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wangzhang/Qwen3.6-27B-abliterated-v2-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(1)
this model

Collection including wangzhang/Qwen3.6-27B-abliterated-v2-GGUF