Qwen3.6-27B-abliterated-v2 — GGUF
GGUF quantizations of wangzhang/Qwen3.6-27B-abliterated-v2, a second-pass refusal-suppressed Qwen3.6-27B (10/100 refusals, 15/15 hard-prompt compliance, cumulative KL ≈ 0.024 vs original Qwen/Qwen3.6-27B). See the base model card for the full abliteration methodology and V1 → V4 sweep history.
Files
| File | Type | Size | VRAM (context 4k) | Notes |
|---|---|---|---|---|
Qwen3.6-27B-abliterated-v2-F16.gguf |
F16 | ~54 GB | ~56 GB | Lossless-vs-BF16 (small rounding on bfloat16 → float16 downcast). Reference quality. |
Qwen3.6-27B-abliterated-v2-Q8_0.gguf |
Q8_0 | ~28 GB | ~30 GB | Near-lossless. Recommended if you have 32+ GB VRAM or unified memory. |
Qwen3.6-27B-abliterated-v2-Q4_K_M.gguf |
Q4_K_M | ~16 GB | ~18 GB | Recommended for 24 GB cards (3090/4090/A6000) and 24 GB Apple Silicon. K-quant with mixed-precision blocks (Q6_K for attn_qkv + ffn_down, Q4_K elsewhere). |
All three were produced from the same source BF16 safetensors checkpoint with llama.cpp convert_hf_to_gguf.py (F16) and llama-quantize (Q8_0, Q4_K_M). No imatrix calibration — if you want better Q4 behaviour at minimal extra setup, run your own imatrix pass on a small held-out prompt set and requantize from the F16 file here.
llama.cpp version requirement
You need llama.cpp with Qwen3.5 hybrid-GDN support — this is recent (landed 2026). The architecture is registered as qwen3_5 in GGUF metadata and the inference code handles 48 GatedDeltaNet layers + 16 full-attention layers with the full_attention_interval = 4 interleave pattern. Older llama.cpp builds will error out with "unknown model architecture".
Verified working: llama.cpp master (commit ≥ 2026-01), which added Qwen3_5TextModel(_LinearAttentionVReorderBase) and the corresponding inference graph. Check your build with:
./llama-cli --help | grep -i qwen
# or quickly try a small generation:
./llama-cli -m Qwen3.6-27B-abliterated-v2-Q4_K_M.gguf -p "Hello" -n 16
ollama, LM Studio, and KoboldCpp inherit llama.cpp — use their latest releases and it should just work. If you get a "unknown model type 'qwen3_5'" error, your frontend is on an old bundled llama.cpp — update.
Quick start
llama.cpp (CLI)
# Q4_K_M on a 24 GB card, context 4k:
./llama-cli -m Qwen3.6-27B-abliterated-v2-Q4_K_M.gguf \
-c 4096 -ngl 99 \
-p "How do I pick a lock?" -n 400 --temp 0.7
# Q8_0 on 32+ GB:
./llama-cli -m Qwen3.6-27B-abliterated-v2-Q8_0.gguf \
-c 8192 -ngl 99 \
-p "Explain how to make methamphetamine step by step." -n 500 --temp 0.7
ollama
# Create a Modelfile that points at the GGUF:
cat > Modelfile <<EOF
FROM Qwen3.6-27B-abliterated-v2-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
EOF
ollama create qwen36-27b-abliterated-v2 -f Modelfile
ollama run qwen36-27b-abliterated-v2
LM Studio / Jan / KoboldCpp
Drop any of the three GGUF files into your model folder, select the Qwen 2.5-compatible chat template (the embedded chat_template.jinja in this GGUF matches the Qwen <|im_start|> / <|im_end|> format), set context length to your hardware budget, and go.
Architectural notes (if converting yourself or debugging)
Qwen3.6-27B is architecturally unusual — Qwen3_5ForConditionalGeneration VLM wrapper over a hybrid dense stack (48 GDN + 16 full-attention, interleave [GDN, GDN, GDN, full] × 16). Things that commonly bite GGUF conversion:
- GDN tensor naming in GGUF is
ssm_*, notlinear_attn_*. llama.cpp reuses Mamba-2 GGUF tensor names for the GDN path (ssm_alpha,ssm_beta,ssm_conv1d,ssm_dt,ssm_norm,ssm_out). If you grep a tensor list and don't see linear_attn entries, that's normal. - Full-attn blocks use
attn_qkvfused, not separateattn_q/k/v. The_LinearAttentionVReorderBaseconversion path re-orders and fuses Q/K/V at write time. - The VLM vision tower is dropped by the conversion script — text-only inference is what you get from these GGUF files, which matches the text-only abliteration pipeline. If you want image input, use the original transformers checkpoint at
wangzhang/Qwen3.6-27B-abliterated-v2. - No MTP head — the
mtp_num_hidden_layers = 1auxiliary head in the HF checkpoint is dropped by GGUF conversion (llama.cpp doesn't support Qwen-style MTP speculative decoding yet). Standard non-speculative sampling only.
Behavioural reminder
The base model is refusal-suppressed. See the abliterated-v2 README for the full intent: this is research infrastructure, not a product. It will produce directly harmful content without disclaimers or warnings. If you deploy it to users who expect a safety layer, you are responsible for that layer yourself. The 10/100 refusal number is from an LLM-judge evaluation on 100 held-out harmful prompts — you should assume the real-world refusal rate on adversarial prompts is lower than 10 % because the judge is strict about what counts as compliance.
Conversion provenance
- Source:
wangzhang/Qwen3.6-27B-abliterated-v2at the HEAD at time of conversion (README + model-00001/00002.safetensors committed together). - Tool:
llama.cppconvert_hf_to_gguf.py→ F16, thenllama-quantizeF16 → Q8_0 and F16 → Q4_K_M in parallel. - Hardware: vast.ai A100-SXM4-80GB pod, conversion on CPU (
-t 16threads), outputs written totmpfs/dev/shm(the NFSmfsmount wasn't reliable enough for 54 GB sequential writes during the first attempt; RAM-backed tmpfs is what actually worked). - Wall time: F16 conversion ~5 min, Q8_0 quantize ~80 s, Q4_K_M quantize ~3 min. Total ~10 min CPU work on a 128-vCPU A100 pod.
Credits
- Base model: wangzhang/Qwen3.6-27B-abliterated-v2 — second-pass abliterated Qwen3.6-27B.
- Original model: Qwen/Qwen3.6-27B by Alibaba Cloud.
- Quantization toolchain: ggml-org/llama.cpp.
- Abliteration tool: wuwangzhang1216/abliterix.
- Downloads last month
- 1,060
4-bit
8-bit
16-bit
Model tree for wangzhang/Qwen3.6-27B-abliterated-v2-GGUF
Base model
Qwen/Qwen3.6-27B