huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp-FP8

FP8 quantisation derived from the heretic-abliterated llmfan46/Qwen3.6-27B-uncensored-heretic-v2 of Qwen/Qwen3.6-27B, with the MTP head and vision tower preserved in bf16.

Provenance

Base: Qwen/Qwen3.6-27B (bf16)
Abliteration tool: heretic v1.2.0 (ARA) by Philipp Emanuel Weidmann (p-e-w)
Abliterated source weights: llmfan46/Qwen3.6-27B-uncensored-heretic-v2 — the heretic-derived bf16 abliteration of Qwen/Qwen3.6-27B
MTP head: re-grafted from Qwen/Qwen3.6-27B (15 tensors, ~810 MB bf16) so SGLang/vLLM speculative decoding (--speculative-algo NEXTN) works
Vision tower: model.visual.* preserved in bf16 (333 tensors)

Quantisation

Format: FP8_DYNAMIC W8A8 (8-bit FP8 E4M3 weights, dynamic per-token FP8 activation quant) via llm-compressor 0.10 + compressed-tensors 0.14
Scheme: see recipe.yaml — FP8_DYNAMIC recomputes activation scales per input at inference time, so the checkpoint is data-free and no calibration set is shipped.
Kept in bf16 (quantization_config.ignore): lm_head, all model.visual.*, all mtp.*, and the entire linear_attn Mamba/SSM block (in_proj_*, out_proj, conv1d)

KL divergence measurements

KLD computed with eval_kld.py — per-token KLD averaged over 8 samples from neuralmagic/calibration (LLM split), max_seq=1024. Max sample KLD is the highest single-sample mean (catches outliers that the overall mean hides).

Comparison	Mean KLD (nats)	Max sample KLD	Samples	max_seq
vs Qwen3.6-27B base	0.0582	0.1709	8	1024

Note: this pipeline always uploads the resulting checkpoint. Consult the KL divergence numbers above to judge whether the result is acceptable for your use case.

Perplexity (wikitext-2-raw)

Wikitext-2-raw test split, non-overlapping chunks of 2048 tokens, computed with eval_ppl.py. Same tokenizer for every row so the numbers compare apples-to-apples.

Model	Perplexity	Tokens scored	Dataset	seq
Qwen3.6-27B base (bf16)	7.3057	296907	`wikitext/wikitext-2-raw-v1/test`	2048
heretic-v2-mtp (bf16)	7.4619	296907	`wikitext/wikitext-2-raw-v1/test`	2048
this checkpoint	7.5631	296907	`wikitext/wikitext-2-raw-v1/test`	2048

Inference

transformers (text + vision; MTP not exercised)

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

repo = "huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp-FP8"
proc = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    repo, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)

vLLM (FP8 + MTP speculative decoding)

vllm serve huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp-FP8 \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --quantization compressed-tensors \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

FP8 GEMM runs native on Hopper (SM89+) and Blackwell.

Downloads last month: 69

Safetensors

Model size

28B params

Tensor type

BF16

F8_E4M3

Model tree for huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp-FP8

Base model

Qwen/Qwen3.6-27B

Quantized

(205)

this model