huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp-FP8
FP8 quantisation derived from the heretic-abliterated llmfan46/Qwen3.6-27B-uncensored-heretic-v2 of Qwen/Qwen3.6-27B, with the MTP head and vision tower preserved in bf16.
Provenance
- Base:
Qwen/Qwen3.6-27B(bf16) - Abliteration tool: heretic v1.2.0 (ARA) by Philipp Emanuel Weidmann (
p-e-w) - Abliterated source weights:
llmfan46/Qwen3.6-27B-uncensored-heretic-v2— the heretic-derived bf16 abliteration ofQwen/Qwen3.6-27B - MTP head: re-grafted from
Qwen/Qwen3.6-27B(15 tensors, ~810 MB bf16) so SGLang/vLLM speculative decoding (--speculative-algo NEXTN) works - Vision tower:
model.visual.*preserved in bf16 (333 tensors)
Quantisation
- Format: FP8_DYNAMIC W8A8 (8-bit FP8 E4M3 weights, dynamic per-token FP8 activation quant) via
llm-compressor0.10 +compressed-tensors0.14 - Scheme: see
recipe.yaml—FP8_DYNAMICrecomputes activation scales per input at inference time, so the checkpoint is data-free and no calibration set is shipped. - Kept in bf16 (
quantization_config.ignore):lm_head, allmodel.visual.*, allmtp.*, and the entirelinear_attnMamba/SSM block (in_proj_*,out_proj,conv1d)
KL divergence measurements
KLD computed with eval_kld.py — per-token KLD averaged over 8 samples from neuralmagic/calibration (LLM split), max_seq=1024. Max sample KLD is the highest single-sample mean (catches outliers that the overall mean hides).
| Comparison | Mean KLD (nats) | Max sample KLD | Samples | max_seq |
|---|---|---|---|---|
| vs Qwen3.6-27B base | 0.0582 | 0.1709 | 8 | 1024 |
Note: this pipeline always uploads the resulting checkpoint. Consult the KL divergence numbers above to judge whether the result is acceptable for your use case.
Perplexity (wikitext-2-raw)
Wikitext-2-raw test split, non-overlapping chunks of 2048 tokens, computed with eval_ppl.py. Same tokenizer for every row so the numbers compare apples-to-apples.
| Model | Perplexity | Tokens scored | Dataset | seq |
|---|---|---|---|---|
| Qwen3.6-27B base (bf16) | 7.3057 | 296907 | wikitext/wikitext-2-raw-v1/test |
2048 |
| heretic-v2-mtp (bf16) | 7.4619 | 296907 | wikitext/wikitext-2-raw-v1/test |
2048 |
| this checkpoint | 7.5631 | 296907 | wikitext/wikitext-2-raw-v1/test |
2048 |
Inference
transformers (text + vision; MTP not exercised)
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch
repo = "huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp-FP8"
proc = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
repo, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)
vLLM (FP8 + MTP speculative decoding)
vllm serve huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp-FP8 \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--quantization compressed-tensors \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'
FP8 GEMM runs native on Hopper (SM89+) and Blackwell.
- Downloads last month
- 69
Model tree for huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp-FP8
Base model
Qwen/Qwen3.6-27B