Qwen3.6-35B-A3B-uncensored-heretic-FP8
Native FP8 E4M3 block-wise quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic, matching the official Qwen FP8 format exactly.
| BF16 | FP8 (this) | |
|---|---|---|
| Size | 66 GB | 34 GB |
| tok/s | ~180 | 226 |
| Format | bfloat16 | float8_e4m3fn |
Recommended Sampling Parameters
This model ignores the enable_thinking: false chat template kwarg โ it always generates <think>...</think> blocks regardless. To suppress thinking at the logit level, use logit_bias to ban the think tokens:
# Non-thinking / instruct mode (recommended for this model)
response = client.chat.completions.create(
model="protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8",
messages=messages,
temperature=0.7,
top_p=0.8,
extra_body={
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 1.5,
"repetition_penalty": 1.0,
"logit_bias": {
"248068": -100, # suppress <think>
"248069": -100, # suppress </think>
},
},
)
Without logit_bias, the model will emit thinking tokens that waste compute. The --reasoning-parser qwen3 vLLM flag can separate them into reasoning_content, but does not prevent their generation.
These follow the Qwen3.5 recommended settings for instruct/non-thinking mode.
Usage with vLLM
vllm serve protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8 \
--host 0.0.0.0 --port 8000 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--chat-template <path-to-nothink-template> \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--language-model-only
Key flags:
--language-model-onlyโ required for the Mamba/SSM hybrid architecture--chat-templateโ use a Qwen3.5 nothink template to avoid prompting for thinking (though the model still generates think tokens โ uselogit_biasabove to actually suppress them)--enable-auto-tool-choice --tool-call-parser qwen3_xmlโ enables structured tool calling
Known Issues
- Always thinks: The model generates
<think>...</think>blocks regardless of chat template settings. Uselogit_bias: {"248068": -100, "248069": -100}to suppress at the logit level. - Broken vision: Image inputs produce degenerate output (repeated "!!!!" characters). This is inherited from the base model. Use
--language-model-onlyand do not pass image content.
Quantization Details
- Method: Block-wise FP8 E4M3 with [128, 128] per-block scaling
- Scale convention: weight_scale_inv (direct scale value, matching Qwen/DeepSeek convention)
- MoE handling: Packed 3D expert tensors unpacked to per-expert 2D weights; fused gate_up_proj split into separate gate_proj + up_proj
- Preserved in BF16: Embeddings, LM head, LayerNorm, MoE router gates, Mamba/SSM parameters (conv1d, A_log, D, dt_bias, in_proj), shared expert gates โ 402 modules total via
modules_to_not_convert
Quantization Script
The quantization script (quantize_native_fp8_lowmem.py) is included in this repo. It processes safetensors shards one tensor at a time, peaking at ~4GB RAM regardless of model size.
python quantize_native_fp8_lowmem.py llmfan46/Qwen3.6-35B-A3B-uncensored-heretic
Key features:
- Shard-by-shard streaming โ never loads full model into memory
- Auto-generates
modules_to_not_convertforconfig.json - Unpacks packed MoE experts and fused gate/up projections
- Produces format identical to official Qwen FP8 releases
Hardware Tested
- NVIDIA RTX PRO 6000 Blackwell (96GB VRAM, SM 12.0)
- CUDA 12.8, Driver 595.45.04
- vLLM 0.20.0
Credits
- Base model: llmfan46
- Quantization: protoLabs.studio
- Downloads last month
- -
Model tree for protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8
Base model
Qwen/Qwen3.6-35B-A3B