Qwen3.6-35B-A3B-uncensored-heretic-FP8

Native FP8 E4M3 block-wise quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic, matching the official Qwen FP8 format exactly.

	BF16	FP8 (this)
Size	66 GB	34 GB
tok/s	~180	226
Format	bfloat16	float8_e4m3fn

Recommended Sampling Parameters

This model ignores the enable_thinking: false chat template kwarg — it always generates <think>...</think> blocks regardless. To suppress thinking at the logit level, use logit_bias to ban the think tokens:

# Non-thinking / instruct mode (recommended for this model)
response = client.chat.completions.create(
    model="protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8",
    messages=messages,
    temperature=0.7,
    top_p=0.8,
    extra_body={
        "top_k": 20,
        "min_p": 0.0,
        "presence_penalty": 1.5,
        "repetition_penalty": 1.0,
        "logit_bias": {
            "248068": -100,  # suppress <think>
            "248069": -100,  # suppress </think>
        },
    },
)

Without logit_bias, the model will emit thinking tokens that waste compute. The --reasoning-parser qwen3 vLLM flag can separate them into reasoning_content, but does not prevent their generation.

These follow the Qwen3.5 recommended settings for instruct/non-thinking mode.

Usage with vLLM

vllm serve protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8 \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.85 \
    --chat-template <path-to-nothink-template> \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --language-model-only

Key flags:

--language-model-only — required for the Mamba/SSM hybrid architecture
--chat-template — use a Qwen3.5 nothink template to avoid prompting for thinking (though the model still generates think tokens — use logit_bias above to actually suppress them)
--enable-auto-tool-choice --tool-call-parser qwen3_xml — enables structured tool calling

Known Issues

Always thinks: The model generates <think>...</think> blocks regardless of chat template settings. Use logit_bias: {"248068": -100, "248069": -100} to suppress at the logit level.
Broken vision: Image inputs produce degenerate output (repeated "!!!!" characters). This is inherited from the base model. Use --language-model-only and do not pass image content.

Quantization Details

Method: Block-wise FP8 E4M3 with [128, 128] per-block scaling
Scale convention: weight_scale_inv (direct scale value, matching Qwen/DeepSeek convention)
MoE handling: Packed 3D expert tensors unpacked to per-expert 2D weights; fused gate_up_proj split into separate gate_proj + up_proj
Preserved in BF16: Embeddings, LM head, LayerNorm, MoE router gates, Mamba/SSM parameters (conv1d, A_log, D, dt_bias, in_proj), shared expert gates — 402 modules total via modules_to_not_convert

Quantization Script

The quantization script (quantize_native_fp8_lowmem.py) is included in this repo. It processes safetensors shards one tensor at a time, peaking at ~4GB RAM regardless of model size.

python quantize_native_fp8_lowmem.py llmfan46/Qwen3.6-35B-A3B-uncensored-heretic

Key features:

Shard-by-shard streaming — never loads full model into memory
Auto-generates modules_to_not_convert for config.json
Unpacks packed MoE experts and fused gate/up projections
Produces format identical to official Qwen FP8 releases

Hardware Tested

NVIDIA RTX PRO 6000 Blackwell (96GB VRAM, SM 12.0)
CUDA 12.8, Driver 595.45.04
vLLM 0.20.0

Credits

Base model: llmfan46
Quantization: protoLabs.studio

Downloads last month: -

Safetensors

Model size

35B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

llmfan46/Qwen3.6-35B-A3B-uncensored-heretic

Quantized

(10)

this model