Qwen3.6-35B-A3B-uncensored-heretic-FP8

Native FP8 E4M3 block-wise quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic, matching the official Qwen FP8 format exactly.

BF16 FP8 (this)
Size 66 GB 34 GB
tok/s ~180 226
Format bfloat16 float8_e4m3fn

Recommended Sampling Parameters

This model ignores the enable_thinking: false chat template kwarg โ€” it always generates <think>...</think> blocks regardless. To suppress thinking at the logit level, use logit_bias to ban the think tokens:

# Non-thinking / instruct mode (recommended for this model)
response = client.chat.completions.create(
    model="protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8",
    messages=messages,
    temperature=0.7,
    top_p=0.8,
    extra_body={
        "top_k": 20,
        "min_p": 0.0,
        "presence_penalty": 1.5,
        "repetition_penalty": 1.0,
        "logit_bias": {
            "248068": -100,  # suppress <think>
            "248069": -100,  # suppress </think>
        },
    },
)

Without logit_bias, the model will emit thinking tokens that waste compute. The --reasoning-parser qwen3 vLLM flag can separate them into reasoning_content, but does not prevent their generation.

These follow the Qwen3.5 recommended settings for instruct/non-thinking mode.

Usage with vLLM

vllm serve protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8 \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.85 \
    --chat-template <path-to-nothink-template> \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --language-model-only

Key flags:

  • --language-model-only โ€” required for the Mamba/SSM hybrid architecture
  • --chat-template โ€” use a Qwen3.5 nothink template to avoid prompting for thinking (though the model still generates think tokens โ€” use logit_bias above to actually suppress them)
  • --enable-auto-tool-choice --tool-call-parser qwen3_xml โ€” enables structured tool calling

Known Issues

  • Always thinks: The model generates <think>...</think> blocks regardless of chat template settings. Use logit_bias: {"248068": -100, "248069": -100} to suppress at the logit level.
  • Broken vision: Image inputs produce degenerate output (repeated "!!!!" characters). This is inherited from the base model. Use --language-model-only and do not pass image content.

Quantization Details

  • Method: Block-wise FP8 E4M3 with [128, 128] per-block scaling
  • Scale convention: weight_scale_inv (direct scale value, matching Qwen/DeepSeek convention)
  • MoE handling: Packed 3D expert tensors unpacked to per-expert 2D weights; fused gate_up_proj split into separate gate_proj + up_proj
  • Preserved in BF16: Embeddings, LM head, LayerNorm, MoE router gates, Mamba/SSM parameters (conv1d, A_log, D, dt_bias, in_proj), shared expert gates โ€” 402 modules total via modules_to_not_convert

Quantization Script

The quantization script (quantize_native_fp8_lowmem.py) is included in this repo. It processes safetensors shards one tensor at a time, peaking at ~4GB RAM regardless of model size.

python quantize_native_fp8_lowmem.py llmfan46/Qwen3.6-35B-A3B-uncensored-heretic

Key features:

  • Shard-by-shard streaming โ€” never loads full model into memory
  • Auto-generates modules_to_not_convert for config.json
  • Unpacks packed MoE experts and fused gate/up projections
  • Produces format identical to official Qwen FP8 releases

Hardware Tested

  • NVIDIA RTX PRO 6000 Blackwell (96GB VRAM, SM 12.0)
  • CUDA 12.8, Driver 595.45.04
  • vLLM 0.20.0

Credits

Downloads last month
-
Safetensors
Model size
35B params
Tensor type
BF16
ยท
F8_E4M3
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8

Quantized
(10)
this model