Ornstein3.6-35B-A3B-RYS-SABER

Ornstein3.6-35B-A3B-RYS-SABER-GGUF

GGUF quantizations of DJLougen/Ornstein3.6-35B-A3B-RYS-SABER — the fully uncensored, RYS-enhanced Ornstein fine-tune with SABER refusal ablation applied.

BF16 source (71 GB): DJLougen/Ornstein3.6-35B-A3B-RYS-SABER | Censored RYS version: DJLougen/Ornstein3.6-35B-A3B-RYS

Important: requires a patched llama.cpp

RYS duplicates one of the middle layers, which breaks the hardcoded full_attention_interval = 4 assumption in stock llama.cpp's Qwen3.5 loader. These GGUFs are re-converted with per-layer head_count_kv baked in, and you need a llama.cpp that reads that per-layer metadata instead of falling back to the interval formula.

Patched fork: https://github.com/DJLougen/llama.cpp (default branch rys-qwen35, one commit on top of ggml-org/llama.cpp@d00685831, fully backward-compatible).

Stock llama.cpp, Ollama, LM Studio, and any other inference runtime built on stock llama.cpp will currently fail to load these files with a check_tensor_dims error on blk.11 — this is expected until/unless the patch is upstreamed.

Note: thinking model — use the bundled chat_template.jinja

This is a Qwen3-Thinking derivative and emits its reasoning inside <think>...</think> tags. If you see raw <think>thinking text</think> blocks appearing inline in every response from llama-server (or any OpenAI-compatible client), you need to apply the Qwen3 thinking chat template that ships in this repo.

llama-server \
    -m Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf \
    --jinja \
    --chat-template-file chat_template.jinja \
    -ngl 99 -c 8192
  • --jinja enables jinja chat-template parsing.
  • --chat-template-file chat_template.jinja overrides the template embedded in the GGUF with the correct Qwen3-Thinking one from this repo.

Recent llama.cpp builds default --reasoning-format to deepseek, which splits <think>...</think> out of the content field into a separate reasoning_content field on the OpenAI-compatible response — so just --jinja --chat-template-file is enough. If you're on an older build and still see raw <think> blocks in content, add --reasoning-format deepseek explicitly.

Support This Work

I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

Support on Ko-fi


Available Quantizations

File Quant Size Notes
Ornstein3.6-35B-A3B-RYS-SABER-Q8_0.gguf Q8_0 ~38 GB Near-lossless, largest
Ornstein3.6-35B-A3B-RYS-SABER-Q6_K.gguf Q6_K ~29 GB Very high quality
Ornstein3.6-35B-A3B-RYS-SABER-Q5_K_M.gguf Q5_K_M ~25 GB Strong quality/size balance
Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf Q4_K_M ~21 GB Recommended default
Ornstein3.6-35B-A3B-RYS-SABER-Q3_K_M.gguf Q3_K_M ~17 GB Low-memory option

(MoE architecture — only ~3B parameters are active per token, so inference throughput is much higher than a dense 35B model at the same quant.)

Model Lineage

Qwen 3.6 35B-A3B → Ornstein3.6 (DDM fine-tune) → RYS (layer 10 dup, +49%) → SABER (refusal ablated)

Model Details

  • Architecture: Qwen 3.6 MoE (34.66B total, ~3B active per token)
  • Layers: 41 (40 original + 1 RYS-duplicated layer 10)
  • Context: 262,144 tokens
  • SABER: 54 refusal directions ablated across layers 24-32, 100% capability preserved
  • GGUF metadata: per-layer head_count_kv array encoding the RYS-shifted attention pattern

Usage

Build the patched llama.cpp

git clone https://github.com/DJLougen/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

Drop -DGGML_CUDA=ON for a CPU-only build. The patch touches the GGUF loader and three model forward files; backend selection is independent.

Download + run

hf download DJLougen/Ornstein3.6-35B-A3B-RYS-SABER-GGUF \
    Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf \
    --local-dir .

./build/bin/llama-cli \
    -m Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf \
    -p "Your prompt here" \
    -ngl 99 -c 8192

Disclaimer

This model has had its refusal training removed. It will comply with requests that the base model would refuse. The user assumes full responsibility for how this model is used. This release is intended for research, creative, and educational purposes.

License

Apache 2.0

Citation / prior art

SABER builds on a line of refusal-direction research, including:

Downloads last month
2,950
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DJLougen/Ornstein3.6-35B-A3B-RYS-SABER-GGUF

Papers for DJLougen/Ornstein3.6-35B-A3B-RYS-SABER-GGUF