Ornstein3.6-35B-A3B-RYS-SABER-GGUF

GGUF quantizations of DJLougen/Ornstein3.6-35B-A3B-RYS-SABER — the fully uncensored, RYS-enhanced Ornstein fine-tune with SABER refusal ablation applied.

BF16 source (71 GB): DJLougen/Ornstein3.6-35B-A3B-RYS-SABER | Censored RYS version: DJLougen/Ornstein3.6-35B-A3B-RYS

Important: requires a patched llama.cpp

RYS duplicates one of the middle layers, which breaks the hardcoded full_attention_interval = 4 assumption in stock llama.cpp's Qwen3.5 loader. These GGUFs are re-converted with per-layer head_count_kv baked in, and you need a llama.cpp that reads that per-layer metadata instead of falling back to the interval formula.

Patched fork: https://github.com/DJLougen/llama.cpp (default branch rys-qwen35, one commit on top of ggml-org/llama.cpp@d00685831, fully backward-compatible).

Stock llama.cpp, Ollama, LM Studio, and any other inference runtime built on stock llama.cpp will currently fail to load these files with a check_tensor_dims error on blk.11 — this is expected until/unless the patch is upstreamed.

Note: thinking model — use the bundled `chat_template.jinja`

This is a Qwen3-Thinking derivative and emits its reasoning inside <think>...</think> tags. If you see raw <think>thinking text</think> blocks appearing inline in every response from llama-server (or any OpenAI-compatible client), you need to apply the Qwen3 thinking chat template that ships in this repo.

llama-server \
    -m Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf \
    --jinja \
    --chat-template-file chat_template.jinja \
    -ngl 99 -c 8192

--jinja enables jinja chat-template parsing.
--chat-template-file chat_template.jinja overrides the template embedded in the GGUF with the correct Qwen3-Thinking one from this repo.

Recent llama.cpp builds default --reasoning-format to deepseek, which splits <think>...</think> out of the content field into a separate reasoning_content field on the OpenAI-compatible response — so just --jinja --chat-template-file is enough. If you're on an older build and still see raw <think> blocks in content, add --reasoning-format deepseek explicitly.

Support This Work

I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

Support on Ko-fi

Available Quantizations

File	Quant	Size	Notes
`Ornstein3.6-35B-A3B-RYS-SABER-Q8_0.gguf`	Q8_0	~38 GB	Near-lossless, largest
`Ornstein3.6-35B-A3B-RYS-SABER-Q6_K.gguf`	Q6_K	~29 GB	Very high quality
`Ornstein3.6-35B-A3B-RYS-SABER-Q5_K_M.gguf`	Q5_K_M	~25 GB	Strong quality/size balance
`Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf`	Q4_K_M	~21 GB	Recommended default
`Ornstein3.6-35B-A3B-RYS-SABER-Q3_K_M.gguf`	Q3_K_M	~17 GB	Low-memory option

(MoE architecture — only ~3B parameters are active per token, so inference throughput is much higher than a dense 35B model at the same quant.)

Model Lineage

Qwen 3.6 35B-A3B → Ornstein3.6 (DDM fine-tune) → RYS (layer 10 dup, +49%) → SABER (refusal ablated)

Model Details

Architecture: Qwen 3.6 MoE (34.66B total, ~3B active per token)
Layers: 41 (40 original + 1 RYS-duplicated layer 10)
Context: 262,144 tokens
SABER: 54 refusal directions ablated across layers 24-32, 100% capability preserved
GGUF metadata: per-layer head_count_kv array encoding the RYS-shifted attention pattern

Usage

Build the patched llama.cpp

git clone https://github.com/DJLougen/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

Drop -DGGML_CUDA=ON for a CPU-only build. The patch touches the GGUF loader and three model forward files; backend selection is independent.

Download + run

hf download DJLougen/Ornstein3.6-35B-A3B-RYS-SABER-GGUF \
    Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf \
    --local-dir .

./build/bin/llama-cli \
    -m Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf \
    -p "Your prompt here" \
    -ngl 99 -c 8192

Disclaimer

This model has had its refusal training removed. It will comply with requests that the base model would refuse. The user assumes full responsibility for how this model is used. This release is intended for research, creative, and educational purposes.

License

Apache 2.0

Citation / prior art

SABER builds on a line of refusal-direction research, including:

Arditi et al., Refusal in LLMs Is Mediated by a Single Direction (NeurIPS 2024)
Gülmez, Gabliteration: Adaptive Multi-Directional Neural Weight Modification (2025)
Prakash et al., Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal (2025) — hydra features
Siu et al., COSMIC: Generalized Refusal Direction Identification in LLM Activations (ACL 2025)
Yeo et al., Understanding Refusal in Language Models with Sparse Autoencoders (EMNLP 2025)

Downloads last month: 2,950

GGUF

Model size

36B params

Architecture

qwen35moe

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

Model tree for DJLougen/Ornstein3.6-35B-A3B-RYS-SABER-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

unsloth/Qwen3.6-35B-A3B

Finetuned

DJLougen/Ornstein3.6-35B-A3B

Finetuned

DJLougen/Ornstein3.6-35B-A3B-RYS

Finetuned

DJLougen/Ornstein3.6-35B-A3B-RYS-SABER

Quantized

(1)

this model

Papers for DJLougen/Ornstein3.6-35B-A3B-RYS-SABER-GGUF

Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models

Paper • 2512.18901 • Published Dec 21, 2025 • 3

Ornstein3.6-35B-A3B-RYS-SABER-GGUF

Important: requires a patched llama.cpp

Note: thinking model — use the bundled chat_template.jinja

Support This Work

Available Quantizations

Model Lineage

Model Details

Usage

Build the patched llama.cpp

Download + run

Disclaimer

License

Citation / prior art

Model tree for DJLougen/Ornstein3.6-35B-A3B-RYS-SABER-GGUF

Papers for DJLougen/Ornstein3.6-35B-A3B-RYS-SABER-GGUF

Note: thinking model — use the bundled `chat_template.jinja`