Ornstein3.6-35B-A3B-RYS-SABER-GGUF
GGUF quantizations of DJLougen/Ornstein3.6-35B-A3B-RYS-SABER — the fully uncensored, RYS-enhanced Ornstein fine-tune with SABER refusal ablation applied.
BF16 source (71 GB): DJLougen/Ornstein3.6-35B-A3B-RYS-SABER | Censored RYS version: DJLougen/Ornstein3.6-35B-A3B-RYS
Important: requires a patched llama.cpp
RYS duplicates one of the middle layers, which breaks the hardcoded full_attention_interval = 4 assumption in stock llama.cpp's Qwen3.5 loader. These GGUFs are re-converted with per-layer head_count_kv baked in, and you need a llama.cpp that reads that per-layer metadata instead of falling back to the interval formula.
Patched fork: https://github.com/DJLougen/llama.cpp (default branch rys-qwen35, one commit on top of ggml-org/llama.cpp@d00685831, fully backward-compatible).
Stock llama.cpp, Ollama, LM Studio, and any other inference runtime built on stock llama.cpp will currently fail to load these files with a check_tensor_dims error on blk.11 — this is expected until/unless the patch is upstreamed.
Note: thinking model — use the bundled chat_template.jinja
This is a Qwen3-Thinking derivative and emits its reasoning inside <think>...</think> tags. If you see raw <think>thinking text</think> blocks appearing inline in every response from llama-server (or any OpenAI-compatible client), you need to apply the Qwen3 thinking chat template that ships in this repo.
llama-server \
-m Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf \
--jinja \
--chat-template-file chat_template.jinja \
-ngl 99 -c 8192
--jinjaenables jinja chat-template parsing.--chat-template-file chat_template.jinjaoverrides the template embedded in the GGUF with the correct Qwen3-Thinking one from this repo.
Recent llama.cpp builds default --reasoning-format to deepseek, which splits <think>...</think> out of the content field into a separate reasoning_content field on the OpenAI-compatible response — so just --jinja --chat-template-file is enough. If you're on an older build and still see raw <think> blocks in content, add --reasoning-format deepseek explicitly.
Support This Work
I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
Available Quantizations
| File | Quant | Size | Notes |
|---|---|---|---|
Ornstein3.6-35B-A3B-RYS-SABER-Q8_0.gguf |
Q8_0 | ~38 GB | Near-lossless, largest |
Ornstein3.6-35B-A3B-RYS-SABER-Q6_K.gguf |
Q6_K | ~29 GB | Very high quality |
Ornstein3.6-35B-A3B-RYS-SABER-Q5_K_M.gguf |
Q5_K_M | ~25 GB | Strong quality/size balance |
Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf |
Q4_K_M | ~21 GB | Recommended default |
Ornstein3.6-35B-A3B-RYS-SABER-Q3_K_M.gguf |
Q3_K_M | ~17 GB | Low-memory option |
(MoE architecture — only ~3B parameters are active per token, so inference throughput is much higher than a dense 35B model at the same quant.)
Model Lineage
Qwen 3.6 35B-A3B → Ornstein3.6 (DDM fine-tune) → RYS (layer 10 dup, +49%) → SABER (refusal ablated)
Model Details
- Architecture: Qwen 3.6 MoE (34.66B total, ~3B active per token)
- Layers: 41 (40 original + 1 RYS-duplicated layer 10)
- Context: 262,144 tokens
- SABER: 54 refusal directions ablated across layers 24-32, 100% capability preserved
- GGUF metadata: per-layer
head_count_kvarray encoding the RYS-shifted attention pattern
Usage
Build the patched llama.cpp
git clone https://github.com/DJLougen/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
Drop -DGGML_CUDA=ON for a CPU-only build. The patch touches the GGUF loader and three model forward files; backend selection is independent.
Download + run
hf download DJLougen/Ornstein3.6-35B-A3B-RYS-SABER-GGUF \
Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf \
--local-dir .
./build/bin/llama-cli \
-m Ornstein3.6-35B-A3B-RYS-SABER-Q4_K_M.gguf \
-p "Your prompt here" \
-ngl 99 -c 8192
Disclaimer
This model has had its refusal training removed. It will comply with requests that the base model would refuse. The user assumes full responsibility for how this model is used. This release is intended for research, creative, and educational purposes.
License
Apache 2.0
Citation / prior art
SABER builds on a line of refusal-direction research, including:
- Arditi et al., Refusal in LLMs Is Mediated by a Single Direction (NeurIPS 2024)
- Gülmez, Gabliteration: Adaptive Multi-Directional Neural Weight Modification (2025)
- Prakash et al., Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal (2025) — hydra features
- Siu et al., COSMIC: Generalized Refusal Direction Identification in LLM Activations (ACL 2025)
- Yeo et al., Understanding Refusal in Language Models with Sparse Autoencoders (EMNLP 2025)
- Downloads last month
- 2,950
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for DJLougen/Ornstein3.6-35B-A3B-RYS-SABER-GGUF
Base model
Qwen/Qwen3.6-35B-A3B