Qwen3.6-27B โ€” RYS Layer Surgery (GGUF)

A modified version of Qwen3.6-27B-Instruct produced by RYS layer duplication โ€” no training, no weight changes, just running layers 33โ€“36 a second time during the forward pass.

Based on David Ng's RYS method.


TL;DR

On the Berkeley Function-Call Leaderboard (BFCL v4, 100 tests/category ร— 13 single-turn categories, sampled), this variant beats the unmodified base model by +1.96 pp on average when run with thinking mode enabled โ€” driven by large gains on the hardest live categories:

Category Base rys_33-36 ฮ”
live_parallel 68.75% 87.50% +18.75
live_relevance 68.75% 81.25% +12.50
live_parallel_multiple 70.83% 75.00% +4.17
mean (13 categories) 82.56% 84.52% +1.96

The wins come from improved reasoning during prefill on multi-call / relevance-judgement queries. The trade is small regressions (โˆ’1 to โˆ’3 pp) on easier non-live categories. Thinking mode is required โ€” without it, this variant slightly underperforms base.


Files

File Layers Size
Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf 68 18 GiB

The base GGUF (no surgery) is at unsloth/Qwen3.6-27B-GGUF.


Internal probe results

A small probe of math, EQ, and reasoning prompts was run during the layer search. The probe categories are tiny (3 questions per reasoning subcategory, ~16 EQ-Bench-style items, ~16 math problems) so individual numbers should be treated as directional, not definitive.

Metric Base rys_33-36
Math (GSM8K-style partial credit) 0.537 0.500
EQ (EQ-Bench-style, 0โ€“100) 93.59 86.64
Reasoning total (17 probes, 5 subcategories) 0.765 0.882
  โ†ณ causal 0.67 1.00
  โ†ณ date 1.00 1.00
  โ†ณ logic 1.00 1.00
  โ†ณ navigation 0.67 1.00
  โ†ณ gsm 0.60 0.60

Layers 33โ€“36 was the only configuration in the layer-block sweep that achieved a perfect score on the causal reasoning subcategory while keeping the other reasoning categories at or above their baseline. This is what motivated picking it for the BFCL run below.


BFCL results (sampled, thinking enabled)

Category Base rys_33-36
irrelevance 90.00 88.00
multiple 96.00 95.00
parallel 93.00 91.00
parallel_multiple 87.00 85.00
simple_java 59.00 61.00
simple_javascript 74.00 72.00
simple_python 95.00 92.00
live_irrelevance 98.00 99.00
live_multiple 88.00 87.00
live_parallel 68.75 87.50
live_parallel_multiple 70.83 75.00
live_relevance 68.75 81.25
live_simple 85.00 85.00
mean 82.56 84.52

Sample size: 100 tests/category for categories with โ‰ฅ100 entries; the full category was used for the smaller ones (live_parallel, live_parallel_multiple, live_relevance, simple_javascript). 1006 tests per model in total. The full benchmark would be ~5x larger and would also cover multi-turn, memory, and web-search categories that we did not run.

Inference: llama.cpp llama-server --jinja, BFCL via /v1/chat/completions with native tool use, temperature=1.0, top_p=0.95, top_k=20, max_tokens=8192. Multi-turn, memory, and web-search categories were not run.


What is RYS?

Transformers self-organise during training into functional circuits โ€” contiguous blocks of layers that act together. RYS duplicates a specific block in the forward pass using the same weights:

Normal:    0 โ†’ โ€ฆ โ†’ 32 โ†’ 33 โ†’ 34 โ†’ 35 โ†’ 36 โ†’ 37 โ†’ โ€ฆ โ†’ 63
rys_33-36: 0 โ†’ โ€ฆ โ†’ 32 โ†’ 33 โ†’ 34 โ†’ 35 โ†’ 36
                       โ†’ 33 โ†’ 34 โ†’ 35 โ†’ 36 โ†’ 37 โ†’ โ€ฆ โ†’ 63

The model processes layers 33โ€“36 twice. No fine-tuning, no extra parameters beyond the GGUF file overhead. Total layer count goes from 64 โ†’ 68.


How the layer range was found

A two-pass sweep across all 64 layers using a small probe of math, EQ, and reasoning prompts:

  • Pass 1 (8-layer blocks, stride 4): identified hot zones around layers 32โ€“48 (math gains, causal reasoning) and 48โ€“60 (general reasoning gains).
  • Pass 2 (4-layer blocks, stride 1, layers 32โ€“58): (33, 37) was the only configuration that achieved a perfect score on the probe's causal reasoning subcategory while keeping date, logic, and nav at their baseline ceilings.

The probe alone suggested rys_33-36 was a moderate win. The sampled BFCL run with thinking enabled confirms it on the harder live categories (above).


Hybrid Mamba/attention architecture constraint

Qwen3.6-27B is a hybrid SSM/attention model (full_attention_interval = 4): full attention every 4th layer, Gated DeltaNet SSM everywhere else. This creates a hard constraint: the total layer count must remain divisible by 4.

  • Block size 4 โ†’ 64 + 4 = 68 layers (68 รท 4 = 17 โœ“)
  • Block size 3 โ†’ 64 + 3 = 67 layers (67 รท 4 = 16.75 โœ— โ†’ crash)

Usage

llama.cpp / llama-server

The wins require thinking mode. Use --jinja so the server applies the Qwen3.6 chat template, which primes thinking properly:

llama-server -m Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf \
             --jinja \
             -ngl 99 -c 32768 \
             --port 8080

Sampling parameters (Qwen3.6 thinking-mode defaults)

temperature = 1.0
top_p       = 0.95
top_k       = 20
min_p       = 0.0

For more deterministic / coding-focused tasks, Qwen recommends temperature=0.6 instead. Either way, leave thinking enabled.

Token budget

Qwen3.6's thinking chains can be long (we observed up to ~7k tokens of reasoning on hard BFCL parallel cases). Set max_tokens โ‰ฅ 8192 to avoid truncating mid-thought.

VRAM

About 22 GiB at Q4_K_XL with 32k context and Q8 KV cache. Fits comfortably on a single A100 40 GB.


When to use this

  • You want better function-calling performance on complex live queries (parallel calls, relevance judgement) and you can afford ~6 extra layers of prefill compute.
  • You're running with thinking mode on (this is where the gain comes from).

When NOT to use this

  • You're running without thinking โ€” base will be ~1.5 pp better.
  • You care about the very-easy categories (simple_python, multiple) more than the hard live ones โ€” base is 1โ€“3 pp better there.

Credits

License

Apache 2.0 (inherited from base model).

Downloads last month
375
GGUF
Model size
28B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support