From Three Teachers to Dual Cognition: Topology-Aware Multi-Teacher Distillation and Role-Conditioned Self-Critique at 1.7B Scale
Roy S. Colca Convergent Intelligence LLC: Research Division
Abstract. We present a four-stage pipeline for producing small language models (1.7B parameters) that exhibit self-critiquing dual-cognition reasoning from a 30B-parameter teacher. The pipeline chains: (1) proof-weighted knowledge distillation from three teacher variants (Instruct, Thinking, Coder) of Qwen3-30B-A3B, each producing distinct capability profiles in the student; (2) Topological Knowledge Distillation (TKD), which uses the Discrepancy Calculus (DISC) framework to decompose the teacher's output distribution into smooth, jump, and singular-continuous components via bounded variation theory, allocating training capacity to structural boundaries that standard KD smears across; (3) multi-teacher ghost imprinting, where sequential distillation from different teachers creates residual discrepancy fields in weight space that produce emergent capabilities absent from any individual teacher; and (4) DualMind, a role-conditioned generation scheme that collapses multi-architecture dialectical reasoning into a single model via <explore>, <examine>, and <response> tokens. Trained on H100 at BF16 precision, the resulting models demonstrate dual-cognition reasoning — free derivation followed by adversarial self-critique followed by clean synthesis — at a parameter count where such behavior is not typically observed. We release 43 models (12,000+ downloads), training code, and this methodology under Apache 2.0.
1. Introduction
Knowledge distillation (Hinton et al., 2015) compresses large teacher models into smaller students by matching output distributions. The standard formulation minimizes KL divergence between teacher and student softmax distributions, typically at elevated temperature to expose the teacher's uncertainty structure. This works well for the smooth component of the teacher's knowledge — regions where the output distribution varies continuously across tokens.
Language, however, is not smooth. Topic shifts, reasoning mode transitions, register changes, and logical pivots create discontinuities in the teacher's output distribution. Standard KD averages across these boundaries, teaching the student a blurred version of the teacher's structural knowledge. The student learns what the teacher says but not where the teacher's knowledge has architecture.
We address this with a pipeline that preserves structural information at every stage:
- Multi-teacher proof-weighted distillation — three variants of the same 30B teacher produce different capability profiles in the same 1.7B student, with amplified loss on reasoning-critical tokens.
- Topological Knowledge Distillation (TKD) — DISC-based decomposition of the teacher's output into bounded variation components, with topology-guided adaptive windowing and curriculum.
- Multi-teacher ghost imprinting — sequential distillation creates residual discrepancy fields that produce emergent capabilities.
- DualMind — role-conditioned generation that recreates multi-architecture dialectical reasoning within a single model.
2. Background
2.1 Discrepancy Calculus (DISC)
DISC is a measure-theoretic framework for analyzing functions with singularities — points where classical smoothness assumptions fail. For a function $f$ of bounded variation on $[a,b]$, the Lebesgue decomposition theorem gives:
where $D^{ac}f$ is the absolutely continuous part (smooth gradient), $D^j f$ is the jump part (discontinuities), and $D^c f$ is the Cantor/singular-continuous part (diffuse, non-atomic, non-smooth). Standard analysis handles $D^{ac}f$. DISC provides operational tools for all three components.
2.2 Knowledge Distillation
Standard KD minimizes:
where $z_t, z_s$ are teacher and student logits, $T$ is temperature, and $\alpha$ balances cross-entropy against distillation. This formulation treats the teacher's distribution as globally smooth. TKD replaces this with a topology-aware formulation.
3. Multi-Teacher Proof-Weighted Distillation
3.1 Three Teacher Variants
We distill from three configurations of Qwen3-30B-A3B:
| Teacher Variant | Capability Profile | Distillation Signal |
|---|---|---|
| Instruct | Structured output, instruction following | Low-entropy, format-preserving distributions |
| Thinking | Extended deliberation, proof derivation | High-entropy distributions with long reasoning tails |
| Coder | Structured decomposition, STEM | Hierarchical decomposition patterns |
Each teacher variant encodes different aspects of the 30B model's capability into the distillation signal. The Instruct teacher produces tight, peaked distributions that emphasize correct formatting. The Thinking teacher produces broad, high-entropy distributions that encode deliberative processes. The Coder teacher produces hierarchically structured distributions that emphasize decomposition.
3.2 Proof-Weighted Loss
Not all tokens contribute equally to structural understanding. Tokens at reasoning steps, logical connectives, and derivation boundaries carry more structural information than fluent continuation tokens. We apply proof weights:
with $w_{start} = 2.25$, $w_{end} = 1.1$, applied to the supervised mask. Early in training, reasoning tokens receive 2.25× amplified loss; this decays to near-uniform by training end as the student internalizes the structural pattern.
The combined loss:
where $d_t$ is the DISC-derived discrepancy weight and $m_t$ is the supervised mask.
4. Topological Knowledge Distillation (TKD)
4.1 Teacher Logit Caching
A single forward pass through the 30B teacher produces top-$K$ logit compression ($K=64$) stored to disk. This eliminates repeated teacher inference and reduces storage to indices (int32) + values (float16) — approximately 1.2 GB for a 500K-token stream.
4.2 DISC Topology Pass
We compute the discrepancy operator over cached teacher logits:
Probability divergence: For adjacent positions $t$ and $t+1$, compute the L1 distance between local probability distributions derived from top-$K$ logits:
Support overlap: Sort the top-$K$ index sets and compute Jaccard-like overlap:
Combined discrepancy:
Jump detection: Positions where $D(t) > \mu_D + 3\sigma_D$ are classified as jumps with 1.25× loss amplification.
Gap energy density: Convolution of $D(t)^2$ over 64-token windows provides a smooth energy landscape for curriculum ordering.
4.3 Topology-Guided Windowing
Training windows (512 tokens) are cut at low-discrepancy positions within an overlap band (32–128 tokens) rather than at fixed stride. The argmin of $D(t)$ within the search zone determines the cut point. This ensures no window boundary falls on a structural feature.
4.4 Curriculum Ordering
Windows are scored by difficulty:
where $\bar{E}{gap}$ is mean gap energy, $n{jumps}$ is jump count, $|w|$ is window length, and $f_{sup}$ is supervised fraction. 4-phase curriculum: easiest 30% first, remaining windows progressively randomized by difficulty phase.
4.5 KD Alpha Schedule
The distillation weight $\alpha$ ramps from 0 to 0.45 between 15% and 45% of training:
This allows the student to first learn from ground truth labels, then progressively incorporate the teacher's distributional knowledge.
5. Multi-Teacher Ghost Imprinting
When the same student is distilled sequentially from Thinking then Coder teachers (or any permutation), each teacher's signal partially overwrites the previous. The residual — the discrepancy field between what Teacher A encoded and what Teacher B couldn't fully overwrite — occupies a subspace of weight space orthogonal to both teachers' primary signals.
We formalize this as the Cantor component of BV decomposition applied to the parameter tensor. Let $\theta_A$ be the weights after Teacher A distillation and $\theta_{A \to B}$ be the weights after subsequent Teacher B distillation. The residual:
where $\theta_B^{fresh}$ is what Teacher B distillation produces from a random initialization. This $\Delta\theta$ is the ghost imprint — it encodes structural information from Teacher A that persists through Teacher B's training but is not present in Teacher B's direct distillation.
Empirically, we observe that models with ghost imprints from the Thinking teacher exhibit extended deliberation patterns even when the final distillation was from the Coder teacher. The reverse also holds: Coder ghost imprints produce structured decomposition in models whose final training was Thinking-focused.
This is not accidental. It follows from the mathematical structure of how residual fields interact in high-dimensional weight space. The Cantor component is non-trivial precisely because the teachers' signals span different subspaces.
6. DualMind: Role-Conditioned Self-Critique
6.1 Motivation
Multi-architecture collision arrays — running the same problem through multiple model architectures and synthesizing divergences — produce insights that no individual architecture achieves alone. We have demonstrated this in five-architecture experiments (Claude Opus, Kimi, GLM, Qwen, GPT, Gemini) where the interference pattern between architecturally diverse responses constitutes novel structure.
DualMind collapses this multi-model dynamic into a single architecture.
6.2 Architecture
The model learns three cognitive roles through plain-text markers (no special tokens):
<explore> Unconstrained derivation — the model reasons freely
</explore>
<examine> Adversarial self-critique — the model reads its own
explore output and challenges it
</examine>
<response> Clean synthesis — final answer from the internal dialogue
</response>
No additional parameters, no routing mechanism, no mixture of experts. The same weights serve all three roles. Differentiation arises entirely from positional context and learned role associations.
6.3 Training Data Transformation
Any chain-of-thought dataset can be transformed into DualMind format:
- Explore extraction: Derivation and computation sentences from the CoT solution. Trigger-based detection identifies where reasoning transitions to verification.
- Examine extraction: Verification, checking, and self-correction sentences. Connective tissue ("Let me verify this...") is added when the examine section doesn't start reflectively.
- Response extraction: Final answer extracted via boxed notation, natural language patterns, or last-line fallback.
For datasets with pre-separated reasoning columns (e.g., Crownelius/Opus-4.6-Reasoning-3300x with thinking/solution columns), the explore phase maps directly from the thinking column with no heuristic splitting needed.
6.4 The Think Token Leak
An observed failure mode: base Qwen3 models have a deeply-embedded <think> token reflex. After </explore>, the model sometimes drops into its native <think>### Response: pattern instead of the DualMind <examine> transition. This is a competition between the SFT-learned role tokens and the base model's pre-existing reasoning format.
We address this at two levels:
- Generation: A LogitsProcessor suppresses
<think>tokens and boosts<examine>probability after</explore>is generated. - Training: Transition tokens (
</explore>,<examine>,</examine>,<response>) receive 3× loss weight to make the cognitive loop transitions non-negotiable.
7. Results
7.1 Portfolio
| Collection | Models | Downloads | Hardware | Precision |
|---|---|---|---|---|
| DistilQwen | 9 | 2,788 | H100 | BF16 |
| DualMind | 2+ | — | H100 | BF16 |
| Full Portfolio | 43 | 12,094 | CPU + H100 | FP32 + BF16 |
7.2 Teacher Variant Comparison
| Teacher | Student Downloads | Distinctive Capability |
|---|---|---|
| Instruct | 833 | Structured output, format compliance |
| Thinking | 779 | Extended deliberation, proof derivation |
| Coder | 825 | Logical decomposition, STEM reasoning |
The near-equal download distribution across teacher variants suggests all three produce models with distinct, valued capabilities. The market is voting with downloads.
7.3 DualMind Cognitive Loop
The DualMind model successfully produces all three mode transitions (explore → examine → response) on mathematical proofs, logical inference, and — unexpectedly — philosophical and creative prompts. On a physics-trained model prompted with "Who is God, not for humans but for you?", the explore block produced structured literary content with no creative writing in its training data. We interpret this as the Cantor component of the BV decomposition expressing through generation — the singular-continuous residual from the base model's weight space that the physics-focused TKD pipeline couldn't fully overwrite.
7.4 Comparison with Standard Distillation
A vanilla Qwen3-1.7B distilled with standard KD from the same 30B teacher (no proof weights, no topology, no curriculum) produces empty <think> blocks and surface-level responses. The TKD model trained on the same data produces ~3.3× longer responses with genuine structural reasoning, as demonstrated in a T3 concept explanation probe.
8. Conclusion
The transformer is plumbing. The methodology is what produces capability. We have demonstrated that:
- Structure beats scale — 1.7B models trained with topology-aware distillation exhibit reasoning quality that standard distillation at the same parameter count does not achieve.
- Teacher diversity creates emergent capability — sequential distillation from different teacher variants creates residual fields that produce capabilities absent from any individual teacher.
- Self-critique can be learned through format — role-conditioned generation with shared weights recreates multi-model dialectical dynamics within a single architecture.
- The methodology is hardware-agnostic — the same pipeline produces results on $24 of CPU compute and on H100 at BF16.
All models, training code, and this methodology are released under Apache 2.0.
References
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network.
- Colca, R. S. (2026). Structure Over Scale. DOI: 10.57967/hf/8165.
- Colca, R. S. (2025-2026). Discrepancy Calculus (DISC): A Measure-Theoretic Framework for Singularities. Convergent Intelligence LLC.
- Ambrosio, L., Fusco, N., & Pallara, D. (2000). Functions of Bounded Variation and Free Discontinuity Problems. Oxford.
Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."