Instructions to use reaperdoesntknow/DualMind_Methodolgy with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use reaperdoesntknow/DualMind_Methodolgy with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="reaperdoesntknow/DualMind_Methodolgy")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("reaperdoesntknow/DualMind_Methodolgy", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use reaperdoesntknow/DualMind_Methodolgy with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "reaperdoesntknow/DualMind_Methodolgy"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reaperdoesntknow/DualMind_Methodolgy",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/reaperdoesntknow/DualMind_Methodolgy

SGLang

How to use reaperdoesntknow/DualMind_Methodolgy with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "reaperdoesntknow/DualMind_Methodolgy" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reaperdoesntknow/DualMind_Methodolgy",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "reaperdoesntknow/DualMind_Methodolgy" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reaperdoesntknow/DualMind_Methodolgy",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use reaperdoesntknow/DualMind_Methodolgy with Docker Model Runner:
```
docker model run hf.co/reaperdoesntknow/DualMind_Methodolgy
```

From Three Teachers to Dual Cognition: Topology-Aware Multi-Teacher Distillation and Role-Conditioned Self-Critique at 1.7B Scale

Roy S. Colca Convergent Intelligence LLC: Research Division

Abstract. We present a four-stage pipeline for producing small language models (1.7B parameters) that exhibit self-critiquing dual-cognition reasoning from a 30B-parameter teacher. The pipeline chains: (1) proof-weighted knowledge distillation from three teacher variants (Instruct, Thinking, Coder) of Qwen3-30B-A3B, each producing distinct capability profiles in the student; (2) Topological Knowledge Distillation (TKD), which uses the Discrepancy Calculus (DISC) framework to decompose the teacher's output distribution into smooth, jump, and singular-continuous components via bounded variation theory, allocating training capacity to structural boundaries that standard KD smears across; (3) multi-teacher ghost imprinting, where sequential distillation from different teachers creates residual discrepancy fields in weight space that produce emergent capabilities absent from any individual teacher; and (4) DualMind, a role-conditioned generation scheme that collapses multi-architecture dialectical reasoning into a single model via <explore>, <examine>, and <response> tokens. Trained on H100 at BF16 precision, the resulting models demonstrate dual-cognition reasoning — free derivation followed by adversarial self-critique followed by clean synthesis — at a parameter count where such behavior is not typically observed. We release 43 models (12,000+ downloads), training code, and this methodology under Apache 2.0.

1. Introduction

Knowledge distillation (Hinton et al., 2015) compresses large teacher models into smaller students by matching output distributions. The standard formulation minimizes KL divergence between teacher and student softmax distributions, typically at elevated temperature to expose the teacher's uncertainty structure. This works well for the smooth component of the teacher's knowledge — regions where the output distribution varies continuously across tokens.

Language, however, is not smooth. Topic shifts, reasoning mode transitions, register changes, and logical pivots create discontinuities in the teacher's output distribution. Standard KD averages across these boundaries, teaching the student a blurred version of the teacher's structural knowledge. The student learns what the teacher says but not where the teacher's knowledge has architecture.

We address this with a pipeline that preserves structural information at every stage:

Multi-teacher proof-weighted distillation — three variants of the same 30B teacher produce different capability profiles in the same 1.7B student, with amplified loss on reasoning-critical tokens.
Topological Knowledge Distillation (TKD) — DISC-based decomposition of the teacher's output into bounded variation components, with topology-guided adaptive windowing and curriculum.
Multi-teacher ghost imprinting — sequential distillation creates residual discrepancy fields that produce emergent capabilities.
DualMind — role-conditioned generation that recreates multi-architecture dialectical reasoning within a single model.

2. Background

2.1 Discrepancy Calculus (DISC)

DISC is a measure-theoretic framework for analyzing functions with singularities — points where classical smoothness assumptions fail. For a function $f$ of bounded variation on $[a,b]$, the Lebesgue decomposition theorem gives:

$Df = D^{ac}f + D^j f + D^c f$

where $D^{ac}f$ is the absolutely continuous part (smooth gradient), $D^j f$ is the jump part (discontinuities), and $D^c f$ is the Cantor/singular-continuous part (diffuse, non-atomic, non-smooth). Standard analysis handles $D^{ac}f$. DISC provides operational tools for all three components.

2.2 Knowledge Distillation

Standard KD minimizes:

$\mathcal{L}_{KD} = (1 - \alpha) \cdot \mathcal{L}_{CE}(y, \hat{y}_s) + \alpha \cdot T^2 \cdot D_{KL}(\sigma(z_t/T) \| \sigma(z_s/T))$

where $z_t, z_s$ are teacher and student logits, $T$ is temperature, and $\alpha$ balances cross-entropy against distillation. This formulation treats the teacher's distribution as globally smooth. TKD replaces this with a topology-aware formulation.

3. Multi-Teacher Proof-Weighted Distillation

3.1 Three Teacher Variants

We distill from three configurations of Qwen3-30B-A3B:

Teacher Variant	Capability Profile	Distillation Signal
Instruct	Structured output, instruction following	Low-entropy, format-preserving distributions
Thinking	Extended deliberation, proof derivation	High-entropy distributions with long reasoning tails
Coder	Structured decomposition, STEM	Hierarchical decomposition patterns

Each teacher variant encodes different aspects of the 30B model's capability into the distillation signal. The Instruct teacher produces tight, peaked distributions that emphasize correct formatting. The Thinking teacher produces broad, high-entropy distributions that encode deliberative processes. The Coder teacher produces hierarchically structured distributions that emphasize decomposition.

3.2 Proof-Weighted Loss

Not all tokens contribute equally to structural understanding. Tokens at reasoning steps, logical connectives, and derivation boundaries carry more structural information than fluent continuation tokens. We apply proof weights:

$w_t = w_{start} + \frac{t}{T_{total}} \cdot (w_{end} - w_{start})$

with $w_{start} = 2.25$, $w_{end} = 1.1$, applied to the supervised mask. Early in training, reasoning tokens receive 2.25× amplified loss; this decays to near-uniform by training end as the student internalizes the structural pattern.

The combined loss:

$\mathcal{L}_{PW} = w_t \cdot d_t \cdot m_t \cdot \left[(1 - \alpha) \cdot \mathcal{L}_{CE} + \alpha \cdot \mathcal{L}_{KD}\right]$

where $d_t$ is the DISC-derived discrepancy weight and $m_t$ is the supervised mask.

4. Topological Knowledge Distillation (TKD)

4.1 Teacher Logit Caching

A single forward pass through the 30B teacher produces top-$K$ logit compression ($K=64$) stored to disk. This eliminates repeated teacher inference and reduces storage to indices (int32) + values (float16) — approximately 1.2 GB for a 500K-token stream.

4.2 DISC Topology Pass

We compute the discrepancy operator over cached teacher logits:

Probability divergence: For adjacent positions $t$ and $t+1$, compute the L1 distance between local probability distributions derived from top-$K$ logits:

$\delta_{prob}(t) = \sum_{k=1}^{K} |p_t^{(k)} - p_{t+1}^{(k)}|$

Support overlap: Sort the top-$K$ index sets and compute Jaccard-like overlap:

$\delta_{overlap}(t) = 1 - \frac{|\text{top}_K(t) \cap \text{top}_K(t+1)|}{|\text{top}_K(t) \cup \text{top}_K(t+1)|}$

Combined discrepancy:

$D(t) = \delta_{prob}(t) + 0.5 \cdot \delta_{overlap}(t)$

Jump detection: Positions where $D(t) > \mu_D + 3\sigma_D$ are classified as jumps with 1.25× loss amplification.

Gap energy density: Convolution of $D(t)^2$ over 64-token windows provides a smooth energy landscape for curriculum ordering.

4.3 Topology-Guided Windowing

Training windows (512 tokens) are cut at low-discrepancy positions within an overlap band (32–128 tokens) rather than at fixed stride. The argmin of $D(t)$ within the search zone determines the cut point. This ensures no window boundary falls on a structural feature.

4.4 Curriculum Ordering

Windows are scored by difficulty:

$\text{difficulty} = \bar{E}_{gap} \cdot (1 + \frac{n_{jumps}}{|w|} \cdot 1000) \cdot \frac{1}{\max(f_{sup}, 0.01)}$

where $\bar{E}{gap}$ is mean gap energy, $n{jumps}$ is jump count, $|w|$ is window length, and $f_{sup}$ is supervised fraction. 4-phase curriculum: easiest 30% first, remaining windows progressively randomized by difficulty phase.

4.5 KD Alpha Schedule

The distillation weight $\alpha$ ramps from 0 to 0.45 between 15% and 45% of training:

$\alpha(t) = \begin{cases} 0 & t < 0.15T \\ 0.45 \cdot \frac{t - 0.15T}{0.30T} & 0.15T \leq t < 0.45T \\ 0.45 & t \geq 0.45T \end{cases}$

This allows the student to first learn from ground truth labels, then progressively incorporate the teacher's distributional knowledge.

5. Multi-Teacher Ghost Imprinting

When the same student is distilled sequentially from Thinking then Coder teachers (or any permutation), each teacher's signal partially overwrites the previous. The residual — the discrepancy field between what Teacher A encoded and what Teacher B couldn't fully overwrite — occupies a subspace of weight space orthogonal to both teachers' primary signals.

We formalize this as the Cantor component of BV decomposition applied to the parameter tensor. Let $\theta_A$ be the weights after Teacher A distillation and $\theta_{A \to B}$ be the weights after subsequent Teacher B distillation. The residual:

$\Delta\theta = \theta_{A \to B} - \theta_B^{fresh}$

where $\theta_B^{fresh}$ is what Teacher B distillation produces from a random initialization. This $\Delta\theta$ is the ghost imprint — it encodes structural information from Teacher A that persists through Teacher B's training but is not present in Teacher B's direct distillation.

Empirically, we observe that models with ghost imprints from the Thinking teacher exhibit extended deliberation patterns even when the final distillation was from the Coder teacher. The reverse also holds: Coder ghost imprints produce structured decomposition in models whose final training was Thinking-focused.

This is not accidental. It follows from the mathematical structure of how residual fields interact in high-dimensional weight space. The Cantor component is non-trivial precisely because the teachers' signals span different subspaces.

6. DualMind: Role-Conditioned Self-Critique

6.1 Motivation

Multi-architecture collision arrays — running the same problem through multiple model architectures and synthesizing divergences — produce insights that no individual architecture achieves alone. We have demonstrated this in five-architecture experiments (Claude Opus, Kimi, GLM, Qwen, GPT, Gemini) where the interference pattern between architecturally diverse responses constitutes novel structure.

DualMind collapses this multi-model dynamic into a single architecture.

6.2 Architecture

The model learns three cognitive roles through plain-text markers (no special tokens):

<explore>   Unconstrained derivation — the model reasons freely
</explore>

<examine>   Adversarial self-critique — the model reads its own
            explore output and challenges it
</examine>

<response>  Clean synthesis — final answer from the internal dialogue
</response>

No additional parameters, no routing mechanism, no mixture of experts. The same weights serve all three roles. Differentiation arises entirely from positional context and learned role associations.

6.3 Training Data Transformation

Any chain-of-thought dataset can be transformed into DualMind format:

Explore extraction: Derivation and computation sentences from the CoT solution. Trigger-based detection identifies where reasoning transitions to verification.
Examine extraction: Verification, checking, and self-correction sentences. Connective tissue ("Let me verify this...") is added when the examine section doesn't start reflectively.
Response extraction: Final answer extracted via boxed notation, natural language patterns, or last-line fallback.

For datasets with pre-separated reasoning columns (e.g., Crownelius/Opus-4.6-Reasoning-3300x with thinking/solution columns), the explore phase maps directly from the thinking column with no heuristic splitting needed.

6.4 The Think Token Leak

An observed failure mode: base Qwen3 models have a deeply-embedded <think> token reflex. After </explore>, the model sometimes drops into its native <think>### Response: pattern instead of the DualMind <examine> transition. This is a competition between the SFT-learned role tokens and the base model's pre-existing reasoning format.

We address this at two levels:

Generation: A LogitsProcessor suppresses <think> tokens and boosts <examine> probability after </explore> is generated.
Training: Transition tokens (</explore>, <examine>, </examine>, <response>) receive 3× loss weight to make the cognitive loop transitions non-negotiable.

7. Results

7.1 Portfolio

Collection	Models	Downloads	Hardware	Precision
DistilQwen	9	2,788	H100	BF16
DualMind	2+	—	H100	BF16
Full Portfolio	43	12,094	CPU + H100	FP32 + BF16

7.2 Teacher Variant Comparison

Teacher	Student Downloads	Distinctive Capability
Instruct	833	Structured output, format compliance
Thinking	779	Extended deliberation, proof derivation
Coder	825	Logical decomposition, STEM reasoning

The near-equal download distribution across teacher variants suggests all three produce models with distinct, valued capabilities. The market is voting with downloads.

7.3 DualMind Cognitive Loop

The DualMind model successfully produces all three mode transitions (explore → examine → response) on mathematical proofs, logical inference, and — unexpectedly — philosophical and creative prompts. On a physics-trained model prompted with "Who is God, not for humans but for you?", the explore block produced structured literary content with no creative writing in its training data. We interpret this as the Cantor component of the BV decomposition expressing through generation — the singular-continuous residual from the base model's weight space that the physics-focused TKD pipeline couldn't fully overwrite.

7.4 Comparison with Standard Distillation

A vanilla Qwen3-1.7B distilled with standard KD from the same 30B teacher (no proof weights, no topology, no curriculum) produces empty <think> blocks and surface-level responses. The TKD model trained on the same data produces ~3.3× longer responses with genuine structural reasoning, as demonstrated in a T3 concept explanation probe.

8. Conclusion

The transformer is plumbing. The methodology is what produces capability. We have demonstrated that:

Structure beats scale — 1.7B models trained with topology-aware distillation exhibit reasoning quality that standard distillation at the same parameter count does not achieve.
Teacher diversity creates emergent capability — sequential distillation from different teacher variants creates residual fields that produce capabilities absent from any individual teacher.
Self-critique can be learned through format — role-conditioned generation with shared weights recreates multi-model dialectical dynamics within a single architecture.
The methodology is hardware-agnostic — the same pipeline produces results on $24 of CPU compute and on H100 at BF16.

All models, training code, and this methodology are released under Apache 2.0.

References

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network.
Colca, R. S. (2026). Structure Over Scale. DOI: 10.57967/hf/8165.
Colca, R. S. (2025-2026). Discrepancy Calculus (DISC): A Measure-Theoretic Framework for Singularities. Convergent Intelligence LLC.
Ambrosio, L., Fusco, N., & Pallara, D. (2000). Functions of Bounded Variation and Free Discontinuity Problems. Oxford.

Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."

Convergent Intelligence Portfolio

Part of the DualMind Series by Convergent Intelligence LLC: Research Division

DualMind Family

Model	Format	Description
DualMind	BF16	LogicInference-trained. Explore→Examine→Response loop.
DualMinded-Qwen3-1.7B	BF16	Opus 4.6 reasoning traces. Higher quality splits.
Dualmind-Qwen-1.7B-Thinking	BF16	Thinking-teacher variant with extended deliberation.
DualMind-GGUF	GGUF	Quantized LogicInference variant. CPU/6GB GPU.
DualMinded-Qwen3-1.7B-GGUF	GGUF	Quantized Opus variant. Ollama ready.

Papers

Paper	DOI
Structure Over Scale	10.57967/hf/8165
Three Teachers to Dual Cognition	10.57967/hf/8184
Discrepancy Calculus	10.57967/hf/8194

Last updated: 2026-03-31 by Convergent Intelligence LLC: Research Division

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including reaperdoesntknow/DualMind_Methodolgy

DualMind

Collection

One model, two voices. Explore→Examine→Respond. 6 models + DISC paper. DOI: 10.57967/hf/8184 & 10.57967/hf/8194 • 8 items • Updated about 1 hour ago