| | --- |
| | license: other |
| | license_name: raml-v1.0 |
| | pipeline_tag: text-generation |
| | tags: |
| | - model_hub_mixin |
| | - pytorch_model_hub_mixin |
| | - RxNN |
| | - RxLM |
| | - ReactiveTransformer |
| | - Event-Driven |
| | - MemorySystem |
| | - ShortTermMemory |
| | - Real-Time |
| | - RxLM |
| | - ReactiveLanguageModel |
| | - RealTimeLanguageModel |
| | language: |
| | - en |
| | datasets: |
| | - HuggingFaceFW/fineweb-edu |
| | - wikimedia/wikipedia |
| | - HuggingFaceFW/clean-wikipedia |
| | - ReactiveAI/smol-smoltalk-Interaction-SFT |
| | - ReactiveAI/cosmopedia-100k-Interaction-SFT |
| | - ReactiveAI/Real-Chat-SMAT |
| | - ReactiveAI/Real-Chat-No-System-SMAT |
| | library_name: RxLM |
| | gated: true |
| | extra_gated_prompt: >- |
| | Accept [Reactive AI Model & Architecture License (RAML) |
| | v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to |
| | access the repository and use model. Reactive Transformer (pending patent |
| | #P.453260) is available for free for non-commercial usage. For commercial |
| | usage please contact Reactive AI at licensing@rxai.dev |
| | extra_gated_fields: |
| | Company: text |
| | Country: country |
| | I want to use this model for: |
| | type: select |
| | options: |
| | - Research |
| | - Education |
| | - label: Other |
| | value: other |
| | I agree to use this model for non-commercial use ONLY: checkbox |
| | extra_gated_heading: >- |
| | You need to agree to use this model only for research or education purposes |
| | under Reactive AI Model & Architecture License (RAML) v1.0 |
| | extra_gated_description: The repository will be available instantly after accepting license terms |
| | extra_gated_button_content: Accept license terms |
| | base_model: |
| | - ReactiveAI/RxT-Beta-Micro-Supervised |
| | --- |
| | |
| | # RxT-Beta-Micro-Supervised AI 270M |
| | World's first experimental real-time **Reactive Language Model (RxLM)** trained on limited real-world data (after synthetic |
| | RxT-Alpha generation). It's based on revolutionary **Reactive Transformer** architecture - processing only single |
| | interactions/messages, with all the context moved to **Short-Term Memory**, managed by **Attention-Based Memory System**. |
| |
|
| | This model is a fine-tuned version of [RxT-Beta-Micro-Supervised](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised), |
| | specialized in AI/Data Science knowledge based chats and interactive [Reactive AI](https://huggingface.co/ReactiveAI) documentation |
| |
|
| | > Docs in progress |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| | First **Reactive Language Model (RxLM)** trained on limited real-world datasets, based on **Reactive Transformer (RxT)** architecture |
| |
|
| | **RxLMs** have linear computational/inference cost scaling (`O(NT)`) compared to **LLMs** quadratic growth (`O(N²T)`), |
| | where `N` is the number of messages in conversation and `T` is the number of tokens in single interaction. Thanks to that |
| | scaling, they are just `N` times faster and cheaper than **LLMs**. |
| |
|
| | That's not all from the advantages - event-driven real-time processing with memory is a lot more natural and human-like, |
| | than LLMs data-driven approach (processing full conversation history everytime). It's a crucial milestone in development |
| | of AGI and awareness models. |
| |
|
| | > This is _Supervised_ version of the model with "weak" memory system - result of Supervised Memory System Training (SMST). It's |
| | > able to remember information between interactions (without passing it explicitly in prompt/chat template), but it |
| | > has to be refined in next Memory Reinforcement Learning (MRL) stage for full functionality. |
| |
|
| | After successful experiments with simple synthetic datasets, we moved to real-world data, but this model still had limited |
| | amount of english-only data for pre-training - only 10B tokens from Wikipedia and FineWeb-Edu (+2B tokens in later stages). |
| | Then it could have limited general knowledge, so we fine-tuned it for chats with AI/Data Science knowledge |
| |
|
| |
|
| | ### Reactive Transformer Architecture |
| | Experimental research model made to test our Reactive Transformer architecture and Attention-based Memory System. |
| |
|
| | Reactive Transformer has additional Short-Term Memory layers, connected to model with Memory Cross-Attention, and updated by Memory Encoder and Memory Attention. |
| | Short-Term Memory state is kept between interactions/event (single message), not between tokens in sequence - that's key difference between RxNNs and RNNs. |
| |
|
| | The goal of the architecture is to process only single messages and keep conversation history in Short-Term Memory - we believe, that this is the key requirement |
| | for awareness and AGI. Processing all the chat history on every interaction is not natural and that's not how human awareness is working. Then, Reactive Transformer |
| | architecture is a first step in transition from language models to awareness models. |
| |
|
| | To balance number of the parameters, decoder is based on Mixture-of-Experts architecture, while the encoder is using regular |
| | dense feed forward layers. This model is using gated self/interlayer version of memory attention network with sigmoid residual gates. |
| |
|
| | <img src="https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised/resolve/main/reactive-transformer-self-interlayer.png" width="800" /> |
| |
|
| | #### Architecture details: |
| | - dim: 256 |
| | - layers: 14 |
| | - heads (for split): 16 |
| | - **Decoder:** |
| | - self-attention: Sparse Query Attention |
| | - query heads: 8/16 |
| | - key/value heads: 4/16 |
| | - memory cross-attention: Sparse Query Attention |
| | - query heads: 8/16 |
| | - key/value heads: 4/16 |
| | - Mixture-of-Experts Feed Forward |
| | - experts: 42 |
| | - active experts: 4 |
| | - SwiGLU feed forward with 512 dim |
| | - size: \~251M (~41M Activated) |
| | - **Encoder:** |
| | - self-attention: symmetric Sparse Query Attention |
| | - query/key/value heads: 8/16 |
| | - SwiGLU feed forward with 768 dim |
| | - size: ~18.3M |
| | - **Memory Attention:** |
| | - variant: **Gated Self/Interlayer Memory Attention** |
| | - attention layers: symmetric Sparse Query Attention |
| | - query/key/value heads: 8/16 |
| | - residual gate: elementwise with sigmoid activation (per STM slot) |
| | - size: ~3.73M |
| | - RoPE for self-attention, memory cross-attention (query only) and memory attention (key only) |
| | - RMS Norm for all normalization layers |
| | - vocab: 32k (english only) |
| | - interaction (query + answer) length: 1024 tokens |
| | - STM size: 14 layers * 1024 slots (* 256 dim) |
| | - context/messages: **Infinite** |
| | - size: ~270M |
| | - Library: RxLM |
| | --- |
| | - **Developed by:** [Adam Filipek](https://huggingface.co/AdamF92) & [Reactive AI](https://huggingface.co/ReactiveAI) |
| | - **Funded by:** [Reactive AI](https://huggingface.co/ReactiveAI) |
| | - **Model type:** **Reactive Language Model (RxLM)** |
| | - **Language(s) (NLP):** English |
| | - **License:** [Reactive AI Model & Architecture License (RAML) v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) |
| | - **Finetuned from model:** [RxT-Beta-Micro-Supervised](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised) |
| |
|
| | ### Model Sources |
| |
|
| | <!-- Provide the basic links for the model. --> |
| |
|
| | - **Repository:** [RxLM Framework](https://github.com/RxAI-dev/rxlm) |
| | - **Paper:** [Reactive Transformer (RxT) - Stateful Real-Time Processing for Event-Driven Reactive Language Models](https://arxiv.org/abs/2510.03561) |
| | - **Demo:** In progress |
| |
|
| | ## Uses |
| | This model is fine-tuned version of [RxT-Beta-Micro-Supervised](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised), trained on AI/Data Science knowledge |
| | and Reactive AI documentation based conversations. It's made for interactive documentation of our technologies. |
| |
|
| | Base model is still experimental and it was pre-trained on limited corpus with only 10B tokens, so it's general knowledge is also limited, but it should |
| | work correctly for AI/Data Science oriented topics |
| |
|
| | **Supervised** RxT models are partially functional intermediate stage models - it's recommended to refine them in Memory Reinforcement Learning (MRL) and Reactive |
| | Reinforcement Learning from Human Feedback (RxRLHF) to reach final stage. |
| |
|
| | ### Direct Use |
| | It's recommended to refine the model in reinforcement learning stages for full functionality (in progress). |
| |
|
| | **Reactive Transformer** models are made for conversational tasks, especially chatbots or as a stateful base for agentic systems. |
| |
|
| | This model is made to act as interactive documentation of Reactive AI technologies and AI/Data Science knowledge agent. |
| |
|
| | ### Out-of-Scope Use |
| | **Reactive Transformer** models are natively conversational and made for multi-step tasks. They aren't typical Gen AI and aren't made |
| | for single-step generative tasks (like summarization, dataset generation, etc.) - they will work in those scenarios, but it will be waste |
| | of computational resources (initializing/processing memory, when it's not needed). For that case it's better to use stateless LLM. |
| |
|
| | ## Bias, Risks, and Limitations |
| | The model is still experimental, made to test **Reactive Transformer** architecture on real-world data, after succesful experiments with simple synthetic data. |
| | It was pre-trained on 10B tokens only (and additional 2B in next stages), so it's general knowledge is limited and responses could be inaccurate. |
| |
|
| | Conversation context is theoretically infinite (1024 tokens limit is only for single interaction), but after some number of messages model will slowly forget |
| | outdated information - that's why it's called **Short-Term Memory**. It will be extended in upcoming generations with **Long-Term Memory** for true infinite context. |
| |
|
| | AI/Data Science knowledge and Reactive AI documentation datasets for fine-tuned model were created _"semi-synthetically"_ with LLMs (GPT-OSS and Qwen3) - the |
| | conversation examples were generated by LLM, based on provided documentation. It's then possible, that they include some hallucinations and incorrect facts, but |
| | is should be rather rare. |
| |
|
| | ### Recommendations |
| | As mentioned before, supervised models are in intermediate stage and it's recommended to continue the training in reinforcement learning stages. |
| |
|
| | ## How to Get Started with the Model |
| | Model could be loaded and used with our RxLM framework (https://github.com/RxAI-dev/RxLM): |
| |
|
| | ```python |
| | import torch |
| | from rxlm.rxt.models import RxTBeta |
| | from rxlm.training.tokenizer import load_tokenizer_from_hf_hub |
| | |
| | tokenizer = load_tokenizer_from_hf_hub('ReactiveAI/RxT-Beta-Micro') |
| | |
| | model = RxTBeta.from_pretrained('ReactiveAI/RxT-Beta-Micro-Supervised-AI', tokenizer=tokenizer) |
| | model.share_components() # currently required to connect embeddings/STM |
| | |
| | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
| | model.to(device) |
| | |
| | seq_len = 1024 |
| | |
| | # Memory init - could be used as "system prompt" in LLMs (not recommended in this model, as it wasn't trained with system prompts) |
| | stm_init_state = model.tokenize_full_interaction('System prompt like', 'Initial memory for the model', max_seq_len=seq_len, device=device) |
| | model.init_stm_state(**stm_init_state) |
| | |
| | # Helper function |
| | def interaction(query: str): |
| | tokenized_query = model.tokenize_query(query, max_seq_len=seq_len, device=device) |
| | for token_id in model.interact(**tokenized_query, max_seq_len=seq_len, temperature=1.0): |
| | if token_id == -1: print('\n', '[Start memory update...]') |
| | elif token_id == -2: print('[Memory updated]') |
| | else: |
| | txt_token = model.stringify_token(token_id) |
| | print(txt_token, end='') |
| | |
| | # Process first interaction |
| | interaction('Hello! Who are you?') |
| | # Process follow-up interaction |
| | interaction('Follow-up question?') |
| | |
| | ``` |
| |
|
| | ## Training Details |
| | Stateful & real-time nature of **Reactive Transformer** architecture, especially asynchronous memory update, requires advanced training pipeline with multiple |
| | supervised and reinforcement learning stages: |
| | - Supervised: |
| | - Joint Language Models Pre-Training | raw large text corpora |
| | - Interaction Supervised Fine-Tuning | single, not connected interactions (query + answer) |
| | - Self-Supervised Memory Attention Pre-Training | multi-step conversations (SMAT datasets) |
| | - Supervised Memory-Aware Training (SMAT) | multi-step conversations |
| | - Reinforcement: |
| | - Memory Reinforcement Learning (MRL) | multi-step conversations |
| | - Reactive Reinforcement Learning from Human Feedback (RxRLHF) | multi-step conversations |
| |
|
| | Fine-tuning for narrow specialization was performed in additional epochs of **Supervised Memory-Aware Training (SMAT)** |
| |
|
| | ### Training Data |
| | We used public open-source datasets for pre-training and our custom datasets (converted from public datasets) for other stages: |
| | - Joint Language Models Pre-Training |
| | - 'sample-10BT' subset from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) |
| | - '20231101.en' subset from [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) |
| | - Interaction SFT |
| | - [ReactiveAI/smol-smoltalk-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT) |
| | - [ReactiveAI/cosmopedia-100k-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/cosmopedia-100k-Interaction-SFT) |
| | - Self-Supervised Memory Attention Pre-Training |
| | - 30% of [ReactiveAI/Real-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) |
| | - Supervised Memory-Aware Training (SMAT) |
| | - [ReactiveAI/Real-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) |
| | - [ReactiveAI/Real-Chat-No-System-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-No-System-SMAT) |
| | - Specialization SMAT |
| | - [ReactiveAI/AI-Knowledge-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/AI-Knowledge-Chat-SMAT) |
| | - [ReactiveAI/ReactiveAI-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/ReactiveAI-Chat-SMAT) |
| |
|
| | ### Training Procedure |
| | Supervised Memory System Training includes 4 steps, before proceeding to Reinforcement Learning stages. |
| |
|
| | #### Joint Language Models Pre-Training |
| | Decoder was trained with Encoder and additional MLM head model, using Joint LM Training (with MLM and Autoregressive loss), |
| | using [**HuggingFaceFW/fineweb-edu**](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [**wikimedia/wikipedia**](https://huggingface.co/datasets/wikimedia/wikipedia) datasets. |
| | Both encoder and decoder are using shared embedding layer |
| |
|
| | #### Supervised Fine-Tuning |
| | **RxT-Beta Micro** model was fine-tuned to real-time interactions (sequences) format on our datasets, derived from HuggingFace ones: |
| | - [**ReactiveAI/smol-smoltalk-Interaction-SFT**](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT) |
| | - [**ReactiveAI/cosmopedia-100k-Interaction-SFT**](https://huggingface.co/datasets/ReactiveAI/cosmopedia-100k-Interaction-SFT-Interaction-SFT). |
| |
|
| | Models were fine-tuned using Joint LM Training mode (for memory cross-attention pre-training): |
| | - encode data with encoder and calculate MLM loss for it |
| | - save encoder layer's results as Short-Term Memory (available for decoder by memory cross-attention) |
| | - process data with decoder and calculate autoregressive loss |
| |
|
| | That training results in decoder with ~95% accuracy, because it has access to all next tokens information with memory cross-attention. In next training stages it |
| | will access previous interactions data with those layers. |
| |
|
| | #### Self-Supervised Memory Attention Pre-Training |
| | Memory Attention was pre-trained to combine accumulated Short-Term Memory states with next interaction data processed by the |
| | encoder, using weighted mean (with randomized arbitrary weights) as labels and negative cosine similarity as loss. Label weights |
| | depending on inner step: |
| | - first step, when STM is in initial random normal state, using 90% of new encoded data |
| | - follow-up steps are using `50% - step * 5%` of new encoded data |
| | - each step could have 0-15% random differences in weights |
| |
|
| | Additionally, random noise is added to both inputs and labels. |
| |
|
| | This model was trained on six arbitrary selected steps using single epoch on 30% from [**ReactiveAI/Real-Chat-SMAT**](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) dataset. |
| |
|
| | #### Supervised Memory-Aware Training |
| | Finally, with pre-trained/fine-tuned components, in last supervised stage, model is trained to use previous/accumulated STM |
| | states as memory cross-attention input, instead of the same sequences as decoder's input: |
| | - previous (or first) interaction is processed by encoder and used to update memory |
| | - next interaction is processed by decoder, using related information from STM |
| | - loss is calculated from decoder's logits and gradients propagate through memory attention to encoder |
| |
|
| | We used staged memory-aware training with different datasets: |
| | - starting from 2 epochs on raw 80k examples (with 7 interactions) - [**ReactiveAI/Real-Chat-SMAT**](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) |
| | - then 5 epochs on filtered 27k better quality examples - [**ReactiveAI/Real-Chat-No-System-SMAT**](https://huggingface.co/datasets/ReactiveAI/Real-Chat-No-System-SMAT) |
| |
|
| | #### Specialization |
| | After described stages, general purpose model were saved as [RxT-Beta-Micro-Supervised](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised) and we moved to AI/Data Science specialization. |
| |
|
| | It's the same training procedure as previous stage - Supervised Memory-Aware Training: |
| | - we used 21.5k synthetically generated examples with AI/Data Science knowledge chats from [**ReactiveAI/AI-Knowledge-Chat-SMAT**](https://huggingface.co/datasets/ReactiveAI/AI-Knowledge-Chat-SMAT), combined with 6.5k examples from filtered general dataset |
| | - finally we used 50% of dataset from previous step and new [**ReactiveAI/ReactiveAI-Chat-SMAT**](https://huggingface.co/datasets/ReactiveAI/ReactiveAI-Chat-SMAT) with information about our own technologies and model identity |
| |
|
| | #### Preprocessing |
| | Pre-training is done on raw text corpora and it require only tokenization. In next stages, model is processing sequences in simple **Interaction format**, that's used |
| | instead complex chat templates - `[Q] User's query... [A] Model's answer`. For upcoming reasoning models, it will be extended to `[Q] User's query... [T] Reasoning... [A] Model's answer` |
| |
|
| | #### Training Hyperparameters |
| | - **Training regime:** bf16 mixed precision (AMP autocast) |
| | - **Optimizer**: AdamW |
| | - **Scheduler**: Cosine annealing |
| |
|
| | ## Evaluation |
| | Evaluation is in progress - more details soon! |
| |
|
| | ### Testing Data, Factors & Metrics |
| |
|
| | #### Testing Data |
| |
|
| | <!-- This should link to a Dataset Card if possible. --> |
| |
|
| | [More Information Needed] |
| |
|
| | #### Factors |
| |
|
| | <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
| |
|
| | [More Information Needed] |
| |
|
| | #### Metrics |
| | In progress |
| |
|
| | ##### Supervised Memory-Aware Training Validation Metrics |
| | - **Loss:** 0.5360 |
| | - **Perplexity**: 1.7091 |
| | - **Accuracy**: 88.97% |
| |
|
| |
|
| | ### Results |
| |
|
| | [More Information Needed] |
| |
|
| | #### Summary |
| |
|
| |
|
| | ## Environmental Impact |
| | - Base model |
| | - **Hardware Type:** 4x NVIDIA A100 40GB |
| | - **Hours used:** 150 |
| | - Specialization |
| | - **Hardware Type:** 1x NVIDIA A100 40GB |
| | - **Hours used:** 30 |
| |
|
| | ## Model Card Contact |
| | [Adam Filipek](https://huggingface.co/AdamF92) - adamfilipek@rxai.dev |
| |
|
| | Licences - licensing@rxai.dev |