Our lab recently released a paper where we introduce ShadowPEFT, a new Parameter-Efficient Fine-Tuning (PEFT) paradigm tailored for edge computing scenarios.
Unlike traditional approaches such as LoRA and its variants, which inject trainable parameters directly into the weights of Transformer, requiring tight coupling with the backbone.
ShadowPEFT instead enhances the frozen large base model by adding a lightweight, centralized, pretrainable, and detachable Shadow network. This shadow network operates in parallel with the base model, delivering learned corrections to each decoder layer. Because the shadow module is architecturally decoupled from the backbone, it can be independently trained, stored, and deployed, benefiting edge computing scenarios and edge-cloud collaboration computing.
Translating benchmarks is a painful process, requiring a lot of manual inspection and adjustments. You start from setting up the whole pipeline and adapting to every format type, including task specifics. There already exist some massive benchmarks, but they still have some simple (and sometimes silly) bugs, which can hurt the evaluations :( We present a novel automated translation framework to help with that!
Eastern and Southern European languages introduce richer linguistic structures compared to English and for benchmarks which heavily rely on grammatical coherence machine translation presents a risk of harming evaluations. We discover potential answer leakage or misleading through grammatical structure of the questions. Some benchmarks are also just outdated and need to be retranslated with newer and better models.
We present a framework with novel test-time scaling methods which allow to control time and cost investments, while at the same time mitigate the need for human-in-the-loop verification. While working on Ukrainian-focused MamayLM models, we had to translate 10+ benchmarks in a short span of time. Finding human evaluators is costly and time-consuming, same goes for using professional translators. With our pipeline we were able to do it in 3 days๐๏ธ
We hope our findings will help enable stronger multilingual evaluations and developments. We release all produced benchmarks on Hugging Face together with the source code and Arxiv paper ๐ค
Do Bubbles Form When Tens of Thousands of AIs Simulate Capitalism?
We gave LLMs autonomous trading over 30 real tickers at 100x leverage. All went bankrupt in 30 minutes from hallucination. This spawned FINAL Bench (first metacognition benchmark) and AI NPC Trading Arena โ tens of thousands of metacognition-equipped AI agents competing under capitalist rules. Humans can only watch.
NPCs form a society: 3-tier memory, self-modifying parameters, mutual criticism, strategy propagation, and a virtual SEC enforcing fines every 20 minutes. Every trade passes 4-stage verification including Brave Search fact-check. FINAL Bench confirmed across 9 SOTA models that AI can say "I might be wrong" (MA 0.694) but cannot actually fix errors (ER 0.302).
Six findings: Bubbles form naturally through knowledge transfer and swarm herding. Identical NPCs diverge irreversibly from their first three trades. Metacognition blocks individual hallucination but not collective herding โ this is the key finding. Information asymmetry solidifies hierarchy. Fraud and regulation co-evolve. Criticism improves returns.
Individual intelligence does not guarantee collective intelligence.
@abdurrahmanbutler and I just dropped Legal RAG Bench, the first benchmark for legal RAG systems to simultaneously evaluate hallucinations, retrieval failures, and reasoning errors.
Our key takeaways are: 1. Embedding models, not generative models, are the primary driver of RAG accuracy. Switching from a general-purpose embedder like OpenAI's Text Embedding 3 Large to a legal domain embedder like Isaacus' Kanon 2 Embedder can raise accuracy by ~19 points. 2. Hallucinations are often triggered by retrieval failures. Fix your retrieval stack, and, in most cases, you end up fixing hallucinations. 3. Once you have a solid legal retrieval engine like Kanon 2 Embedder, it doesnโt matter as much what generative model you use; GPT-5.2 and Gemini 3.1 Pro perform relatively similarly, with Gemini 3.1 Pro achieving slightly better accuracy at the cost of more hallucinations. 4. Google's latest LLM, Gemini 3.1 Pro, is actually a bit worse than its predecessor at legal RAG, achieving 79.3% accuracy instead of 80.3%.
These findings confirm what we already knew at Isaacus: that information retrieval sets the ceiling on the accuracy of legal RAG systems. It doesnโt matter how smart you are; you arenโt going to magically know what the penalty is for speeding in California without access to an up-to-date copy of the California Vehicle Code.
Even still, to our knowledge, weโre the first to actually show this empirically.
Unfortunately, as we highlight in our write-up, high-quality open legal benchmarks like Legal RAG Bench and our earlier MLEB are few and far between.
In the interests of transparency, we have not only detailed exactly how we built Legal RAG Bench, but weโve also released all of our data openly on Hugging Face. You can read our write up [here](https://isaacus.com/blog/legal-rag-bench), noting that weโll soon be publishing it as a paper.
Kudos to my brother @abdurrahmanbutler for serving as the lead author on this monumental release.
@CohereLabs just released ๐ฟ Tiny Aya: a fully open-source 3B parameter model that speaks 70+ languages ๐! But thereโs a catch:
Tiny Aya is just a language model. It doesnโt support tool calling, the key capability that turns frontier models into powerful *agents*. So the real question is:
How hard is it to turn Tiny Aya into an agent?
Turns outโฆ itโs simple, thanks to Hugging Face TRL. Weโre sharing a hands-on example showing how to train Tiny Aya to turn it into a tool-calling agent using TRL, unlocking what could become the first *massively multilingual open agent*.
I wasted days on a GPU node on a bug that shouldn't exist
So I was fine-tuning TildeOPEN-30B and the outputs were... weird. Token ID 179 (<0x00>) kept appearing between almost every token pair. Took me a bit to figure out what was going on.
Turns out I used the fast tokenizer for training, but the model was trained on the slow one. Silent failure.
Well... long story shortโTGI uses (forces) the fast tokenizer, no questions asked. And you'll have agile's kryptonite: silent failure. If the model was trained on slow, it's a silent disaster.
I got curious and wrote a quick script to check how common this is. Ran it on 6,014 LLM HF models overnight.
Roughly 10% of HF model downloads have mismatched tokenizers. Not all mismatches are catastrophic, but some are brutal โ like chat template markers inflating from 1 token to 3, silently wrecking context windows and causing model act weird.
This wasn't rigorous research, but the drift is real. And the worst part? 968 models(out of 500+ downloads) have both fast and slow tokenizers present, but they still produce different outputs. No missing files, no errors โ just silent degradation.
TGI defaults to the fast tokenizer, as does AutoTokenizer.from_pretrained(). If a fast tokenizer doesn't exist, it auto-generates one. If your model was trained on slow, you get silent degradation. Output looks fine; the model just performs worse. Sometimes really worse. You'd never know.
If model was trained on fast tokenizer, its fine, but how do You know?
The root cause? Either model authors run HF conversion and upload both without verifying, or users run TGI, which always forces(converts to) fast .
It's based on TildeOPEN-30B (a solid EU HPC multilingual base). Nothing fancyโjust a proper instruction fine-tune where I didn't mess up the tokenizer this time.
The LLM by @karpathy is officially in the library, and we wrote a blog covering: how did we port the model, differences from the original, and how to run or train it.
๐ AutoXLA - Accelerating Large Models on TPU AutoXLA is an experimental library that automates the distribution, optimization, and quantization of large language models for TPUs using PyTorch/XLA. It extends the Hugging Face Transformers interface with TPU-aware features such as automatic sharding, custom attention kernels, and quantization-aware loading, making large-scale deployment and training both simpler and faster. With quantization and Splash Attention kernels, AutoXLA achieves up to 4ร speedups over standard Flash Attention implementations, significantly improving throughput for both inference and training workloads. Whether youโre experimenting with distributed setups (FSDP, 2D, or 3D sharding) or optimizing memory via LanguageModelQuantizer, AutoXLA is built to make scaling LLMs on TPU seamless. โ ๏ธ Note: This is an experimental repository. Expect rough edges! Please report bugs or unexpected behavior through GitHub issues. ๐ GitHub Repository: https://github.com/Locutusque/AutoXLA
These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.
The collection includes: - finePDFs-1B: High-quality textbook-style educational content - DCLM-baseline-1B: Filtered, diverse web content - FineWeb-Edu-1B: Curated educational web resources
We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.
Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.
There is no anxiety quite like powering up 2KW of basement compute after rewiring it all. Small bit of trouble with the horizontal 3090 because I misread my motherboard manual, but otherwise so far so good.. Next we see if I've built up enough cooling to hit my target TDP on those 3-slot nvlinked cards especially. The 4-slot bridges are much easier to work with but their prices went bananas and I couldn't acquire a second, so gotta get a little creative with intakes.
Some weeks ago, i've just decide its time to leave LinkedIn for me. It got silent around my open source activities the last year, so i thought something has to change.
That's why my focus will move to share experiences and insights about hardware, drivers, kernels and linux. I won't post about how to use models, built agents or do prompting. I want to share about some deeper layers the actual hypes are built on.
I will start posting summarizations of my articles here on the hub. English version: https://flozi.net/en
After training ๐๐ฆ๐จ๐ฅ๐๐๐ on ๐๐๐ ๐๐๐๐๐ฌ for nearly a month, I've come to realize something most people overlook: ๐ข๐ง๐๐ซ๐๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐ ๐ข๐ฌ ๐ญ๐ก๐ ๐ฆ๐๐ค๐-๐จ๐ซ-๐๐ซ๐๐๐ค ๐๐๐๐ญ๐จ๐ซ ๐ข๐ง ๐๐๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ . ๐ฅ
Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐๐๐ ๐๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐๐% ๐๐๐๐ข๐๐ข๐๐ง๐๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ ๐จ๐ ๐ญ๐ก๐ ๐ก๐๐ซ๐๐ฐ๐๐ซ๐. ๐ ๏ธ
Questions that seemed simple but had no clear answers: Why is ๐๐จ๐ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ ๐ฌ๐ฅ๐จ๐ฐ๐๐ซ ๐ญ๐ก๐๐ง ๐๐๐ง๐ฌ๐ ๐ฆ๐จ๐๐๐ฅ๐ฌ? Which ๐๐๐๐ ๐๐ฅ๐๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?
That's why we built ๐๐ก๐ ๐๐ฆ๐จ๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐ ๐๐ฅ๐๐ฒ๐๐จ๐จ๐ค ๐: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐๐ซ๐๐ฌ๐ญ๐ซ๐ฎ๐๐ญ๐ฎ๐ซ๐ ๐ฅ๐๐ฒ๐๐ซ that most teams get wrong.
We validated real vs theoretical bandwidth across the entire stack: ๐๐๐๐ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐ ๐ ๐๐/๐ฌ, ๐๐๐๐ข๐ง๐ค ๐.๐ ๐ซ๐๐๐๐ก๐ข๐ง๐ ๐๐๐ ๐๐/๐ฌ, ๐๐๐๐ ๐๐๐ง๐ ๐๐ญ ๐๐.๐ ๐๐/๐ฌ. Then we ran collective operations across ๐๐๐ ๐๐๐๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐๐๐ ๐๐/๐ฌ on a single node to ๐๐๐-๐๐๐ ๐๐/๐ฌ across 16 nodes.
If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.
Iโm just reading that Ryzen AI 395 has to be 30% slower than DGX Spark in LLM inferencingโฆ and only 96GB GPU RAMโฆ good I havenโt RTFM upfront, so I made the AMD faster with 128GB unified RAM ๐ซก Z2 mini G1a can run Qwen3 Coder 30B BF16 at 26.8 tok/sec in ~60GB GPU RAM
๐งฌ Breaking news in Clinical AI: Introducing the OpenMed NER Model Discovery App on Hugging Face ๐ฌ
OpenMed is back! ๐ฅ Finding the right biomedical NER model just became as precise as a PCR assay!
I'm thrilled to unveil my comprehensive OpenMed Named Entity Recognition Model Discovery App that puts 384 specialized biomedical AI models at your fingertips.
๐ฏ Why This Matters in Healthcare AI: Traditional clinical text mining required hours of manual model evaluation. My Discovery App instantly connects researchers, clinicians, and data scientists with the exact NER models they need for their biomedical entity extraction tasks.
๐ฌ What You Can Discover: โ Pharmacological Models - Extract "chemical compounds", "drug interactions", and "pharmaceutical" entities from clinical notes โ Genomics & Proteomics - Identify "DNA sequences", "RNA transcripts", "gene variants", "protein complexes", and "cell lines" โ Pathology & Disease Detection - Recognize "pathological formations", "cancer types", and "disease entities" in medical literature โ Anatomical Recognition - Map "anatomical systems", "tissue types", "organ structures", and "cellular components" โ Clinical Entity Extraction - Detect "organism species", "amino acids", 'protein families", and "multi-tissue structures"
๐ก Advanced Features: ๐ Intelligent Entity Search - Find models by specific biomedical entities (e.g., "Show me models detecting CHEM + DNA + Protein") ๐ฅ Domain-Specific Filtering - Browse by Oncology, Pharmacology, Genomics, Pathology, Hematology, and more ๐ Model Architecture Insights - Compare BERT, RoBERTa, and DeBERTa implementations โก Real-Time Search - Auto-filtering as you type, no search buttons needed ๐จ Clinical-Grade UI - Beautiful, intuitive interface designed for medical professionals
Ready to revolutionize your biomedical NLP pipeline?
๐ Try it now: OpenMed/openmed-ner-models ๐งฌ Built with: Gradio, Transformers, Advanced Entity Mapping
Has anyone ever backed up a model to a sequential tape drive, or I'm the world first? :D Just played around with my retro PC that has got a tape driveโdid it just because I can.
๐ข๐พ Introducing the Common Crawl Creative Commons Corpus (C5)!
C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.
</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.
๐ In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.
๐ More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!