Multilingual Tool Calling in 70+ Languages, On Device
The biggest limitation to that is the fact that low-resource languages are in some cases underrepresented when it comes to tool calling. A lot of these models are not able to understand the language and follow the instructions that come with it.
Cohere Labs recently released Tiny Aya, and we found that despite it expressly not being trained on tool calling, it was extremely good at instruction following to the point that when given an actual template to follow, it does quite well, especially on the Hermes tool calling method.
What this means is that you can basically plug and play a custom version of Tiny Aya with support for over 70 languages, including many low-resource languages such as Swahili and Luganda. In most cases, you can have beyond a 60% or 70% success rate on structured function calls.
The beauty is that this is not locked down to any specific model. You can plug and play this even with the tiniest of models as long as it can follow the Hermes tool calling format, which is quite simple. We think you should be able to test this with even smaller models and see how it works to your liking.
What We Found
These were some of the findings that we discovered as we were doing the Tiny Aya Expedition, where we evaluated 4 Tiny Aya model variants (fire, earth, water, and global) across 53 languages, 2 quantizations (fp16 and 4-bit), and 4 temperatures (0.0, 0.3, 0.7, 1.0). That is 1,696 total configurations, each tested on the MASSIVE-Agents benchmark with 2,792 items per language.
A 3.35B Model Competing With Models Trained for Tool Calling
The headline result that really stood out to us: TinyAya-Earth at 3.35B parameters, with zero tool-call training, beats Aya Expanse 8B (which was trained for tool calling) by 22 percentage points on Luganda (85.9% vs 64.0%), 17 percentage points on Swahili (79.5% vs 62.6%), and 7 percentage points on English (88.7% vs 81.6%).
Command R at 35B remains the upper bound at 97.2% on English, 94.6% on Swahili, and 89.3% on Luganda. But it needs a server, an internet connection, and costs per token. TinyFacade runs on a mid-range phone with airplane mode on. Different problem space entirely.
4-Bit Quantization Outperforms fp16
This one surprised us. On specialist models (fire, earth, water), 4-bit quantization does not just match fp16, it actually outperforms it by 5 to 9 percentage points. TinyAya-Earth 4-bit hits 89.7% JSON accuracy on European languages and 80.2% on Middle Eastern languages. Smaller files, better accuracy. This is exactly what you want for mobile deployment.
Temperature Must Be Zero for Deployment
Every 0.3 step increase in temperature costs you 5 to 10 percentage points across all language families. The degradation is monotonic and there is no recovery. For any deployment scenario, greedy decoding at temperature 0.0 is non-negotiable.
Script System Predicts Difficulty More Than Geography
We found that the writing system a language uses (Latin, CJK, Devanagari, Thai, Ethiopic, Armenian, Myanmar, Khmer) is a stronger predictor of tool-calling accuracy than geographic language family. Languages using Latin script perform consistently well regardless of which region they come from.
The Ollama Template Issue
We also discovered that Ollama was loading the Cohere Command template instead of the actual Tiny Aya template that ships with the model, which severely degraded performance by over 30%. We fixed that in the GGUFs we provide.
What We Built
But beyond evaluation, we actually had success with a couple of other things. Our idea was that tool calling should not remain in the domain of just having this on your laptop. We wanted this to basically expand to the mobile device realm. In our particular use case, we scoped this down to Android.
What we did is that we basically built an inference engine and tool calling engine, a facade really, that would house all the logic to do tool calling and would allow you to build simple and very easy to scaffold apps on top of it.
This was very intentional because we think that in the future, as more and more of these applications become extremely good at these sorts of capabilities and more and more of these applications become multilingual and good at following instructions, what is really going to be holding people back is that not everyone can download a 2GB application bundled with every single app that they need. That is really not scalable.
We think that the much better route is to basically have this one hub that aggregates all of these different models into one area that the other applications that you build are extremely small enough to aggregate but also use.
How TinyFacade Works
The idea of TinyFacade is very simple. We basically allow you to have and run tool calls via the TinyFacade and get the output straight back into your application while letting you define custom, user-friendly tool calls using a set of primitives that we have already defined and also while making it extremely simple for you to spin up these applications.
TinyAya-Earth (3.35B, 4-bit GGUF) is loaded once as an Android service via AIDL (2.14 GB). It understands English, Swahili, and Luganda instructions and generates structured JSON function calls. No cloud dependency. Runs on mid-range phones (4-6 GB RAM) at under 3 seconds latency. Clients can register custom tools using 5 action primitives: HTTP, file, system, intent, and content resolver. No code runs on the client, just declarative JSON.
We built three working clients on top of the Facade that together used only 2.6 GB in storage:
- TinyFacade App: The main inference service with a built-in chat interface
- AIDL Test Client: A companion app for validating tool call execution
- Linga: A standalone Android chat application demonstrating conversational AI powered by the same on-device TinyAya pipeline
What We Provide
Modified GGUF model files: TinyAya variants (fire, earth, water, global) re-quantized to remove the wrong Ollama prompt template use, which resulted in over 30% drop in performance. Ready for on-device inference via llama.cpp. HuggingFace Collection
Claude Skill: A skill that allows you to quickly scaffold a thin client and build an Android app even without deep Android knowledge. TinyFacade Plugin
TinyFacade Android app + test client: An AIDL-based shared inference service that exposes multilingual tool calling to any app on the device, with a companion test client for validation. TinyFacade GitHub
Linga: A standalone Android chat application demonstrating conversational AI powered by the same on-device TinyAya inference pipeline. Linga App
Luganda translation for MASSIVE: A new community-contributed Luganda locale for the MASSIVE dataset, expanding evaluation coverage for under-resourced East African languages.
All of this was done within the time span of two weeks, and we are extremely grateful for the team that was able to push through and bring this forward.
Next Steps
Our next steps are to publish some of the more interesting findings that we were able to have, including the fact that quantization did indeed improve the results of the work that we were doing. We are also working on merging the Facade and thin client for easier distribution.
Our first use case is voice agents and accessibility, more dynamic accessibility reach for people who might have visual impairments and things along that line. This is just the first step, but we know that once we put this into the hands of developers and people with bright ideas, that is where we are going to realize the full capabilities of this application, this facade and the models that lie under it.
Get Involved
This effort keeps evolving and we keep trying to make it better and we call for any sort of collaboration on this. We are open over at the Cohere Labs Community Discord and we would be more than happy to have you join, especially if you are an expert in this field. We would love to actually make this something that lots of people around the world can use.
Built by Bronson Bakunga, Kato Steven Mubiru, Gimei Alex, Oj Onyeagwu and Adnan El Assadi for the Cohere Labs' Expedition Tiny Aya.
NOTE for Luganda evaluation: MASSIVE-Agents does not include Luganda as one of its evaluation languages. We translated it using Gemini models and performed manual review of the output.


