Falcon-H1-Tiny-100M-Multilingual-Instruct GGUF
GGUF quantizations of tiiuae/Falcon-H1-Tiny-100M-Multilingual-Instruct model.
This is a 100M parameter multilingual instruction-tuned Falcon H1 Tiny model with hybrid Transformer + Mamba architecture, optimized for edge deployment.
Model Details
- Architecture: Hybrid Transformers + Mamba
- Parameters: 100M
- Languages: Multilingual (English, Chinese, and others)
- Context Length: 262,144 tokens
- License: Falcon-LLM License
Available Quantizations
| Quantization | Size | Description |
|---|---|---|
| F16 | 209 MB | Full precision float16 |
| Q8_0 | 113 MB | 8-bit quantization |
| IQ4_NL | 70 MB | 4.5-bit non-linear quantization |
Usage
llama.cpp
# Clone this repository
git clone https://huggingface.co/<your-username>/Falcon-H1-Tiny-100M-Multilingual-Instruct-GGUF
cd Falcon-H1-Tiny-100M-Multilingual-Instruct-GGUF
# Run with llama-cli
llama-cli ./Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf -cnv
# Or with conversation mode
llama-cli -m ./Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf -cnv -n 512 -t 4
llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf",
n_ctx=2048,
n_threads=8,
)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
print(response["choices"][0]["message"]["content"])
ollama
# Create a Modelfile
echo 'FROM ./Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf' > Modelfile
# Build and run
ollama create Falcon-H1-Tiny-100M -f Modelfile
ollama run Falcon-H1-Tiny-100M
Chat Template
The model uses ChatML format with <|im_start|> and <|im_end|> tokens:
<|im_start|>user
Your message here<|im_end|>
<|im_start|>assistant
Model response<|im_end|>
The chat template is automatically applied when using:
llama-cliwith-cnvflagllama-cpp-pythonwithcreate_chat_completion()method
Model Performance
Quantization Comparison
| Quantization | Quality | Speed | Use Case |
|---|---|---|---|
| F16 | ⭐⭐⭐ | Fastest | Maximum quality |
| Q8_0 | ⭐⭐ | Fast | Good quality, good balance |
| IQ4_NL | ⭐⭐ | Medium | Best size/quality trade-off |
Recommendations
- For edge/mobile devices: Use IQ4_NL (70 MB) - best compression with good quality
- For desktop/server: Use Q8_0 (113 MB) - better quality with reasonable size
- For maximum quality: Use F16 (209 MB) - no quantization loss
Limitations
- The 100M multilingual model has limited capacity for complex multilingual tasks
- Chinese factual accuracy may be lower than English due to training data distribution
- Best performance on English; other languages may have reduced quality
- Use higher temperature (0.7-0.9) for creative tasks
- Use lower temperature (0.3-0.5) for factual tasks
Hardware Requirements
| Quantization | RAM Required |
|---|---|
| IQ4_NL (70 MB) | ~500 MB |
| Q8_0 (113 MB) | ~600 MB |
| F16 (209 MB) | ~800 MB |
Citation
If you use this model, please cite the original model:
@misc{falcon_h1_tiny,
title={Falcon-H1-Tiny: A series of extremely small, yet powerful language models redefining capabilities at small scale},
author={Falcon-LLM Team},
year={2026},
}
License
Falcon-LLM License - See terms and conditions
Acknowledgments
- Original model by Technology Innovation Institute (TII)
- llama.cpp for GGUF format support
- Downloads last month
- 98
Hardware compatibility
Log In to add your hardware
4-bit
8-bit
16-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support