Falcon-H1-Tiny-100M-Multilingual-Instruct GGUF

GGUF quantizations of tiiuae/Falcon-H1-Tiny-100M-Multilingual-Instruct model.

This is a 100M parameter multilingual instruction-tuned Falcon H1 Tiny model with hybrid Transformer + Mamba architecture, optimized for edge deployment.

Model Details

Architecture: Hybrid Transformers + Mamba
Parameters: 100M
Languages: Multilingual (English, Chinese, and others)
Context Length: 262,144 tokens
License: Falcon-LLM License

Available Quantizations

Quantization	Size	Description
F16	209 MB	Full precision float16
Q8_0	113 MB	8-bit quantization
IQ4_NL	70 MB	4.5-bit non-linear quantization

Usage

llama.cpp

# Clone this repository
git clone https://huggingface.co/<your-username>/Falcon-H1-Tiny-100M-Multilingual-Instruct-GGUF
cd Falcon-H1-Tiny-100M-Multilingual-Instruct-GGUF

# Run with llama-cli
llama-cli ./Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf -cnv

# Or with conversation mode
llama-cli -m ./Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf -cnv -n 512 -t 4

llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf",
    n_ctx=2048,
    n_threads=8,
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)
print(response["choices"][0]["message"]["content"])

ollama

# Create a Modelfile
echo 'FROM ./Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf' > Modelfile

# Build and run
ollama create Falcon-H1-Tiny-100M -f Modelfile
ollama run Falcon-H1-Tiny-100M

Chat Template

The model uses ChatML format with <|im_start|> and <|im_end|> tokens:

<|im_start|>user
Your message here<|im_end|>
<|im_start|>assistant
Model response<|im_end|>

The chat template is automatically applied when using:

llama-cli with -cnv flag
llama-cpp-python with create_chat_completion() method

Model Performance

Quantization Comparison

Quantization	Quality	Speed	Use Case
F16	⭐⭐⭐	Fastest	Maximum quality
Q8_0	⭐⭐	Fast	Good quality, good balance
IQ4_NL	⭐⭐	Medium	Best size/quality trade-off

Recommendations

For edge/mobile devices: Use IQ4_NL (70 MB) - best compression with good quality
For desktop/server: Use Q8_0 (113 MB) - better quality with reasonable size
For maximum quality: Use F16 (209 MB) - no quantization loss

Limitations

The 100M multilingual model has limited capacity for complex multilingual tasks
Chinese factual accuracy may be lower than English due to training data distribution
Best performance on English; other languages may have reduced quality
Use higher temperature (0.7-0.9) for creative tasks
Use lower temperature (0.3-0.5) for factual tasks

Hardware Requirements

Quantization	RAM Required
IQ4_NL (70 MB)	~500 MB
Q8_0 (113 MB)	~600 MB
F16 (209 MB)	~800 MB

Citation

If you use this model, please cite the original model:

@misc{falcon_h1_tiny,
  title={Falcon-H1-Tiny: A series of extremely small, yet powerful language models redefining capabilities at small scale},
  author={Falcon-LLM Team},
  year={2026}, 
}

License

Falcon-LLM License - See terms and conditions

Acknowledgments

Original model by Technology Innovation Institute (TII)
llama.cpp for GGUF format support

Downloads last month: 98

GGUF

Model size

0.1B params

Architecture

falcon-h1

Hardware compatibility

4-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Luigi/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF

Base model

tiiuae/Falcon-H1-Tiny-Multilingual-100M-Instruct

Quantized

(6)

this model