Falcon-H1-Tiny-100M-Multilingual-Instruct GGUF

GGUF quantizations of tiiuae/Falcon-H1-Tiny-100M-Multilingual-Instruct model.

This is a 100M parameter multilingual instruction-tuned Falcon H1 Tiny model with hybrid Transformer + Mamba architecture, optimized for edge deployment.

Model Details

  • Architecture: Hybrid Transformers + Mamba
  • Parameters: 100M
  • Languages: Multilingual (English, Chinese, and others)
  • Context Length: 262,144 tokens
  • License: Falcon-LLM License

Available Quantizations

Quantization Size Description
F16 209 MB Full precision float16
Q8_0 113 MB 8-bit quantization
IQ4_NL 70 MB 4.5-bit non-linear quantization

Usage

llama.cpp

# Clone this repository
git clone https://huggingface.co/<your-username>/Falcon-H1-Tiny-100M-Multilingual-Instruct-GGUF
cd Falcon-H1-Tiny-100M-Multilingual-Instruct-GGUF

# Run with llama-cli
llama-cli ./Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf -cnv

# Or with conversation mode
llama-cli -m ./Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf -cnv -n 512 -t 4

llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf",
    n_ctx=2048,
    n_threads=8,
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)
print(response["choices"][0]["message"]["content"])

ollama

# Create a Modelfile
echo 'FROM ./Falcon-H1-Tiny-Multilingual-100M-Instruct-IQ4_NL.gguf' > Modelfile

# Build and run
ollama create Falcon-H1-Tiny-100M -f Modelfile
ollama run Falcon-H1-Tiny-100M

Chat Template

The model uses ChatML format with <|im_start|> and <|im_end|> tokens:

<|im_start|>user
Your message here<|im_end|>
<|im_start|>assistant
Model response<|im_end|>

The chat template is automatically applied when using:

  • llama-cli with -cnv flag
  • llama-cpp-python with create_chat_completion() method

Model Performance

Quantization Comparison

Quantization Quality Speed Use Case
F16 ⭐⭐⭐ Fastest Maximum quality
Q8_0 ⭐⭐ Fast Good quality, good balance
IQ4_NL ⭐⭐ Medium Best size/quality trade-off

Recommendations

  • For edge/mobile devices: Use IQ4_NL (70 MB) - best compression with good quality
  • For desktop/server: Use Q8_0 (113 MB) - better quality with reasonable size
  • For maximum quality: Use F16 (209 MB) - no quantization loss

Limitations

  • The 100M multilingual model has limited capacity for complex multilingual tasks
  • Chinese factual accuracy may be lower than English due to training data distribution
  • Best performance on English; other languages may have reduced quality
  • Use higher temperature (0.7-0.9) for creative tasks
  • Use lower temperature (0.3-0.5) for factual tasks

Hardware Requirements

Quantization RAM Required
IQ4_NL (70 MB) ~500 MB
Q8_0 (113 MB) ~600 MB
F16 (209 MB) ~800 MB

Citation

If you use this model, please cite the original model:

@misc{falcon_h1_tiny,
  title={Falcon-H1-Tiny: A series of extremely small, yet powerful language models redefining capabilities at small scale},
  author={Falcon-LLM Team},
  year={2026}, 
}

License

Falcon-LLM License - See terms and conditions

Acknowledgments

Downloads last month
98
GGUF
Model size
0.1B params
Architecture
falcon-h1
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Luigi/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF

Quantized
(6)
this model