You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Project Overview

This repository contains all the scripts, data samples, and artifacts for training, evaluating, and testing a multilingual SentencePiece BPE tokenizer (Kazakh, Russian, English) and running inference with the Gemma 1B model.

The workflow encompasses:

Sampling & Corpus Preparation: Extracting text samples from large datasets (JSON, Parquet) and assembling a training corpus.

Tokenizer Training: Using SentencePiece to train a BPE tokenizer on the sampled corpus.

Evaluation: Measuring metrics (compression ratio, fertility, continued-word ratio) on language-specific test sets.

Inference: Generating text with the Gemma model using either the default or custom tokenizer.

├── .gitattributes
├── english_eval_texts.json        # Collected English test texts (~25 MB)
├── kazakh_eval_texts.json         # Collected Kazakh test texts (~25 MB)
├── russian_eval_texts.json        # Collected Russian test texts (~25 MB)
├── sentencepiece-bpe-tokenizer.py # Script: train tokenizer from multiple sources
├── test-tokenizer.py              # Script: evaluate custom tokenizer metrics
├── test-tokenizer-gemma-3-1b.py   # Script: evaluate Gemma tokenizer metrics
├── inference_gemma.py             # Script: run text generation with Gemma 1B
├── tokenizer_evaluation.json      # Saved evaluation metrics (per-language & overall)
│
└── spm_bpe_tokenizer_50000_new/   # Artifacts and sampled data
    ├── samples/                   # Per-source sampled texts used for training
    ├── training_corpus.txt        # Combined training corpus (one sentence per line)
    ├── tokenizer.model            # Trained SentencePiece model file
    ├── tokenizer.vocab            # Corresponding vocabulary file
    ├── tokenizer_config.json      # Hugging Face tokenizer config
    ├── tokenizer_multilingual.model # (Optional) alternate multilingual model
    └── tokenizer_multilingual.vocab # (Optional) alternate vocab file

Usage

Sample and Train Tokenizer

python sentencepiece-bpe-tokenizer.py

This will:

Read and randomly sample from the specified JSON and Parquet sources.

Write per-file samples into spm_bpe_tokenizer_50000_new/samples/.

Assemble training_corpus.txt in the same directory.

Train a BPE tokenizer with vocab size 50 000.

Output tokenizer.model, tokenizer.vocab, and tokenizer_config.json.

Evaluate Custom Tokenizer

python test-tokenizer.py

Generates metrics on compression ratio, fertility, and segmentation balance for each language and saves results to tokenizer_evaluation.json.

Evaluate Gemma’s Tokenizer

python test-tokenizer-gemma-3-1b.py

Computes the same metrics using the Gemma 3B tokenizer via Hugging Face.

Run Inference with Gemma 1B

python inference_gemma.py

Prompts the Gemma model in English, Russian, and Kazakh (customizable in the script) and prints generated outputs.

Reproducing & Modifying

Re-run sampling: tweak target_samples or word count targets in sentencepiece-bpe-tokenizer.py and re-run to regenerate samples/ and training_corpus.txt.

Re-train: adjust vocab_size, model_type, or other SentencePiece parameters in the same script.

Re-evaluate: modify test-tokenizer.py parameters (e.g. test corpus size) and re-run.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support