YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
Project Overview
This repository contains all the scripts, data samples, and artifacts for training, evaluating, and testing a multilingual SentencePiece BPE tokenizer (Kazakh, Russian, English) and running inference with the Gemma 1B model.
The workflow encompasses:
Sampling & Corpus Preparation: Extracting text samples from large datasets (JSON, Parquet) and assembling a training corpus.
Tokenizer Training: Using SentencePiece to train a BPE tokenizer on the sampled corpus.
Evaluation: Measuring metrics (compression ratio, fertility, continued-word ratio) on language-specific test sets.
Inference: Generating text with the Gemma model using either the default or custom tokenizer.
βββ .gitattributes
βββ english_eval_texts.json # Collected English test texts (~25 MB)
βββ kazakh_eval_texts.json # Collected Kazakh test texts (~25 MB)
βββ russian_eval_texts.json # Collected Russian test texts (~25 MB)
βββ sentencepiece-bpe-tokenizer.py # Script: train tokenizer from multiple sources
βββ test-tokenizer.py # Script: evaluate custom tokenizer metrics
βββ test-tokenizer-gemma-3-1b.py # Script: evaluate Gemma tokenizer metrics
βββ inference_gemma.py # Script: run text generation with Gemma 1B
βββ tokenizer_evaluation.json # Saved evaluation metrics (per-language & overall)
β
βββ spm_bpe_tokenizer_50000_new/ # Artifacts and sampled data
βββ samples/ # Per-source sampled texts used for training
βββ training_corpus.txt # Combined training corpus (one sentence per line)
βββ tokenizer.model # Trained SentencePiece model file
βββ tokenizer.vocab # Corresponding vocabulary file
βββ tokenizer_config.json # Hugging Face tokenizer config
βββ tokenizer_multilingual.model # (Optional) alternate multilingual model
βββ tokenizer_multilingual.vocab # (Optional) alternate vocab file
Usage
- Sample and Train Tokenizer
python sentencepiece-bpe-tokenizer.py
This will:
Read and randomly sample from the specified JSON and Parquet sources.
Write per-file samples into spm_bpe_tokenizer_50000_new/samples/.
Assemble training_corpus.txt in the same directory.
Train a BPE tokenizer with vocab size 50β―000.
Output tokenizer.model, tokenizer.vocab, and tokenizer_config.json.
- Evaluate Custom Tokenizer
python test-tokenizer.py
Generates metrics on compression ratio, fertility, and segmentation balance for each language and saves results to tokenizer_evaluation.json.
- Evaluate Gemmaβs Tokenizer
python test-tokenizer-gemma-3-1b.py
Computes the same metrics using the Gemma 3B tokenizer via Hugging Face.
- Run Inference with Gemma 1B
python inference_gemma.py
Prompts the Gemma model in English, Russian, and Kazakh (customizable in the script) and prints generated outputs.
Reproducing & Modifying
Re-run sampling: tweak target_samples or word count targets in sentencepiece-bpe-tokenizer.py and re-run to regenerate samples/ and training_corpus.txt.
Re-train: adjust vocab_size, model_type, or other SentencePiece parameters in the same script.
Re-evaluate: modify test-tokenizer.py parameters (e.g. test corpus size) and re-run.