Finetune Voxtral for ASR 🤗
Fine-tune the Voxtral speech model for automatic speech recognition (ASR) using Hugging Face transformers and datasets. The recommended way to run training is Hugging Face Jobs: push your dataset to the Hub, then launch training on HF infrastructure (default a100-large GPU) with one script—no local GPU required.
Train with Hugging Face Jobs (recommended)
The scripts/launch_hf_job.py script submits your training run to Hugging Face Jobs. Training runs on HF’s cloud (default GPU: a100-large), so you don’t need a local GPU.
Requirements
- Hugging Face account with Jobs access (Pro, Team, or Enterprise; pre-paid credits).
- Dataset on the Hub: your data must be a Hugging Face dataset (slug format
username/dataset-name). The job loads it from the Hub; local JSONL is not used. - Token: set
HF_TOKENorHUGGINGFACE_HUB_TOKENso the job can read private datasets and (optionally) push the model.
1. Push your dataset to the Hub
If you have a local JSONL with {audio_path, text} and audio files, push it first:
python scripts/push_to_huggingface.py dataset datasets/voxtral_user/data.jsonl username/voxtral-asr-data
Or use the Gradio interface: Advanced options → Push dataset to HF Hub, then use the repo name (e.g. username/voxtral-dataset-20250225) as the dataset slug below.
2. Launch the training job
# LoRA (default) – fast, parameter-efficient
python scripts/launch_hf_job.py --dataset username/voxtral-asr-data
# Full fine-tuning
python scripts/launch_hf_job.py --dataset username/voxtral-asr-data --no-lora
# With options: config, timeout, hardware, hyperparameters
python scripts/launch_hf_job.py \
--dataset username/voxtral-asr-data \
--dataset-config voxpopuli \
--model-checkpoint mistralai/Voxtral-Mini-3B-2507 \
--train-count 500 \
--eval-count 100 \
--epochs 3 \
--timeout 8h \
--flavor a100-large
The script prints the job URL and job ID. Monitor progress at huggingface.co/settings/jobs.
3. launch_hf_job.py reference
| Argument | Default | Description |
|---|---|---|
--dataset |
(required) | Hugging Face dataset slug, e.g. username/dataset-name. |
--dataset-config |
— | Dataset config/subset for multi-config datasets. |
--model-checkpoint |
mistralai/Voxtral-Mini-3B-2507 |
Base Voxtral model. |
--model-repo |
— | Target model repo (username/repo) for later push. |
--use-lora |
on | Use LoRA (parameter-efficient). |
--no-lora |
— | Use full fine-tuning instead of LoRA. |
--train-count |
100 | Number of training samples. |
--eval-count |
50 | Number of evaluation samples. |
--batch-size |
2 | Per-device train batch size. |
--grad-accum |
4 | Gradient accumulation steps. |
--learning-rate |
5e-5 | Learning rate. |
--epochs |
3 | Number of epochs. |
--lora-r, --lora-alpha, --lora-dropout |
8, 32, 0.0 | LoRA hyperparameters (when using LoRA). |
--freeze-audio-tower |
— | Freeze audio encoder (LoRA only). |
--flavor |
a100-large |
Hardware flavor. Override with env HF_JOBS_FLAVOR. Options: a100-large, a10g-large, a10g-small, etc. |
--timeout |
6h |
Job timeout (e.g. 6h, 30m, 1d). |
--namespace |
— | HF org/namespace to run the job under. |
Environment
HF_TOKENorHUGGINGFACE_HUB_TOKEN: required to launch jobs and for the job to access Hub datasets/model push.HF_JOBS_FLAVOR: optional; overrides default--flavor(e.g.a100-largeorh100-largeif available).
After the job completes, the model is saved in the job’s output directory. Use scripts/push_to_huggingface.py model ... to push it to the Hub from your local machine if you download the artifacts, or add push logic to the training script.
Dataset format (for Jobs and local training)
Datasets are normalized to audio (16 kHz) and text via scripts/dataset_utils.py. For Hugging Face Jobs, the dataset must live on the Hub and be referenced by slug (username/dataset-name).
Hub dataset (for Jobs or --dataset):
- Use the dataset slug with
--dataset username/dataset-name. - Optional
--dataset-configfor multi-config datasets. - Supported column names are normalized automatically:
Audio:audio,audio_path,path,file
Text:text,transcript,transcription,sentence,target,targets
Local JSONL (for local training only; not used by Jobs):
{"audio_path": "/path/to/audio.wav", "text": "reference transcription"}
- Push this to the Hub (see above) so you can use it with
launch_hf_job.py --dataset username/your-dataset.
Installation
Clone and install dependencies (for running the launcher, Gradio, or local training):
git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
cd Finetune-Voxtral-ASR
With UV (recommended):
uv venv .venv --python 3.10 && source .venv/bin/activate # Linux/macOS
# Windows: uv venv .venv --python 3.10 && .venv\Scripts\activate
uv pip install -r requirements.txt
With pip:
python -m venv .venv --python 3.10 && source .venv/bin/activate
pip install --upgrade pip && pip install -r requirements.txt
For launch_hf_job.py you need a recent huggingface_hub (with Jobs support):
pip install -U huggingface_hub
Launch Jobs from the Gradio interface
You can start the same training job from the UI:
- Run
python interface.py. - In Advanced options, open “Launch on Hugging Face Jobs”.
- Enter your dataset slug (e.g.
username/voxtral-asr-data) after pushing the dataset. - Optionally set Dataset config, Job timeout, Hardware flavor (default
a100-large). - Click “Launch training on HF Jobs”.
The UI runs scripts/launch_hf_job.py with the current form values and shows the job URL and status.
Local training (optional)
If you have a GPU locally, you can run training yourself instead of using Jobs.
Full fine-tuning:
python scripts/train.py \
--dataset username/voxtral-asr-data \
--model-checkpoint mistralai/Voxtral-Mini-3B-2507 \
--train-count 100 --eval-count 50 \
--batch-size 2 --grad-accum 4 --learning-rate 5e-5 --epochs 3 \
--output-dir ./voxtral-finetuned
LoRA fine-tuning:
python scripts/train_lora.py \
--dataset username/voxtral-asr-data \
--model-checkpoint mistralai/Voxtral-Mini-3B-2507 \
--train-count 100 --eval-count 50 \
--batch-size 2 --grad-accum 4 --learning-rate 5e-5 --epochs 3 \
--lora-r 8 --lora-alpha 32 --freeze-audio-tower \
--output-dir ./voxtral-finetuned-lora
Use --dataset-jsonl path/to/data.jsonl instead of --dataset to train from a local JSONL. For Jobs, only --dataset (Hub slug) is used.
Push models and datasets (scripts/push_to_huggingface.py)
Push a trained model:
python scripts/push_to_huggingface.py model ./voxtral-finetuned username/voxtral-asr \
--model-name mistralai/Voxtral-Mini-3B-2507
Push a dataset (before using it with Jobs):
python scripts/push_to_huggingface.py dataset datasets/voxtral_user/data.jsonl username/voxtral-asr-data
Deploy a demo Space (scripts/deploy_demo_space.py)
After pushing a model:
python scripts/deploy_demo_space.py \
--hf-token $HF_TOKEN \
--hf-username YOUR_USERNAME \
--model-id YOUR_USERNAME/voxtral-asr \
--demo-type voxtral \
--space-name voxtral-asr-demo
Troubleshooting
| Issue | What to do |
|---|---|
| “HF_TOKEN must be set” | Set HF_TOKEN or HUGGINGFACE_HUB_TOKEN in your environment. |
| Jobs not available | Jobs require Pro / Team / Enterprise and pre-paid credits. |
| Dataset not found on job | Ensure the dataset is on the Hub and use the exact slug username/dataset-name. Push with push_to_huggingface.py dataset ... first. |
| Job timeout | Increase --timeout (e.g. --timeout 8h or 12h). |
| Different GPU | Set HF_JOBS_FLAVOR (e.g. a10g-large) or pass --flavor a10g-large. |
| Windows | Use set HF_TOKEN=your_token in CMD or $env:HF_TOKEN="your_token" in PowerShell. |
License
MIT