Finetune Voxtral for ASR 🤗

Fine-tune the Voxtral speech model for automatic speech recognition (ASR) using Hugging Face transformers and datasets. The recommended way to run training is Hugging Face Jobs: push your dataset to the Hub, then launch training on HF infrastructure (default a100-large GPU) with one script—no local GPU required.

Train with Hugging Face Jobs (recommended)

The scripts/launch_hf_job.py script submits your training run to Hugging Face Jobs. Training runs on HF’s cloud (default GPU: a100-large), so you don’t need a local GPU.

Requirements

Hugging Face account with Jobs access (Pro, Team, or Enterprise; pre-paid credits).
Dataset on the Hub: your data must be a Hugging Face dataset (slug format username/dataset-name). The job loads it from the Hub; local JSONL is not used.
Token: set HF_TOKEN or HUGGINGFACE_HUB_TOKEN so the job can read private datasets and (optionally) push the model.

1. Push your dataset to the Hub

If you have a local JSONL with {audio_path, text} and audio files, push it first:

python scripts/push_to_huggingface.py dataset datasets/voxtral_user/data.jsonl username/voxtral-asr-data

Or use the Gradio interface: Advanced options → Push dataset to HF Hub, then use the repo name (e.g. username/voxtral-dataset-20250225) as the dataset slug below.

2. Launch the training job

# LoRA (default) – fast, parameter-efficient
python scripts/launch_hf_job.py --dataset username/voxtral-asr-data

# Full fine-tuning
python scripts/launch_hf_job.py --dataset username/voxtral-asr-data --no-lora

# With options: config, timeout, hardware, hyperparameters
python scripts/launch_hf_job.py \
  --dataset username/voxtral-asr-data \
  --dataset-config voxpopuli \
  --model-checkpoint mistralai/Voxtral-Mini-3B-2507 \
  --train-count 500 \
  --eval-count 100 \
  --epochs 3 \
  --timeout 8h \
  --flavor a100-large

The script prints the job URL and job ID. Monitor progress at huggingface.co/settings/jobs.

3. `launch_hf_job.py` reference

Argument	Default	Description
`--dataset`	(required)	Hugging Face dataset slug, e.g. `username/dataset-name`.
`--dataset-config`	—	Dataset config/subset for multi-config datasets.
`--model-checkpoint`	`mistralai/Voxtral-Mini-3B-2507`	Base Voxtral model.
`--model-repo`	—	Target model repo (username/repo) for later push.
`--use-lora`	on	Use LoRA (parameter-efficient).
`--no-lora`	—	Use full fine-tuning instead of LoRA.
`--train-count`	100	Number of training samples.
`--eval-count`	50	Number of evaluation samples.
`--batch-size`	2	Per-device train batch size.
`--grad-accum`	4	Gradient accumulation steps.
`--learning-rate`	5e-5	Learning rate.
`--epochs`	3	Number of epochs.
`--lora-r`, `--lora-alpha`, `--lora-dropout`	8, 32, 0.0	LoRA hyperparameters (when using LoRA).
`--freeze-audio-tower`	—	Freeze audio encoder (LoRA only).
`--flavor`	`a100-large`	Hardware flavor. Override with env `HF_JOBS_FLAVOR`. Options: `a100-large`, `a10g-large`, `a10g-small`, etc.
`--timeout`	`6h`	Job timeout (e.g. `6h`, `30m`, `1d`).
`--namespace`	—	HF org/namespace to run the job under.

Environment

HF_TOKEN or HUGGINGFACE_HUB_TOKEN: required to launch jobs and for the job to access Hub datasets/model push.
HF_JOBS_FLAVOR: optional; overrides default --flavor (e.g. a100-large or h100-large if available).

After the job completes, the model is saved in the job’s output directory. Use scripts/push_to_huggingface.py model ... to push it to the Hub from your local machine if you download the artifacts, or add push logic to the training script.

Dataset format (for Jobs and local training)

Datasets are normalized to audio (16 kHz) and text via scripts/dataset_utils.py. For Hugging Face Jobs, the dataset must live on the Hub and be referenced by slug (username/dataset-name).

Hub dataset (for Jobs or --dataset):

Use the dataset slug with --dataset username/dataset-name.
Optional --dataset-config for multi-config datasets.
Supported column names are normalized automatically:
Audio: audio, audio_path, path, file
Text: text, transcript, transcription, sentence, target, targets

Local JSONL (for local training only; not used by Jobs):

{"audio_path": "/path/to/audio.wav", "text": "reference transcription"}

Push this to the Hub (see above) so you can use it with launch_hf_job.py --dataset username/your-dataset.

Installation

Clone and install dependencies (for running the launcher, Gradio, or local training):

git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
cd Finetune-Voxtral-ASR

With UV (recommended):

uv venv .venv --python 3.10 && source .venv/bin/activate   # Linux/macOS
# Windows: uv venv .venv --python 3.10 && .venv\Scripts\activate
uv pip install -r requirements.txt

With pip:

python -m venv .venv --python 3.10 && source .venv/bin/activate
pip install --upgrade pip && pip install -r requirements.txt

For launch_hf_job.py you need a recent huggingface_hub (with Jobs support):

pip install -U huggingface_hub

Launch Jobs from the Gradio interface

You can start the same training job from the UI:

Run python interface.py.
In Advanced options, open “Launch on Hugging Face Jobs”.
Enter your dataset slug (e.g. username/voxtral-asr-data) after pushing the dataset.
Optionally set Dataset config, Job timeout, Hardware flavor (default a100-large).
Click “Launch training on HF Jobs”.

The UI runs scripts/launch_hf_job.py with the current form values and shows the job URL and status.

Local training (optional)

If you have a GPU locally, you can run training yourself instead of using Jobs.

Full fine-tuning:

python scripts/train.py \
  --dataset username/voxtral-asr-data \
  --model-checkpoint mistralai/Voxtral-Mini-3B-2507 \
  --train-count 100 --eval-count 50 \
  --batch-size 2 --grad-accum 4 --learning-rate 5e-5 --epochs 3 \
  --output-dir ./voxtral-finetuned

LoRA fine-tuning:

python scripts/train_lora.py \
  --dataset username/voxtral-asr-data \
  --model-checkpoint mistralai/Voxtral-Mini-3B-2507 \
  --train-count 100 --eval-count 50 \
  --batch-size 2 --grad-accum 4 --learning-rate 5e-5 --epochs 3 \
  --lora-r 8 --lora-alpha 32 --freeze-audio-tower \
  --output-dir ./voxtral-finetuned-lora

Use --dataset-jsonl path/to/data.jsonl instead of --dataset to train from a local JSONL. For Jobs, only --dataset (Hub slug) is used.

Push models and datasets (scripts/push_to_huggingface.py)

Push a trained model:

python scripts/push_to_huggingface.py model ./voxtral-finetuned username/voxtral-asr \
  --model-name mistralai/Voxtral-Mini-3B-2507

Push a dataset (before using it with Jobs):

python scripts/push_to_huggingface.py dataset datasets/voxtral_user/data.jsonl username/voxtral-asr-data

Deploy a demo Space (scripts/deploy_demo_space.py)

After pushing a model:

python scripts/deploy_demo_space.py \
  --hf-token $HF_TOKEN \
  --hf-username YOUR_USERNAME \
  --model-id YOUR_USERNAME/voxtral-asr \
  --demo-type voxtral \
  --space-name voxtral-asr-demo

Troubleshooting

Issue	What to do
“HF_TOKEN must be set”	Set `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN` in your environment.
Jobs not available	Jobs require Pro / Team / Enterprise and pre-paid credits.
Dataset not found on job	Ensure the dataset is on the Hub and use the exact slug `username/dataset-name`. Push with `push_to_huggingface.py dataset ...` first.
Job timeout	Increase `--timeout` (e.g. `--timeout 8h` or `12h`).
Different GPU	Set `HF_JOBS_FLAVOR` (e.g. `a10g-large`) or pass `--flavor a10g-large`.
Windows	Use `set HF_TOKEN=your_token` in CMD or `$env:HF_TOKEN="your_token"` in PowerShell.

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support