Transformers documentation

Parakeet

Transformers

Get started

Transformers Installation Quickstart

Base classes

Models

Preprocessors

Inference

Pipeline API

Generate API

Optimization

Chat with models

Serving

Training

Get started

Customization

Parameter-efficient fine-tuning

Performance

Distributed training

Hardware

Quantization

Ecosystem integrations

Resources

API

Main Classes

Models

Text models

Vision models

Audio models

Video models

Multimodal models

Reinforcement learning models

Time series models

Internal helpers

Reference

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.8.1).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on {release_date} and added to Hugging Face Transformers on 2025-09-25.

Parakeet

Overview

Parakeet models, introduced by NVIDIA NeMo, are models that combine a Fast Conformer encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition.

Model Architecture

Fast Conformer Encoder: A linearly scalable Conformer architecture that processes mel-spectrogram features and reduces sequence length through subsampling. This is more efficient version of the Conformer Encoder found in FastSpeech2Conformer (see ParakeetEncoder for the encoder implementation and details).
ParakeetForCTC: a Fast Conformer Encoder + a CTC decoder
- CTC Decoder: Simple but effective decoder consisting of:
  - 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility).
  - CTC loss computation for training.
  - Greedy CTC decoding for inference.
ParakeetForTDT: a Fast Conformer Encoder + a TDT (Token Duration Transducer) decoder
- TDT Decoder: Jointly predicts tokens and their durations, enabling efficient decoding:
  - LSTM prediction network maintains language context across token predictions.
  - Joint network combines encoder and decoder outputs.
  - Duration head predicts how many frames to skip, enabling fast inference.

The original implementation can be found in NVIDIA NeMo. Model checkpoints are to be found under the NVIDIA organization.

This model was contributed by Nithin Rao Koluguri, Eustache Le Bihan, Eric Bezzam, Maksym Lypivskyi, and Hainan Xu.

Usage

ParakeetForCTC usage

Pipeline

AutoModel

ParakeetForTDT usage

Pipeline

AutoModel

Timestamping

Making The Model Go Brrr

Parakeet supports full-graph compilation with CUDA graphs! This optimization is most effective when you know the maximum audio length you want to transcribe. The key idea is using static input shapes to avoid recompilation. For example, if you know your audio will be under 30 seconds, you can use the processor to pad all inputs to 30 seconds, preparing consistent input features and attention masks. See the example below!

import torch
from datasets import Audio, load_dataset

from transformers import AutoModelForCTC, AutoProcessor


processor = AutoProcessor.from_pretrained("nvidia/parakeet-ctc-1.1b")
model = AutoModelForCTC.from_pretrained("nvidia/parakeet-ctc-1.1b", device_map="auto")

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
speech_samples = [el['array'] for el in ds["audio"][:5]]

# Compile the generate method with fullgraph and CUDA graphs
model.generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")

# let's define processor kwargs to pad to 30 seconds
processor_kwargs = {
    "padding": "max_length",
    "max_length": 30 * processor.feature_extractor.sampling_rate,
}

# Define a timing context using CUDA events
class TimerContext:
    def __init__(self, name="Execution"):
        self.name = name
        self.start_event = None
        self.end_event = None

    def __enter__(self):
        # Use CUDA events for more accurate GPU timing
        self.start_event = torch.cuda.Event(enable_timing=True)
        self.end_event = torch.cuda.Event(enable_timing=True)
        self.start_event.record()
        return self

    def __exit__(self, *args):
        self.end_event.record()
        torch.cuda.synchronize()
        elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
        print(f"{self.name} time: {elapsed_time:.4f} seconds")


inputs = processor(speech_samples[0], **processor_kwargs)
inputs.to(model.device, dtype=model.dtype)
print("\n" + "="*50)
print("First generation - compiling...")
# Generate with the compiled model
with TimerContext("First generation"):
    outputs = model.generate(**inputs)
print(processor.decode(outputs))

inputs = processor(speech_samples[1], **processor_kwargs)
inputs.to(model.device, dtype=model.dtype)
print("\n" + "="*50)
print("Second generation - recording CUDA graphs...")
with TimerContext("Second generation"):
    outputs = model.generate(**inputs)
print(processor.decode(outputs))

inputs = processor(speech_samples[2], **processor_kwargs)
inputs.to(model.device, dtype=model.dtype)
print("\n" + "="*50)
print("Third generation - fast !!!")
with TimerContext("Third generation"):
    outputs = model.generate(**inputs)
print(processor.decode(outputs))

inputs = processor(speech_samples[3], **processor_kwargs)
inputs.to(model.device, dtype=model.dtype)
print("\n" + "="*50)
print("Fourth generation - still fast !!!")
with TimerContext("Fourth generation"):
    outputs = model.generate(**inputs)
print(processor.decode(outputs))

CTC Training

import torch
from datasets import Audio, load_dataset
from transformers import AutoModelForCTC, AutoProcessor

model_id = "nvidia/parakeet-ctc-1.1b"
NUM_SAMPLES = 5

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCTC.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
model.train()

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
speech_samples = [el['array'] for el in ds["audio"][:NUM_SAMPLES]]
text_samples = ds["text"][:NUM_SAMPLES]

# passing `text` to the processor will prepare inputs' `labels` key
inputs = processor(audio=speech_samples, text=text_samples, sampling_rate=processor.feature_extractor.sampling_rate)
inputs.to(model.device, dtype=model.dtype)

outputs = model(**inputs)
print("Loss:", outputs.loss.item())
outputs.loss.backward()

TDT Training

from datasets import Audio, load_dataset
import torch
from transformers import AutoModelForTDT, AutoProcessor

model_id = "nvidia/parakeet-tdt-0.6b-v3"
NUM_SAMPLES = 4

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForTDT.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
model.train()

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
speech_samples = [el['array'] for el in ds["audio"][:NUM_SAMPLES]]
text_samples = ds["text"][:NUM_SAMPLES]

# passing `text` to the processor will prepare inputs' `labels` key
inputs = processor(audio=speech_samples, text=text_samples, sampling_rate=processor.feature_extractor.sampling_rate)
inputs.to(model.device, dtype=model.dtype)

outputs = model(**inputs)
print("Loss:", outputs.loss.item())
outputs.loss.backward()

Transformers

Parakeet

Overview

Usage

ParakeetForCTC usage

ParakeetForTDT usage

Making The Model Go Brrr

CTC Training

TDT Training

ParakeetTokenizer

class transformers.ParakeetTokenizer

ParakeetFeatureExtractor

class transformers.ParakeetFeatureExtractor

__call__

ParakeetProcessor

class transformers.ParakeetProcessor

__call__

decode

ParakeetEncoderConfig

class transformers.ParakeetEncoderConfig

ParakeetCTCConfig

class transformers.ParakeetCTCConfig

ParakeetTDTConfig

class transformers.ParakeetTDTConfig

ParakeetEncoder

class transformers.ParakeetEncoder

forward

ParakeetForCTC

class transformers.ParakeetForCTC

forward

generate

ParakeetForTDT

class transformers.ParakeetForTDT

forward

call

call