Transformers documentation

Parakeet

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.8.1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

This model was released on {release_date} and added to Hugging Face Transformers on 2025-09-25.

SDPA

Parakeet

Overview

Parakeet models, introduced by NVIDIA NeMo, are models that combine a Fast Conformer encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition.

Model Architecture

  • Fast Conformer Encoder: A linearly scalable Conformer architecture that processes mel-spectrogram features and reduces sequence length through subsampling. This is more efficient version of the Conformer Encoder found in FastSpeech2Conformer (see ParakeetEncoder for the encoder implementation and details).
  • ParakeetForCTC: a Fast Conformer Encoder + a CTC decoder
    • CTC Decoder: Simple but effective decoder consisting of:
      • 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility).
      • CTC loss computation for training.
      • Greedy CTC decoding for inference.
  • ParakeetForTDT: a Fast Conformer Encoder + a TDT (Token Duration Transducer) decoder
    • TDT Decoder: Jointly predicts tokens and their durations, enabling efficient decoding:
      • LSTM prediction network maintains language context across token predictions.
      • Joint network combines encoder and decoder outputs.
      • Duration head predicts how many frames to skip, enabling fast inference.

The original implementation can be found in NVIDIA NeMo. Model checkpoints are to be found under the NVIDIA organization.

This model was contributed by Nithin Rao Koluguri, Eustache Le Bihan, Eric Bezzam, Maksym Lypivskyi, and Hainan Xu.

Usage

ParakeetForCTC usage

Pipeline
AutoModel
from transformers import pipeline


pipe = pipeline("automatic-speech-recognition", model="nvidia/parakeet-ctc-1.1b")
out = pipe("https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3")
print(out)
# {'text': 'yesterday it was thirty five degrees in barcelona but today the temperature will go down to minus twenty degrees'}

ParakeetForTDT usage

Pipeline
AutoModel
Timestamping

Parakeet TDT transcripts include casing, and the model can also perform token timestamping.

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="nvidia/parakeet-tdt-0.6b-v3")
out = pipe("https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3")
print(out)
# {'text': 'Yesterday it was 35 degrees in Barcelona, but today the temperature will go down to minus 20 degrees.'}

Making The Model Go Brrr

Parakeet supports full-graph compilation with CUDA graphs! This optimization is most effective when you know the maximum audio length you want to transcribe. The key idea is using static input shapes to avoid recompilation. For example, if you know your audio will be under 30 seconds, you can use the processor to pad all inputs to 30 seconds, preparing consistent input features and attention masks. See the example below!

import torch
from datasets import Audio, load_dataset

from transformers import AutoModelForCTC, AutoProcessor


processor = AutoProcessor.from_pretrained("nvidia/parakeet-ctc-1.1b")
model = AutoModelForCTC.from_pretrained("nvidia/parakeet-ctc-1.1b", device_map="auto")

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
speech_samples = [el['array'] for el in ds["audio"][:5]]

# Compile the generate method with fullgraph and CUDA graphs
model.generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")

# let's define processor kwargs to pad to 30 seconds
processor_kwargs = {
    "padding": "max_length",
    "max_length": 30 * processor.feature_extractor.sampling_rate,
}

# Define a timing context using CUDA events
class TimerContext:
    def __init__(self, name="Execution"):
        self.name = name
        self.start_event = None
        self.end_event = None

    def __enter__(self):
        # Use CUDA events for more accurate GPU timing
        self.start_event = torch.cuda.Event(enable_timing=True)
        self.end_event = torch.cuda.Event(enable_timing=True)
        self.start_event.record()
        return self

    def __exit__(self, *args):
        self.end_event.record()
        torch.cuda.synchronize()
        elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
        print(f"{self.name} time: {elapsed_time:.4f} seconds")


inputs = processor(speech_samples[0], **processor_kwargs)
inputs.to(model.device, dtype=model.dtype)
print("\n" + "="*50)
print("First generation - compiling...")
# Generate with the compiled model
with TimerContext("First generation"):
    outputs = model.generate(**inputs)
print(processor.decode(outputs))

inputs = processor(speech_samples[1], **processor_kwargs)
inputs.to(model.device, dtype=model.dtype)
print("\n" + "="*50)
print("Second generation - recording CUDA graphs...")
with TimerContext("Second generation"):
    outputs = model.generate(**inputs)
print(processor.decode(outputs))

inputs = processor(speech_samples[2], **processor_kwargs)
inputs.to(model.device, dtype=model.dtype)
print("\n" + "="*50)
print("Third generation - fast !!!")
with TimerContext("Third generation"):
    outputs = model.generate(**inputs)
print(processor.decode(outputs))

inputs = processor(speech_samples[3], **processor_kwargs)
inputs.to(model.device, dtype=model.dtype)
print("\n" + "="*50)
print("Fourth generation - still fast !!!")
with TimerContext("Fourth generation"):
    outputs = model.generate(**inputs)
print(processor.decode(outputs))

CTC Training

import torch
from datasets import Audio, load_dataset
from transformers import AutoModelForCTC, AutoProcessor

model_id = "nvidia/parakeet-ctc-1.1b"
NUM_SAMPLES = 5

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCTC.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
model.train()

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
speech_samples = [el['array'] for el in ds["audio"][:NUM_SAMPLES]]
text_samples = ds["text"][:NUM_SAMPLES]

# passing `text` to the processor will prepare inputs' `labels` key
inputs = processor(audio=speech_samples, text=text_samples, sampling_rate=processor.feature_extractor.sampling_rate)
inputs.to(model.device, dtype=model.dtype)

outputs = model(**inputs)
print("Loss:", outputs.loss.item())
outputs.loss.backward()

TDT Training

from datasets import Audio, load_dataset
import torch
from transformers import AutoModelForTDT, AutoProcessor

model_id = "nvidia/parakeet-tdt-0.6b-v3"
NUM_SAMPLES = 4

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForTDT.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
model.train()

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
speech_samples = [el['array'] for el in ds["audio"][:NUM_SAMPLES]]
text_samples = ds["text"][:NUM_SAMPLES]

# passing `text` to the processor will prepare inputs' `labels` key
inputs = processor(audio=speech_samples, text=text_samples, sampling_rate=processor.feature_extractor.sampling_rate)
inputs.to(model.device, dtype=model.dtype)

outputs = model(**inputs)
print("Loss:", outputs.loss.item())
outputs.loss.backward()

ParakeetTokenizer

class transformers.ParakeetTokenizer

< >

( *args **kwargs )

Inherits all methods from PreTrainedTokenizerFast. Users should refer to this superclass for more information regarding those methods, except for _decode which is overridden to adapt it to CTC decoding:

  1. Group consecutive tokens
  2. Filter out the blank token

ParakeetFeatureExtractor

class transformers.ParakeetFeatureExtractor

< >

( feature_size = 80 sampling_rate = 16000 hop_length = 160 n_fft = 512 win_length = 400 preemphasis = 0.97 padding_value = 0.0 **kwargs )

Parameters

  • feature_size (int, optional, defaults to 80) — The feature dimension of the extracted features.
  • sampling_rate (int, optional, defaults to 16000) — The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
  • hop_length (int, optional, defaults to 160) — Length of the overlapping windows for the STFT used to obtain the Mel Frequency coefficients.
  • n_fft (int, optional, defaults to 512) — Size of the Fourier transform.
  • win_length (int, optional, defaults to 400) — The window length for the STFT computation.
  • preemphasis (float, optional, defaults to 0.97) — A preemphasis filter coefficient. 0.0 means no preemphasis filter.
  • padding_value (float, optional, defaults to 0.0) — Padding value used to pad the audio. Should correspond to silences.

Constructs a Parakeet feature extractor.

This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

This class extracts mel-filter bank features from raw speech using a custom numpy implementation of the Short Time Fourier Transform which should match pytorch’s torch.stft equivalent.

__call__

< >

( raw_speech: numpy.ndarray | list[float] | list[numpy.ndarray] | list[list[float]] truncation: bool = False pad_to_multiple_of: int | None = None return_tensors: str | transformers.utils.generic.TensorType | None = None return_attention_mask: bool | None = None padding: str | None = 'longest' max_length: int | None = None sampling_rate: int | None = None do_normalize: bool | None = None device: str | None = 'cpu' return_token_timestamps: bool | None = None **kwargs )

Parameters

  • raw_speech (np.ndarray, list[float], list[np.ndarray], list[list[float]]) — The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not stereo, i.e. single float per timestep.
  • truncation (bool, optional, default to True) — Activates truncation to cut input sequences longer than max_length to max_length.
  • pad_to_multiple_of (int, optional, defaults to None) — If set will pad the sequence to a multiple of the provided value.

    This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.

  • return_attention_mask (bool, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific feature_extractor’s default.

    What are attention masks?

    For Parakeet models, attention_mask should always be passed for batched inference, to avoid subtle bugs.

  • return_tensors (str or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:

    • 'tf': Return TensorFlow tf.constant objects.
    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return Numpy np.ndarray objects.
  • sampling_rate (int, optional) — The sampling rate at which the raw_speech input was sampled. It is strongly recommended to pass sampling_rate at the forward call to prevent silent errors and allow automatic speech recognition pipeline.
  • padding_value (float, optional, defaults to 0.0) — The value that is used to fill the padding values / vectors.
  • do_normalize (bool, optional, defaults to False) — Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly improve the performance of the model.
  • device (str, optional, defaults to 'cpu') — Specifies the device for computation of the log-mel spectrogram of audio signals in the _torch_extract_fbank_features method. (e.g., “cpu”, “cuda”)
  • return_token_timestamps (bool, optional, defaults to None) — Deprecated. Use return_attention_mask instead from which the number of frames can be inferred.

    Whether or not to return the number of frames of the input raw_speech. These num_frames can be used by the model to compute word level timestamps.

Main method to featurize and prepare for the model one or several sequence(s). Implementation uses PyTorch for the STFT computation if available, otherwise a slower NumPy based one.

ParakeetProcessor

class transformers.ParakeetProcessor

< >

( feature_extractor tokenizer blank_token = '<blank>' )

Parameters

  • feature_extractor (feature_extractor_class) — The feature extractor is a required input.
  • tokenizer (tokenizer_class) — The tokenizer is a required input.
  • blank_token (str, optional, defaults to "<blank>") — Blank token for TDT decoding.

Constructs a ParakeetProcessor which wraps a feature extractor and a tokenizer into a single processor.

ParakeetProcessor offers all the functionalities of feature_extractor_class and tokenizer_class. See the ~feature_extractor_class and ~tokenizer_class for more information.

__call__

< >

( audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence['torch.Tensor']] text: str | list[str] | list[list[str]] | None = None sampling_rate: int | None = None **kwargs: typing_extensions.Unpack[transformers.models.parakeet.processing_parakeet.ParakeetProcessorKwargs] )

Parameters

  • audio (Union[numpy.ndarray, torch.Tensor, collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence[torch.Tensor]]) — The audio or batch of audios to be prepared. Each audio can be a NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each audio should be of shape (C, T), where C is a number of channels, and T is the sample length of the audio.
  • text (Union[str, list[str], list[list[str]]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If you pass a pretokenized input, set is_split_into_words=True to avoid ambiguity with batched inputs.
  • sampling_rate (int, optional) — The sampling rate of the input audio in Hz. This should match the sampling rate expected by the feature extractor (defaults to 16000 Hz). If provided, it will be validated against the processor’s expected sampling rate, and an error will be raised if they don’t match. If not provided, a warning will be issued and the default sampling rate will be assumed.
  • return_tensors (str or TensorType, optional) — If set, will return tensors of a particular framework. Acceptable values are:

    • 'pt': Return PyTorch torch.Tensor objects.
    • 'np': Return NumPy np.ndarray objects.
  • **kwargs (ProcessingKwargs, optional) — Additional processing options for each modality (text, images, videos, audio). Model-specific parameters are listed above; see the TypedDict class for the complete list of supported arguments.

decode

< >

( *args durations = None **kwargs )

Forward arguments to decode() and post-process the timestamps (if provided for TDT) as in the NeMo library.

ParakeetEncoderConfig

class transformers.ParakeetEncoderConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None hidden_size: int = 1024 num_hidden_layers: int = 24 num_attention_heads: int = 8 intermediate_size: int = 4096 hidden_act: str = 'silu' attention_bias: bool = True convolution_bias: bool = True conv_kernel_size: int = 9 subsampling_factor: int = 8 subsampling_conv_channels: int = 256 num_mel_bins: int = 80 subsampling_conv_kernel_size: int = 3 subsampling_conv_stride: int = 2 dropout: float | int = 0.1 dropout_positions: float | int = 0.0 layerdrop: float | int = 0.1 activation_dropout: float | int = 0.1 attention_dropout: float | int = 0.1 max_position_embeddings: int = 5000 scale_input: bool = True initializer_range: float = 0.02 )

Parameters

  • hidden_size (int, optional, defaults to 1024) — Dimension of the hidden representations.
  • num_hidden_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer decoder.
  • num_attention_heads (int, optional, defaults to 8) — Number of attention heads for each attention layer in the Transformer decoder.
  • intermediate_size (int, optional, defaults to 4096) — Dimension of the MLP representations.
  • hidden_act (str, optional, defaults to silu) — The non-linear activation function (function or string) in the decoder. For example, "gelu", "relu", "silu", etc.
  • attention_bias (bool, optional, defaults to True) — Whether to use a bias in the query, key, value and output projection layers during self-attention.
  • convolution_bias (bool, optional, defaults to True) — Whether to use bias in convolutions of the conformer’s convolution module.
  • conv_kernel_size (int, optional, defaults to 9) — The kernel size of the convolution layers in the Conformer block.
  • subsampling_factor (int, optional, defaults to 8) — The factor by which the input sequence is subsampled.
  • subsampling_conv_channels (int, optional, defaults to 256) — The number of channels in the subsampling convolution layers.
  • num_mel_bins (int, optional, defaults to 80) — Number of mel features.
  • subsampling_conv_kernel_size (int, optional, defaults to 3) — The kernel size of the subsampling convolution layers.
  • subsampling_conv_stride (int, optional, defaults to 2) — The stride of the subsampling convolution layers.
  • dropout (Union[float, int], optional, defaults to 0.1) — The ratio for all dropout layers.
  • dropout_positions (float, optional, defaults to 0.0) — The dropout ratio for the positions in the input sequence.
  • layerdrop (Union[float, int], optional, defaults to 0.1) — The LayerDrop probability. See the [LayerDrop paper](see https://huggingface.co/papers/1909.11556) for more details.
  • activation_dropout (Union[float, int], optional, defaults to 0.1) — The dropout ratio for activations inside the fully connected layer.
  • attention_dropout (Union[float, int], optional, defaults to 0.1) — The dropout ratio for the attention probabilities.
  • max_position_embeddings (int, optional, defaults to 5000) — The maximum sequence length that this model might ever be used with.
  • scale_input (bool, optional, defaults to True) — Whether to scale the input embeddings.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

This is the configuration class to store the configuration of a ParakeetModel. It is used to instantiate a Parakeet model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the nvidia/parakeet-ctc-1.1b

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Example:

>>> from transformers import ParakeetEncoderModel, ParakeetEncoderConfig

>>> # Initializing a `ParakeetEncoder` configuration
>>> configuration = ParakeetEncoderConfig()

>>> # Initializing a model from the configuration
>>> model = ParakeetEncoderModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

ParakeetCTCConfig

class transformers.ParakeetCTCConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None vocab_size: int = 1025 ctc_loss_reduction: str = 'mean' ctc_zero_infinity: bool = True encoder_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None pad_token_id: int | None = 1024 )

Parameters

  • vocab_size (int, optional, defaults to 1025) — Vocabulary size of the model. Defines the number of different tokens that can be represented by the input_ids.
  • ctc_loss_reduction (str, optional, defaults to "mean") — Specifies the reduction to apply to the output of torch.nn.CTCLoss. Only relevant when training an instance of ParakeetForCTC.
  • ctc_zero_infinity (bool, optional, defaults to True) — Whether to zero infinite losses and the associated gradients of torch.nn.CTCLoss. Infinite losses mainly occur when the inputs are too short to be aligned to the targets. Only relevant when training an instance of ParakeetForCTC.
  • encoder_config (Union[dict, ParakeetEncoderConfig], optional) — The config object or dictionary of the encoder.
  • pad_token_id (int, optional, defaults to 1024) — Token id used for padding in the vocabulary.

This is the configuration class to store the configuration of a ParakeetModel. It is used to instantiate a Parakeet model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the nvidia/parakeet-ctc-1.1b

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Example:

>>> from transformers import ParakeetForCTC, ParakeetCTCConfig
>>> # Initializing a Parakeet configuration
>>> configuration = ParakeetCTCConfig()
>>> # Initializing a model from the configuration
>>> model = ParakeetForCTC(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config

ParakeetTDTConfig

class transformers.ParakeetTDTConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None is_encoder_decoder: bool = True vocab_size: int = 8193 decoder_hidden_size: int = 640 num_decoder_layers: int = 2 hidden_act: str = 'relu' max_symbols_per_step: int = 10 durations: list[int] | tuple[int, ...] = (0, 1, 2, 3, 4) encoder_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None pad_token_id: int = 2 blank_token_id: int = 8192 )

Parameters

  • is_encoder_decoder (bool, optional, defaults to True) — Whether the model is used as an encoder/decoder or not.
  • vocab_size (int, optional, defaults to 8193) — Vocabulary size of the model. Defines the number of different tokens that can be represented by the input_ids.
  • decoder_hidden_size (int, optional, defaults to 640) — Hidden size of the LSTM prediction network and joint network.
  • num_decoder_layers (int, optional, defaults to 2) — Number of LSTM layers in the prediction network.
  • hidden_act (str, optional, defaults to relu) — The non-linear activation function (function or string) in the decoder. For example, "gelu", "relu", "silu", etc.
  • max_symbols_per_step (int, optional, defaults to 10) — Maximum number of symbols to emit per encoder time step during greedy decoding.
  • durations (list[int], optional, defaults to [0, 1, 2, 3, 4]) — Token duration values that can be predicted. Each value represents how many frames a token or blank emission spans.
  • encoder_config (Union[dict, ParakeetEncoderConfig], optional) — The config object or dictionary of the encoder.
  • pad_token_id (int, optional, defaults to 2) — Token id used for padding in the vocabulary.
  • blank_token_id (int, optional, defaults to 8192) — Blank token id. Different from pad_token_id for TDT.

This is the configuration class to store the configuration of a ParakeetModel. It is used to instantiate a Parakeet model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the nvidia/parakeet-tdt-0.6b-v3

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Example:

>>> from transformers import ParakeetForTDT, ParakeetTDTConfig

>>> # Initializing a Parakeet TDT configuration
>>> configuration = ParakeetTDTConfig()

>>> # Initializing a model from the configuration
>>> model = ParakeetForTDT(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

ParakeetEncoder

class transformers.ParakeetEncoder

< >

( config: ParakeetEncoderConfig )

Parameters

  • config (ParakeetEncoderConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The Parakeet Encoder model, based on the Fast Conformer architecture.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_features: Tensor attention_mask: torch.Tensor | None = None output_attention_mask: bool = True **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) BaseModelOutput or tuple(torch.FloatTensor)

Parameters

  • input_features (torch.Tensor of shape (batch_size, sequence_length, feature_dim)) — The tensors corresponding to the input audio features. Audio features can be obtained using feature_extractor_class. See feature_extractor_class.__call__ for details (processor_class uses feature_extractor_class for processing audios).
  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • output_attention_mask (bool, optional, defaults to True) — Whether to return the output attention mask. Only effective when attention_mask is provided.

Returns

BaseModelOutput or tuple(torch.FloatTensor)

A BaseModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (None) and inputs.

The ParakeetEncoder forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

>>> from transformers import AutoProcessor, ParakeetEncoder
>>> from datasets import load_dataset, Audio

>>> model_id = "nvidia/parakeet-ctc-1.1b"
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> encoder = ParakeetEncoder.from_pretrained(model_id)

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))

>>> inputs = processor(ds[0]["audio"]["array"])
>>> encoder_outputs = encoder(**inputs)

>>> print(encoder_outputs.last_hidden_state.shape)

ParakeetForCTC

class transformers.ParakeetForCTC

< >

( config: ParakeetCTCConfig )

Parameters

  • config (ParakeetCTCConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Parakeet Encoder with a Connectionist Temporal Classification (CTC) head.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_features: Tensor attention_mask: torch.Tensor | None = None labels: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) CausalLMOutput or tuple(torch.FloatTensor)

Parameters

  • input_features (torch.Tensor of shape (batch_size, sequence_length, feature_dim)) — The tensors corresponding to the input audio features. Audio features can be obtained using feature_extractor_class. See feature_extractor_class.__call__ for details (processor_class uses feature_extractor_class for processing audios).
  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • labels (torch.Tensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

Returns

CausalLMOutput or tuple(torch.FloatTensor)

A CausalLMOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (None) and inputs.

The ParakeetForCTC forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Language modeling loss (for next-token prediction).

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

>>> from transformers import AutoProcessor, ParakeetForCTC
>>> from datasets import load_dataset, Audio

>>> model_id = "nvidia/parakeet-ctc-1.1b"
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> model = ParakeetForCTC.from_pretrained(model_id)

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))

>>> inputs = processor(ds[0]["audio"]["array"], text=ds[0]["text"])
>>> outputs = model(**inputs)

>>> print(outputs.loss)

generate

< >

( input_features: Tensor attention_mask: torch.Tensor | None = None return_dict_in_generate: bool = False compile_config: transformers.generation.configuration_utils.CompileConfig | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] )

compile_config (CompileConfig, optional): If provided, torch.compile will be applied to the forward calls in the decoding loop.

Example:

>>> from transformers import AutoProcessor, ParakeetForCTC
>>> from datasets import load_dataset, Audio

>>> model_id = "nvidia/parakeet-ctc-1.1b"
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> model = ParakeetForCTC.from_pretrained(model_id)

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))

>>> inputs = processor(ds[0]["audio"]["array"], text=ds[0]["text"])
>>> predicted_ids = model.generate(**inputs)
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

>>> print(transcription)

ParakeetForTDT

class transformers.ParakeetForTDT

< >

( config: ParakeetTDTConfig )

Parameters

  • config (ParakeetTDTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Parakeet Encoder with a TDT (Token Duration Transducer) head.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_features: torch.Tensor | None = None attention_mask: torch.Tensor | None = None decoder_input_ids: torch.LongTensor | None = None decoder_cache: transformers.models.parakeet.generation_parakeet.ParakeetTDTDecoderCache | None = None use_decoder_cache: bool | None = None encoder_outputs: transformers.models.parakeet.modeling_parakeet.ParakeetEncoderModelOutput | tuple[torch.FloatTensor] | None = None labels: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) ParakeetTDTOutput or tuple(torch.FloatTensor)

Parameters

  • input_features (torch.Tensor of shape (batch_size, sequence_length, feature_dim), optional) — The tensors corresponding to the input audio features. Audio features can be obtained using feature_extractor_class. See feature_extractor_class.__call__ for details (processor_class uses feature_extractor_class for processing audios).
  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • decoder_input_ids (torch.LongTensor of shape (batch_size, 1), optional) — Decoder input token ids for single-step inference.
  • decoder_cache (ParakeetTDTDecoderCache, optional) — Decoder LSTM cache. When provided and initialized, the cached decoder_output is reused (e.g. during blank-skipping) instead of running the decoder. When input_ids is provided, the decoder runs and the cache is updated in-place.
  • use_decoder_cache (bool, optional) — Whether to use a decoder cache. When True and decoder_cache is None, a new cache is created automatically during the forward pass.
  • encoder_outputs (tuple(torch.FloatTensor), optional) — Pre-computed encoder outputs (last_hidden_state, pooler_output, hidden_states, attentions, attention_mask). Can be a tuple or ParakeetEncoderModelOutput.
  • labels (torch.Tensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

Returns

ParakeetTDTOutput or tuple(torch.FloatTensor)

A ParakeetTDTOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (None) and inputs.

The ParakeetForTDT forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • loss (torch.FloatTensor, optional) — TDT loss, returned when labels are provided.
  • logits (torch.FloatTensor) — Joint token and duration logits. Shape is (batch, T, U+1, vocab+durations) for training or (batch, 1, 1, vocab+durations) for single-step inference.
  • decoder_cache (ParakeetTDTDecoderCache, optional) — Decoder LSTM cache containing hidden state, cell state, and last output.

Example:

>>> from transformers import AutoProcessor, ParakeetForTDT
>>> from datasets import load_dataset, Audio

>>> model_id = "nvidia/parakeet-tdt-0.6b-v3"
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> model = ParakeetForTDT.from_pretrained(model_id)

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))

>>> inputs = processor(ds[0]["audio"]["array"])
>>> outputs = model(**inputs)
Update on GitHub