Continuous batching

This page documents the classes behind continuous batching inference: submitting prompts, configuring scheduling and memory limits, and retrieving results.

For usage examples, see the Continuous batching guide and for how scheduling and memory interact, see the Continuous batching architecture doc.

ContinuousMixin.generate_batch

transformers.ContinuousMixin.generate_batch

< source >

( inputs: list generation_config: transformers.generation.configuration_utils.GenerationConfig | None = None continuous_batching_config: transformers.generation.configuration_utils.ContinuousBatchingConfig | None = None record_timestamps: bool = False progress_bar: bool = True persistent_manager: bool = False **kwargs ) → dict[str, GenerationOutput]

Parameters

inputs — List of input token sequences (prompts)
generation_config — Optional generation configuration
continuous_batching_config — Optional continuous batching configuration
record_timestamps — If set to true, the requests will have a timestamp for each token generated
progress_bar — If set to true, a progress bar will be displayed
persistent_manager — whether to persist the manager after the generation is finished. Default is False.
**kwargs — Additional generation parameters. Only max_new_tokens is used, but other deprecated arguments are extracted and passed to the continuous_batching_config object.

Returns

dict[str, GenerationOutput]

a dictionary of request ids to GenerationOutput objects

Generate sequences for a batch of prompts using continuous batching.

ContinuousBatchingManager

class transformers.ContinuousBatchingManager

< source >

( model: ProtoPretrainedModel generation_config: GenerationConfig continuous_batching_config: ContinuousBatchingConfig )

Manager for handling continuous batching of generation requests. It provides a user interface for submitting generation requests, retrieving results, and managing the background generation thread. This class should not be created directly, but through one of the following entry points (all methods of the ContinuousMixin mixin):

init_continuous_batching
continuous_batching_context_manager
generate_batch

add_request

< source >

( input_ids: list request_id: str | None = None max_new_tokens: int | None = None streaming: bool = False record_timestamps: bool = False eos_token_id: int | list[int] | None = None ) → str

Parameters

input_ids — Input token IDs to use as prompt
request_id — Optional custom request ID (auto-generated if None)
**kwargs — Additional generation parameters

Returns

str

The request ID

Add a new generation request to the queue.

cancel_request

< source >

( request_id: str )

Parameters

request_id — The ID of the request to cancel

Cancel a request by its ID.

get_result

< source >

( request_id: str | None = None timeout: float | None = None ) → Optional[GenerationOutput]

Parameters

request_id — If set, only return results matching this ID (others are requeued).
timeout — Maximum time to wait for a result.

Returns

Optional[GenerationOutput]

The result data or None if timeout.

Retrieve one result from the output queue.

is_running

< source >

( )

Check if the background generation thread is running.

join

< source >

( stop_trigger_time: float timeout: float | None = None )

Parameters

timeout — Maximum time to wait for the thread to stop

Wait for the background thread to finish.

register_result_handler

< source >

( request_id: str callback: Callable )

Parameters

request_id (str) — The request ID to receive outputs for.
callback (callable) — Called with a GenerationOutput for each result.

The callback is invoked on the event loop via call_soon_threadsafe each time a result is produced for this request. For streaming requests, this happens on every token; for non-streaming, only on completion.

The handler is automatically cleaned up when the request finishes.

request_id_iter

< source >

( request_id: str )

Iterate over results matching a specific request id (blocking).

Uses the shared output queue with requeue. For high-concurrency serving, use register_result_handler instead.

start

< source >

( )

Start the background generation thread.

stop

< source >

( block: bool = True timeout: float | None = None keep_for_next_session: bool = False )

Parameters

block — Whether to wait for the thread to stop
timeout — Maximum time to wait for the thread to stop
keep_for_next_session — Whether to cache this on the model for future use

Signal the background thread to stop.

Continuous batching config

class transformers.ContinuousBatchingConfig

< source >

( block_size: int = 256 num_blocks: int | None = None max_batch_tokens: int | None = None max_memory_percent: float = 0.8 max_blocks_per_request: int | None = 0 allow_block_sharing: bool = True use_async_batching: bool | None = None use_cuda_graph: bool | None = None q_padding_interval_size: int = 0 kv_padding_interval_size: int = 0 max_cached_graphs: int = 0 varlen_compile_config: transformers.generation.configuration_utils.CompileConfig | None = None decode_compile_config: transformers.generation.configuration_utils.CompileConfig | None = None use_default_compile_configs: bool = False scheduler_type: str = 'fifo' return_logprobs: bool = False max_queue_size: int = 0 )

Parameters

block_size (int, optional, defaults to 256) — Size of each KV cache block in tokens.
num_blocks (int, optional) — Number of blocks in the KV cache. Auto-inferred from GPU memory when None.
max_batch_tokens (int, optional) — Maximum number of tokens in a batch. Auto-inferred from GPU memory when None.
max_memory_percent (float, optional, defaults to 0.8) — Maximum percentage of free GPU memory (after the model is loaded) to use for the KV cache.
max_blocks_per_request (int, optional, defaults to 0) — Maximum blocks per request, used in the flash_attn_with_kvcache fast decode path to dimension the block table. Setting this to 0 disables the fast decode path.
allow_block_sharing (bool, optional, defaults to True) — Whether to allow block sharing for prefix caching. Block sharing can only be allowed, never forced, as some models do not support it. Disable if you have few short prompts but long generation lengths.
use_async_batching (bool, optional) — Whether to enable async double-buffering, which removes CPU overhead from the continuous batching loop at the cost of doubled VRAM usage. Auto-detected when None.
use_cuda_graph (bool, optional) — Whether to enable CUDA graphs. Auto-inferred when None.
q_padding_interval_size (int, optional, defaults to 0) — Query padding granularity in tokens for CUDA graphs. Uses a preset from continuous_api.py when set to 0.
kv_padding_interval_size (int, optional, defaults to 0) — KV padding granularity in tokens for CUDA graphs. Uses a preset from continuous_api.py when set to 0.
max_cached_graphs (int, optional, defaults to 0) — Maximum number of cached CUDA graphs. Uses a preset from continuous_api.py when set to 0.
varlen_compile_config (CompileConfig, optional) — CompileConfig for varlen (prefill) path. Default is None (uses generation_config fallback) The varlen path handles batches with varying query and KV lengths, often benefiting from dynamic=True.
decode_compile_config (CompileConfig, optional) — CompileConfig for decode (fast) path. Default is None (uses generation_config fallback) The decode path handles batches has no dynamic KV length, so static shapes are a better fit.
use_default_compile_configs (bool, optional, defaults to False) — If True, a default compile config will be used for paths that are not explicitly set.
scheduler_type (str, optional, defaults to "fifo") — Scheduler type to use.
return_logprobs (bool, optional, defaults to False) — Whether to return log probabilities along with the generated tokens.
max_queue_size (int, optional, defaults to 0) — Maximum request queue size for serving. 0 means unlimited.

Class that holds arguments relative to continuous batching, when using continuous batching through the generate_batch method or the continuous_batching_context_manager context manager.

call

( *args **kwargs )

Call self as a function.

Update on GitHub

Transformers

Continuous batching

ContinuousMixin.generate_batch

transformers.ContinuousMixin.generate_batch

ContinuousBatchingManager

class transformers.ContinuousBatchingManager

add_request

cancel_request

get_result

is_running

join

register_result_handler

request_id_iter

start

stop

Continuous batching config

class transformers.ContinuousBatchingConfig

__call__

call