Transformers documentation
Continuous batching
Continuous batching
This page documents the classes behind continuous batching inference: submitting prompts, configuring scheduling and memory limits, and retrieving results.
For usage examples, see the Continuous batching guide and for how scheduling and memory interact, see the Continuous batching architecture doc.
ContinuousMixin.generate_batch
transformers.ContinuousMixin.generate_batch
< source >( inputs: list generation_config: transformers.generation.configuration_utils.GenerationConfig | None = None continuous_batching_config: transformers.generation.configuration_utils.ContinuousBatchingConfig | None = None record_timestamps: bool = False progress_bar: bool = True persistent_manager: bool = False **kwargs ) → dict[str, GenerationOutput]
Parameters
- inputs — List of input token sequences (prompts)
- generation_config — Optional generation configuration
- continuous_batching_config — Optional continuous batching configuration
- record_timestamps — If set to true, the requests will have a timestamp for each token generated
- progress_bar — If set to true, a progress bar will be displayed
- persistent_manager — whether to persist the manager after the generation is finished. Default is False.
- **kwargs — Additional generation parameters. Only max_new_tokens is used, but other deprecated arguments are extracted and passed to the continuous_batching_config object.
Returns
dict[str, GenerationOutput]
a dictionary of request ids to GenerationOutput objects
Generate sequences for a batch of prompts using continuous batching.
ContinuousBatchingManager
class transformers.ContinuousBatchingManager
< source >( model: ProtoPretrainedModel generation_config: GenerationConfig continuous_batching_config: ContinuousBatchingConfig )
Manager for handling continuous batching of generation requests. It provides a user interface for submitting
generation requests, retrieving results, and managing the background generation thread. This class should not be
created directly, but through one of the following entry points (all methods of the ContinuousMixin mixin):
init_continuous_batchingcontinuous_batching_context_managergenerate_batch
add_request
< source >( input_ids: list request_id: str | None = None max_new_tokens: int | None = None streaming: bool = False record_timestamps: bool = False eos_token_id: int | list[int] | None = None ) → str
Add a new generation request to the queue.
cancel_request
< source >( request_id: str )
Cancel a request by its ID.
get_result
< source >( request_id: str | None = None timeout: float | None = None ) → Optional[GenerationOutput]
Retrieve one result from the output queue.
Check if the background generation thread is running.
join
< source >( stop_trigger_time: float timeout: float | None = None )
Wait for the background thread to finish.
register_result_handler
< source >( request_id: str callback: Callable )
Register a callback for result delivery (streaming or non-streaming).
The callback is invoked on the event loop via call_soon_threadsafe
each time a result is produced for this request. For streaming requests,
this happens on every token; for non-streaming, only on completion.
The handler is automatically cleaned up when the request finishes.
Iterate over results matching a specific request id (blocking).
Uses the shared output queue with requeue. For high-concurrency serving,
use register_result_handler instead.
Start the background generation thread.
stop
< source >( block: bool = True timeout: float | None = None keep_for_next_session: bool = False )
Signal the background thread to stop.
Continuous batching config
class transformers.ContinuousBatchingConfig
< source >( block_size: int = 256 num_blocks: int | None = None max_batch_tokens: int | None = None max_memory_percent: float = 0.8 max_blocks_per_request: int | None = 0 allow_block_sharing: bool = True use_async_batching: bool | None = None use_cuda_graph: bool | None = None q_padding_interval_size: int = 0 kv_padding_interval_size: int = 0 max_cached_graphs: int = 0 varlen_compile_config: transformers.generation.configuration_utils.CompileConfig | None = None decode_compile_config: transformers.generation.configuration_utils.CompileConfig | None = None use_default_compile_configs: bool = False scheduler_type: str = 'fifo' return_logprobs: bool = False max_queue_size: int = 0 )
Parameters
- block_size (
int, optional, defaults to 256) — Size of each KV cache block in tokens. - num_blocks (
int, optional) — Number of blocks in the KV cache. Auto-inferred from GPU memory whenNone. - max_batch_tokens (
int, optional) — Maximum number of tokens in a batch. Auto-inferred from GPU memory whenNone. - max_memory_percent (
float, optional, defaults to 0.8) — Maximum percentage of free GPU memory (after the model is loaded) to use for the KV cache. - max_blocks_per_request (
int, optional, defaults to 0) — Maximum blocks per request, used in theflash_attn_with_kvcachefast decode path to dimension the block table. Setting this to 0 disables the fast decode path. - allow_block_sharing (
bool, optional, defaults toTrue) — Whether to allow block sharing for prefix caching. Block sharing can only be allowed, never forced, as some models do not support it. Disable if you have few short prompts but long generation lengths. - use_async_batching (
bool, optional) — Whether to enable async double-buffering, which removes CPU overhead from the continuous batching loop at the cost of doubled VRAM usage. Auto-detected whenNone. - use_cuda_graph (
bool, optional) — Whether to enable CUDA graphs. Auto-inferred whenNone. - q_padding_interval_size (
int, optional, defaults to 0) — Query padding granularity in tokens for CUDA graphs. Uses a preset fromcontinuous_api.pywhen set to 0. - kv_padding_interval_size (
int, optional, defaults to 0) — KV padding granularity in tokens for CUDA graphs. Uses a preset fromcontinuous_api.pywhen set to 0. - max_cached_graphs (
int, optional, defaults to 0) — Maximum number of cached CUDA graphs. Uses a preset fromcontinuous_api.pywhen set to 0. - varlen_compile_config (
CompileConfig, optional) — CompileConfig for varlen (prefill) path. Default is None (uses generation_config fallback) The varlen path handles batches with varying query and KV lengths, often benefiting from dynamic=True. - decode_compile_config (
CompileConfig, optional) — CompileConfig for decode (fast) path. Default is None (uses generation_config fallback) The decode path handles batches has no dynamic KV length, so static shapes are a better fit. - use_default_compile_configs (
bool, optional, defaults toFalse) — If True, a default compile config will be used for paths that are not explicitly set. - scheduler_type (
str, optional, defaults to"fifo") — Scheduler type to use. - return_logprobs (
bool, optional, defaults toFalse) — Whether to return log probabilities along with the generated tokens. - max_queue_size (
int, optional, defaults to 0) — Maximum request queue size for serving. 0 means unlimited.
Class that holds arguments relative to continuous batching, when using continuous batching through the
generate_batch method or the continuous_batching_context_manager context manager.