# TGI Configuration Reference Guide

## Required Configuration

### Required Environment Variables
- `HF_TOKEN`: HuggingFace authentication token

### Required Command Line Arguments
**docker specific parameters**
- `--shm-size 16GB`: Shared memory allocation
- `--privileged`: Enable privileged container mode
- `--net host`: Uses host network mode

Those are needed to run a TPU container so that the docker container can properly access the TPU hardware

**TGI specific parameters**
- `--model-id`: Model identifier to load from the HuggingFace hub

Those are parameters used by TGI and optimum-TPU to configure the server behavior.

## Optional Configuration

### Optional Environment Variables
- `JETSTREAM_PT_DISABLE`: Disable Jetstream PyTorch backend
- `QUANTIZATION`: Enable int8 quantization
- `MAX_BATCH_SIZE`: Set batch processing size, that is **static** on TPUs
- `LOG_LEVEL`: Set logging verbosity (useful for debugging). It can be set to info, debug or a comma separated list of attribute such text_generation_launcher,text_generation_router=debug
- `SKIP_WARMUP`: Skip model warmup phase

**Note on warmup:**
- TGI performs warmup to compile TPU operations for optimal performance
- For production use, never use `SKIP_WARMUP=1`; you can however use the parameters for debugging purposes to speed up model loading at the cost of slow model inference

You can view more options in the [TGI documentation](https://huggingface.co/docs/text-generation-inference/reference/launcher). Not all parameters might be compatible with TPUs (for example, all the CUDA-specific parameters)

TIP for TGI: you can pass most parameters to TGI as docker environment variables or docker arguments. So you can pass `--model-id google/gemma-2b-it` or `-e MODEL_ID=google/gemma-2b-it` to the `docker run` command

### Optional Command Line Arguments
- `--max-input-length`: Maximum input sequence length
- `--max-total-tokens`: Maximum combined input/output tokens
- `--max-batch-prefill-tokens`: Maximum tokens for batch processing
- `--max-batch-total-tokens`: Maximum total tokens in batch

You can view more options in the [TGI documentation](https://huggingface.co/docs/text-generation-inference/reference/launcher). Not all parameters might be compatible with TPUs (for example, all the CUDA-specific parameters)

### Docker Requirements
When running TGI inside a container (recommended), the container should be started with:
- Privileged mode for TPU access
- Shared memory allocation (16GB recommended)
- Host IPC settings

## Example Command
Here's a complete example showing all major configuration options:

```bash
docker run -p 8080:80 \
    --shm-size 16GB \
    --privileged \
    --net host \
    -e QUANTIZATION=1 \
    -e MAX_BATCH_SIZE=2 \
    -e LOG_LEVEL=text_generation_router=debug \
    -v ~/hf_data:/data \
    -e HF_TOKEN= \
    ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
    --model-id google/gemma-2b-it \
    --max-input-length 512 \
    --max-total-tokens 1024 \
    --max-batch-prefill-tokens 512 \
    --max-batch-total-tokens 1024
```

You need to replace  with a HuggingFace access token that you can get [here](https://huggingface.co/settings/tokens)

If you already logged in via `huggingface-cli login`, then you can set HF_TOKEN=$(cat ~/.cache/huggingface/token) for more convenience

## Additional Resources
- [TGI Documentation](https://huggingface.co/docs/text-generation-inference)