arxiv:2412.18387

Scaling Capability in Token Space: An Analysis of Large Vision Language Model

Published on Dec 24, 2024

Authors:

Abstract

Vision-language models exhibit distinct scaling regimes for vision tokens, with sublinear scaling for fewer tokens and linear scaling for more tokens, as characterized by a theoretical framework linking token number to representation correlations.

AI-generated summary

Large language models have demonstrated predictable scaling behaviors with respect to model parameters and training data. This study investigates whether a similar scaling relationship exist for vision-language models with respect to the number of vision tokens. A mathematical framework is developed to characterize a relationship between vision token number and the expected divergence of distance between vision-referencing sequences. The theoretical analysis reveals two distinct scaling regimes: sublinear scaling for less vision tokens and linear scaling for more vision tokens. This aligns with model performance relationships of the form \(S(n) \approx c / n^{α(n)}\), where the scaling exponent relates to the correlation structure between vision token representations. Empirical validations across multiple vision-language benchmarks show that model performance matches the prediction from scaling relationship. The findings contribute to understanding vision token scaling in transformers through a theoretical framework that complements empirical observations.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.18387 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.18387 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.18387 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.