Multimodal AI Models - a adarshzolekar Collection

adarshzolekar 's Collections

Multimodal AI Models

Audio & Speech Models

Vision Models (Image & Video)

Text & Code Models (NLP)

Multimodal AI Models

updated Jan 23

Purpose: Models that understand text + image + audio together.

llava-hf/llava-1.5-7b-hf

Image-Text-to-Text • 7B • Updated Jun 6, 2025 • 3.04M • 342
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3, 2025 • 2.96M • 844
google/pix2struct-base

Image-to-Text • 0.3B • Updated Dec 24, 2023 • 3.32k • 76
microsoft/kosmos-2-patch14-224

Image-to-Text • Updated Nov 28, 2023 • 177k • 184
openbmb/MiniCPM-V-4_5

Image-Text-to-Text • Updated Dec 18, 2025 • 62.5k • 1.07k