Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
adarshzolekar
's Collections
Multimodal AI Models
Audio & Speech Models
Vision Models (Image & Video)
Text & Code Models (NLP)
Multimodal AI Models
updated
Jan 23
Purpose: Models that understand text + image + audio together.
Upvote
1
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text
•
7B
•
Updated
Jun 6, 2025
•
3.04M
•
342
Salesforce/blip-image-captioning-base
Image-to-Text
•
Updated
Feb 3, 2025
•
2.96M
•
844
google/pix2struct-base
Image-to-Text
•
0.3B
•
Updated
Dec 24, 2023
•
3.32k
•
76
microsoft/kosmos-2-patch14-224
Image-to-Text
•
Updated
Nov 28, 2023
•
177k
•
184
openbmb/MiniCPM-V-4_5
Image-Text-to-Text
•
Updated
Dec 18, 2025
•
62.5k
•
1.07k
Upvote
1
Share collection
View history
Collection guide
Browse collections