Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
adarshzolekar 's Collections
Multimodal AI Models
Audio & Speech Models
Vision Models (Image & Video)
Text & Code Models (NLP)

Multimodal AI Models

updated Jan 23

Purpose: Models that understand text + image + audio together.

Upvote
1

  • llava-hf/llava-1.5-7b-hf

    Image-Text-to-Text • 7B • Updated Jun 6, 2025 • 3.04M • 342

  • Salesforce/blip-image-captioning-base

    Image-to-Text • Updated Feb 3, 2025 • 2.96M • 844

  • google/pix2struct-base

    Image-to-Text • 0.3B • Updated Dec 24, 2023 • 3.32k • 76

  • microsoft/kosmos-2-patch14-224

    Image-to-Text • Updated Nov 28, 2023 • 177k • 184

  • openbmb/MiniCPM-V-4_5

    Image-Text-to-Text • Updated Dec 18, 2025 • 62.5k • 1.07k
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs