OnePiece123 's Collections
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large
Language Models
Paper
• 2406.17294
• Published
• 11
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
• 2406.19389
• Published
• 54
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything
Model
Paper
• 2406.20076
• Published
• 10
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of
Audio Events in Text-to-audio Generation
Paper
• 2407.02869
• Published
• 21
Unveiling Encoder-Free Vision-Language Models
Paper
• 2406.11832
• Published
• 54
FunAudioLLM: Voice Understanding and Generation Foundation Models for
Natural Interaction Between Humans and LLMs
Paper
• 2407.04051
• Published
• 40
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
• 2407.06135
• Published
• 22
Vision language models are blind
Paper
• 2407.06581
• Published
• 85
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
• 2407.07895
• Published
• 42
SEED-Story: Multimodal Long Story Generation with Large Language Model
Paper
• 2407.08683
• Published
• 24
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Paper
• 2505.23762
• Published
• 45