btjhjeon
			's Collections
			 
		
			
		Multimodal LLM
		
	updated
			
 
				
				
	
	
	
			
			DocLLM: A layout-aware generative language model for multimodal document
  understanding
		
			Paper
			
•
			2401.00908
			
•
			Published
				
			•
				
				189
			
 
	
	 
	
	
	
			
			COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
  Pre-Training
		
			Paper
			
•
			2401.00849
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
		
			Paper
			
•
			2311.05437
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
  Generation and Editing
		
			Paper
			
•
			2311.00571
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
		
			Paper
			
•
			2401.02330
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
  Language, Audio, and Action
		
			Paper
			
•
			2312.17172
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
		
			Paper
			
•
			2206.08916
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			ImageBind: One Embedding Space To Bind Them All
		
			Paper
			
•
			2305.05665
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			Distilling Vision-Language Models on Millions of Videos
		
			Paper
			
•
			2401.06129
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			LEGO:Language Enhanced Multi-modal Grounding Model
		
			Paper
			
•
			2401.06071
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Improving fine-grained understanding in image-text pre-training
		
			Paper
			
•
			2401.09865
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
  Language Models
		
			Paper
			
•
			2402.05935
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			ViGoR: Improving Visual Grounding of Large Vision Language Models with
  Fine-Grained Reward Modeling
		
			Paper
			
•
			2402.06118
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
  Language Models
		
			Paper
			
•
			2402.07865
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
  Vision-Language Models
		
			Paper
			
•
			2402.13577
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			A Touch, Vision, and Language Dataset for Multimodal Alignment
		
			Paper
			
•
			2402.13232
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			TinyLLaVA: A Framework of Small-scale Large Multimodal Models
		
			Paper
			
•
			2402.14289
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			Enhancing Vision-Language Pre-training with Rich Supervisions
		
			Paper
			
•
			2403.03346
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
		
			Paper
			
•
			2403.11703
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
  Large Language Models
		
			Paper
			
•
			2403.13447
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			When Do We Not Need Larger Vision Models?
		
			Paper
			
•
			2403.13043
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient
  Inference
		
			Paper
			
•
			2403.14520
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
  Math Problems?
		
			Paper
			
•
			2403.14624
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			MoAI: Mixture of All Intelligence for Large Language and Vision Models
		
			Paper
			
•
			2403.07508
			
•
			Published
				
			•
				
				77
			
 
	
	 
	
	
	
			
			Mini-Gemini: Mining the Potential of Multi-modality Vision Language
  Models
		
			Paper
			
•
			2403.18814
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
  Handling Resolutions from 336 Pixels to 4K HD
		
			Paper
			
•
			2404.06512
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			OmniFusion Technical Report
		
			Paper
			
•
			2404.06212
			
•
			Published
				
			•
				
				77
			
 
	
	 
	
	
	
			
			BLINK: Multimodal Large Language Models Can See but Not Perceive
		
			Paper
			
•
			2404.12390
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
  Models
		
			Paper
			
•
			2404.12387
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
  and Generation
		
			Paper
			
•
			2404.14396
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
		
			Paper
			
•
			2404.16821
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
  Text-Rich Visual Comprehension
		
			Paper
			
•
			2404.16790
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
  Dense Captioning
		
			Paper
			
•
			2404.16994
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			What matters when building vision-language models?
		
			Paper
			
•
			2405.02246
			
•
			Published
				
			•
				
				103
			
 
	
	 
	
	
	
			
			An Introduction to Vision-Language Modeling
		
			Paper
			
•
			2405.17247
			
•
			Published
				
			•
				
				90
			
 
	
	 
	
	
	
			
			Matryoshka Multimodal Models
		
			Paper
			
•
			2405.17430
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
  Models
		
			Paper
			
•
			2405.15738
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			Needle In A Multimodal Haystack
		
			Paper
			
•
			2406.07230
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
  Chart-to-Code Generation
		
			Paper
			
•
			2406.09961
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
  Interleaved with Text
		
			Paper
			
•
			2406.08418
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
  Language Models
		
			Paper
			
•
			2406.09403
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
		
			Paper
			
•
			2406.08707
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			CVQA: Culturally-diverse Multilingual Visual Question Answering
  Benchmark
		
			Paper
			
•
			2406.05967
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
  Instruction-Tuning Dataset for LVLMs
		
			Paper
			
•
			2406.11833
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			mDPO: Conditional Preference Optimization for Multimodal Large Language
  Models
		
			Paper
			
•
			2406.11839
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
  Dataset with One Trillion Tokens
		
			Paper
			
•
			2406.11271
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			TokenPacker: Efficient Visual Projector for Multimodal LLM
		
			Paper
			
•
			2407.02392
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Understanding Alignment in Multimodal LLMs: A Comprehensive Study
		
			Paper
			
•
			2407.02477
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Vision language models are blind
		
			Paper
			
•
			2407.06581
			
•
			Published
				
			•
				
				84
			
 
	
	 
	
	
	
			
			Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
		
			Paper
			
•
			2406.16860
			
•
			Published
				
			•
				
				62
			
 
	
	 
	
	
	
			
			ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
  Interleaved Image-Text Generation
		
			Paper
			
•
			2407.06135
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			MAVIS: Mathematical Visual Instruction Tuning
		
			Paper
			
•
			2407.08739
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
  Instruction Using Language Model
		
			Paper
			
•
			2407.07053
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
  Multimodal Models
		
			Paper
			
•
			2407.07895
			
•
			Published
				
			•
				
				42
			
 
	
	 
	
	
	
			
			SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
		
			Paper
			
•
			2407.09413
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
  Large Language Model
		
			Paper
			
•
			2407.16198
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			VILA^2: VILA Augmented VILA
		
			Paper
			
•
			2407.17453
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			MiniCPM-V: A GPT-4V Level MLLM on Your Phone
		
			Paper
			
•
			2408.01800
			
•
			Published
				
			•
				
				89
			
 
	
	 
	
	
	
			
			VITA: Towards Open-Source Interactive Omni Multimodal LLM
		
			Paper
			
•
			2408.05211
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
  Large Language Models
		
			Paper
			
•
			2408.04840
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
		
			Paper
			
•
			2408.08872
			
•
			Published
				
			•
				
				100
			
 
	
	 
	
	
	
			
			LongVILA: Scaling Long-Context Visual Language Models for Long Videos
		
			Paper
			
•
			2408.10188
			
•
			Published
				
			•
				
				52
			
 
	
	 
	
	
	
			
			Show-o: One Single Transformer to Unify Multimodal Understanding and
  Generation
		
			Paper
			
•
			2408.12528
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			Open-FinLLMs: Open Multimodal Large Language Models for Financial
  Applications
		
			Paper
			
•
			2408.11878
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			Building and better understanding vision-language models: insights and
  future directions
		
			Paper
			
•
			2408.12637
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
  Encoders
		
			Paper
			
•
			2408.15998
			
•
			Published
				
			•
				
				87
			
 
	
	 
	
	
	
			
			CogVLM2: Visual Language Models for Image and Video Understanding
		
			Paper
			
•
			2408.16500
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			Law of Vision Representation in MLLMs
		
			Paper
			
•
			2408.16357
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
		
			Paper
			
•
			2408.15881
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
  Hybrid Architecture
		
			Paper
			
•
			2409.02889
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
  Document Understanding
		
			Paper
			
•
			2409.03420
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			NVLM: Open Frontier-Class Multimodal LLMs
		
			Paper
			
•
			2409.11402
			
•
			Published
				
			•
				
				74
			
 
	
	 
	
	
	
			
			Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
  Any Resolution
		
			Paper
			
•
			2409.12191
			
•
			Published
				
			•
				
				78
			
 
	
	 
	
	
	
			
			Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary
  Resolution
		
			Paper
			
•
			2409.12961
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			Phantom of Latent for Large Language and Vision Models
		
			Paper
			
•
			2409.14713
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
  Multimodal Models
		
			Paper
			
•
			2409.17146
			
•
			Published
				
			•
				
				121
			
 
	
	 
	
	
	
			
			LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with
  3D-awareness
		
			Paper
			
•
			2409.18125
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
		
			Paper
			
•
			2409.20566
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			MIO: A Foundation Model on Multimodal Tokens
		
			Paper
			
•
			2409.17692
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			Emu3: Next-Token Prediction is All You Need
		
			Paper
			
•
			2409.18869
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			LLaVA-Critic: Learning to Evaluate Multimodal Models
		
			Paper
			
•
			2410.02712
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
		
			Paper
			
•
			2410.01744
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			TLDR: Token-Level Detective Reward Model for Large Vision Language
  Models
		
			Paper
			
•
			2410.04734
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
		
			Paper
			
•
			2410.11779
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			Baichuan-Omni Technical Report
		
			Paper
			
•
			2410.08565
			
•
			Published
				
			•
				
				87
			
 
	
	 
	
	
	
			
			From Generalist to Specialist: Adapting Vision Language Models via
  Task-Specific Visual Instruction Tuning
		
			Paper
			
•
			2410.06456
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			Aria: An Open Multimodal Native Mixture-of-Experts Model
		
			Paper
			
•
			2410.05993
			
•
			Published
				
			•
				
				111
			
 
	
	 
	
	
	
		
			Paper
			
•
			2410.07073
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
  and Generation
		
			Paper
			
•
			2410.13848
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid
  Visual Redundancy Reduction
		
			Paper
			
•
			2410.17247
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
		
			Paper
			
•
			2410.13861
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
		
			Paper
			
•
			2410.16153
			
•
			Published
				
			•
				
				44
			
 
	
	 
	
	
	
			
			DM-Codec: Distilling Multimodal Representations for Speech Tokenization
		
			Paper
			
•
			2410.15017
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
  Capabilities
		
			Paper
			
•
			2410.11190
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			Distill Visual Chart Reasoning Ability from LLMs to MLLMs
		
			Paper
			
•
			2410.18798
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			WAFFLE: Multi-Modal Model for Automated Front-End Development
		
			Paper
			
•
			2410.18362
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
  Tuning
		
			Paper
			
•
			2410.17779
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			LongVU: Spatiotemporal Adaptive Compression for Long Video-Language
  Understanding
		
			Paper
			
•
			2410.17434
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Document Parsing Unveiled: Techniques, Challenges, and Prospects for
  Structured Information Extraction
		
			Paper
			
•
			2410.21169
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
		
			Paper
			
•
			2410.21276
			
•
			Published
				
			•
				
				87
			
 
	
	 
	
	
	
			
			Infinity-MM: Scaling Multimodal Performance with Large-Scale and
  High-Quality Instruction Data
		
			Paper
			
•
			2410.18558
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
		
			Paper
			
•
			2410.23218
			
•
			Published
				
			•
				
				49
			
 
	
	 
	
	
	
			
			VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
  Videos
		
			Paper
			
•
			2411.04923
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			LLaVA-o1: Let Vision Language Models Reason Step-by-Step
		
			Paper
			
•
			2411.10440
			
•
			Published
				
			•
				
				129
			
 
	
	 
	
	
	
			
			Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of
  Experts
		
			Paper
			
•
			2411.10669
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
  Language Models on Mobile Devices
		
			Paper
			
•
			2411.10640
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			ShowUI: One Vision-Language-Action Model for GUI Visual Agent
		
			Paper
			
•
			2411.17465
			
•
			Published
				
			•
				
				90
			
 
	
	 
	
	
	
			
			VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
  Comprehension with Video-Text Duet Interaction Format
		
			Paper
			
•
			2411.17991
			
•
			Published
				
			•
				
				5
			
 
	
	 
	
	
	
			
			Ovis: Structural Embedding Alignment for Multimodal Large Language Model
		
			Paper
			
•
			2405.20797
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			X-Prompt: Towards Universal In-Context Image Generation in
  Auto-Regressive Vision Language Foundation Models
		
			Paper
			
•
			2412.01824
			
•
			Published
				
			•
				
				64
			
 
	
	 
	
	
	
			
			VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
  by Video Spatiotemporal Augmentation
		
			Paper
			
•
			2412.00927
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			On Domain-Specific Post-Training for Multimodal Large Language Models
		
			Paper
			
•
			2411.19930
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual
  Preferences
		
			Paper
			
•
			2412.01292
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			PaliGemma 2: A Family of Versatile VLMs for Transfer
		
			Paper
			
•
			2412.03555
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
  Generation
		
			Paper
			
•
			2412.03069
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene
  Understanding
		
			Paper
			
•
			2412.00493
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models
		
			Paper
			
•
			2411.19103
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			Expanding Performance Boundaries of Open-Source Multimodal Models with
  Model, Data, and Test-Time Scaling
		
			Paper
			
•
			2412.05271
			
•
			Published
				
			•
				
				159
			
 
	
	 
	
	
	
			
			NVILA: Efficient Frontier Visual Language Models
		
			Paper
			
•
			2412.04468
			
•
			Published
				
			•
				
				59
			
 
	
	 
	
	
	
			
			Florence-VL: Enhancing Vision-Language Models with Generative Vision
  Encoder and Depth-Breadth Fusion
		
			Paper
			
•
			2412.04424
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			POINTS1.5: Building a Vision-Language Model towards Real World
  Applications
		
			Paper
			
•
			2412.08443
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
  Visual Descriptions
		
			Paper
			
•
			2412.08737
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			Apollo: An Exploration of Video Understanding in Large Multimodal Models
		
			Paper
			
•
			2412.10360
			
•
			Published
				
			•
				
				147
			
 
	
	 
	
	
	
			
			BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
		
			Paper
			
•
			2412.07769
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Multimodal Latent Language Modeling with Next-Token Diffusion
		
			Paper
			
•
			2412.08635
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary
  Embedding Distillation
		
			Paper
			
•
			2412.09585
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced
  Multimodal Understanding
		
			Paper
			
•
			2412.10302
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			SynerGen-VL: Towards Synergistic Image Understanding and Generation with
  Vision Experts and Token Folding
		
			Paper
			
•
			2412.09604
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			Emma-X: An Embodied Multimodal Action Model with Grounded Chain of
  Thought and Look-ahead Spatial Reasoning
		
			Paper
			
•
			2412.11974
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via
  Hierarchical Window Transformer
		
			Paper
			
•
			2412.13871
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Descriptive Caption Enhancement with Visual Specialists for Multimodal
  Perception
		
			Paper
			
•
			2412.14233
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			Diving into Self-Evolving Training for Multimodal Reasoning
		
			Paper
			
•
			2412.17451
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
  Collective Monte Carlo Tree Search
		
			Paper
			
•
			2412.18319
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Video-Panda: Parameter-efficient Alignment for Encoder-free
  Video-Language Models
		
			Paper
			
•
			2412.18609
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Molar: Multimodal LLMs with Collaborative Filtering Alignment for
  Enhanced Sequential Recommendation
		
			Paper
			
•
			2412.18176
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			Explanatory Instructions: Towards Unified Vision Tasks Understanding and
  Zero-shot Generalization
		
			Paper
			
•
			2412.18525
			
•
			Published
				
			•
				
				75
			
 
	
	 
	
	
	
			
			On the Compositional Generalization of Multimodal LLMs for Medical
  Imaging
		
			Paper
			
•
			2412.20070
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			Next Token Prediction Towards Multimodal Intelligence: A Comprehensive
  Survey
		
			Paper
			
•
			2412.18619
			
•
			Published
				
			•
				
				58
			
 
	
	 
	
	
	
			
			VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
		
			Paper
			
•
			2501.01957
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
		
			Paper
			
•
			2501.01904
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			STAR: Spatial-Temporal Augmentation with Text-to-Video Models for
  Real-World Video Super-Resolution
		
			Paper
			
•
			2501.02976
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
  Vision Token
		
			Paper
			
•
			2501.03895
			
•
			Published
				
			•
				
				52
			
 
	
	 
	
	
	
			
			URSA: Understanding and Verifying Chain-of-thought Reasoning in
  Multimodal Mathematics
		
			Paper
			
•
			2501.04686
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			An Empirical Study of Autoregressive Pre-training from Videos
		
			Paper
			
•
			2501.05453
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
		
			Paper
			
•
			2501.06186
			
•
			Published
				
			•
				
				65
			
 
	
	 
	
	
	
			
			MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
		
			Paper
			
•
			2501.06282
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction
  Following
		
			Paper
			
•
			2501.08187
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			Tarsier2: Advancing Large Vision-Language Models from Detailed Video
  Description to Comprehensive Video Understanding
		
			Paper
			
•
			2501.07888
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			Parameter-Inverted Image Pyramid Networks for Visual Perception and
  Multimodal Understanding
		
			Paper
			
•
			2501.07783
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
		
			Paper
			
•
			2501.09012
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
  Marks
		
			Paper
			
•
			2501.08326
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			FAST: Efficient Action Tokenization for Vision-Language-Action Models
		
			Paper
			
•
			2501.09747
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
		
			Paper
			
•
			2501.11733
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
  Understanding
		
			Paper
			
•
			2501.13106
			
•
			Published
				
			•
				
				90
			
 
	
	 
	
	
	
			
			Baichuan-Omni-1.5 Technical Report
		
			Paper
			
•
			2501.15368
			
•
			Published
				
			•
				
				62
			
 
	
	 
	
	
	
			
			AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal
  Understanding
		
			Paper
			
•
			2502.01341
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			MetaMorph: Multimodal Understanding and Generation via Instruction
  Tuning
		
			Paper
			
•
			2412.14164
			
•
			Published
				
			•
				
				4
			
 
	
	 
	
	
	
			
			Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive
  Modality Alignment
		
			Paper
			
•
			2502.04328
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			VideoRoPE: What Makes for Good Video Rotary Position Embedding?
		
			Paper
			
•
			2502.05173
			
•
			Published
				
			•
				
				65
			
 
	
	 
	
	
	
			
			Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and
  Generation
		
			Paper
			
•
			2502.05415
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
		
			Paper
			
•
			2502.06788
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Éclair -- Extracting Content and Layout with Integrated Reading Order
  for Documents
		
			Paper
			
•
			2502.04223
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and
  Generation
		
			Paper
			
•
			2502.12148
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			Soundwave: Less is More for Speech-Text Alignment in LLMs
		
			Paper
			
•
			2502.12900
			
•
			Published
				
			•
				
				85
			
 
	
	 
	
	
	
			
			Magma: A Foundation Model for Multimodal AI Agents
		
			Paper
			
•
			2502.13130
			
•
			Published
				
			•
				
				58
			
 
	
	 
	
	
	
			
			HealthGPT: A Medical Large Vision-Language Model for Unifying
  Comprehension and Generation via Heterogeneous Knowledge Adaptation
		
			Paper
			
•
			2502.09838
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			Qwen2.5-VL Technical Report
		
			Paper
			
•
			2502.13923
			
•
			Published
				
			•
				
				208
			
 
	
	 
	
	
	
			
			UniTok: A Unified Tokenizer for Visual Generation and Understanding
		
			Paper
			
•
			2502.20321
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
  Models via Mixture-of-LoRAs
		
			Paper
			
•
			2503.01743
			
•
			Published
				
			•
				
				89
			
 
	
	 
	
	
	
			
			Token-Efficient Long Video Understanding for Multimodal LLMs
		
			Paper
			
•
			2503.04130
			
•
			Published
				
			•
				
				96
			
 
	
	 
	
	
	
			
			Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding
  and Expert Reasoning Abilities
		
			Paper
			
•
			2503.03983
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			VisualWebInstruct: Scaling up Multimodal Instruction Data through Web
  Search
		
			Paper
			
•
			2503.10582
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model
  with Interleaved Multimodal Generation via Asymmetric Synergy
		
			Paper
			
•
			2503.06542
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			Being-0: A Humanoid Robotic Agent with Vision-Language Models and
  Modular Skills
		
			Paper
			
•
			2503.12533
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play
  Visual Games with Keyboards and Mouse
		
			Paper
			
•
			2503.16365
			
•
			Published
				
			•
				
				40
			
 
	
	 
	
	
	
			
			CoMP: Continual Multimodal Pre-training for Vision Foundation Models
		
			Paper
			
•
			2503.18931
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Scaling Vision Pre-Training to 4K Resolution
		
			Paper
			
•
			2503.19903
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			Qwen2.5-Omni Technical Report
		
			Paper
			
•
			2503.20215
			
•
			Published
				
			•
				
				166
			
 
	
	 
	
	
	
			
			Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal
  LLMs on Academic Resources
		
			Paper
			
•
			2504.00595
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
		
			Paper
			
•
			2504.00557
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models
  with Unsupervised Coefficient Optimization
		
			Paper
			
•
			2503.23733
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and
  Diffusion Refinement
		
			Paper
			
•
			2504.01934
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			Scaling Analysis of Interleaved Speech-Text Language Models
		
			Paper
			
•
			2504.02398
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			ShortV: Efficient Multimodal Large Language Models by Freezing Visual
  Tokens in Ineffective Layers
		
			Paper
			
•
			2504.00502
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			Slow-Fast Architecture for Video Multi-Modal Large Language Models
		
			Paper
			
•
			2504.01328
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			SmolVLM: Redefining small and efficient multimodal models
		
			Paper
			
•
			2504.05299
			
•
			Published
				
			•
				
				200
			
 
	
	 
	
	
	
		
			Paper
			
•
			2504.07491
			
•
			Published
				
			•
				
				132
			
 
	
	 
	
	
	
			
			InternVL3: Exploring Advanced Training and Test-Time Recipes for
  Open-Source Multimodal Models
		
			Paper
			
•
			2504.10479
			
•
			Published
				
			•
				
				300
			
 
	
	 
	
	
	
			
			FUSION: Fully Integration of Vision-Language Representations for Deep
  Cross-Modal Understanding
		
			Paper
			
•
			2504.09925
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			Mavors: Multi-granularity Video Representation for Multimodal Large
  Language Model
		
			Paper
			
•
			2504.10068
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Eagle 2.5: Boosting Long-Context Post-Training for Frontier
  Vision-Language Models
		
			Paper
			
•
			2504.15271
			
•
			Published
				
			•
				
				66
			
 
	
	 
	
	
	
			
			An LMM for Efficient Video Understanding via Reinforced Compression of
  Video Cubes
		
			Paper
			
•
			2504.15270
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			Vidi: Large Multimodal Models for Video Understanding and Editing
		
			Paper
			
•
			2504.15681
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			MR. Video: "MapReduce" is the Principle for Long Video Understanding
		
			Paper
			
•
			2504.16082
			
•
			Published
				
			•
				
				5
			
 
	
	 
	
	
	
			
			Breaking the Modality Barrier: Universal Embedding Learning with
  Multimodal LLMs
		
			Paper
			
•
			2504.17432
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery
  Simulation
		
			Paper
			
•
			2504.17207
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
		
			Paper
			
•
			2504.17040
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Kimi-Audio Technical Report
		
			Paper
			
•
			2504.18425
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			MMInference: Accelerating Pre-filling for Long-Context VLMs via
  Modality-Aware Permutation Sparse Attention
		
			Paper
			
•
			2504.16083
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			YoChameleon: Personalized Vision and Language Generation
		
			Paper
			
•
			2504.20998
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			UniBiomed: A Universal Foundation Model for Grounded Biomedical Image
  Interpretation
		
			Paper
			
•
			2504.21336
			
•
			Published
				
			•
				
				4
			
 
	
	 
	
	
	
			
			Voila: Voice-Language Foundation Models for Real-Time Autonomous
  Interaction and Voice Role-Play
		
			Paper
			
•
			2505.02707
			
•
			Published
				
			•
				
				85
			
 
	
	 
	
	
	
			
			LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive
  Streaming Speech Synthesis
		
			Paper
			
•
			2505.02625
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			Unlearning Sensitive Information in Multimodal LLMs: Benchmark and
  Attack-Defense Evaluation
		
			Paper
			
•
			2505.01456
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient
  Large Speech-Language Model
		
			Paper
			
•
			2505.03739
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			Unified Multimodal Understanding and Generation Models: Advances,
  Challenges, and Opportunities
		
			Paper
			
•
			2505.02567
			
•
			Published
				
			•
				
				80
			
 
	
	 
	
	
	
			
			On Path to Multimodal Generalist: General-Level and General-Bench
		
			Paper
			
•
			2505.04620
			
•
			Published
				
			•
				
				82
			
 
	
	 
	
	
	
			
			StreamBridge: Turning Your Offline Video Large Language Model into a
  Proactive Streaming Assistant
		
			Paper
			
•
			2505.05467
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			Seed1.5-VL Technical Report
		
			Paper
			
•
			2505.07062
			
•
			Published
				
			•
				
				152
			
 
	
	 
	
	
	
			
			BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture,
  Training and Dataset
		
			Paper
			
•
			2505.09568
			
•
			Published
				
			•
				
				98
			
 
	
	 
	
	
	
			
			Aya Vision: Advancing the Frontier of Multilingual Multimodality
		
			Paper
			
•
			2505.08751
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Bring Reason to Vision: Understanding Perception and Reasoning through
  Model Merging
		
			Paper
			
•
			2505.05464
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
		
			Paper
			
•
			2505.09439
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			End-to-End Vision Tokenizer Tuning
		
			Paper
			
•
			2505.10562
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			FastVLM: Efficient Vision Encoding for Vision Language Models
		
			Paper
			
•
			2412.13303
			
•
			Published
				
			•
				
				70
			
 
	
	 
	
	
	
			
			Emerging Properties in Unified Multimodal Pretraining
		
			Paper
			
•
			2505.14683
			
•
			Published
				
			•
				
				134
			
 
	
	 
	
	
	
			
			QuickVideo: Real-Time Long Video Understanding with System Algorithm
  Co-Design
		
			Paper
			
•
			2505.16175
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			Backdoor Cleaning without External Guidance in MLLM Fine-tuning
		
			Paper
			
•
			2505.16916
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal
  Large Language Models
		
			Paper
			
•
			2505.17015
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			HoliTom: Holistic Token Merging for Fast Video Large Language Models
		
			Paper
			
•
			2505.21334
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware
  Multi-Segment Grounding
		
			Paper
			
•
			2505.20715
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial
  Intelligence
		
			Paper
			
•
			2505.23747
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			ZeroGUI: Automating Online GUI Learning at Zero Human Cost
		
			Paper
			
•
			2505.23762
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			VidText: Towards Comprehensive Evaluation for Video Text Understanding
		
			Paper
			
•
			2505.22810
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			TokBench: Evaluating Your Visual Tokenizer before Visual Generation
		
			Paper
			
•
			2505.18142
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			Don't Look Only Once: Towards Multimodal Interactive Reasoning with
  Selective Visual Revisitation
		
			Paper
			
•
			2505.18842
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual
  Large Language Models
		
			Paper
			
•
			2505.20873
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and
  Understanding
		
			Paper
			
•
			2506.01853
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			OmniResponse: Online Multimodal Conversational Response Generation in
  Dyadic Interactions
		
			Paper
			
•
			2505.21724
			
•
			Published
				
			•
				
				5
			
 
	
	 
	
	
	
			
			Aligning VLM Assistants with Personalized Situated Cognition
		
			Paper
			
•
			2506.00930
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech
  Paralinguistic and Affect Labeling
		
			Paper
			
•
			2505.15772
			
•
			Published
				
			•
				
				3
			
 
	
	 
	
	
	
			
			Visual Embodied Brain: Let Multimodal Large Language Models See, Think,
  and Control in Spaces
		
			Paper
			
•
			2506.00123
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
		
			Paper
			
•
			2506.03569
			
•
			Published
				
			•
				
				80
			
 
	
	 
	
	
	
			
			SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
		
			Paper
			
•
			2506.05344
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal
  Contextual Fusion
		
			Paper
			
•
			2506.01111
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Is Extending Modality The Right Path Towards Omni-Modality?
		
			Paper
			
•
			2506.01872
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Lingshu: A Generalist Foundation Model for Unified Multimodal Medical
  Understanding and Reasoning
		
			Paper
			
•
			2506.07044
			
•
			Published
				
			•
				
				113
			
 
	
	 
	
	
	
			
			MIRAGE: Multimodal foundation model and benchmark for comprehensive
  retinal OCT image analysis
		
			Paper
			
•
			2506.08900
			
•
			Published
				
			•
				
				3
			
 
	
	 
	
	
	
			
			Ming-Omni: A Unified Multimodal Model for Perception and Generation
		
			Paper
			
•
			2506.09344
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			Stream-Omni: Simultaneous Multimodal Interactions with Large
  Language-Vision-Speech Model
		
			Paper
			
•
			2506.13642
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			VideoMolmo: Spatio-Temporal Grounding Meets Pointing
		
			Paper
			
•
			2506.05336
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			GenRecal: Generation after Recalibration from Large to Small
  Vision-Language Models
		
			Paper
			
•
			2506.15681
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			CoMemo: LVLMs Need Image Context with Image Memory
		
			Paper
			
•
			2506.06279
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal
  Models
		
			Paper
			
•
			2506.14435
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal
  Large Language Models
		
			Paper
			
•
			2506.14824
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			Show-o2: Improved Native Unified Multimodal Models
		
			Paper
			
•
			2506.15564
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			UniFork: Exploring Modality Alignment for Unified Multimodal
  Understanding and Generation
		
			Paper
			
•
			2506.17202
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video
  Understanding
		
			Paper
			
•
			2506.15745
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			OmniGen2: Exploration to Advanced Multimodal Generation
		
			Paper
			
•
			2506.18871
			
•
			Published
				
			•
				
				77
			
 
	
	 
	
	
	
			
			Vision as a Dialect: Unifying Visual Understanding and Generation via
  Text-Aligned Representations
		
			Paper
			
•
			2506.18898
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image
  Generation
		
			Paper
			
•
			2506.18095
			
•
			Published
				
			•
				
				66
			
 
	
	 
	
	
	
			
			LLaVA-Scissor: Token Compression with Semantic Connected Components for
  Video LLMs
		
			Paper
			
•
			2506.21862
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			ShotBench: Expert-Level Cinematic Understanding in Vision-Language
  Models
		
			Paper
			
•
			2506.21356
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
		
			Paper
			
•
			2506.23044
			
•
			Published
				
			•
				
				62
			
 
	
	 
	
	
	
			
			UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence
  with Spatial Reasoning and Understanding
		
			Paper
			
•
			2506.23219
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			Kwai Keye-VL Technical Report
		
			Paper
			
•
			2507.01949
			
•
			Published
				
			•
				
				130
			
 
	
	 
	
	
	
			
			μ^2Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for
  Radiology Report Generation
		
			Paper
			
•
			2507.00316
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			MARVIS: Modality Adaptive Reasoning over VISualizations
		
			Paper
			
•
			2507.01544
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Scaling RL to Long Videos
		
			Paper
			
•
			2507.07966
			
•
			Published
				
			•
				
				157
			
 
	
	 
	
	
	
			
			Multi-Granular Spatio-Temporal Token Merging for Training-Free
  Acceleration of Video LLMs
		
			Paper
			
•
			2507.07990
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal
  Large Language Models
		
			Paper
			
•
			2507.12566
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
		
			Paper
			
•
			2507.13264
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Pixels, Patterns, but No Poetry: To See The World like Humans
		
			Paper
			
•
			2507.16863
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			Region-based Cluster Discrimination for Visual Representation Learning
		
			Paper
			
•
			2507.20025
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image
  Generative Models Great Again
		
			Paper
			
•
			2507.22058
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Phi-Ground Tech Report: Advancing Perception in GUI Grounding
		
			Paper
			
•
			2507.23779
			
•
			Published
				
			•
				
				44
			
 
	
	 
	
	
	
			
			Qwen-Image Technical Report
		
			Paper
			
•
			2508.02324
			
•
			Published
				
			•
				
				259
			
 
	
	 
	
	
	
			
			VeOmni: Scaling Any Modality Model Training with Model-Centric
  Distributed Recipe Zoo
		
			Paper
			
•
			2508.02317
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			MELLA: Bridging Linguistic Capability and Cultural Groundedness for
  Low-Resource Language MLLMs
		
			Paper
			
•
			2508.05502
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and
  Precise Inference-Time Grounding
		
			Paper
			
•
			2507.22025
			
•
			Published
				
			•
				
				4
			
 
	
	 
	
	
	
			
			Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding
  and Generation
		
			Paper
			
•
			2508.03320
			
•
			Published
				
			•
				
				61
			
 
	
	 
	
	
	
			
			LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
		
			Paper
			
•
			2508.03694
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with
  Patch-level CLIP Latents
		
			Paper
			
•
			2508.05954
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
		
			Paper
			
•
			2508.11737
			
•
			Published
				
			•
				
				110
			
 
	
	 
	
	
	
			
			Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision
  Mapping
		
			Paper
			
•
			2508.12466
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
		
			Paper
			
•
			2508.05748
			
•
			Published
				
			•
				
				137
			
 
	
	 
	
	
	
			
			Intern-S1: A Scientific Multimodal Foundation Model
		
			Paper
			
•
			2508.15763
			
•
			Published
				
			•
				
				255
			
 
	
	 
	
	
	
			
			LLaSO: A Foundational Framework for Reproducible Research in Large
  Language and Speech Model
		
			Paper
			
•
			2508.15418
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			InternVL3.5: Advancing Open-Source Multimodal Models in Versatility,
  Reasoning, and Efficiency
		
			Paper
			
•
			2508.18265
			
•
			Published
				
			•
				
				202
			
 
	
	 
	
	
	
			
			MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time
  Autoregressive Video Generation
		
			Paper
			
•
			2508.19320
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			VibeVoice Technical Report
		
			Paper
			
•
			2508.19205
			
•
			Published
				
			•
				
				123
			
 
	
	 
	
	
	
			
			POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models
  for Document Conversion
		
			Paper
			
•
			2509.01215
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			Kwai Keye-VL 1.5 Technical Report
		
			Paper
			
•
			2509.01563
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Visual Programmability: A Guide for Code-as-Thought in Chart
  Understanding
		
			Paper
			
•
			2509.09286
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			Curia: A Multi-Modal Foundation Model for Radiology
		
			Paper
			
•
			2509.06830
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
		
			Paper
			
•
			2509.11543
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
  Vision Tokenizer
		
			Paper
			
•
			2509.16197
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			AToken: A Unified Tokenizer for Vision
		
			Paper
			
•
			2509.14476
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			SAIL-VL2 Technical Report
		
			Paper
			
•
			2509.14033
			
•
			Published
				
			•
				
				44
			
 
	
	 
	
	
	
			
			Qwen3-Omni Technical Report
		
			Paper
			
•
			2509.17765
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
  Training Recipe
		
			Paper
			
•
			2509.18154
			
•
			Published
				
			•
				
				49
			
 
	
	 
	
	
	
			
			Seedream 4.0: Toward Next-generation Multimodal Image Generation
		
			Paper
			
•
			2509.20427
			
•
			Published
				
			•
				
				76
			
 
	
	 
	
	
	
			
			UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
		
			Paper
			
•
			2509.21760
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			MinerU2.5: A Decoupled Vision-Language Model for Efficient
  High-Resolution Document Parsing
		
			Paper
			
•
			2509.22186
			
•
			Published
				
			•
				
				127
			
 
	
	 
	
	
	
			
			CHURRO: Making History Readable with an Open-Weight Large
  Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition
		
			Paper
			
•
			2509.19768
			
•
			Published
				
			•
				
				4
			
 
	
	 
	
	
	
			
			MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
		
			Paper
			
•
			2509.25131
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			HunyuanImage 3.0 Technical Report
		
			Paper
			
•
			2509.23951
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
		
			Paper
			
•
			2510.01141
			
•
			Published
				
			•
				
				113
			
 
	
	 
	
	
	
			
			UniVideo: Unified Understanding, Generation, and Editing for Videos
		
			Paper
			
•
			2510.08377
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			NaViL: Rethinking Scaling Properties of Native Multimodal Large Language
  Models under Data Constraints
		
			Paper
			
•
			2510.08565
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			InstructX: Towards Unified Visual Editing with MLLM Guidance
		
			Paper
			
•
			2510.08485
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			Ming-UniVision: Joint Image Understanding and Generation with a Unified
  Continuous Tokenizer
		
			Paper
			
•
			2510.06590
			
•
			Published
				
			•
				
				70
			
 
	
	 
	
	
	
			
			Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in
  MLLMs
		
			Paper
			
•
			2510.01954
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Thinking with Camera: A Unified Multimodal Model for Camera-Centric
  Understanding and Generation
		
			Paper
			
•
			2510.08673
			
•
			Published
				
			•
				
				121
			
 
	
	 
	
	
	
			
			PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B
  Ultra-Compact Vision-Language Model
		
			Paper
			
•
			2510.14528
			
•
			Published
				
			•
				
				80
			
 
	
	 
	
	
	
			
			InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn
  Dialogue
		
			Paper
			
•
			2510.13747
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Scaling Language-Centric Omnimodal Representation Learning
		
			Paper
			
•
			2510.11693
			
•
			Published
				
			•
				
				97
			
 
	
	 
	
	
	
			
			Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware
  Finetuning and MLLM Implicit Feedback
		
			Paper
			
•
			2510.16888
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding
  LLM
		
			Paper
			
•
			2510.15870
			
•
			Published
				
			•
				
				86
			
 
	
	 
	
	
	
			
			From Pixels to Words -- Towards Native Vision-Language Primitives at
  Scale
		
			Paper
			
•
			2510.14979
			
•
			Published
				
			•
				
				65
			
 
	
	 
	
	
	
			
			olmOCR 2: Unit Test Rewards for Document OCR
		
			Paper
			
•
			2510.19817
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			DeepSeek-OCR: Contexts Optical Compression
		
			Paper
			
•
			2510.18234
			
•
			Published
				
			•
				
				70
			
 
	
	 
	
	
	
			
			Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
		
			Paper
			
•
			2510.13251
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Emu3.5: Native Multimodal Models are World Learners
		
			Paper
			
•
			2510.26583
			
•
			Published
				
			•
				
				94
			
 
	
	 
	
	
	
			
			JanusCoder: Towards a Foundational Visual-Programmatic Interface for
  Code Intelligence
		
			Paper
			
•
			2510.23538
			
•
			Published
				
			•
				
				93
			
 
	
	 
	
	
	
			
			PairUni: Pairwise Training for Unified Multimodal Language Models
		
			Paper
			
•
			2510.25682
			
•
			Published
				
			•
				
				13