zzfive
			's Collections
			 
		
			
		multimodal
		
	updated
			
 
				
				
	
	
	
			
			iVideoGPT: Interactive VideoGPTs are Scalable World Models
		
			Paper
			
•
			2405.15223
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
  Models
		
			Paper
			
•
			2405.15574
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			An Introduction to Vision-Language Modeling
		
			Paper
			
•
			2405.17247
			
•
			Published
				
			•
				
				90
			
 
	
	 
	
	
	
			
			Matryoshka Multimodal Models
		
			Paper
			
•
			2405.17430
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
		
			Paper
			
•
			2405.18669
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			MotionLLM: Understanding Human Behaviors from Human Motions and Videos
		
			Paper
			
•
			2405.20340
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Parrot: Multilingual Visual Instruction Tuning
		
			Paper
			
•
			2406.02539
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with
  LLM
		
			Paper
			
•
			2406.02884
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			What If We Recaption Billions of Web Images with LLaMA-3?
		
			Paper
			
•
			2406.08478
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
  Understanding in Video-LLMs
		
			Paper
			
•
			2406.07476
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
  in Videos
		
			Paper
			
•
			2406.08407
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and
  Video Generation
		
			Paper
			
•
			2406.07686
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			OpenVLA: An Open-Source Vision-Language-Action Model
		
			Paper
			
•
			2406.09246
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
  Language Models
		
			Paper
			
•
			2406.09403
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Explore the Limits of Omni-modal Pretraining at Scale
		
			Paper
			
•
			2406.09412
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
		
			Paper
			
•
			2406.09406
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
  Chart-to-Code Generation
		
			Paper
			
•
			2406.09961
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			Needle In A Multimodal Haystack
		
			Paper
			
•
			2406.07230
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
  Interleaved with Text
		
			Paper
			
•
			2406.08418
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			mDPO: Conditional Preference Optimization for Multimodal Large Language
  Models
		
			Paper
			
•
			2406.11839
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			LLaNA: Large Language and NeRF Assistant
		
			Paper
			
•
			2406.11840
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
		
			Paper
			
•
			2406.14544
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal
  Documents
		
			Paper
			
•
			2406.13923
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Improving Visual Commonsense in Language Models via Multiple Image
  Generation
		
			Paper
			
•
			2406.13621
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical
  Report
		
			Paper
			
•
			2406.11403
			
•
			Published
				
			•
				
				4
			
 
	
	 
	
	
	
			
			Towards Fast Multilingual LLM Inference: Speculative Decoding and
  Specialized Drafters
		
			Paper
			
•
			2406.16758
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
		
			Paper
			
•
			2406.16860
			
•
			Published
				
			•
				
				62
			
 
	
	 
	
	
	
			
			Long Context Transfer from Language to Vision
		
			Paper
			
•
			2406.16852
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
		
			Paper
			
•
			2406.15704
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into
  Multimodal LLMs at Scale
		
			Paper
			
•
			2406.19280
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
		
			Paper
			
•
			2406.17720
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			OmniJARVIS: Unified Vision-Language-Action Tokenization Enables
  Open-World Instruction Following Agents
		
			Paper
			
•
			2407.00114
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Understanding Alignment in Multimodal LLMs: A Comprehensive Study
		
			Paper
			
•
			2407.02477
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			InternLM-XComposer-2.5: A Versatile Large Vision Language Model
  Supporting Long-Contextual Input and Output
		
			Paper
			
•
			2407.03320
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			TokenPacker: Efficient Visual Projector for Multimodal LLM
		
			Paper
			
•
			2407.02392
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Unveiling Encoder-Free Vision-Language Models
		
			Paper
			
•
			2406.11832
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language
  Models
		
			Paper
			
•
			2407.05131
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
		
			Paper
			
•
			2407.04172
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			Stark: Social Long-Term Multi-Modal Conversation with Persona
  Commonsense Knowledge
		
			Paper
			
•
			2407.03958
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			HEMM: Holistic Evaluation of Multimodal Foundation Models
		
			Paper
			
•
			2407.03418
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
  Interleaved Image-Text Generation
		
			Paper
			
•
			2407.06135
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Vision language models are blind
		
			Paper
			
•
			2407.06581
			
•
			Published
				
			•
				
				84
			
 
	
	 
	
	
	
			
			Video-STaR: Self-Training Enables Video Instruction Tuning with Any
  Supervision
		
			Paper
			
•
			2407.06189
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
  Multimodal Models
		
			Paper
			
•
			2407.07895
			
•
			Published
				
			•
				
				42
			
 
	
	 
	
	
	
			
			PaliGemma: A versatile 3B VLM for transfer
		
			Paper
			
•
			2407.07726
			
•
			Published
				
			•
				
				72
			
 
	
	 
	
	
	
			
			FIRE: A Dataset for Feedback Integration and Refinement Evaluation of
  Multimodal Models
		
			Paper
			
•
			2407.11522
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality
  Models
		
			Paper
			
•
			2407.11691
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
		
			Paper
			
•
			2407.11895
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model
  Co-development
		
			Paper
			
•
			2407.11784
			
•
			Published
				
			•
				
				4
			
 
	
	 
	
	
	
			
			E5-V: Universal Embeddings with Multimodal Large Language Models
		
			Paper
			
•
			2407.12580
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
		
			Paper
			
•
			2407.12772
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
		
			Paper
			
•
			2407.12679
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			EVLM: An Efficient Vision-Language Model for Visual Understanding
		
			Paper
			
•
			2407.14177
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
  Models
		
			Paper
			
•
			2407.15841
			
•
			Published
				
			•
				
				40
			
 
	
	 
	
	
	
			
			VideoGameBunny: Towards vision assistants for video games
		
			Paper
			
•
			2407.15295
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			MIBench: Evaluating Multimodal Large Language Models over Multiple
  Images
		
			Paper
			
•
			2407.15272
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			Visual Haystacks: Answering Harder Questions About Sets of Images
		
			Paper
			
•
			2407.13766
			
•
			Published
				
			•
				
				2
			
 
	
	 
	
	
	
			
			INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
  Large Language Model
		
			Paper
			
•
			2407.16198
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			VILA^2: VILA Augmented VILA
		
			Paper
			
•
			2407.17453
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			Efficient Inference of Vision Instruction-Following Models with Elastic
  Cache
		
			Paper
			
•
			2407.18121
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			Wolf: Captioning Everything with a World Summarization Framework
		
			Paper
			
•
			2407.18908
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
  Experts
		
			Paper
			
•
			2407.21770
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			OmniParser for Pure Vision Based GUI Agent
		
			Paper
			
•
			2408.00203
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			MiniCPM-V: A GPT-4V Level MLLM on Your Phone
		
			Paper
			
•
			2408.01800
			
•
			Published
				
			•
				
				89
			
 
	
	 
	
	
	
			
			Language Model Can Listen While Speaking
		
			Paper
			
•
			2408.02622
			
•
			Published
				
			•
				
				42
			
 
	
	 
	
	
	
			
			ExoViP: Step-by-step Verification and Exploration with Exoskeleton
  Modules for Compositional Visual Reasoning
		
			Paper
			
•
			2408.02210
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			MMIU: Multimodal Multi-image Understanding for Evaluating Large
  Vision-Language Models
		
			Paper
			
•
			2408.02718
			
•
			Published
				
			•
				
				62
			
 
	
	 
	
	
	
			
			VITA: Towards Open-Source Interactive Omni Multimodal LLM
		
			Paper
			
•
			2408.05211
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			VisualAgentBench: Towards Large Multimodal Models as Visual Foundation
  Agents
		
			Paper
			
•
			2408.06327
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
		
			Paper
			
•
			2408.08872
			
•
			Published
				
			•
				
				100
			
 
	
	 
	
	
	
			
			LongVILA: Scaling Long-Context Visual Language Models for Long Videos
		
			Paper
			
•
			2408.10188
			
•
			Published
				
			•
				
				52
			
 
	
	 
	
	
	
			
			Segment Anything with Multiple Modalities
		
			Paper
			
•
			2408.09085
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			Transfusion: Predict the Next Token and Diffuse Images with One
  Multi-Modal Model
		
			Paper
			
•
			2408.11039
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
		
			Paper
			
•
			2408.11817
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			Show-o: One Single Transformer to Unify Multimodal Understanding and
  Generation
		
			Paper
			
•
			2408.12528
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for
  Large-scale Vision-Language Models
		
			Paper
			
•
			2408.12114
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			SEA: Supervised Embedding Alignment for Token-Level Visual-Textual
  Integration in MLLMs
		
			Paper
			
•
			2408.11813
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Building and better understanding vision-language models: insights and
  future directions
		
			Paper
			
•
			2408.12637
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			CogVLM2: Visual Language Models for Image and Video Understanding
		
			Paper
			
•
			2408.16500
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
		
			Paper
			
•
			2408.15881
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			Law of Vision Representation in MLLMs
		
			Paper
			
•
			2408.16357
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal
  Models in Multi-View Urban Scenarios
		
			Paper
			
•
			2408.17267
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language
  Models for Trait Discovery from Biological Images
		
			Paper
			
•
			2408.16176
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
		
			Paper
			
•
			2408.16725
			
•
			Published
				
			•
				
				52
			
 
	
	 
	
	
	
			
			VideoLLaMB: Long-context Video Understanding with Recurrent Memory
  Bridges
		
			Paper
			
•
			2409.01071
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
  Hybrid Architecture
		
			Paper
			
•
			2409.02889
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding
  Benchmark
		
			Paper
			
•
			2409.02813
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
  Document Understanding
		
			Paper
			
•
			2409.03420
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
		
			Paper
			
•
			2409.05840
			
•
			Published
				
			•
				
				49
			
 
	
	 
	
	
	
			
			POINTS: Improving Your Vision-language Model with Affordable Strategies
		
			Paper
			
•
			2409.04828
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			LLaMA-Omni: Seamless Speech Interaction with Large Language Models
		
			Paper
			
•
			2409.06666
			
•
			Published
				
			•
				
				59
			
 
	
	 
	
	
	
			
			Guiding Vision-Language Model Selection for Visual Question-Answering
  Across Tasks, Domains, and Knowledge Types
		
			Paper
			
•
			2409.09269
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			NVLM: Open Frontier-Class Multimodal LLMs
		
			Paper
			
•
			2409.11402
			
•
			Published
				
			•
				
				74
			
 
	
	 
	
	
	
			
			Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
  Any Resolution
		
			Paper
			
•
			2409.12191
			
•
			Published
				
			•
				
				78
			
 
	
	 
	
	
	
			
			Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning
		
			Paper
			
•
			2409.12001
			
•
			Published
				
			•
				
				5
			
 
	
	 
	
	
	
			
			MonoFormer: One Transformer for Both Diffusion and Autoregression
		
			Paper
			
•
			2409.16280
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
  Multimodal Models
		
			Paper
			
•
			2409.17146
			
•
			Published
				
			•
				
				121
			
 
	
	 
	
	
	
			
			EMOVA: Empowering Language Models to See, Hear and Speak with Vivid
  Emotions
		
			Paper
			
•
			2409.18042
			
•
			Published
				
			•
				
				40
			
 
	
	 
	
	
	
			
			Emu3: Next-Token Prediction is All You Need
		
			Paper
			
•
			2409.18869
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			MIO: A Foundation Model on Multimodal Tokens
		
			Paper
			
•
			2409.17692
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			UniMuMo: Unified Text, Music and Motion Generation
		
			Paper
			
•
			2410.04534
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			NL-Eye: Abductive NLI for Images
		
			Paper
			
•
			2410.02613
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
		
			Paper
			
•
			2410.07073
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			Personalized Visual Instruction Tuning
		
			Paper
			
•
			2410.07113
			
•
			Published
				
			•
				
				70
			
 
	
	 
	
	
	
			
			Deciphering Cross-Modal Alignment in Large Vision-Language Models with
  Modality Integration Rate
		
			Paper
			
•
			2410.07167
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Aria: An Open Multimodal Native Mixture-of-Experts Model
		
			Paper
			
•
			2410.05993
			
•
			Published
				
			•
				
				111
			
 
	
	 
	
	
	
			
			Multimodal Situational Safety
		
			Paper
			
•
			2410.06172
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
  Foundation Models
		
			Paper
			
•
			2410.02740
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			Video Instruction Tuning With Synthetic Data
		
			Paper
			
•
			2410.02713
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			LLaVA-Critic: Learning to Evaluate Multimodal Models
		
			Paper
			
•
			2410.02712
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			Distilling an End-to-End Voice Assistant Without Instruction Training
  Data
		
			Paper
			
•
			2410.02678
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			MLLM as Retriever: Interactively Learning Multimodal Retrieval for
  Embodied Agents
		
			Paper
			
•
			2410.03450
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Baichuan-Omni Technical Report
		
			Paper
			
•
			2410.08565
			
•
			Published
				
			•
				
				87
			
 
	
	 
	
	
	
			
			From Generalist to Specialist: Adapting Vision Language Models via
  Task-Specific Visual Instruction Tuning
		
			Paper
			
•
			2410.06456
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large
  Multimodal Models
		
			Paper
			
•
			2410.09732
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large
  Vision-Language Models
		
			Paper
			
•
			2410.10139
			
•
			Published
				
			•
				
				52
			
 
	
	 
	
	
	
			
			MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
		
			Paper
			
•
			2410.10563
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
  Documents
		
			Paper
			
•
			2410.10594
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			TemporalBench: Benchmarking Fine-grained Temporal Understanding for
  Multimodal Video Models
		
			Paper
			
•
			2410.10818
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			TVBench: Redesigning Video-Language Evaluation
		
			Paper
			
•
			2410.07752
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			The Curse of Multi-Modalities: Evaluating Hallucinations of Large
  Multimodal Models across Language, Visual, and Audio
		
			Paper
			
•
			2410.12787
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language
  Models
		
			Paper
			
•
			2410.13085
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
  and Generation
		
			Paper
			
•
			2410.13848
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures
		
			Paper
			
•
			2410.13754
			
•
			Published
				
			•
				
				75
			
 
	
	 
	
	
	
			
			WorldCuisines: A Massive-Scale Benchmark for Multilingual and
  Multicultural Visual Question Answering on Global Cuisines
		
			Paper
			
•
			2410.12705
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			Remember, Retrieve and Generate: Understanding Infinite Visual Concepts
  as Your Personalized Assistant
		
			Paper
			
•
			2410.13360
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large
  Language Models
		
			Paper
			
•
			2410.13859
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			NaturalBench: Evaluating Vision-Language Models on Natural Adversarial
  Samples
		
			Paper
			
•
			2410.14669
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
  Capabilities
		
			Paper
			
•
			2410.11190
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
		
			Paper
			
•
			2410.13861
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
		
			Paper
			
•
			2410.16153
			
•
			Published
				
			•
				
				44
			
 
	
	 
	
	
	
			
			Improve Vision Language Model Chain-of-thought Reasoning
		
			Paper
			
•
			2410.16198
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			Mitigating Object Hallucination via Concentric Causal Attention
		
			Paper
			
•
			2410.15926
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large
  Vision-Language Models
		
			Paper
			
•
			2410.17637
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Can Knowledge Editing Really Correct Hallucinations?
		
			Paper
			
•
			2410.16251
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			Unbounded: A Generative Infinite Game of Character Life Simulation
		
			Paper
			
•
			2410.18975
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			WAFFLE: Multi-Modal Model for Automated Front-End Development
		
			Paper
			
•
			2410.18362
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			ROCKET-1: Master Open-World Interaction with Visual-Temporal Context
  Prompting
		
			Paper
			
•
			2410.17856
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			Infinity-MM: Scaling Multimodal Performance with Large-Scale and
  High-Quality Instruction Data
		
			Paper
			
•
			2410.18558
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
		
			Paper
			
•
			2410.21276
			
•
			Published
				
			•
				
				87
			
 
	
	 
	
	
	
			
			Vision Search Assistant: Empower Vision-Language Models as Multimodal
  Search Engines
		
			Paper
			
•
			2410.21220
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			VideoWebArena: Evaluating Long Context Multimodal Agents with Video
  Understanding Web Tasks
		
			Paper
			
•
			2410.19100
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
		
			Paper
			
•
			2410.23218
			
•
			Published
				
			•
				
				49
			
 
	
	 
	
	
	
			
			TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal
  Foundation Models
		
			Paper
			
•
			2410.23266
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical
  Reasoning Robustness of Vision Language Models
		
			Paper
			
•
			2411.00836
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
		
			Paper
			
•
			2411.02327
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			Mixture-of-Transformers: A Sparse and Scalable Architecture for
  Multi-Modal Foundation Models
		
			Paper
			
•
			2411.04996
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
  Videos
		
			Paper
			
•
			2411.04923
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Analyzing The Language of Visual Tokens
		
			Paper
			
•
			2411.05001
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
  And A Retrieval-Aware Tuning Framework
		
			Paper
			
•
			2411.06176
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			LLaVA-o1: Let Vision Language Models Reason Step-by-Step
		
			Paper
			
•
			2411.10440
			
•
			Published
				
			•
				
				129
			
 
	
	 
	
	
	
			
			Generative World Explorer
		
			Paper
			
•
			2411.11844
			
•
			Published
				
			•
				
				77
			
 
	
	 
	
	
	
			
			BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
  Language Models on Mobile Devices
		
			Paper
			
•
			2411.10640
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of
  Experts
		
			Paper
			
•
			2411.10669
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			VideoAutoArena: An Automated Arena for Evaluating Large Multimodal
  Models in Video Analysis through User Simulation
		
			Paper
			
•
			2411.13281
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			Enhancing the Reasoning Ability of Multimodal Large Language Models via
  Mixed Preference Optimization
		
			Paper
			
•
			2411.10442
			
•
			Published
				
			•
				
				86
			
 
	
	 
	
	
	
			
			Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
  Language Models
		
			Paper
			
•
			2411.14432
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			Large Multi-modal Models Can Interpret Features in Large Multi-modal
  Models
		
			Paper
			
•
			2411.14982
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A
  Comprehensive Multimodal Dataset Towards General Medical AI
		
			Paper
			
•
			2411.14522
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			ShowUI: One Vision-Language-Action Model for GUI Visual Agent
		
			Paper
			
•
			2411.17465
			
•
			Published
				
			•
				
				90
			
 
	
	 
	
	
	
			
			MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
		
			Paper
			
•
			2411.15296
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
  Training-Free Acceleration
		
			Paper
			
•
			2411.17686
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			VLRewardBench: A Challenging Benchmark for Vision-Language Generative
  Reward Models
		
			Paper
			
•
			2411.17451
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			FINECAPTION: Compositional Image Captioning Focusing on Wherever You
  Want at Any Granularity
		
			Paper
			
•
			2411.15411
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
		
			Paper
			
•
			2411.18363
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
  Comprehension with Video-Text Duet Interaction Format
		
			Paper
			
•
			2411.17991
			
•
			Published
				
			•
				
				5
			
 
	
	 
	
	
	
			
			Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
		
			Paper
			
•
			2411.18203
			
•
			Published
				
			•
				
				41
			
 
	
	 
	
	
	
			
			ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
		
			Paper
			
•
			2411.17176
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			On Domain-Specific Post-Training for Multimodal Large Language Models
		
			Paper
			
•
			2411.19930
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			X-Prompt: Towards Universal In-Context Image Generation in
  Auto-Regressive Vision Language Foundation Models
		
			Paper
			
•
			2412.01824
			
•
			Published
				
			•
				
				64
			
 
	
	 
	
	
	
			
			PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
		
			Paper
			
•
			2412.01800
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
			
			OmniCreator: Self-Supervised Unified Generation with Universal Editing
		
			Paper
			
•
			2412.02114
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			PaliGemma 2: A Family of Versatile VLMs for Transfer
		
			Paper
			
•
			2412.03555
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			VideoICL: Confidence-based Iterative In-context Learning for
  Out-of-Distribution Video Understanding
		
			Paper
			
•
			2412.02186
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual
  Prompt Instruction Tuning
		
			Paper
			
•
			2412.03565
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			VisionZip: Longer is Better but Not Necessary in Vision Language Models
		
			Paper
			
•
			2412.04467
			
•
			Published
				
			•
				
				118
			
 
	
	 
	
	
	
			
			Florence-VL: Enhancing Vision-Language Models with Generative Vision
  Encoder and Depth-Breadth Fusion
		
			Paper
			
•
			2412.04424
			
•
			Published
				
			•
				
				63
			
 
	
	 
	
	
	
			
			NVILA: Efficient Frontier Visual Language Models
		
			Paper
			
•
			2412.04468
			
•
			Published
				
			•
				
				59
			
 
	
	 
	
	
	
			
			Personalized Multimodal Large Language Models: A Survey
		
			Paper
			
•
			2412.02142
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
		
			Paper
			
•
			2412.01169
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
		
			Paper
			
•
			2412.04449
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			CompCap: Improving Multimodal Large Language Models with Composite
  Captions
		
			Paper
			
•
			2412.05243
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Maya: An Instruction Finetuned Multilingual Multimodal Model
		
			Paper
			
•
			2412.07112
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
		
			Paper
			
•
			2412.06673
			
•
			Published
				
			•
				
				11
			
 
	
	 
	
	
	
			
			POINTS1.5: Building a Vision-Language Model towards Real World
  Applications
		
			Paper
			
•
			2412.08443
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
  Long-term Streaming Video and Audio Interactions
		
			Paper
			
•
			2412.09596
			
•
			Published
				
			•
				
				98
			
 
	
	 
	
	
	
			
			Multimodal Latent Language Modeling with Next-Token Diffusion
		
			Paper
			
•
			2412.08635
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
		
			Paper
			
•
			2412.09501
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			Apollo: An Exploration of Video Understanding in Large Multimodal Models
		
			Paper
			
•
			2412.10360
			
•
			Published
				
			•
				
				147
			
 
	
	 
	
	
	
			
			SynerGen-VL: Towards Synergistic Image Understanding and Generation with
  Vision Experts and Token Folding
		
			Paper
			
•
			2412.09604
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal
  Retrieval-Augmented Generation
		
			Paper
			
•
			2412.10704
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			Thinking in Space: How Multimodal Large Language Models See, Remember,
  and Recall Spaces
		
			Paper
			
•
			2412.14171
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Progressive Multimodal Reasoning via Active Retrieval
		
			Paper
			
•
			2412.14835
			
•
			Published
				
			•
				
				73
			
 
	
	 
	
	
	
			
			MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
		
			Paper
			
•
			2412.14475
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			Diving into Self-Evolving Training for Multimodal Reasoning
		
			Paper
			
•
			2412.17451
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
  Collective Monte Carlo Tree Search
		
			Paper
			
•
			2412.18319
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Next Token Prediction Towards Multimodal Intelligence: A Comprehensive
  Survey
		
			Paper
			
•
			2412.18619
			
•
			Published
				
			•
				
				58
			
 
	
	 
	
	
	
			
			Task Preference Optimization: Improving Multimodal Large Language Models
  with Vision Task Alignment
		
			Paper
			
•
			2412.19326
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			2.5 Years in Class: A Multimodal Textbook for Vision-Language
  Pretraining
		
			Paper
			
•
			2501.00958
			
•
			Published
				
			•
				
				107
			
 
	
	 
	
	
	
			
			VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
		
			Paper
			
•
			2501.01957
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
		
			Paper
			
•
			2501.01904
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			Dispider: Enabling Video LLMs with Active Real-Time Interaction via
  Disentangled Perception, Decision, and Reaction
		
			Paper
			
•
			2501.03218
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Cosmos World Foundation Model Platform for Physical AI
		
			Paper
			
•
			2501.03575
			
•
			Published
				
			•
				
				81
			
 
	
	 
	
	
	
			
			LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
  Vision Token
		
			Paper
			
•
			2501.03895
			
•
			Published
				
			•
				
				52
			
 
	
	 
	
	
	
			
			OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment
  across Language with Real-time Self-Aware Emotional Speech Synthesis
		
			Paper
			
•
			2501.04561
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			Are VLMs Ready for Autonomous Driving? An Empirical Study from the
  Reliability, Data, and Metric Perspectives
		
			Paper
			
•
			2501.04003
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			VideoRAG: Retrieval-Augmented Generation over Video Corpus
		
			Paper
			
•
			2501.05874
			
•
			Published
				
			•
				
				75
			
 
	
	 
	
	
	
			
			LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
		
			Paper
			
•
			2501.06186
			
•
			Published
				
			•
				
				65
			
 
	
	 
	
	
	
			
			OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
  Understanding?
		
			Paper
			
•
			2501.05510
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			ReFocus: Visual Editing as a Chain of Thought for Structured Image
  Understanding
		
			Paper
			
•
			2501.05452
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			Infecting Generative AI With Viruses
		
			Paper
			
•
			2501.05542
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Migician: Revealing the Magic of Free-Form Multi-Image Grounding in
  Multimodal Large Language Models
		
			Paper
			
•
			2501.05767
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
		
			Paper
			
•
			2501.06282
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			Tarsier2: Advancing Large Vision-Language Models from Detailed Video
  Description to Comprehensive Video Understanding
		
			Paper
			
•
			2501.07888
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			Do generative video models learn physical principles from watching
  videos?
		
			Paper
			
•
			2501.09038
			
•
			Published
				
			•
				
				34
			
 
	
	 
	
	
	
			
			InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward
  Model
		
			Paper
			
•
			2501.12368
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			MSTS: A Multimodal Safety Test Suite for Vision-Language Models
		
			Paper
			
•
			2501.10057
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
  Understanding
		
			Paper
			
•
			2501.13106
			
•
			Published
				
			•
				
				90
			
 
	
	 
	
	
	
			
			Temporal Preference Optimization for Long-Form Video Understanding
		
			Paper
			
•
			2501.13919
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Baichuan-Omni-1.5 Technical Report
		
			Paper
			
•
			2501.15368
			
•
			Published
				
			•
				
				62
			
 
	
	 
	
	
	
			
			Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with
  Modality-Aware Sparsity
		
			Paper
			
•
			2501.16295
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal
  Understanding
		
			Paper
			
•
			2502.01341
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive
  Modality Alignment
		
			Paper
			
•
			2502.04328
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
		
			Paper
			
•
			2502.06788
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Scaling Pre-training to One Hundred Billion Data for Vision Language
  Models
		
			Paper
			
•
			2502.07617
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Magma: A Foundation Model for Multimodal AI Agents
		
			Paper
			
•
			2502.13130
			
•
			Published
				
			•
				
				58
			
 
	
	 
	
	
	
			
			Qwen2.5-VL Technical Report
		
			Paper
			
•
			2502.13923
			
•
			Published
				
			•
				
				208
			
 
	
	 
	
	
	
			
			Slamming: Training a Speech Language Model on One GPU in a Day
		
			Paper
			
•
			2502.15814
			
•
			Published
				
			•
				
				69
			
 
	
	 
	
	
	
			
			OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
		
			Paper
			
•
			2502.18411
			
•
			Published
				
			•
				
				74
			
 
	
	 
	
	
	
			
			MLLMs Know Where to Look: Training-free Perception of Small Visual
  Details with Multimodal LLMs
		
			Paper
			
•
			2502.17422
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			Introducing Visual Perception Token into Multimodal Large Language Model
		
			Paper
			
•
			2502.17425
			
•
			Published
				
			•
				
				16
			
 
	
	 
	
	
	
			
			Token-Efficient Long Video Understanding for Multimodal LLMs
		
			Paper
			
•
			2503.04130
			
•
			Published
				
			•
				
				96
			
 
	
	 
	
	
	
			
			Unified Reward Model for Multimodal Understanding and Generation
		
			Paper
			
•
			2503.05236
			
•
			Published
				
			•
				
				123
			
 
	
	 
	
	
	
			
			SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by
  Imitating Human Annotator Trajectories
		
			Paper
			
•
			2503.08625
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			OmniMamba: Efficient and Unified Multimodal Understanding and Generation
  via State Space Models
		
			Paper
			
•
			2503.08686
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			Aligning Multimodal LLM with Human Preference: A Survey
		
			Paper
			
•
			2503.14504
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
		
			Paper
			
•
			2503.13111
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play
  Visual Games with Keyboards and Mouse
		
			Paper
			
•
			2503.16365
			
•
			Published
				
			•
				
				40
			
 
	
	 
	
	
	
			
			Judge Anything: MLLM as a Judge Across Any Modality
		
			Paper
			
•
			2503.17489
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			CoMP: Continual Multimodal Pre-training for Vision Foundation Models
		
			Paper
			
•
			2503.18931
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Video-R1: Reinforcing Video Reasoning in MLLMs
		
			Paper
			
•
			2503.21776
			
•
			Published
				
			•
				
				79
			
 
	
	 
	
	
	
			
			PAVE: Patching and Adapting Video Large Language Models
		
			Paper
			
•
			2503.19794
			
•
			Published
				
			•
				
				3
			
 
	
	 
	
	
	
			
			SmolVLM: Redefining small and efficient multimodal models
		
			Paper
			
•
			2504.05299
			
•
			Published
				
			•
				
				200
			
 
	
	 
	
	
	
			
			Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
		
			Paper
			
•
			2504.05599
			
•
			Published
				
			•
				
				85
			
 
	
	 
	
	
	
			
			OmniCaptioner: One Captioner to Rule Them All
		
			Paper
			
•
			2504.07089
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
		
			Paper
			
•
			2504.07491
			
•
			Published
				
			•
				
				132
			
 
	
	 
	
	
	
			
			MM-IFEngine: Towards Multimodal Instruction Following
		
			Paper
			
•
			2504.07957
			
•
			Published
				
			•
				
				35
			
 
	
	 
	
	
	
			
			Scaling Laws for Native Multimodal Models Scaling Laws for Native
  Multimodal Models
		
			Paper
			
•
			2504.07951
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			InternVL3: Exploring Advanced Training and Test-Time Recipes for
  Open-Source Multimodal Models
		
			Paper
			
•
			2504.10479
			
•
			Published
				
			•
				
				300
			
 
	
	 
	
	
	
			
			FUSION: Fully Integration of Vision-Language Representations for Deep
  Cross-Modal Understanding
		
			Paper
			
•
			2504.09925
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			Mavors: Multi-granularity Video Representation for Multimodal Large
  Language Model
		
			Paper
			
•
			2504.10068
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
		
			Paper
			
•
			2504.10465
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			The Scalability of Simplicity: Empirical Analysis of Vision-Language
  Learning with a Single Transformer
		
			Paper
			
•
			2504.10462
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			Eagle 2.5: Boosting Long-Context Post-Training for Frontier
  Vision-Language Models
		
			Paper
			
•
			2504.15271
			
•
			Published
				
			•
				
				66
			
 
	
	 
	
	
	
			
			LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
		
			Paper
			
•
			2504.16030
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			X-Fusion: Introducing New Modality to Frozen Large Language Models
		
			Paper
			
•
			2504.20996
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive
  Streaming Speech Synthesis
		
			Paper
			
•
			2505.02625
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			Ming-Lite-Uni: Advancements in Unified Architecture for Natural
  Multimodal Interaction
		
			Paper
			
•
			2505.02471
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
			
			Unified Multimodal Understanding and Generation Models: Advances,
  Challenges, and Opportunities
		
			Paper
			
•
			2505.02567
			
•
			Published
				
			•
				
				80
			
 
	
	 
	
	
	
			
			Seed1.5-VL Technical Report
		
			Paper
			
•
			2505.07062
			
•
			Published
				
			•
				
				152
			
 
	
	 
	
	
	
			
			Aya Vision: Advancing the Frontier of Multilingual Multimodality
		
			Paper
			
•
			2505.08751
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			MMaDA: Multimodal Large Diffusion Language Models
		
			Paper
			
•
			2505.15809
			
•
			Published
				
			•
				
				96
			
 
	
	 
	
	
	
			
			Jodi: Unification of Visual Generation and Understanding via Joint
  Modeling
		
			Paper
			
•
			2505.19084
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System
  Collaboration
		
			Paper
			
•
			2505.20256
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Muddit: Liberating Generation Beyond Text-to-Image with a Unified
  Discrete Diffusion Model
		
			Paper
			
•
			2505.23606
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			UniWorld: High-Resolution Semantic Encoders for Unified Visual
  Understanding and Generation
		
			Paper
			
•
			2506.03147
			
•
			Published
				
			•
				
				58
			
 
	
	 
	
	
	
			
			Is Extending Modality The Right Path Towards Omni-Modality?
		
			Paper
			
•
			2506.01872
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Stream-Omni: Simultaneous Multimodal Interactions with Large
  Language-Vision-Speech Model
		
			Paper
			
•
			2506.13642
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			Show-o2: Improved Native Unified Multimodal Models
		
			Paper
			
•
			2506.15564
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			OmniGen2: Exploration to Advanced Multimodal Generation
		
			Paper
			
•
			2506.18871
			
•
			Published
				
			•
				
				77
			
 
	
	 
	
	
	
		
			Paper
			
•
			2506.23044
			
•
			Published
				
			•
				
				62
			
 
	
	 
	
	
	
			
			Kwai Keye-VL Technical Report
		
			Paper
			
•
			2507.01949
			
•
			Published
				
			•
				
				130
			
 
	
	 
	
	
	
			
			Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
  Future Frontiers
		
			Paper
			
•
			2506.23918
			
•
			Published
				
			•
				
				88
			
 
	
	 
	
	
	
			
			VisionThink: Smart and Efficient Vision Language Model via Reinforcement
  Learning
		
			Paper
			
•
			2507.13348
			
•
			Published
				
			•
				
				75
			
 
	
	 
	
	
	
			
			Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal
  Large Language Models
		
			Paper
			
•
			2507.12566
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			Pixels, Patterns, but No Poetry: To See The World like Humans
		
			Paper
			
•
			2507.16863
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World
  Shorts
		
			Paper
			
•
			2507.20939
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			Step-3 is Large yet Affordable: Model-system Co-design for
  Cost-effective Decoding
		
			Paper
			
•
			2507.19427
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			Phi-Ground Tech Report: Advancing Perception in GUI Grounding
		
			Paper
			
•
			2507.23779
			
•
			Published
				
			•
				
				44
			
 
	
	 
	
	
	
			
			Multimodal Referring Segmentation: A Survey
		
			Paper
			
•
			2508.00265
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			VeOmni: Scaling Any Modality Model Training with Model-Centric
  Distributed Recipe Zoo
		
			Paper
			
•
			2508.02317
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			A Glimpse to Compress: Dynamic Visual Token Pruning for Large
  Vision-Language Models
		
			Paper
			
•
			2508.01548
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
			
			Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with
  Patch-level CLIP Latents
		
			Paper
			
•
			2508.05954
			
•
			Published
				
			•
				
				6
			
 
	
	 
	
	
	
		
			Paper
			
•
			2508.11737
			
•
			Published
				
			•
				
				110
			
 
	
	 
	
	
	
			
			Intern-S1: A Scientific Multimodal Foundation Model
		
			Paper
			
•
			2508.15763
			
•
			Published
				
			•
				
				255
			
 
	
	 
	
	
	
			
			InternVL3.5: Advancing Open-Source Multimodal Models in Versatility,
  Reasoning, and Efficiency
		
			Paper
			
•
			2508.18265
			
•
			Published
				
			•
				
				202
			
 
	
	 
	
	
	
			
			MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
		
			Paper
			
•
			2508.18264
			
•
			Published
				
			•
				
				25
			
 
	
	 
	
	
	
			
			R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs
  via Bi-Mode Annealing and Reinforce Learning
		
			Paper
			
•
			2508.21113
			
•
			Published
				
			•
				
				109
			
 
	
	 
	
	
	
			
			Kwai Keye-VL 1.5 Technical Report
		
			Paper
			
•
			2509.01563
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
  Vision Tokenizer
		
			Paper
			
•
			2509.16197
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			Qwen3-Omni Technical Report
		
			Paper
			
•
			2509.17765
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
  Training Recipe
		
			Paper
			
•
			2509.18154
			
•
			Published
				
			•
				
				49
			
 
	
	 
	
	
	
			
			How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven
  Perspective
		
			Paper
			
•
			2509.18905
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully
  Open MLLMs
		
			Paper
			
•
			2510.13795
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn
  Dialogue
		
			Paper
			
•
			2510.13747
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
		
			Paper
			
•
			2510.12793
			
•
			Published
				
			•
				
				3
			
 
	
	 
	
	
	
			
			InternSVG: Towards Unified SVG Tasks with Multimodal Large Language
  Models
		
			Paper
			
•
			2510.11341
			
•
			Published
				
			•
				
				33
			
 
	
	 
	
	
	
			
			AndesVL Technical Report: An Efficient Mobile-side Multimodal Large
  Language Model
		
			Paper
			
•
			2510.11496
			
•
			Published
				
			•
				
				3
			
 
	
	 
	
	
	
			
			Better Together: Leveraging Unpaired Multimodal Data for Stronger
  Unimodal Models
		
			Paper
			
•
			2510.08492
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			DeepSeek-OCR: Contexts Optical Compression
		
			Paper
			
•
			2510.18234
			
•
			Published
				
			•
				
				70
			
 
	
	 
	
	
	
			
			Glyph: Scaling Context Windows via Visual-Text Compression
		
			Paper
			
•
			2510.17800
			
•
			Published
				
			•
				
				64