-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 8 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 40
Collections
Discover the best community collections!
Collections including paper arxiv:2504.07491
-
162
Chat with Kimi-VL-A3B-Thinking-2506
🤔Chat with images, videos, or PDFs to generate text
-
moonshotai/Kimi-VL-A3B-Thinking-2506
Image-Text-to-Text • 16B • Updated • 42.1k • 251 -
moonshotai/Kimi-VL-A3B-Instruct
Image-Text-to-Text • 16B • Updated • 193k • 226 -
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text • 16B • Updated • 97.6k • 433
-
MoBA: Mixture of Block Attention for Long-Context LLMs
Paper • 2502.13189 • Published • 17 -
Kimi-Audio Technical Report
Paper • 2504.18425 • Published • 19 -
Kimi-VL Technical Report
Paper • 2504.07491 • Published • 133 -
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper • 2501.12599 • Published • 123
-
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Paper • 2506.05176 • Published • 68 -
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 46 -
MiMo-VL Technical Report
Paper • 2506.03569 • Published • 76 -
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Paper • 2506.03147 • Published • 58
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
-
Reinforcement Pre-Training
Paper • 2506.08007 • Published • 253 -
A Survey on Latent Reasoning
Paper • 2507.06203 • Published • 85 -
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 16 -
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Paper • 1910.10683 • Published • 14
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 8 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 40
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
162
Chat with Kimi-VL-A3B-Thinking-2506
🤔Chat with images, videos, or PDFs to generate text
-
moonshotai/Kimi-VL-A3B-Thinking-2506
Image-Text-to-Text • 16B • Updated • 42.1k • 251 -
moonshotai/Kimi-VL-A3B-Instruct
Image-Text-to-Text • 16B • Updated • 193k • 226 -
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text • 16B • Updated • 97.6k • 433
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
-
MoBA: Mixture of Block Attention for Long-Context LLMs
Paper • 2502.13189 • Published • 17 -
Kimi-Audio Technical Report
Paper • 2504.18425 • Published • 19 -
Kimi-VL Technical Report
Paper • 2504.07491 • Published • 133 -
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper • 2501.12599 • Published • 123
-
Reinforcement Pre-Training
Paper • 2506.08007 • Published • 253 -
A Survey on Latent Reasoning
Paper • 2507.06203 • Published • 85 -
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 16 -
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Paper • 1910.10683 • Published • 14
-
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Paper • 2506.05176 • Published • 68 -
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Paper • 2506.04207 • Published • 46 -
MiMo-VL Technical Report
Paper • 2506.03569 • Published • 76 -
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Paper • 2506.03147 • Published • 58