Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions Paper • 2412.08737 • Published Dec 11, 2024 • 55
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer Paper • 2412.07720 • Published Dec 10, 2024 • 32
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems Paper • 2402.14008 • Published Feb 21, 2024
GUICourse: From General Vision Language Models to Versatile GUI Agents Paper • 2406.11317 • Published Jun 17, 2024 • 1
Fuse It More Deeply! A Variational Transformer with Layer-Wise Latent Variable Inference for Text Generation Paper • 2207.06130 • Published Jul 13, 2022
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages Paper • 2308.12038 • Published Aug 23, 2023 • 2
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants Paper • 2310.00653 • Published Oct 1, 2023 • 3
Exploring Perceptual Limitation of Multimodal Large Language Models Paper • 2402.07384 • Published Feb 12, 2024 • 1
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback Paper • 2312.00849 • Published Dec 1, 2023 • 12