Trending Papers

77

GitHub 77.9k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Dec 28, 2024

77

GitHub 77.9k arXiv Page

Submitted by

Yuyang-z

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

SANA-Video, a small diffusion model, efficiently generates high-resolution, high-quality videos with strong text-video alignment using linear attention and a constant-memory KV cache, achieving competitive performance at a lower cost and faster speed.

NVIDIA · Published on Sep 29, 2025

53

GitHub 7.19k arXiv Page

Submitted by

Yuyang-z

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

SANA-Video, a small diffusion model, efficiently generates high-resolution, high-quality videos with strong text-video alignment using linear attention and a constant-memory KV cache, achieving competitive performance at a lower cost and faster speed.

NVIDIA · Sep 29, 2025

53

GitHub 7.19k arXiv Page

Submitted by

CoreloneH

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lance is a unified multimodal model that combines understanding, generation, and editing capabilities for images and videos through collaborative multi-task training and a dual-stream architecture.

bytedance-research · Published on May 18, 2026

68

GitHub 595 arXiv Page

Submitted by

CoreloneH

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lance is a unified multimodal model that combines understanding, generation, and editing capabilities for images and videos through collaborative multi-task training and a dual-stream architecture.

bytedance-research · May 18, 2026

68

GitHub 595 arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Published on Sep 26, 2025

162

GitHub 64.3k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Sep 26, 2025

162

GitHub 64.3k arXiv Page

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

AI-Trader presents the first fully automated live benchmark for evaluating large language models in financial decision-making across multiple markets with autonomous information processing.

6 authors

· Published on Dec 1, 2025

8

GitHub 18.3k arXiv Page

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

AI-Trader presents the first fully automated live benchmark for evaluating large language models in financial decision-making across multiple markets with autonomous information processing.

6 authors

· Dec 1, 2025

8

GitHub 18.3k arXiv Page

Submitted by

RuofengYang

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

ARIS is an open-source research harness that uses cross-model adversarial collaboration to ensure reliable long-term research outcomes through coordinated execution, orchestration, and assurance layers.

Shanghai Jiao Tong University · Published on May 4, 2026

119

GitHub 10.3k arXiv Page

Submitted by

RuofengYang

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

ARIS is an open-source research harness that uses cross-model adversarial collaboration to ensure reliable long-term research outcomes through coordinated execution, orchestration, and assurance layers.

Shanghai Jiao Tong University · May 4, 2026

119

GitHub 10.3k arXiv Page

Submitted by

Chaojian

TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

TideGS enables training 3D Gaussian Splatting with over one billion primitives on a single GPU by managing parameters across SSD-CPU-GPU hierarchy through block-virtualization, asynchronous pipeline, and differential streaming techniques.

Sponge Computing Lab at HKUST · Published on May 19, 2026

6

GitHub 100 arXiv Page

Submitted by

Chaojian

TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

TideGS enables training 3D Gaussian Splatting with over one billion primitives on a single GPU by managing parameters across SSD-CPU-GPU hierarchy through block-virtualization, asynchronous pipeline, and differential streaming techniques.

Sponge Computing Lab at HKUST · May 19, 2026

6

GitHub 100 arXiv Page

EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning

EverMemOS presents a self-organizing memory system for large language models that processes dialogue streams into structured memory cells and scenes to enhance long-term interaction capabilities.

11 authors

· Published on Jan 5, 2026

4

GitHub 5.36k arXiv Page

EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning

EverMemOS presents a self-organizing memory system for large language models that processes dialogue streams into structured memory cells and scenes to enhance long-term interaction capabilities.

11 authors

· Jan 5, 2026

4

GitHub 5.36k arXiv Page

Submitted by

akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

24 authors

· Published on Jul 23, 2024

78

GitHub 74.3k arXiv Page

Submitted by

akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

24 authors

· Jul 23, 2024

78

GitHub 74.3k arXiv Page

Submitted by

thuzhaowang

Pixal3D: Pixel-Aligned 3D Generation from Images

Pixal3D introduces a pixel-aligned 3D generation approach that addresses fidelity issues in 3D asset creation by establishing direct pixel-to-3D correspondences through back-projection conditioning.

ARC Lab, Tencent PCG · Published on May 11, 2026

30

GitHub 1.25k arXiv Page

Submitted by

thuzhaowang

Pixal3D: Pixel-Aligned 3D Generation from Images

Pixal3D introduces a pixel-aligned 3D generation approach that addresses fidelity issues in 3D asset creation by establishing direct pixel-to-3D correspondences through back-projection conditioning.

ARC Lab, Tencent PCG · May 11, 2026

30

GitHub 1.25k arXiv Page

Submitted by

AaronHuangWei

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

LongLive-2.0 presents an NVFP4-based parallel infrastructure for long video generation that addresses training and inference bottlenecks through sequence-parallel autoregressive training and diffusion model tuning.

NVIDIA · Published on May 18, 2026

103

GitHub 1.44k arXiv Page

Submitted by

AaronHuangWei

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

LongLive-2.0 presents an NVFP4-based parallel infrastructure for long video generation that addresses training and inference bottlenecks through sequence-parallel autoregressive training and diffusion model tuning.

NVIDIA · May 18, 2026

103

GitHub 1.44k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Published on Apr 28, 2025

GitHub 56.3k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Apr 28, 2025

GitHub 56.3k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Published on Nov 20, 2025

137

GitHub 9.91k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Nov 20, 2025

137

GitHub 9.91k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Published on Sep 12, 2023

55

GitHub 80.6k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Sep 12, 2023

55

GitHub 80.6k arXiv Page

Kronos: A Foundation Model for the Language of Financial Markets

Kronos, a specialized pre-training framework for financial K-line data, outperforms existing models in forecasting and synthetic data generation through a unique tokenizer and autoregressive pre-training on a large dataset.

7 authors

· Published on Aug 2, 2025

Kronos: A Foundation Model for the Language of Financial Markets

Kronos, a specialized pre-training framework for financial K-line data, outperforms existing models in forecasting and synthetic data generation through a unique tokenizer and autoregressive pre-training on a large dataset.

7 authors

· Aug 2, 2025

Submitted by

liangjiaqing

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

GenericAgent is a self-evolving large language model agent system that maximizes context information density through hierarchical memory, reusable SOPs, and efficient compression to overcome long-horizon limitations.

Fudan University · Published on Apr 18, 2026

20

GitHub 11.9k arXiv Page

Submitted by

liangjiaqing

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

GenericAgent is a self-evolving large language model agent system that maximizes context information density through hierarchical memory, reusable SOPs, and efficient compression to overcome long-horizon limitations.

Fudan University · Apr 18, 2026

20

GitHub 11.9k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Published on Oct 23, 2024

13

GitHub 58.5k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Oct 23, 2024

13

GitHub 58.5k arXiv Page

Submitted by

zhangkangning

MMSkills: Towards Multimodal Skills for General Visual Agents

Multimodal procedural knowledge frameworks enable visual agents to leverage external reusable skills through structured representations combining text, state cards, and visual keyframes, improving decision-making in complex environments.

Shanghai Jiaotong University 1(NOT OFFICIAL) · Published on May 14, 2026

116

GitHub 129 arXiv Page

Submitted by

zhangkangning

MMSkills: Towards Multimodal Skills for General Visual Agents

Multimodal procedural knowledge frameworks enable visual agents to leverage external reusable skills through structured representations combining text, state cards, and visual keyframes, improving decision-making in complex environments.

Shanghai Jiaotong University 1(NOT OFFICIAL) · May 14, 2026

116

GitHub 129 arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Published on Oct 16, 2025

126

GitHub 78.3k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Oct 16, 2025

126

GitHub 78.3k arXiv Page

Submitted by

akhaliq

Very Large-Scale Multi-Agent Simulation in AgentScope

Enhancements to the AgentScope platform improve scalability, efficiency, and ease of use for large-scale multi-agent simulations through distributed mechanisms, flexible environments, and user-friendly tools.

8 authors

· Published on Jul 25, 2024

42

Submitted by

akhaliq

Very Large-Scale Multi-Agent Simulation in AgentScope

Enhancements to the AgentScope platform improve scalability, efficiency, and ease of use for large-scale multi-agent simulations through distributed mechanisms, flexible environments, and user-friendly tools.

8 authors

· Jul 25, 2024

42

Submitted by

Paranioar

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Unified vision-language models treat understanding and generation as integrated processes rather than separate tasks, demonstrating strong performance across multiple multimodal capabilities including image synthesis and action reasoning.

SenseNova · Published on May 12, 2026

184

GitHub 2.12k arXiv Page

Submitted by

Paranioar

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Unified vision-language models treat understanding and generation as integrated processes rather than separate tasks, demonstrating strong performance across multiple multimodal capabilities including image synthesis and action reasoning.

SenseNova · May 12, 2026

184

GitHub 2.12k arXiv Page

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Published on Aug 22, 2025

63

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Aug 22, 2025

63

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

IBM Granite · Published on Mar 14, 2025

158

GitHub 60.1k arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

IBM Granite · Mar 14, 2025

158

GitHub 60.1k arXiv Page

Submitted by

taesiri

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Uni-Edit introduces an intelligent image editing task that simultaneously enhances unified multimodal models' understanding, generation, and editing capabilities through a single training stage and dataset, utilizing an automated data synthesis pipeline for complex editing instructions.

7 authors

· Published on May 20, 2026

17

GitHub 23 arXiv Page

Submitted by

taesiri

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Uni-Edit introduces an intelligent image editing task that simultaneously enhances unified multimodal models' understanding, generation, and editing capabilities through a single training stage and dataset, utilizing an automated data synthesis pipeline for complex editing instructions.

7 authors

· May 20, 2026

17

GitHub 23 arXiv Page

Submitted by

shawnxzhu

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory automates the creation of executable tool environments and natural multi-turn trajectories for training LLMs with agentic reinforcement learning, achieving superior performance with fewer resources.

LARK Lab@HKUST (GZ) · Published on May 18, 2026

44

GitHub 48 arXiv Page

Submitted by

shawnxzhu

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory automates the creation of executable tool environments and natural multi-turn trajectories for training LLMs with agentic reinforcement learning, achieving superior performance with fewer resources.

LARK Lab@HKUST (GZ) · May 18, 2026

44

GitHub 48 arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

39

GitHub 35.5k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Oct 8, 2024

39

GitHub 35.5k arXiv Page

Submitted by

ldkong

AI for Auto-Research: Roadmap & User Guide

AI systems demonstrate varying reliability across research stages, excelling in structured tasks but struggling with novel ideas and scientific judgment, necessitating human oversight for credible outcomes.

20 authors

· Published on May 18, 2026

GitHub 94 arXiv Page

Submitted by

ldkong

AI for Auto-Research: Roadmap & User Guide

AI systems demonstrate varying reliability across research stages, excelling in structured tasks but struggling with novel ideas and scientific judgment, necessitating human oversight for credible outcomes.

20 authors

· May 18, 2026

GitHub 94 arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Published on Jan 20, 2025

10

GitHub 26.3k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Jan 20, 2025

10

GitHub 26.3k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Published on Oct 14, 2025

82

GitHub 20.5k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Oct 14, 2025

82

GitHub 20.5k arXiv Page

Submitted by

taesiri

Code as Agent Harness

Large language models are increasingly used as operational substrates for agent reasoning and execution in agentic systems, with code serving as a unified infrastructure layer across multiple domains and applications.

42 authors

· Published on May 18, 2026

GitHub 157 arXiv Page

Submitted by

taesiri

Code as Agent Harness

Large language models are increasingly used as operational substrates for agent reasoning and execution in agentic systems, with code serving as a unified infrastructure layer across multiple domains and applications.

42 authors

· May 18, 2026

GitHub 157 arXiv Page

Submitted by

sinwang

World Action Models: The Next Frontier in Embodied AI

World Action Models unify predictive state modeling with action generation for embodied policy learning, forming a cohesive framework for understanding environment dynamics and action prediction.

OpenMOSS · Published on May 12, 2026

64

GitHub 410 arXiv Page

Submitted by

sinwang

World Action Models: The Next Frontier in Embodied AI

World Action Models unify predictive state modeling with action generation for embodied policy learning, forming a cohesive framework for understanding environment dynamics and action prediction.

OpenMOSS · May 12, 2026

64

GitHub 410 arXiv Page

Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

Adaptive chunking framework uses intrinsic document metrics to select optimal segmentation strategies for retrieval-augmented generation, significantly improving answer correctness and question resolution rates.

Ekimetrics · Published on Mar 26, 2026

3

GitHub 78 arXiv Page

Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

Adaptive chunking framework uses intrinsic document metrics to select optimal segmentation strategies for retrieval-augmented generation, significantly improving answer correctness and question resolution rates.

Ekimetrics · Mar 26, 2026

3

GitHub 78 arXiv Page

Submitted by

taesiri

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

MiniCPM-V 4.5, a 8B parameter multimodal large language model, achieves high performance and efficiency through a unified 3D-Resampler architecture, a unified learning paradigm, and a hybrid reinforcement learning strategy.

34 authors

· Published on Sep 16, 2025

Submitted by

taesiri

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

MiniCPM-V 4.5, a 8B parameter multimodal large language model, achieves high performance and efficiency through a unified 3D-Resampler architecture, a unified learning paradigm, and a hybrid reinforcement learning strategy.

34 authors

· Sep 16, 2025

Submitted by

chengtan9907

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

Visual typesetting optimization addresses the gap between compilable LaTeX documents and publication-ready PDFs through vision-in-the-loop agents that iteratively diagnose and repair layout defects.

OpenDataLab · Published on May 11, 2026

32

GitHub 245 arXiv Page

Submitted by

chengtan9907

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

Visual typesetting optimization addresses the gap between compilable LaTeX documents and publication-ready PDFs through vision-in-the-loop agents that iteratively diagnose and repair layout defects.

OpenDataLab · May 11, 2026

32

GitHub 245 arXiv Page

Submitted by

Yirany

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

MiniCPM-o 4.5 enables real-time full-duplex multimodal interaction through Omni-Flow, a unified streaming framework that aligns inputs and outputs temporally for simultaneous perception and response.

OpenBMB · Published on Apr 30, 2026

71

Submitted by

Yirany

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

MiniCPM-o 4.5 enables real-time full-duplex multimodal interaction through Omni-Flow, a unified streaming framework that aligns inputs and outputs temporally for simultaneous perception and response.

OpenBMB · Apr 30, 2026

71

Submitted by

zbhpku

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

DataFlow is an LLM-driven data preparation framework that enhances data quality and reproducibility for various tasks, improving LLM performance with automatically generated pipelines.

Peking University · Published on Dec 18, 2025

222

GitHub 3.83k arXiv Page

Submitted by

zbhpku

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

DataFlow is an LLM-driven data preparation framework that enhances data quality and reproducibility for various tasks, improving LLM performance with automatically generated pipelines.

Peking University · Dec 18, 2025

222

GitHub 3.83k arXiv Page

AutoDev: Automated AI-Driven Development

AutoDev is an AI-driven software development framework that automates complex engineering tasks within a secure Docker environment, achieving high performance in code and test generation.

5 authors

· Published on Mar 13, 2024

15

GitHub 17.1k arXiv Page

AutoDev: Automated AI-Driven Development

AutoDev is an AI-driven software development framework that automates complex engineering tasks within a secure Docker environment, achieving high performance in code and test generation.

5 authors

· Mar 13, 2024

15

GitHub 17.1k arXiv Page

Submitted by

DukeShen

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

FashionChameleon enables real-time interactive multi-garment video customization through teacher-student distillation and in-context learning techniques while maintaining motion coherence.

alibaba-inc · Published on May 15, 2026

GitHub 89 arXiv Page

Submitted by

DukeShen

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

FashionChameleon enables real-time interactive multi-garment video customization through teacher-student distillation and in-context learning techniques while maintaining motion coherence.

alibaba-inc · May 15, 2026

GitHub 89 arXiv Page

Submitted by

joshuagu15

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

PEEK enables large language model agents to efficiently reuse orientation knowledge about recurring external contexts through a persistent context map that reduces computational costs and improves performance.

Massachusetts Institute of Technology · Published on May 19, 2026

5

Submitted by

joshuagu15

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

PEEK enables large language model agents to efficiently reuse orientation knowledge about recurring external contexts through a persistent context map that reduces computational costs and improves performance.

Massachusetts Institute of Technology · May 19, 2026

5

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Published on Nov 17, 2025

28

GitHub 21.6k arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Nov 17, 2025

28

GitHub 21.6k arXiv Page

Submitted by

taesiri

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

AnyFlow introduces a novel any-step video diffusion distillation framework that improves upon consistency distillation by optimizing full ODE sampling trajectories through flow-map transition learning and backward simulation techniques.

NVIDIA · Published on May 13, 2026

96

GitHub 307 arXiv Page

Submitted by

taesiri

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

AnyFlow introduces a novel any-step video diffusion distillation framework that improves upon consistency distillation by optimizing full ODE sampling trajectories through flow-map transition learning and backward simulation techniques.

NVIDIA · May 13, 2026

96

GitHub 307 arXiv Page

Submitted by

zunhai

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

OScaR is a novel KV cache compression framework that addresses token norm imbalance through canalized rotation and omni-token scaling, achieving significant improvements in memory efficiency and decoding speed for extended context language models.

The University of Hong Kong · Published on May 19, 2026

GitHub 20 arXiv Page

Submitted by

zunhai

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

OScaR is a novel KV cache compression framework that addresses token norm imbalance through canalized rotation and omni-token scaling, achieving significant improvements in memory efficiency and decoding speed for extended context language models.

The University of Hong Kong · May 19, 2026

GitHub 20 arXiv Page

Submitted by

haizhongzheng

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

AstraFlow is a dataflow-oriented reinforcement learning system that enables efficient multi-policy collaborative training and elastic scaling across diverse compute resources for large language model agents.

Carnegie Mellon University · Published on May 15, 2026

14

GitHub 47 arXiv Page

Submitted by

haizhongzheng

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

AstraFlow is a dataflow-oriented reinforcement learning system that enables efficient multi-policy collaborative training and elastic scaling across diverse compute resources for large language model agents.

Carnegie Mellon University · May 15, 2026

14

GitHub 47 arXiv Page

Submitted by

nielsr

Geometric Context Transformer for Streaming 3D Reconstruction

LingBot-Map is a feed-forward 3D foundation model that reconstructs scenes from video streams using a geometric context transformer architecture with specialized attention mechanisms for coordinate grounding, dense geometric cues, and long-range drift correction, achieving stable real-time performance at 20 FPS.

Robbyant · Published on Apr 15, 2026

21

GitHub 6.52k arXiv Page

Submitted by

nielsr

Geometric Context Transformer for Streaming 3D Reconstruction

LingBot-Map is a feed-forward 3D foundation model that reconstructs scenes from video streams using a geometric context transformer architecture with specialized attention mechanisms for coordinate grounding, dense geometric cues, and long-range drift correction, achieving stable real-time performance at 20 FPS.

Robbyant · Apr 15, 2026

21

GitHub 6.52k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Published on Feb 7, 2025

18

GitHub 68.2k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Feb 7, 2025

18

GitHub 68.2k arXiv Page

Submitted by

unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

Microsoft Research · Published on Aug 26, 2025

170

GitHub 47.3k arXiv Page

Submitted by

unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

Microsoft Research · Aug 26, 2025

170

GitHub 47.3k arXiv Page

Submitted by

Two-hot

Semantic Generative Tuning for Unified Multimodal Models

Generative post-training with semantic segmentation as a proxy aligns visual understanding and generation in unified multimodal models, improving both perception and generative fidelity.

Shanghai Jiao Tong University · Published on May 18, 2026

9

GitHub 59 arXiv Page

Submitted by

Two-hot

Semantic Generative Tuning for Unified Multimodal Models

Generative post-training with semantic segmentation as a proxy aligns visual understanding and generation in unified multimodal models, improving both perception and generative fidelity.

Shanghai Jiao Tong University · May 18, 2026

9

GitHub 59 arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Published on Mar 20, 2024

GitHub 71.5k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Mar 20, 2024

GitHub 71.5k arXiv Page

Submitted by

chiennv

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity through shared KV caches and consensus mechanisms.

6 authors

· Published on May 12, 2026

12

GitHub 340 arXiv Page

Submitted by

chiennv

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity through shared KV caches and consensus mechanisms.

6 authors

· May 12, 2026

12

GitHub 340 arXiv Page

Submitted by

LCZZZZ

Auditing Agent Harness Safety

LLM agents executing within execution harnesses can produce correct outputs while violating safety constraints during execution, necessitating trajectory-level auditing to ensure proper resource access and information flow across multi-agent systems.

UC Santa Barbara NLP Group · Published on May 14, 2026

52

Submitted by

LCZZZZ

Auditing Agent Harness Safety

LLM agents executing within execution harnesses can produce correct outputs while violating safety constraints during execution, necessitating trajectory-level auditing to ensure proper resource access and information flow across multi-agent systems.

UC Santa Barbara NLP Group · May 14, 2026

52