TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning Paper • 2511.01833 • Published 11 days ago • 15
Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark Paper • 2510.02356 • Published Sep 27 • 11