Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering Paper • 2512.06915 • Published Dec 7, 2025 • 12
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads Paper • 2511.06209 • Published Nov 9, 2025 • 18
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28, 2025 • 174
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR Paper • 2508.14029 • Published Aug 19, 2025 • 118
Pixels, Patterns, but No Poetry: To See The World like Humans Paper • 2507.16863 • Published Jul 21, 2025 • 68
Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning Paper • 2502.11962 • Published Feb 17, 2025 • 38
SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning Paper • 2506.08989 • Published Jun 10, 2025 • 14
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Paper • 2506.02096 • Published Jun 2, 2025 • 52
Sherlock: Self-Correcting Reasoning in Vision-Language Models Paper • 2505.22651 • Published May 28, 2025 • 48
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research Paper • 2505.19955 • Published May 26, 2025 • 13
Backdoor Cleaning without External Guidance in MLLM Fine-tuning Paper • 2505.16916 • Published May 22, 2025 • 17
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning Paper • 2505.11049 • Published May 16, 2025 • 60
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models Paper • 2505.10554 • Published May 15, 2025 • 120
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation Paper • 2504.13055 • Published Apr 17, 2025 • 19