SWE-Universe: Scale Real-World Verifiable Environments to Millions Paper • 2602.02361 • Published 4 days ago • 55
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models Paper • 2306.05179 • Published Jun 8, 2023 • 2