MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations Paper • 2505.14101 • Published May 20, 2025 • 3
Scaling Reasoning can Improve Factuality in Large Language Models Paper • 2505.11140 • Published May 16, 2025 • 7
Scaling Reasoning can Improve Factuality in Large Language Models Paper • 2505.11140 • Published May 16, 2025 • 7
Scaling Reasoning can Improve Factuality in Large Language Models Paper • 2505.11140 • Published May 16, 2025 • 7
Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation Paper • 2504.07072 • Published Apr 9, 2025 • 9
How Do Hackathons Foster Creativity? Towards AI Collaborative Evaluation of Creativity at Scale Paper • 2503.04290 • Published Mar 6, 2025 • 1
HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings Paper • 2502.15411 • Published Feb 21, 2025 • 2