TaxCalcBench: a first-ever benchmark for evaluating AI’s ability to calculate tax returns
AI can’t do your taxes on its own (yet).
- Column Tax is on a mission to apply AI to the tax domain to make it so every American can confidently file taxes in one-click. In our new research paper and associated code & data repository we present a new benchmark, TaxCalcBench, for evaluating frontier AI models’ performance and reliability in calculating realistic tax returns.
- This is a first-of-its-kind benchmark. Because of the limited availability of personal income tax calculation engines in the US, the dataset behind this benchmark is very hard to come by. The dataset includes 51 realistic pairs of inputs representing taxpayer data (e.g. W-2s) and the corresponding correctly-computed Form 1040 in IRS XML format.
- Results show that AI models consistently struggle with critical tax tasks, frequently misusing IRS tax tables and making calculation errors, underscoring that current LLMs alone are insufficient for reliably computing personal tax returns without additional scaffolding & orchestration.
Read more on the Column Tax blog: