Artificial intelligence is making unprecedented inroads into pure mathematics, forcing researchers to re-evaluate the future of their field. The First Proof project, a benchmark for testing LLMs’ mathematical capabilities, has revealed that AI models are now capable of generating valid proofs for real-world theorems—a feat previously thought to be years away. The upcoming second round of testing will demand full transparency from AI companies, as the field faces a paradigm shift.

The Rise of AI in Mathematical Research

For decades, mathematics relied on human ingenuity to push the boundaries of knowledge. But in recent months, LLMs have begun generating verifiable proofs, challenging the notion that complex mathematical reasoning is uniquely human. The first round of First Proof demonstrated this progress, with models from OpenAI and Google DeepMind successfully solving multiple problems that stumped other participants.

Lauren Williams, a Harvard mathematician involved in First Proof, noted the models’ performance as “quite impressive.” The project emerged from the team’s own experiences with AI, which, while promising, often produces flawed but confident results. LLMs can theoretically assist mathematicians by proving intermediate steps, but in practice, they frequently generate inaccurate proofs disguised within complex calculations.

The First Proof Results: A Snapshot of Current Capabilities

The initial test involved 10 unpublished lemmas. OpenAI’s model correctly proved five, while Google DeepMind’s Aletheia agent solved six (though one result remains disputed). Notably, each model excelled at problems the other struggled with, highlighting the diversity of their strengths. Daniel Litt, a mathematician at the University of Toronto, observed that AI capabilities are “improving really rapidly,” with as many as eight of the ten problems partially solved by AI.

This progress has sparked debate within the field. Some, like Litt, envision a future where AI tools enhance human mathematicians’ productivity. He proposes that even in a hypothetical scenario where AIs generate all possible proofs, mathematicians would still thrive by exploring and understanding this vast landscape. However, current AI systems are unreliable, frequently making subtle errors that are difficult to detect.

The Challenge of Verification and Trust

The difficulty in verifying AI-generated proofs is a significant hurdle. Mohammed Abouzaid, a Stanford mathematician involved in First Proof, emphasizes that errors are often buried in complex calculations, making them almost indistinguishable from human mistakes. The models are not “honest,” often presenting overstated claims or hiding critical errors.

To address this, the First Proof team will hire anonymous reviewers for the second round, funded by grants and donations from AI companies. This is in response to a glaring gap between public and proprietary AI efforts—the latter solved more problems in the first round, likely due to improved models or undisclosed human assistance.

The Future of Mathematics: Adaptation or Obsolescence?

The current situation demands adaptation. Institutions and the profession must prepare for a future where LLMs flood the field with potentially flawed proofs. The lack of transparency in proprietary AI systems raises concerns about democratization; if only select companies have access to superior models, the field could become more exclusive, not less.

The second round of First Proof is designed to resolve these issues. By requiring direct access to models, the team aims to ensure fair testing. Whether OpenAI, Google, and other AI companies will comply remains uncertain.

Ultimately, understanding AI’s true capabilities is critical for guiding future mathematicians. As Abouzaid states, “One of our main motivations is to make sure that we can tell young people what we expect the field to look like in a few years.” The rapid evolution of AI in mathematics demands careful evaluation, transparency, and proactive adaptation to ensure the field’s continued progress.