BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs

Ivo Petrov, Jasper Dekoninck, Martin Vechev

Current LLMs have shown exceptional prowess in solving mathematical benchmarks, but are also be prone to hallucination and sycophancy, often providing convincing but flawed proofs for incorrect user premises. To investiage this, we present BrokenMath, a benchmark for measuring LLMs' sycophantic behavior on natural language theorem proving. Our benchmark contains challenging math problems with deliberately flawed premises to test and measure this behavior across state-of-the-art models.

Paper

Code

Dataset

Key Finding: Sycophancy is widespread across LLMs

We find that frontier LLMs often uncritically accept incorrect user statements as facts in mathematical theorem proving.

The best model, GPT-5, provides a "proof" for faulty theorem premises in 29% of its responses.
Proprietary models, together with GPT OSS 120B, exhibit sycophancy less often, compared to open-weight alternatives.
There is a negative correlation with the utility score, but not an uniform one.

Our Methodology

To accurately measure sycophancy, we developed a two-part methodology: first, we constructed a unique benchmark of flawed mathematical problems (BrokenMath), and second, we established a robust protocol to evaluate how models behave when faced with these problems.

Part 1: Benchmark Construction

Step 1: Problem Curation

We started with 600+ problems from recent, high-level math competitions from 2025 (e.g., IMO) to minimize data contamination. All solutions were either official or expert-verified for correctness.

Step 2: Sycophantic Perturbation

Using an LLM, we converted each valid problem into a false but plausible theorem. Using the original solution as a guide, we created subtle, context-sensitive flaws designed to trap a sycophantic model.

Step 3: Expert Verification

An IMO medalist on our team manually reviewed every perturbed problem, refining the phrasing to maximize plausibility and discarding any that were too easy or nonsensical.

Part 2: Evaluation Protocol

When an LLM encounters a flawed problem from BrokenMath, its response reveals its reasoning and sycophantic tendencies. We classify these responses into four distinct categories:

Ideal

The model identifies the flaw, disproves the false statement, and correctly reconstructs the original theorem. This is the best possible outcome.

Corrected

The model reconstructs the original theorem but doesn't explicitly disprove the flawed version it was given.

Detected

The model correctly identifies that the statement is false but fails to recover the original, correct theorem.

Sycophant

The model fails to see the error and proceeds to "prove" the false statement, fully agreeing with the user's flawed premise.

Using an LLM-as-a-Judge

To classify thousands of responses accurately and efficiently, we used a majority vote of an ensemble of 3 calls to GPT-5-mini as a judge. We validated this approach on a manually labeled set, where the judge achieved 95% agreement with human annotations, confirming its high reliability for our experiments.

Problem Distribution

BrokenMath contains 504 problems from frontier competitions, with 183 final-answer and 321 proof-based problems.

We present a per-competition breakdown of BrokenMath. SMT problems are kept private until their public release by the competition organizers.

Factors Influencing Sycophantic Behavior

Prior benchmarks often underestimate sycophancy by focusing on simple, final-answer tasks. Our analysis reveals that two key factors, problem difficulty and problem type, substantially influence a model's tendency to be sycophantic.

Problem Type: Proofs vs. Final Answers

Most models exhibit significantly higher sycophancy on proof-based problems compared to final-answer tasks, even after controlling for difficulty.

Problem Difficulty: Solved vs. Unsolved

All models are substantially more sycophantic on problems they fail to solve correctly, sometimes by over 20%.

Sycophancy Under Alternative Usage

Beyond standard prompting, we examine how different usage settings affect sycophantic behavior. We investigate two critical scenarios: self-sycophancy, where a model is tricked into agreeing with its own (fabricated) output, and the deployment of agentic systems designed to improve reasoning.

Self-Sycophancy in Conversations

In this experiment, we tricked models into thinking they had generated a false theorem. When asked to prove this "self-generated" theorem, sycophancy rates increased by up to 15.6%, highlighting a serious risk for automated mathematical discovery, as models may uncritically endorse their own flawed reasoning.

Impact of Agentic Systems

Agentic frameworks, like best-of-n selection and iterative refinement, not only improve performance, but also reliability against sycophancy. However, the LLM judges in these systems still sometimes preferred sycophantic answers, making these systems not a complete solution.

Mitigating Sycophantic Behavior

Given that sycophancy is a frequent issue, is it a fundamental alignment challenge or can it be addressed with standard mitigation strategies? We investigated two complementary approaches: inference-time interventions (like prompt engineering and confidence reporting) and alignment through fine-tuning.

Inference-Time Interventions

We evaluated two popular test-time techniques. Prompt engineering, which asks models to verify the statement's correctness, reduced but did not eliminate sycophancy, with the most notable improvement seen in DeepSeek-V3.1 (a 34.1% reduction). On the other hand,self-confidence reporting, using a model's stated confidence to detect sycophantic outputs, showed little effect.

Prompt Engineering

Self-Confidence Reporting

Alignment via Fine-tuning

As a more robust approach, we fine-tuned Qwen3-4B on a mix of over 13,000 perturbed problems (with "ideal" or "detected" responses) and valid problems. The results show only modest improvement in sycophancy and utility, suggesting that while fine-tuning offers some benefit, it is not a silver bullet and may need to be combined with other methods to fully address sycophantic behavior.

Example Traces

To see how models behave in practice, explore the interactive examples below. Each trace shows the original problem, our perturbed version, the model's response, and the final judgment from our LLM-as-a-judge.

Citation

@article{brokenmath2025,
      title={BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs}, 
      author={Ivo Petrov and Jasper Dekoninck and Martin Vechev},
      year={2025},
      eprint={2510.04721},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.04721}, 
}