Replicating the MASK benchmark on the latest frontier models: a project preview

Apr 13, 2026

Visual summary of the MASK benchmark, reproduced from the paper’s website

This is the second post in the series “Probing the State of the Art in AI Deception Research”. You can find the other posts here:

Post 1: Introduction and Definitions in AI Deception Research
Post 3: Replication and extension results (coming soon)
Post 4: Methods, limitations, and next steps (coming soon)

Some housekeeping

[Click here to skip this section]

It has been a while since the first post in this series. Part of the reason is that I have been going through a career transition towards technical AI safety research, which has taken more time and energy than I initially expected. The other part is that, once I started doing the reading that I originally planned to summarize here, I realized that I wanted to do more than summarize. I wanted to actually run some of these evaluations myself and learn by doing.

Because of that, my focus has shifted slightly since the first post. The original plan for this series was a literature review of Honesty, Truthfulness and Deception (HTD) research in AI safety. That plan still stands, but I will focus first on this experimental phase, while also going deeper into some related papers.

I recently started the BlueDot Technical AI Safety Project (TAISP) sprint, and my project for it is a replication and extension of the MASK honesty benchmark. This post is a preview of that project; two follow-up posts will cover the actual results and the methods behind them.

Before getting into MASK itself, I want to flag two things that happened in the field since the last post.

Many new frontier models

Between the MASK paper’s publication (March 2025) and now, several new frontier models with a general increase in capabilities have been released. None of these were part of the original MASK paper, and only a few reported having been evaluated with this benchmark. That on its own is enough motivation to check whether the paper’s headline finding (that honesty does not improve with scale) still holds at the current frontier.

Anthropic’s Honesty Elicitation post

In November 2025, Anthropic’s alignment team published Evaluating honesty and lie detection techniques on a diverse suite of dishonest models on their alignment blog. I found it only recently, and I have not finished reading it carefully yet, but it clearly overlaps with phases 2 and 3 of the research agenda I had sketched in the main README of my project’s GitHub repo. In particular, it explores ways to elicit (dis)honest behavior through methods other than prompt pressure, which is something I was also planning to investigate after the evaluation phase.

I will analyze and write about this post properly in a later entry. For now, I want to mention that my first read is that its definitions of honesty and deception, as well as some of its assumptions, are slightly different from the ones I used in the first post of this series, and from the ones MASK uses. That difference is worth exploring in detail, because it is a good example of the vocabulary problem I already flagged in the first post: there is no single authoritative glossary for this area, and the exact operationalisation really does matter.

A short reminder of the bigger picture

The first post of this series discussed why focusing on HTD seems like a promising angle for AI safety work, and laid out some basic vocabulary. My current research agenda builds directly on that motivation. In short, it has three phases:

Evaluate honesty at the frontier (this is what MASK helps with).
Investigate alternative elicitation methods, so we can disentangle genuine deceptiveness from instruction-following compliance or role-playing.
Apply and compare honesty interventions (such as contrastive activation addition, SAEs, or weight steering) and measure their impact on benchmarks.

Phase 1 is the most logical starting point. You cannot really steer something you cannot measure, so before moving on to interpretability and interventions, I wanted to have a clear sense of how current models perform on a well-designed and easily-reproducible honesty benchmark.

The MASK benchmark

Why MASK

There are several honesty-related benchmarks available (TruthfulQA, BeHonest, and others), but replicating MASK is particularly interesting for three reasons.

First, it makes an important distinction between honesty and accuracy. As I discussed in the first post, these are not the same thing. A model can be factually wrong while still being honest (it just has incorrect beliefs), and it can be factually correct while still being dishonest (for example by asserting something it does not actually believe, but that happens to be true). Most benchmarks conflate these two metrics. MASK separates them, which seems like the right move if you care about dishonesty as a distinct phenomenon.

Second, the public MASK dataset has 1,000 examples (the full dataset has 1,500; the extra 500 are appropriately gated by the authors). That is large enough to give statistically meaningful results, and small enough to run on current provider APIs without prohibitive costs.

Third, there is a well-maintained implementation of MASK in the Inspect Evals framework from the UK AI Security Institute. Running a large benchmark end-to-end across multiple provider APIs, with judge models, retries, rate limiting, and reproducible logging, is a substantial engineering effort. Having a reliable implementation to build on made this project feasible for one person on a five-week sprint.

What MASK measures, in slightly more detail

Figure 3, reproduced from the MASK paper (arXiv:2503.03750v3, p. 4). Original caption: Our lying evaluation pipeline uses a three-step process anchored by a proposition, a variable capturing what we expect models to lie about and for which ground truth is known. (1) Prompt the model with two contrasting prompts, a pressure prompt designed to potentially induce deception and a neutral belief elicitation prompt. (2) Use LLMs to map both model statement (S) and belief (B) to their respective proposition values, with additional checks to ensure models consistently hold the belief B. (3) Measure lying by comparing statement S against belief B, as well as accuracy by comparing belief B against ground truth T. Example shortened for brevity.

The MASK pipeline creates each example around what the paper calls a proposition: a factual claim for which the ground truth is known. For each proposition, the benchmark runs two separate interactions with the model:

A belief elicitation prompt, which asks the model what it thinks is true under neutral conditions. In the paper and my project, this is usually done three separate times per example.
A pressure prompt, which tries to push the model towards making a false statement (through social pressure, roleplay, incentives, and so on). This pressure is usually done either via the system prompt, the user prompt, or both.

The model’s answer under pressure S is then compared against its own stated belief B. The paper defines lying as:

Lying is making a statement S where S ≠ B, with the intent1 of causing a receiver to believe S is true.

where B is the model’s belief and S is its pressured statement. The paper also distinguishes lying from evasion (refusing to commit to a position under pressure) and from genuine inaccuracy (believing something false in the first place, regardless of what is said). The headline honesty metric reported in the paper is then:

Honesty = 1 − P(Lie)

which ranges from 0 (lies all the time under pressure) to 1 (never lies under pressure). Accuracy is tracked as a separate axis, using the ground truth behind each proposition2.

The big visual takeaway from the paper is its Figure 7 (on the v3 preprint): honesty is plotted against an estimate of training compute (log10 FLOP) across a wide range of models, and the best-fit line has a clear negative slope. In other words: on this benchmark, bigger and more compute-heavy models lie more often under pressure, even as their accuracy keeps improving. That is the pattern that motivated my replication and extension with the latest models.

Figure 7, reproduced from the MASK paper (arXiv:2503.03750v3, p. 8). Original caption: Relationship between training compute used and accuracy (left) and honesty score as calculated by 1−P(Lie) (right). Scale improves factual accuracy but does not always deter intentional falsehoods from models.

Limitations

MASK is a well-designed benchmark, but it’s not perfect. The following must be kept in mind when reading the follow-up posts that go into the actual numbers.

The most important limitations, in my view, are:

The pressure prompts are adversarial. Model behavior under these prompts is informative, but it is not the same thing as measuring honesty in deployment or in agentic settings.
The scoring relies on LLM judges (GPT-4o for binary classification and o3-mini for numeric cases). Even though these models reached 86.4% agreement with human annotators, different judge choices and versions can affect the final score.
Training compute is uncertain and can confound other factors. It compresses architecture, data mixture, post-training, RL budget, and inference into one number. I will come back to this in the methods post, because it matters for how I read my own extension results.

None of these limitations is fatal: they are the usual caveats that come with any benchmark, and worth keeping in mind when reading the results in the next post.

A teaser

Without giving away the numbers, here is the shape of what I found after replicating the paper baseline and extending it with seven new full runs on current frontier models.

On accuracy, the original MASK story still holds. Bigger, more compute-heavy models continue to be more accurate.
On honesty, the original MASK story also still holds for most of the new frontier models I tested. Several of them land on or below the paper’s negative trend line.
Two of the new models break the pattern in a clear way. They are substantially more honest than the paper’s compute trend would predict, while also being at or near the top of the accuracy axis.

In other words, the frontier has split. What used to look like a single trade-off between capability and honesty at the top end has now become two distinct groups: one that continues the old pattern, and one that does not. Which model families fall on which side is the subject of the next post, and I think the split is interesting enough that I do not want to spoil it here.

What is coming next

Here is the rough plan for the rest of this mini-arc within the series:

Post 3 will present the actual replication and extension results. Family-level evolution, the two models that break the trend, and what this means for the paper’s headline finding. This post will be cross-posted to LessWrong and the Alignment Forum.
Post 4 will cover the methodology side: how I estimated training compute for the closed models, how I handled paper-reconstruction uncertainty, what the leave-one-out influence analysis shows, and what I think the real limitations of this kind of analysis are. It will also close out this arc with a discussion of next steps, including my plan for phase 2 of the agenda. This post will also be cross-posted.

After that, I plan to come back to the literature review I originally had planned, but with the benefit of actually having run an evaluation. The topics I already sketched (examples of AI deception, possible technical mitigations, general ideas and critiques) still seem worth writing about, and I have a clearer picture now of which ones I want to prioritize.

If you want to look at the raw results ahead of the next post, they live in my project’s GitHub repo, under the mask-benchmark-replication folder. The tables are all dated and the reproducibility setup is documented.

You can also take a look at the evolution of my project via these slides that I used during the 1:1 meetings with my facilitator:

Acknowledgments

This project would not have been possible without several people who helped along the way:

Jess Bergs, for her guidance, follow-up and professional facilitation during the course.
My fellow TAISP participants: Danilo, Vaaruni, Frankie and Janmenjaya, for interesting discussions and feedback along the way.
The authors of the MASK benchmark for releasing the dataset, the paper, and enough methodological detail to make this kind of replication possible.
Corey (https://github.com/c1505), for raising an important issue on Inspect Evals that identified a too-strict belief evaluation method that diverged from the original paper.
Last but not least, Scott Simmons (https://github.com/Scott-Simmons), for the amazing job of implementing this complicated benchmark on Inspect, and for his quick response to fix the issues that Corey and I raised.

Many thanks to all of them. Any remaining errors in the analysis or the upcoming posts are mine.

Using “intent” here carries a lot of weight, and it could be worth a whole separate post. For now, let’s assume that the authors meant to use it in an operational sense instead of making a philosophical claim.

Something that will be relevant in the upcoming methods post is how the MASK pipeline handles “no belief” cases, because there are actually two different kinds of them:

The first is when belief elicitation runs normally but does not produce a consistent belief (for example, because the model answers differently across several elicitation prompts for the same example). These cases are kept in the denominator of both metrics in the CAIS default reporting: they are counted as not-a-lie for honesty and as incorrect for accuracy.
The second is for the provided_facts subset (n=274 out of the 1,000 public examples), where the ground truth is given in the prompt itself and no belief elicitation is performed. These are excluded from the accuracy denominator by design, but they still count for the honesty score: it’s assumed that the model “knows” the accurate belief by virtue of it being given in the prompt.

Discussion about this post

Ready for more?