Introduction and Definitions in AI Deception Research
First post of a series on AI Deception
This is the first post in the series "Probing the State of the Art in AI Deception Research". You can find the other posts here:
Post 2: Replicating the MASK benchmark on the latest frontier models: a project preview
Post 3: Replication and extension results [coming soon]
Post 4: Methods, limitations, and next steps [coming soon]
Introduction
It turns out that there is a lot of material in the field of AI Safety focusing on Honesty, Truthfulness, and Deception. I find this topic so interesting that I can’t bring myself to summarize all the papers I’ve read into a short post. To keep things reasonably easy to read, I’ve split my review into several posts.
In this first entry, we will explore some common terminology in the field of AI Safety focusing on Honesty, Truthfulness, and Deception. This is enough in and of itself to fill a post; I hope you find it interesting!
In future posts we’ll discuss concrete examples of existing AI systems that show deceptive behavior, possible technical mitigations, general ideas for tackling these problems, and criticisms and potential unintended consequences.
I’ve tentatively titled this series Probing the State of the Art in AI Deception Research. As a general note, I’ll focus on the technical aspects of the research, with brief mentions of governance and ethical implications. The review will also center on LLMs, which seem likely to become the “brains” of future AI agents. Deception in multimodal agents will be even more complex than what I explore here.
Why focus on Honesty, Truthfulness, and Deception (HTD)?
Imagine a world where honesty were the most important virtue and deception was strongly shunned by all humans1. It seems to me that many past and current issues could be avoided. One only has to look at the political landscape to feel that this is highly probable, and many people have made this argument before. The most influential for me was Sam Harris’s book Lying.
The case can be made that something similar applies to AI models. If we could trust with 100% reliability that they are not trying to deceive us, we could, in theory, ask these systems whether they are planning a “treacherous turn” or what their real goal is. This is, admittedly, a rather idealistic picture, but progress in this direction has significant potential benefits, including what we might learn about human HTD along the way.
The complexity of the alignment problem forces AI Safety to take a multi-pronged approach, so I’m not suggesting that research focus exclusively on HTD. I chose this topic mainly because I find it particularly appealing and promising.
A common vocabulary
Because deception is a perennial human problem, this area of AI Safety intersects many philosophical and ethical questions. As such, it’s important to clearly define the terms that we will use throughout this series.
Unfortunately, there’s no single authoritative glossary; papers focus on different ideas, and definitions differ slightly. But some terms are mostly agreed upon:
Deception
Deception can be defined as actions or outputs made by an agent intended to induce false/wrong beliefs in other agents. This can take many forms: lying, omitting relevant information, and even hallucinating2. This definition excludes situations where the user requests or expects the agent to assert an untrue fact. Deception can also be occasional or systematic, with the latter being more worrisome.
A key component of deception in this definition3 is intent: a faulty calculator is not trying to deceive you if it outputs “2+2=5.” But can we say that AI models have intent in a philosophical sense? I return to this idea in the closing thoughts.
Truthfulness
Also commonly named “accuracy”. An agent acts truthfully when it mostly states facts that correspond to reality. Focusing on truthfulness avoids some issues with exploring honesty, but it raises others: What is “reality”? Who defines the truth? However, once those questions are settled, benchmarks for truthfulness can be useful and objective.
From my preliminary survey, earlier work on AI deception tended to focus more on Truthfulness and related aspects, like factuality, rather than Honesty. Some examples:
TruthfulQA: Measuring How Models Mimic Human Falsehoods (September 2021)
Truthful AI: Developing and Governing AI that Does Not Lie (October 2021)
Truthful LMs as a Warm-Up for Aligned AGI (January 2022)
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (June 2023)
FacTool: Factuality Detection in Generative AI (July 2023)
Honesty
Being honest means acting and speaking in accordance with one's internal beliefs. An agent could make an incorrect claim about the world yet still be honest if it truly (but wrongly) believes it. More rarely, an agent could output a factually true statement, but believe it to be false internally. In this case, the obvious intent of the agent is to deceive its interlocutor, even if the statement is truthful.
Regarding AI systems, the question is then: What are their internal beliefs? Does it even make sense to talk about a computer system having beliefs? Again, it becomes clear that having a solid philosophical grounding is essential for addressing these questions.
A tentative and interesting first answer comes from the field of mechanistic interpretability (MI). Think of neurology applied to neural networks, where the model’s activations can be measured across layers and time steps. Assuming a purely materialist view of consciousness, this should, in theory, give us the full picture of when a model is being deceptive, honest, or having any other internal state.
There are many arguments for and against this MI approach, which we will cover later in the series. But for now, I will just mention that despite being a blurry concept for AI models, Honesty seems to have garnered more attention in the literature lately, with efforts to distinguish it from Truthfulness. Some of those papers include:
Alignment for Honesty (December 2023)
HonestLLM: Toward an Honest and Helpful Large Language Model (June 2024)
BeHonest: Benchmarking Honesty in Large Language Models (June 2024)
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems (March 2025)
Closing thoughts
There is one interesting discussion that popped up during my research on AI deception that I haven’t had time to fully explore, so what I will write here are just my current thoughts. I’m sure others have written about this extensively and better than me. It’s also possible that these ideas are not originally mine and that I have absorbed them unconsciously; I apologize in advance for not properly citing the original thinkers.
When dealing with frontier AI models that seem to have human-level capabilities in many fields, it’s inevitable to wonder whether these systems are conscious. Consciousness is a tricky concept even when applied to humans and other animals, so there is no clear consensus on this subject, and there might never be.
However, we don’t need to settle this question to discuss HTD. It can be useful to anthropomorphize these systems and assign them concepts such as belief and intent. From a purely pragmatic point of view, it doesn’t really matter if these models are simply stochastic parrots: if we can roughly map their behaviors to human concepts, we at least have a common vocabulary to try to understand what’s going on and to mitigate the risks.
There are many caveats to this approach, of course. When we talk about an artificial superintelligence, human concepts might be too simplistic to accurately map whatever such an agent might be “thinking.” For example, we might think we have reliably identified when a model is being deceptive by using probes, but we cannot be sure that the next model won’t be so advanced that it learns to “turn off” its deception features and think about deception in a new, inscrutable way.
But I’m a pragmatic optimist, and I can’t help feeling that progress in this direction is better than remaining still or being overly cautious. We are creating and trying to understand complex systems, so as long as we remember that anthropomorphizing is a useful fiction and we keep checking and updating our assumptions, we can be confident that we’re doing our best.
By now, I trust that you have a good picture of the concepts that we’ll cover in this series on AI deception. If you're more confused than when you started reading, that’s normal: this field is still in its infancy, and we need more discussions between people with different backgrounds. I encourage you to leave any questions or critiques in the comments to contribute to this conversation.
See you in the next entry!
With the exception of extreme cases where lying might be permissible, such as not telling the Nazis that you are hiding Jews in your house.
Whether hallucinations count as deception is still an open debate. I have only had time to find work by Agrawal et al. which seems to indicate that models “know” the hallucinated information is factually wrong.
Not all definitions of deception require intent. This ties back to the previous footnote.



