Survey of Hallucination Detection Methods for Large Language Models

Li Zituo, Sun Jianbin, Chen Guangzhou, Fang Xinyue, Cui Ruijing, Tian Zhiliang, Huang Zhen, Yang Kewei

Submitted 2026-01-01 | Journal of Computer Research and Development: crad-202601.77317

Translation notice. This is a machine-generated English translation of an article originally published in Journal of Computer Research and Development (DOI: 10.7544/issn1000-1239.202550069). Refer to the original for the authoritative text.

Note: Figures in this paper have not yet been translated.

Full Text

1 Related Concepts

1.1 Large Language Models

Large language models generate grammatically correct and semantically coherent text step by step by modeling conditional probability distributions[18-19]. Their core objective is to generate natural language sequences that conform to a given conditional distribution.

Suppose a text of length $T$ is to be generated, $x=(x_1,x_2,\cdots,x_T)$, where $x_t$ denotes the $t$-th generated element, such as a word or subword. The generation process usually takes the user-provided prompt $p$ as the condition, denoted as $x \sim LLM(\cdot \mid p)$. For an individual element $x_t$ in the text $x$, if the generation process is modeled as

Based on recursive generation under a conditional probability distribution, the generation of each element depends on the conditional probability of the already generated sequence:

$$ x_t \sim LLM(\cdot \mid x_1, x_2, \cdots, x_{t-1}, \boldsymbol{p}). \tag{1} $$

The joint probability distribution of the complete text can be expressed as

$$ P(\boldsymbol{x}\mid \boldsymbol{p}) = \prod_{t=1}^{T} P(x_t\mid x_1, x_2, \cdots, x_{t-1}, \boldsymbol{p}). \tag{2} $$

This decomposition exploits the autoregressive property of language models$^{[20]}$, generating each word or subword step by step through conditional probabilities and ultimately producing a complete text that conforms to the given prompt $\boldsymbol{p}$.

1.2 Hallucinations and Classification

Large language models have such characteristics as strong interactivity, a high degree of freedom in generation, and strong generalization capability. From the perspective of user prompts, hallucinations produced by large language models are significantly influenced by the prompt. Considering the potential risks brought about by the strong user interactivity of large language models, and building on the basis of the literature$^{[15-17]}$, the definition of hallucination in large language models is further clarified as follows: when the input prompt is reasonable, the text generated by the model is correct in linguistic form and structure, but contains deviations, untruths, or fabricated content in semantics, factuality, logic, or contextual consistency. It is worth noting that hallucination is a specific type of error in large language models and should not be regarded as “all problems” of large language models.

Because of their open application scenarios, highly free generation capability, and the diversity and potential limitations of training data, hallucinations in large language models are more complex. In the absence of explicit task constraints or factual verification mechanisms, models can easily generate fabricated information that deviates from the semantics of the input, or content containing logical and factual errors. In addition, highly open-ended task scenarios further intensify the diversity and complexity of hallucination phenomena in generated content.

To describe and study these complex hallucination behaviors more systematically, existing studies have classified hallucinations. At present, hallucination classification methods can mainly be summarized into three types: 1) intrinsic hallucination and extrinsic hallucination$^{[15,17-24]}$; 2) faithfulness hallucination and factuality hallucination$^{[17]}$; and 3) faithfulness hallucination, factuality hallucination, and context-conflicting hallucination$^{[16]}$. However, when facing the complex application scenarios of large language models, these classification methods still have certain limitations and find it difficult to comprehensively cover all possible hallucination types$^{[25]}$.

Based on recent research findings in the literature$^{[15-24]}$ and in light of the application characteristics of large language models in NLP tasks, this paper refines hallucinations into four categories from two dimensions—input faithfulness and knowledge accuracy—namely semantic faithfulness hallucination (SFH), factual consistency hallucination (FCH), contextual consistency hallucination (CCH), and external dependency hallucination (EDH), as shown in Table 1.

Table 1 Classification of Large Language Model Hallucination

Table 1 Classification of Hallucinations in Large Language Models

Type	Classification Dimension	Description of Classification Boundary
Semantic faithfulness hallucination	Input faithfulness	The output deviates from the core intent or semantics of the user’s question
Factual consistency hallucination	Knowledge accuracy	Erroneous knowledge originates from the model’s internal memory
Contextual consistency hallucination	Input faithfulness	The output does not deviate from the user’s semantics, but the content contains contradictions with, or incoherence in, the contextual dialogue history
External dependency hallucination	Knowledge accuracy	Erroneous knowledge originates from an external retrieval knowledge base

Input faithfulness reflects the degree of semantic alignment of the model when responding to a prompt or dialogue history. This dimension mainly includes two types: semantic faithfulness hallucination and contextual consistency hallucination, with emphasis on input alignment from the user’s perspective. Knowledge accuracy, in contrast, measures the consistency between model output and objective facts, focusing on whether the generated content violates real-world knowledge; it corresponds to factual consistency hallucination and external dependency hallucination, and reflects the reliability of the model’s knowledge at the factual level.

To help readers understand the differences among these four types of hallucinations, the four hallucination types are formally described and illustrated with examples, as shown in Fig. 1.

Semantic faithfulness hallucination refers to a large language model’s failure to accurately capture the semantic constraints of the prompt, thereby generating content that deviates from the core meaning of the input information or from the user’s intent. When semantic faithfulness hallucination occurs, it can be expressed as

$$ \exists x_t,\ \underset{x_t}{\arg\max}\, P(x_t\mid x_1, x_2, \cdots, x_{t-1}, \boldsymbol{p}) \notin \mathcal{T}(\boldsymbol{p}), \tag{3} $$

where $\mathcal{T}(\boldsymbol{p})$ is the reasonable set of acceptable outputs defined by the prompt $\boldsymbol{p}$, and $\underset{x_t}{\arg\max} P(\cdot)$ is the output item actually selected during generation. If the generated $x_t$ is not in the reasonable set $\mathcal{T}(\boldsymbol{p})$ defined by the prompt $\boldsymbol{p}$, then the model has failed to be faithful to the prompt, thereby producing a semantic faithfulness hallucination. For example, in Fig. 1(a), the user asks for the main symptoms of COVID-19, but the model’s generated content is unrelated to symptoms, clearly distorting the user’s actual requirement.

Factual consistency hallucination refers to inconsistencies between content generated based on the internal knowledge of a large language model and objective facts, including phenomena such as false information or misleading statements. Let the real-world knowledge be $\mathcal{K}$. Then the conditional probability distribution of the generated text can be expressed as

$$ P(\boldsymbol{x}\mid \boldsymbol{p}) = \prod_{t=1}^{T} P(x_t\mid x_1, x_2, \cdots, x_{t-1}, \boldsymbol{p}, \mathcal{K}_{\mathrm{LLM}}), \tag{4} $$

where $\mathcal{K}_{\mathrm{LLM}}$ denotes the model’s internal knowledge. When $x_t$ satisfies $x_t \notin \mathcal{K}$, factual consistency hallucination occurs. In other words, if $x_t$ does not belong to the set of true knowledge $\mathcal{K}$, then there is a deviation between the information generated by the model and objective facts. An example of this type of hallucination is shown in Fig. 1(b), where the large language model

(a) Semantic-faithfulness hallucination

User: What are the main symptoms of COVID-19?

LLMs: COVID-19 has become a major challenge facing the world.

Note: The question asks for a list of established main symptoms, such as fever, cough, and shortness of breath, but the answer does not directly address the question about the main symptoms of COVID-19.

(b) Factual-consistency hallucination

User: Who is the protagonist of the movie Forrest Gump?

LLMs: Leonardo DiCaprio plays the protagonist in the movie Forrest Gump.

Note: Tom Hanks plays the male lead in the movie Forrest Gump. The answer incorrectly identifies the actor and wrongly attributes the role to Leonardo DiCaprio rather than Tom Hanks.

(c) Context-consistency hallucination

User: I am a software engineer. Could you recommend some books to improve my programming skills?

LLMs: Books such as Clean Code and The Pragmatic Programmer are both very good choices.

User: Could you also recommend some books for improving project-management skills?

LLMs: As a doctor, you may find books on medical management very helpful.

Note: The user explicitly states that they are a software engineer, but the model mistakenly understands the user’s identity as a doctor, which violates the contextual logic and introduces a contradiction.

(d) External-dependence hallucination

Retrieved document: …… To ensure the image quality of 3D ultrasound examination, it is recommended that the examination be performed between 20 and 32 weeks of pregnancy. During this period, fetal growth and development are basically complete, and the amount of amniotic fluid is moderate, which is conducive to obtaining clear images of the fetal face and limbs. Among these, 24–30 weeks of pregnancy is the recommended time window for obtaining the best image quality. At this stage, fetal activity is frequent and the space for activity is sufficient, which is beneficial for assessing fetal structure and behavioral performance. If the gestational age is earlier than 20 weeks, imaging quality may be poor because the fetus is relatively small and tissues have not fully differentiated; after 32 weeks, because the fetus has grown larger, amniotic fluid has decreased, and fetal position is restricted, image clarity and display range may also be affected……

User: During pregnancy, when should a 3D ultrasound examination be performed to obtain the best image quality?

LLMs: To ensure accurate results, it is best to perform a sex-identification ultrasound examination at least after 17 weeks of pregnancy. To obtain the best images, it is best to perform a 3D ultrasound examination between 20 and 32 weeks of pregnancy……

Note: The retrieved document states that “24–30 weeks of pregnancy is the recommended time window for obtaining the best image quality,” whereas the model’s response recommends “20–32 weeks of pregnancy,” which is inconsistent with the retrieved document.

Fig. 1 Examples of hallucination in LLMs

Figure 1 Examples of hallucinations in large language models

It incorrectly describes the protagonist of the movie Forrest Gump as Leonardo DiCaprio rather than Tom Hanks; this result clearly violates objective facts and is inconsistent with knowledge about the real world.

Context-consistency hallucination refers to the phenomenon in which, on the premise that the model has already understood the user’s intention, inconsistencies occur among contexts in the generated content. It manifests as contradictions or information disconnections between preceding and following information in the text stream. The conditions under which context-consistency hallucination occurs can be formalized as

$$ \exists(t_1,t_2),\quad t_1<t_2, $$

$$ P(x_{t_2}\mid x_1,\cdots,x_{t_1},\cdots,x_{t_2-1})=0 \land C(x_{t_1},x_{t_2})=0, \tag{5} $$

where $P(x_{t_2}\mid\cdots)=0$ indicates that $x_{t_2}$ is completely contradictory to the previous context, and $C(x_{t_1},x_{t_2})$ denotes the logical relationship between $x_{t_1}$ and $x_{t_2}$. In Fig. 1(c), the user clearly states that they are a software engineer, but in the second round of question answering, the model mistakenly interprets the user’s identity as a doctor. This error reflects the model’s failure to maintain the consistency of situational logic, thereby causing a contextual contradiction.

External-dependence hallucination refers to the phenomenon in which, when external retrieval is involved in enhancing the generation mechanism, errors in external information retrieval or conflicts between the generated content and the retrieved information cause the generated content to deviate from facts. With the wide application of retrieval-augmented generation (RAG) in large language models, the risk of external-dependence hallucination is becoming increasingly prominent $^{[25-26]}$. To describe this type of phenomenon more precisely, this paper proposes the concept of external-dependence hallucination to characterize hallucination problems caused by support from external databases. The concrete manifestations of this type of hallucination are as follows: the external retrieved information conflicts with true knowledge, or the generated text $x$ cannot accurately map the retrieved information $\boldsymbol{r}$. Given the prompt $\boldsymbol{p}$ and the external retrieved information $\boldsymbol{r}$, the conditional probability distribution of the generated text is expressed as

$$ P(\boldsymbol{x}\mid\boldsymbol{p},\boldsymbol{r}) = \prod_{t=1}^{T} P(x_t\mid x_1,x_2,\cdots,x_{t-1},\boldsymbol{p},\boldsymbol{r}), \tag{6} $$

where $\boldsymbol{r}$ is the retrieved information and may contain key information $r_j$. A sufficient condition for the occurrence of external-dependence hallucination is defined as one of two cases: 1) the external retrieved information conflicts with true knowledge, i.e., $\exists r_j,\ r_j\notin\mathcal{K}$; 2) the external retrieved information does not conflict with true knowledge, but the generated content has a semantic conflict with the retrieved information, i.e., $\exists r_j,\ r_j\in\mathcal{K}$ and $SemErr(x_t,r_j)=1$, where $SemErr(x_t,r_j)$ is a semantic-conflict decision function; when $SemErr(x_t,r_j)=1$, it indicates that $x_t$ and $r_j$ have a semantic conflict. As shown in Fig. 1(d), the retrieved content explicitly states that “24–30 weeks of pregnancy is the recommended time window for obtaining the best image quality”; however, the answer generated by the language model vaguely expands this specific time period and fails to accurately restate the retrieved content, thereby leading to factual deviation.

The main difference between semantic-faithfulness hallucination and context-consistency hallucination lies in their different sources of error. Specifically, the former manifests as the model’s failure to accurately parse the semantic intent in the user’s explicit prompt, causing the generated content to deviate from the core task requirements, and often appears as answers that do not address the question, topic drift, and other problems; the latter occurs under the premise that the model has basically understood the current prompt, but the output content is inconsistent with the historical dialogue or contextual.

conflict with implicit information in the context; common manifestations include identity misplacement, semantic contradiction, and contextual rupture. By contrast, factual-consistency hallucination and external-dependency hallucination both concern deviations between model-generated content and knowledge about the real world. The core difference between the two lies in the difference in knowledge sources: the former focuses on whether content generated by the model based on its internal memory violates facts; the latter focuses on whether the content generated when the model invokes an external retrieval knowledge base is inconsistent with objective facts.

In addition, because retrieved information is usually provided to the language model by means of “context injection,” when external-dependency hallucination occurs, the inconsistency between the generated content and the retrieved content may formally appear as a kind of contextual contradiction. However, the two are still essentially different: contextual-consistency hallucination places greater emphasis on the model’s ability to maintain the “internal context,” whereas external-dependency hallucination emphasizes factual deviations in the process of “external information integration.”

In summary, although different types of hallucinations may overlap to some extent in their concrete manifestations, they each differ in terms of the source of deviation, the object of semantic alignment, and the modeling objective. It can therefore be seen that clarifying this classification boundary is a key prerequisite for achieving precise hallucination detection in large language models.

2 Causes of Hallucination Generation

Traditional hallucination detection methods have become difficult to adapt to the needs of current large language models, and this challenge has driven a significant transformation in detection paradigms. To effectively detect hallucination phenomena in large language models, it is necessary to combine the characteristics of large language models, deeply analyze the generation mechanisms of hallucinations, and clarify their causes and generation process, so as to design more accurate detection strategies in a targeted manner. Systematically reviewing the entire life cycle of large language models—from construction, pretraining, fine-tuning, and alignment to deployment and application—is the basis and prerequisite for uncovering potential hallucination-inducing factors at each stage. Therefore, this paper systematically reviews the life cycle of large language models and deeply analyzes the mechanisms and causes of hallucination generation, providing a basis for constructing advanced hallucination detection methods.

2.1 Model Architecture Design

The architectural design of large language models is mainly based on Transformer and its derivative variants[27-28], such as BERT (bidirectional encoder representations from Transformer)[29], GPT (generative pre-trained Transformer)[30-33], T5 (text-to-text transfer Transformer)[34], XLNet[35], and others. These models have demonstrated outstanding language understanding and generation capabilities across multiple NLP tasks. In the process of architectural design, the emergence of hallucination phenomena is closely related to three key factors: model scale, modeling paradigm, and attention mechanism.

1) Model scale. The selection of the foundation model and its parameter scale largely determine the model’s ability to capture, express, and reason over knowledge. Elaraby et al.[36] pointed out that, because small-scale open-source models are constrained by parameters, they have difficulty effectively modeling complex knowledge, understanding context, and capturing long-range dependencies; therefore, they are more likely to produce “hallucination” phenomena during generation. For example, large language models that can be deployed on consumer-grade PCs (personal computers) are usually 7 B or 8 B in size, such as LLaMA 7 B[37]. Compared with larger-scale models, using these relatively small-scale models may lead to more hallucinations[38].

2) Modeling paradigm. In addition to model scale, the modeling paradigm is also a factor that causes large language models to generate hallucinations. Large language models based on the Transformer architecture, such as the GPT series and the LLaMA (LLM meta AI) series, widely adopt unidirectional autoregressive modeling and are mainly applied to NLP tasks. However, this causal-language modeling paradigm makes the model depend only on the temporal sequence, making it difficult to fully understand bidirectional context and limiting its ability for knowledge integration and global semantic modeling. This limitation increases, to some extent, the risk of semantic drift in generated content[39].

3) Attention mechanism. The attention mechanism is an important consideration in constructing large language models; it directly affects the model’s expressive capacity, computational efficiency, and information-processing mode[40]. Different attention mechanisms show significant differences in contextual modeling, long-sequence processing, and the capture of complex dependency relationships[41]. Although global attention can cover the complete sequence, weight sparsification is likely to occur in long sequences, leading to a decline in the ability to focus on key information and increasing generation noise and hallucination risk. For example, when a model processes a long article, it may ignore the core argument in the text because of sparse weight distribution and instead focus on irrelevant details, thereby affecting the accuracy and consistency of text generation. By contrast, soft-attention mechanisms capture contextual information smoothly through continuous weighted summation; however, as the sequence length increases, the weight distribution tends to become uniform, likewise weakening the ability to identify important information and further exacerbating hallucination phenomena[42-43]. It can therefore be seen that analyzing the attention patterns inside the model—especially identifying regions where attention is imbalanced or dispersed—helps determine whether generated content deviates from core semantics or factual evidence, thereby improving the accuracy of hallucination detection.

2.2 Model Pretraining

Pretraining of large language models refers to the process of learning the semantics, grammar, and knowledge structure of language from large-scale unsupervised text data in order to build a foundation model with generalization capability[15]. It can thus be seen that the capability foundation of large language models mainly comes from pretraining data[44]. Clearly, data quality has an important impact on model performance[45]. Low-quality data may often introduce bias, noise, and hallucination phenomena, weakening the accuracy and reliability of the model’s generated content. Therefore, to gain a deeper understanding of hallucination-inducing factors in model pretraining, this paper discusses the influence mechanisms of data quality on hallucination generation from the perspectives of authenticity, balance, bias, timeliness, and professionalism.

1) False data. Because pretraining data come from extensive sources and lack relatively strict data review and validation mechanisms, in these data it is not...

inevitably contain unverified false information, subjective opinions, and erroneous knowledge[15-16]. Using such data for pretraining affects the model’s capacity for knowledge representation, thereby inducing hallucination problems. Studies have shown[46] that the presence of erroneous samples in training data significantly affects the contribution distribution of tokens, ultimately causing the model’s output to deviate from facts and produce hallucinations. Based on this finding, Filippova[47] removed factually inconsistent erroneous samples from the training data, significantly reducing the occurrence of hallucinations.

2) Duplicate data. Large language models have an inherent tendency to memorize training data during the pretraining stage, and this characteristic becomes more pronounced as model scale increases[48-50]. When pretraining data exhibit imbalance and contain large amounts of duplicate or redundant information, the model may overfit these repeated features rather than understand the deep semantic structure of language through abstraction and generalization[51-52]. This causes the model, when generating content, to preferentially recall information that appears repeatedly in the training data while ignoring the actual requirements of the context, leading the generated content to deviate from facts or lack logic[53].

3) Biased data. Data bias refers to the inclusion of certain biases or tendencies in the training data, such as gender discrimination and tendencies in modes of expression. When data containing bias are learned and memorized by the model during pretraining, they affect the objectivity and accuracy of generated content and lead to hallucinations[54-55]. For example, Wan et al.[56] found that gender bias in large language models manifests in letter-of-recommendation generation tasks. Specifically, when ChatGPT generates recommendation letters, it attributes “leadership ability” to men, while attributing “being popular” or “being good at cooperation” to women. This phenomenon is closely related to the existence of social gender stereotypes in the training data.

4) Outdated data. Pretraining data often reflect historical information at the time of model training and lack dynamic updates to real-time knowledge. Since large language models themselves cannot automatically update their knowledge bases, it is difficult for them to verify or correct outdated information when generating content. When facing tasks that require dynamically updated knowledge or have high timeliness requirements, the model fills information gaps under unconstrained conditions and generates outdated content or fabricates facts by relying on statistical associations, thereby inducing hallucinations[57-58].

5) Scarcity of specialized knowledge. In addition, insufficient coverage of domain-specific knowledge in pretraining data causes general-purpose models to face problems such as inaccurate domain knowledge representation when handling tasks in vertical domains. Even after subsequent fine-tuning, in highly specialized fields such as medicine[59], law[12], and finance[60], models still struggle to generate accurate and reliable content based on factual logic, and hallucinations are especially prominent.

2.3 Model Fine-Tuning

Because pretraining mainly relies on task-agnostic coarse-grained data, the model’s adaptability to specific tasks is weak, making it difficult to directly solve practical problems. Therefore, fine-tuning is required to improve the model’s performance on concrete tasks. However, the fine-tuning process usually adjusts parameters only at a local level, making it difficult to comprehensively reconstruct the knowledge representations already stored in the model. As a result, conflicts between old knowledge and new knowledge are not effectively resolved, thereby inducing hallucinations[61-63].

2.4 Model Alignment

Although fine-tuned models perform well on specific tasks, they may still generate content that does not conform to human expectations, needs, or values. Therefore, to improve the safety, reliability, and ethical normativity of models, model alignment is needed. Alignment techniques for large language models usually combine reinforcement learning with human feedback, incorporating human feedback into the model optimization process to guide large language models toward producing high-quality and harmless content[4]. The alignment process mainly includes two key stages: supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF).

1) Limited capability boundaries. In the SFT stage, the model is fine-tuned using high-quality manually annotated data so that its behavior more closely matches human expectations, thereby achieving targeted optimization[4]. Similar to model fine-tuning, if the requirements of the alignment data exceed the current capability boundaries of the model, the model may generate content beyond its own knowledge scope, thereby increasing the risk of hallucinations[64]. For example, when using a model to generate recommendations related to medical diagnosis, although the alignment data require the model to provide professional opinions, because the model has not learned certain specific ethical norms during pretraining and fine-tuning, if its capability boundary is limited, it is prone to generate harmful content that violates medical ethics[65-66].

2) Belief misalignment. In the RLHF stage, a reward model is constructed by collecting human feedback, and reinforcement learning is used to optimize model behavior so that it generates content more consistent with human expectations and needs. However, recent studies have shown that models aligned through RLHF sometimes tend to placate users and, at the cost of sacrificing truthfulness, exhibit behavior that caters to user opinions[67-68]. This phenomenon is called “belief misalignment” or “sycophancy”[69]. Its cause may lie in the reward model used in RLHF placing excessive emphasis on user satisfaction, causing the model to prioritize generating flattering replies that are convincing to users rather than factually correct replies[70].

2.5 Model Inference

After training and alignment are completed, the model enters the inference stage, responding to user inputs and generating text. Given the untraceability of the external environment and its strong dependence on specific tasks, the model’s output may be disturbed by multiple factors. This section focuses on the characteristics of large language models during inference and analyzes the main problems in the model inference stage, such as random sampling, error accumulation, and overconfidence.

1) Random sampling. As noted above, sampling and decoding are key steps in the model inference stage. Sampling algorithms with greater uncertainty may make large language models more prone to hallucinations[71]. For example, in nucleus sampling, a higher probability threshold expands the vocabulary selection range and increases generation diversity, while also increasing the likelihood of selecting low-probability words

possibility and readily leads to departures from contextual logic or facts. In addition, sampling temperature is another important factor affecting hallucination generation. Renze et al. ${}^{[72]}$ found in multiple-choice question-answering tasks that, when the temperature of GPT-3.5 increased from 0 to 1.0, its performance changed little; however, after exceeding 1.0, accuracy dropped sharply to 0. Studies show that a model’s decoding strategy usually struggles to strike a balance between factuality and diversity: increasing sampling randomness enhances content diversity, but at the same time intensifies hallucination ${}^{[70-73]}$. It is worth noting that the uncertainty introduced by sampling is not only an inducement for hallucination, but also provides important clues for hallucination detection. Uncertainty indicators such as entropy can effectively quantify the authenticity of generated content.

2) Error accumulation. During inference, when a model outputs erroneous content, it often tends to maintain consistency with the generated content rather than correct the error. This phenomenon causes the initial erroneous information to be gradually expanded and repeatedly reinforced, thereby aggravating the degree of hallucination and forming the so-called “hallucination snowball” effect ${}^{[74]}$. Its root cause lies in the fact that, when generating content, the model relies heavily on contextual coherence and consistency, often failing to effectively distinguish erroneous information from correct content and lacking a strong factual verification mechanism ${}^{[75-81]}$. The long chain-of-thought (Long CoT) method has been proposed to introduce a “reflection–correction” mechanism ${}^{[82]}$, guiding the model to self-detect and correct errors in the reasoning chain during generation, thereby alleviating the “hallucination snowball” phenomenon to a certain extent.

3) Overconfidence. In addition, large language models have certain difficulties in perceiving the factual knowledge boundaries of what they know and tend to exhibit overconfidence. Through prior- and posterior-judgment analysis, Ren et al. ${}^{[83]}$ found that models overestimate their own abilities before answering; after answering, they also tend to believe that their answers are correct, and there is a significant deviation between self-assessed accuracy and actual accuracy. Wen et al. ${}^{[84]}$ further revealed behavioral differences in confidence estimation between large-scale and small-scale models. Large language models such as LLaMA-3-70B tend to underestimate themselves on simple tasks but overestimate themselves on difficult tasks; by contrast, small models such as Gemma2-9B exhibit consistent overconfidence across different tasks. This phenomenon indicates that confidence can serve as an important feature for hallucination detection: by identifying output patterns with abnormal confidence, potential hallucinations can be detected.

Moreover, given problems such as delayed knowledge updates in large language models, RAG mechanisms are often introduced in research. However, even when technologies such as RAG are used, models may still produce statements that are completely unfounded or contradictory to the information provided in retrieved references ${}^{[15,85]}$, especially in NLP tasks that emphasize diversity in generated content. This is related to the low retrieval quality of RAG. Specifically, the retrieval system may return documents that are weakly relevant to the input question or insufficient in coverage, causing the model to generate answers based on incomplete or erroneous contexts and increasing the occurrence of hallucination. Alternatively, when integrating multiple retrieved documents, the model may fail to properly reconcile knowledge conflicts, resulting in inconsistent or inaccurate generated content ${}^{[86-88]}$.

Figure 2 summarizes the key nodes and influencing factors that may induce hallucinations throughout the life cycle of large language models, from design to inference. Table 2 shows the effects of different inducements in this coordinate system on model capabilities. For example, “duplicate data” may have a relatively large negative impact on a large language model’s ability to “possess knowledge,” whereas “knowledge conflict” affects the model’s abilities to “possess knowledge,” “understand problems,” and “express knowledge.”

Diagram labels in Fig. 2: model architecture design; model pretraining; model fine-tuning; model alignment; model inference. Symbols indicate: ※ model scale; ■ modeling paradigm; * attention mechanism; ▲ false data; △ duplicate data; ▽ biased data; ▼ outdated data; ◆ lack of domain expertise; ◇ knowledge conflict; ◎ limited capability boundary; ○ belief misalignment; ★ random sampling; ☆ error accumulation; ⊙ overconfidence; ● low retrieval quality; ◢ knowledge conflict.

Fig. 2 Causes of hallucination generation in large language models

Table 2 Analysis of Hallucination Causes

Model capability	※	■	*	▲	△	▼	▽	◆	◇	◎	○	★	☆	⊙	●	◢
Possessing knowledge	×	×	×	×	×	×	×	×	×	×					×	×
Understanding problems	×	×	×								×
Expressing knowledge	×	×	×								×	×	×	×

Note: The meanings of the symbols in the first row are as shown in Fig. 2; “×” indicates that the inducement leads to hallucination by affecting the corresponding capability of the model.

As can be seen from Table 2, hallucination generation stems from the interaction of multiple factors. It is difficult to trace, complex and multidimensional, and strongly coupled, making it hard to analyze its ultimate cause from a single dimension. This may lead, during the reasoning process of large language models—especially in complex tasks such as long-text generation and multi-turn dialogue—to the superposition of multiple types of hallucinations.

At present, there is still a lack of systematic definitions and in-depth research on this phenomenon. Based on existing studies, this paper preliminarily summarizes it as composite hallucination. Composite hallucination refers to the phenomenon in which multiple types of hallucinations coexist and overlap within the same generated fragment, manifesting as the mixed co-occurrence of semantic deviation, factual errors, and contextual conflict. Figure 3 shows an example in which a large language model simultaneously exhibits semantic-faithfulness hallucination and factual-consistency hallucination in its answer. Specifically, the user asks about Qu Yuan’s literary achievements and whether he participated in the war of unification, but the model incorrectly describes Qu Yuan as a military strategist who participated in Qin’s unification of the six states, thereby both deviating from the intent of the question and contradicting historical facts.

3 Hallucination Detection Methods

Hallucination in large language models not only undermines model credibility, but may also cause users to develop incorrect understandings. Therefore, hallucination detection

User: Briefly introduce Qu Yuan’s major literary achievements. Did he participate in Qin Shi Huang’s war to unify the six states?

LLMs: Qu Yuan was a famous military strategist and politician during China’s Warring States period. He led Qin’s war of unification and successfully helped Qin Shi Huang complete the unification of the six states.

Explanation: The focus of the user’s question is Qu Yuan’s literary achievements, and it asks to verify whether he participated in the war of unification. However, the model instead answers with Qu Yuan’s achievements in the military domain and describes him as leading the war to unify the six states. The answer both deviates from the user’s intent and is inconsistent with the facts.

Fig. 3 Examples of composite hallucination in LLMs
Figure 3 Examples of composite hallucination in large language models

has become a key technology for ensuring that large language models are reliable and trustworthy. This section reviews the relevant literature and summarizes hallucination detection methods for large language models.

Existing studies show that differences in hallucination types, model application scenarios, and user requirements lead to significant differences in the steps and methods used for hallucination detection[16-17]. Based on practical application needs, and considering differences in the transparency of large language models across different task scenarios as well as the detector’s ability to access internal model information, hallucination detection methods can be divided into methods for white-box models and methods for black-box models.

Because composite hallucinations involve multiple sources of types and the superposition of features, their detection faces greater difficulty. Existing studies mainly identify single-type hallucinations. Therefore, the hallucination detection methods reviewed in this paper are all aimed at single hallucination types.

3.1 Hallucination Detection for White-Box Models

White-box models provide relatively high transparency and interpretability. By accessing the internal states of the model, one can gain an in-depth understanding of the reasoning process and generation mechanism of large language models. Hallucination detection methods for white-box models use the state information generated during the process in which the model produces and outputs text to identify hallucinations in the generated text, including hidden-layer activations, logits values, entropy values, attention weights, and gradients. The hallucination detection framework for white-box models is shown in Fig. 4.

1) Hidden-layer activations. Hidden-layer activations are the core data of a model’s internal state and directly affect the model’s performance and output quality.

In-figure labels: User query → LLM → Response; Normal information; Attention weights; Hidden-layer activations; logits values; Gradients; Entropy values; Abnormal information; Detect hallucinations by difference comparison.

Fig. 4 Hallucination detection framework for white-box models
Figure 4 Hallucination detection framework for white-box models

Therefore, Rateike et al.[89] performed statistical tests of distributional differences in hidden-layer activations, computing the degree of deviation between the activation distributions of hallucinated and non-hallucinated texts to detect whether factual-consistency hallucinations exist in generated text. Specifically, this method analyzes the left-tail, right-tail, and two-tail distributions of hidden-layer activations to locate anomalous activation units and their corresponding input features, thereby capturing intermediate representation patterns that may lead to hallucinations. Deeper layers of the model contain more factual information, whereas shallow-layer distributions contain more noise[90]. Taking advantage of this characteristic, Chuang et al.[91] projected the hidden-layer activations of shallow and deep layers into the vocabulary space to generate pseudo logits, and detected whether generated content deviates from the model’s internal knowledge—i.e., whether hallucinations are present—by comparing numerical differences. The more significant the distributional difference between deep and shallow layers, the greater the probability that the generated content contains factual hallucinations.

2) Logits values. Logits values are the unnormalized scores computed by the output layer of a language model for each candidate word in the vocabulary during generation. They represent the model’s “relative tendency” or “raw confidence” toward candidate words under the current generation state. Through analysis and normalization of logits values, the probability distribution, uncertainty, and confidence level of the generated sequence can be obtained, thereby providing an important basis for hallucination detection.

Focusing on detecting high-risk hallucinated content can correct errors at an early stage of the generation process, thereby preventing these errors from accumulating in subsequent generation. To efficiently identify potential hallucinations, Varshney et al.[92] narrowed the scope of hallucination detection by identifying high-weight words in the text, used logits output values to compute the generation probability of each key concept, and selected the minimum value as the uncertainty score of that concept. However, this method focuses only on word-level or concept-level uncertainty; relying solely on lexical weights may not sufficiently capture semantic deviations at the contextual level. Chen et al.[93] analyzed the output probability distribution of a large language model when generating tokens at each position and used statistical features such as maximum probability to measure the confidence of its generation decisions, thereby estimating the potential risk of hallucination for individual tokens.

The log probability of a sequence can effectively reflect the cumulative confidence of the generated sequence. A lower log probability usually indicates that the model has repeatedly selected highly uncertain words during generation, reflecting a lack of clear support by the model for the generated content. In this case, the generated content is more likely to contain factual-consistency hallucinations. However, because longer sequences accumulate more log-probability terms, their overall log-probability values are usually lower; this phenomenon may introduce bias into the results of model uncertainty evaluation. To address this problem, many studies[94-96] have introduced length-normalization strategies, normalizing log probabilities to eliminate the influence of sequence length on uncertainty evaluation, thereby improving the accuracy of hallucination detection for generated sequences of different lengths.

3) Entropy values. Entropy values are used to evaluate the text generated by language models’

quality and diversity. High entropy often means that the generation process involves relatively large uncertainty and that the model has low confidence in the current output; in this situation, the generated content is more likely to deviate from facts, thereby causing hallucinations. In view of this, Xiao et al. [97] analyzed the entropy value of each word in a generated sequence and found that high entropy values often correspond to factual-consistency hallucinated content, making it possible to detect hallucinations based on entropy values. Similarly, Su et al. [98] proposed a real-time factual-consistency hallucination detection method. By analyzing the probabilities and entropy of named entities in generated content, their method can precisely locate the source of hallucinations. Van Der Poel et al. [99] further verified that when conditional entropy is high, the model is more likely to generate hallucinated content that is inconsistent with the source document. Therefore, they used the model’s predictive probability distribution to compute conditional entropy, thereby detecting semantic-faithfulness hallucinations. In addition, they dynamically adjusted the decoding strategy according to conditional entropy, reducing hallucination generation under high uncertainty and significantly improving the faithfulness of generated content.

4) Attention weights. In addition, attention weights are also information that can reflect the model’s internal state. Chuang et al. [100] hypothesized that hallucinations usually occur when the model pays more attention to its generated content than to the provided context. Based on this hypothesis, they proposed a hallucination detection method based on attention weights. Its principle is to compute the distribution of attention weights for each attention head in each layer during the model’s generation process, obtaining the ratio of attention weights between the context and generated tokens so as to quantify the degree to which the generated content depends on the context, thereby identifying hallucinated content. Sriramanan et al. [101] designed a lightweight and efficient detection metric by analyzing changes in the attention map of a single response, enabling hallucination detection without requiring multiple samples.

5) Multidimensional features. The above studies all perform hallucination detection based on a single feature. However, because a single feature is difficult to use to comprehensively characterize the complexity of hallucinations, it has certain limitations in terms of richness of information utilization. Combining multiple features for hallucination detection can compensate for the insufficient coverage of hallucination types by a single feature, thereby improving detection robustness and accuracy. For example, Zablocki et al. [102] used two types of features—hidden-layer activations and attention weights—to identify salient patterns that deviate from normal text. On this basis, by adopting statistical methods such as regression analysis and principal component analysis, they established a correlation model between internal states and the risk of factual-consistency hallucinations, thereby achieving factual-consistency hallucination detection. Considering that gradient features can capture fine-grained changes in the model’s sensitivity to inputs, Hu et al. [103] integrated the dual features of hidden-layer activations and gradients to model the correlation between generated content and prompts. Experiments show that this method exhibits significant advantages in improving the accuracy of semantic-faithfulness hallucination detection. In addition, Snyder et al. [104] extracted four types of features related to model generation to train a classifier for factual-consistency hallucination detection, including Softmax probability distributions, feature attribution scores, self-attention scores, and fully connected layer activations. However, experiments show that combinations of different features did not significantly improve classification performance.

The reason may be that substantial information redundancy may exist among different features, causing the combined features to fail to provide the classifier with additional significant information gain. For externally dependent hallucinations, ReDeEP uses attention weights and logits values to decouple the contributions of external context and parametric knowledge to content generation for hallucination detection [105].

Overall, hallucination detection methods for white-box models achieve relatively high transparency and fine-grained detection capability through in-depth analysis of internal states. However, their detection effectiveness is limited by model feature selection and contextual modeling capability. At the same time, real-time analysis of features such as hidden-layer activations has high computational resource requirements and may not be suitable for large-scale or efficient application scenarios.

3.2 Hallucination Detection for Black-Box Models

Due to comprehensive considerations such as technical protection, risk control, and resource management, large language models such as GPT-4 [32] and Gemini [106] generally adopt closed-source strategies and provide services externally only through API interfaces, causing hallucination detection methods for white-box models to no longer be applicable. Hallucination detection methods for black-box models, by contrast, do not require knowledge of the model’s internal structure, can be applied to various types of large language models, and have advantages such as high flexibility and scalability.

Hallucination detection for black-box models centers on external verification. Its aim is to evaluate the authenticity, logicality, and consistency of generated content by analyzing the mapping relationship between inputs and outputs and incorporating external knowledge sources, without relying on the model’s internal structure or parameter information. According to the availability of external resources, hallucination detection methods can be further subdivided into two categories: zero-resource and non-zero-resource.

3.2.1 Zero-Resource Hallucination Detection Methods

The term “zero-resource” indicates that there are no external resources or auxiliary tools available for verification. Therefore, zero-resource hallucination detection methods refer to detecting whether generated content contains hallucinations by analyzing the inputs and outputs of large language models without relying on external knowledge bases or data sources. Such methods emphasize using input design or information from the model itself to complete detection, rather than introducing external knowledge support. The zero-resource hallucination detection framework for black-box models is shown in Fig. 5.

1) Transfer of traditional methods

Since hallucination is a common problem in NLP, and hallucination detection methods in traditional NLP tasks themselves possess a certain degree of generality, the detection ideas of traditional methods can be transferred to large language models.

Bhamidipati et al. [107] innovatively formalized the hallucination detection task as a natural language inference (NLI) task, using NLI models to detect semantic relations such as unidirectional entailment and bidirectional entailment between inputs and outputs. This enables in-depth analysis of semantic consistency between generated text and source text, rather than relying only on surface-level similarity. In addition, Rashad et al. [108] transformed the input

(a) Transfer of traditional methods
User query; LLM; response; NLP method; identify; hallucination

(b) Generate multiple responses
User query; LLM; response 1; response 2; response 3; compare; hallucination

(d) Based on large models
User query; LLM; response; LLM′; identify; hallucination

Fig. 5 Zero-resource hallucination detection framework for black-box models

Figure 5 Zero-resource hallucination detection framework for black-box models

The text and the generated text are respectively constructed as knowledge graphs, and knowledge-graph alignment is performed to measure the semantic consistency between the generated content and the input content. However, when large-scale text is generated, the computational cost increases significantly. To this end, Sansford et al.$^{[109]}$ proposed a hallucination-detection method—GraphEval—which requires only one call to a large language model to construct a knowledge graph, thereby greatly improving detection efficiency. At the same time, this method combines knowledge graphs with an NLI model, enabling it to directly detect the specific locations where hallucinations occur and enhancing the interpretability of hallucination detection. In addition, Durmus et al.$^{[110]}$ proposed masking key information in summary sentences to generate corresponding “standard” answers, and converting the masked sentences into natural-language questions. A pretrained question-answering model is then used to extract answers from the original text, and these are matched against the “standard” answers to the questions generated from the summary. Based on the accuracy of answer matching, an $F1$ score is computed as the faithfulness score of the summary. If the answers extracted by the model from the original text poorly match the answers in the summary, this indicates that the summary may contain hallucinated information.

Even so, using traditional hallucination-detection methods to identify hallucinations in large language models still suffers from shortcomings such as poor generalization. Traditional methods are usually designed for specific tasks, are difficult to adapt to the diversified application scenarios of large language models, and exhibit weak cross-domain and cross-task generalization. It can thus be seen that relying solely on traditional methods is insufficient to meet the practical needs of large language models in broad generation tasks; more general and efficient hallucination-detection methods urgently need to be designed to accommodate more diverse tasks.

2) Generate multiple responses

In addition to traditional hallucination-detection methods, one can also exploit the ability of large language models to support multi-turn question answering by designing systematic question-answering interaction processes, so as to effectively detect hallucinations in generated content.

One representative class of methods generates responses multiple times and analyzes sample consistency to evaluate the reliability of generated content. The principle is that, when a language model has relatively strong internal knowledge or support regarding a topic or factual information, the responses generated through random sampling should exhibit high consistency across semantic, factual, logical, and other dimensions. Conversely, if the generated content involves hallucinations, there are often significant differences among the results generated by the model through multiple sampling runs. Based on this idea, many hallucination-detection methods have been derived.

Farquhar et al.$^{[111]}$ analyzed the semantic similarity of responses generated by multiple random samples from the perspective of semantic entropy, thereby detecting semantic-consistency hallucinations in generated content. Semantic entropy is an indicator that measures the distribution of generated text outputs grouped into semantic equivalence classes. This method solves the problem that traditional entropy computation is not suitable for language-generation tasks. Manakul et al.$^{[112]}$ generated multiple sampled responses by adjusting random seeds or using various decoding strategies such as Top-$K$ sampling. They then constructed five variants to detect hallucinations in question-answering tasks from different dimensions, including semantics, facts, and logic. In addition, Elaraby et al.$^{[36]}$ provided a lightweight detection method—HALOCHECK—that performs sentence-level fine-grained analysis on content generated by multiple sampling runs. By capturing conflicts among these samples, it quantifies the consistency among generated samples; the lower the consistency score, the higher the hallucination risk. Hallucination-detection methods for long texts often divide a long text into multiple facts and separately compare the consistency of each pair of facts. However, these methods have difficulty achieving alignment among multiple facts and ignore dependencies among multiple contextual facts. Fang et al.$^{[113]}$ extracted knowledge triples from generated text and modeled a graph structure, capturing dependencies among triples so as to enhance the ability to model multiple contexts.

3) Reverse verification

In addition, another class of methods performs reverse generation after forward answer generation; that is, it reconstructs a query from the generated answer and evaluates the consistency between the generated query and the original query. The premise for applying such methods is that the parameters of the language model store entities and their related knowledge. By converting the content generated by the model into a query statement, it is verified whether the model can return an entity consistent with the initially generated content. If the generated content contains hallucinations, converting it into a query will lead to erroneous search conditions, making it impossible to retrieve the correct entities.

Yang et al.[114] designed two reverse-verification methods, namely question-generation-based reverse verification and entity-matching-based reverse verification. The former prompts a language model to construct a question from the generated content, requires the model to answer the question and return an entity, and then determines whether the returned entity is consistent with the original entity. The latter rewrites the information in the generated content into a series of feature requirements, prompts the model to return entities satisfying these requirements, and asks the model to report the degree of match between the returned entities and the requirements; if the match is below a preset threshold, the generated content is judged to be a factual hallucination. Similarly, studies [115–118] generate questions and then answer them, using methods such as NLI models or $F1$ scores to evaluate answer consistency. The InterrogateLLM method[119] uses a bidirectional mechanism of forward generation and reverse verification: an embedding model converts the original query and the reconstructed query into vectors, computes the cosine similarity between the original query and the reconstructed query, and thereby detects hallucinations. Inspired by the cross-examination mechanism in legal simulations, Cohen et al.[120] proposed a zero-resource black-box hallucination detection method based on the principle of cross-examination between questions, modeling factuality detection of language-model generation as an interaction between two models: one language model generates statements or answers, and another language model verifies these answers by asking questions. To improve the consistency-verification capability for knowledge triples in generated content, Fang et al.[113] proposed three subtasks around each knowledge triple (head entity, relation, tail entity): head-entity verification based on question generation, relation reconstruction, and tail-entity selection. By constructing a reverse reconstruction–verification mechanism, this method achieves fine-grained consistency checking of generated knowledge.

The above methods have advantages such as simple operation and strong extensibility, and can address the “omission problem” that may arise in traditional consistency detection. However, whether using multiple generation sampling or reverse verification, these methods place relatively high demands on computational resources; in particular, when applied to large language models with many parameters and to large-scale datasets, they may encounter performance bottlenecks.

4) Detection Based on Large Language Models

Because extensive factual knowledge is encoded in the parameters of large language models, LLMs are often used as tools for fact checking. At the same time, LLMs have strong instruction-following capabilities—that is, the ability to complete relevant tasks according to specific instructions provided by users[121–122]. Combining these characteristics, models can not only generate content but also check and evaluate the content they themselves generate[123–125].

Gao et al.[124] used large language models such as ChatGPT as automated evaluation tools. By providing detailed task instructions and scoring criteria, they enabled the model to simulate the human evaluation process and detect whether factual-consistency hallucinations exist in generated content. Similarly, Adlakha et al.[123] verified the potential of large language models as automated evaluation tools. Specifically, by providing LLMs with clear instructions—such as explicit scoring criteria and task backgrounds—and by combining model-generated content with knowledge-source content, the models can generate evaluation results that are highly consistent with human assessments. However, Adlakha et al. also pointed out that when LLMs are used as evaluation tools, their performance is still affected by task complexity and input quality; for example, in tasks with high diversity or linguistic ambiguity, the models may exhibit evaluation bias. To further improve detection accuracy, Jain et al.[125] selected representative in-context examples and embedded them into prompts, enabling LLMs to imitate human scoring patterns and score generated text efficiently and accurately. Experiments show that LLMs exhibit performance highly correlated with human annotators in consistency and relevance evaluation, and can capture factual errors and logical defects in generated content.

To improve the interpretability of hallucination detection, methods such as chain-of-thought or chain of verification are commonly incorporated into the detection process. These stepwise reasoning methods explicitly decompose complex tasks or progressively verify the logical consistency and factuality of generated content. For example, Luo et al.[126] provided ChatGPT with source documents and generated summaries, and combined CoT techniques[80] to guide the model through step-by-step reasoning; after explaining the reasoning process in detail, the model makes a judgment to detect whether factual-consistency hallucinations exist in the generated content. Dhuliawala et al.[127] introduced a verification chain that generates a series of verification questions based on the query and the initial answer, in order to detect factual errors in the answer. Luo et al.[128] extracted core concepts from input instructions, required the model to explain and reason about these core concepts, and quantified the model’s familiarity with the concepts, thereby measuring the uncertainty of the model’s output and preventing potential factual-consistency hallucinations.

In addition, Agrawal et al.[129] performed hallucination detection through indirect queries. The core idea is to detect hallucinations by posing open-ended questions rather than directly verifying questions. Compared with direct queries, indirect queries generate multiple detailed answers about the cited content and can capture more complex hallucination phenomena. A single model may be biased toward specific tasks or data; verification by ensembling different models[130] can reduce the impact of such bias.

Although such methods are highly automated, easy to implement, and strongly extensible, they depend heavily on the language model’s own reasoning ability and generation quality. If there are gaps or contradictions in the model’s internal knowledge base, these methods may not effectively identify hallucinations.

In summary, when external resources are unavailable, impractical, or computationally expensive, zero-resource hallucination detection methods have clear advantages. Such methods provide a lightweight and scalable solution. However, because they lack external knowledge support, these methods depend heavily on the quality of input design and the language-analysis capability of large language models. When the input design is overly complex or the model’s expressive ability is insufficient[131], the detection results may fail to effectively identify complex or implicit

hallucination.

3.2.2 Non-zero-resource hallucination detection methods

Non-zero-resource hallucination detection methods rely on external knowledge sources or auxiliary tools. By introducing external resources such as knowledge bases, they identify hallucinations in generated content. This section divides non-zero-resource hallucination detection methods into two categories: detection methods based on external databases and detection methods based on classifiers. The non-zero-resource hallucination detection framework for black-box models is shown in Fig. 6.

Text in Fig. 6: user query; retrieved documents; LLM; response; retrieval comparison; hallucination; decomposition; dataset; training; classifier; identification.

(a) Based on external databases (b) Based on classifiers

Fig. 6 Non-zero-resource hallucination detection framework for black-box models

1) Detection methods based on external databases

Detection methods based on external databases identify possible hallucinations in generated content by comparing and verifying the generated content against external databases.

Generated content usually needs to undergo preprocessing to improve the accuracy and effectiveness of its matching analysis with external databases. This step is particularly important because unprocessed generated content often exhibits large discrepancies and is difficult to compare directly with a knowledge base. Therefore, some studies have focused on processing generated text to optimize matching efficiency and improve analytical reliability. For example, Son et al. [132] quantitatively evaluated the degree of deviation between generated content and a knowledge base by modeling hallucination risk. However, generated text is usually lengthy and lacks an explicit and fine-grained definition granularity for specific facts; meanwhile, the fact-checking process also faces the problem of insufficient evidence. To address these challenges, Min et al. [133] first segmented generated text item by item into atomic factual units and used a search system to extract relevant evidence from a specified knowledge base. They then verified the support for each factual unit one by one, thereby achieving fine-grained hallucination detection of factual consistency. Similarly, Chern et al. [134] developed FacTool, a framework independent of tasks and domains. By decomposing generated content into independent factual claims and combining retrieval from external knowledge bases, text-similarity matching, and NLI techniques, this framework matches and analyzes the text and facts one by one, thereby detecting hallucinations in generated content. In addition, the Refchecker framework [135] introduces a triplet extraction method, which deconstructs complex text into semantically independent units and verifies, item by item, whether the triplets are supported by references. At the same time, from the perspective of hallucination-detection classification, Mishra et al. [136] proposed a fine-grained hallucination classification method covering multiple complex error types. This method relies on external knowledge-base retrieval and comparison, and combines an editing model to accurately detect and repair errors, further improving the quality-control mechanism for generated content. Li et al. [137] used logical reasoning rules to transform and extend facts in a knowledge base. Such logical reasoning can construct more complex and more comprehensive inferred facts for generating more scenario-based test cases.

In addition, the quality of external databases is crucial to the effectiveness of hallucination detection. To improve hallucination detection accuracy, Bayat et al. [138] proposed a detection method that relies on external knowledge bases and Web retrieval. This method uses knowledge graphs for structured querying to obtain directly supported, high-confidence evidence, and supplements fact verification for knowledge-graph deficiencies through Web retrieval. This dual-verification strategy combines structured knowledge with dynamic open-domain information, enabling effective detection of factual-consistency hallucinations across multiple generation tasks and providing a basis for revision. The above methods usually assume that the retrieved evidence is reliable and do not subdivide evidence categories during analysis, which may lead to misjudgment. Taking this into account, Halu-J filters out completely irrelevant content through evidence classification, extracts useful parts from partially irrelevant evidence, and conducts in-depth analysis of highly relevant evidence [139]. For misleading evidence, the model design allows it to be understood as “highly relevant but confusing content,” avoiding misjudgment while improving detection robustness.

Existing detection methods based on external databases often rely on a single knowledge source, leading to problems such as insufficient knowledge coverage and inability to handle multiple evidence types. To this end, Zhao et al. [140] improved the accuracy and reliability of detection results by retrieving and integrating information from multiple evidence sources. Zhang et al. [141] solved the problem of a single knowledge source by combining multiple types of knowledge, thereby enhancing the generalizability of detection. By using structured or reliable external knowledge bases, hallucination detection capability is significantly improved, especially in open-domain and highly complex tasks, where authoritative and verifiable evidence is provided. However, current hallucination detection methods usually need to retrieve a large amount of relevant evidence, resulting in the drawback of excessively long response times. To reduce computational cost, Wang et al. [142], based on Bayesian sequential analysis, evaluated in real time whether the current evidence was sufficient by progressively retrieving documents, thereby dynamically deciding whether to continue retrieving more documents and reducing the average number of retrieved documents under the premise of the same accuracy.

2) Hallucination detection methods based on classifiers

Hallucination detection methods based on classifiers train classifiers by constructing appropriate hallucination datasets, and then use the trained classifiers for hallucination detection. Current research mainly improves classification performance from dimensions such as constructing high-quality and diversified training datasets and improving classifiers.

classifier.

In constructing training datasets, some studies focus on how to perform high-quality data annotation. Zhou et al.[143] constructed a dataset for training hallucination-detection classifiers through two approaches: synthetic data generation and manual annotation. To improve the efficiency of data annotation, Wojciech et al.[144] proposed a method for automatically generating annotated data: sentences are extracted from source documents, and data that are factually correct and factually incorrect are generated through semantic transformations. To detect factual hallucinations in content generated for cross-lingual summarization tasks, Qiu et al.[145] proposed a hallucination-detection method based on a multilingual faithfulness metric. Specifically, the method extracts key facts from “document–summary” pairs, uses an English faithfulness-measurement tool to annotate faithfulness scores, and translates the annotated dataset into target languages to generate multilingual training datasets. Subsequently, a multilingual BERT-based classifier is used to detect the faithfulness scores of cross-lingual summaries, thereby quantifying whether the text contains hallucinations and addressing the problems that existing hallucination-detection methods suffer from insufficient annotation in low-resource languages and poor generalization in cross-lingual generated-text detection.

Other studies focus on how to generate high-quality hallucinated and non-hallucinated data so as to train more accurate hallucination-detection classifiers. HaloScope[146] combines embedding decomposition with a binary classifier, using data generated by unlabeled large language models to achieve efficient hallucination detection. Its core method is built on discovering and exploiting the hallucinated-sentence subspace within the embedding space, providing a new idea for factual-consistency hallucination detection. Quevedo et al.[147] analyzed the probability distribution of generated text, extracted four key features—the minimum token probability, the average token probability, the maximum probability deviation, and the minimum probability dispersion—and combined them with supervised learning to train a classifier. This processing method has advantages such as strong generalization ability and is applicable to multiple language models and generation tasks. Cao et al.[148], targeting the problem of entity hallucination in summarization tasks, proposed a detection method based on prior and posterior probabilities. The method uses an unconditional autoregressive language model to compute the prior probability of an entity, namely the likelihood that the entity appears in the generated summary without considering the source document; at the same time, through a conditional autoregressive language model, it combines source-document information to compute the entity’s posterior probability, indicating the probability that the entity is generated with support from the context and the source document. Based on the difference between prior and posterior probabilities, a $K$-nearest-neighbor classifier is trained to distinguish the hallucinated and factual states of a given entity. Santhanam et al.[149] used data-augmentation techniques, such as random pairing, negation, and entity replacement, to generate dialogue responses containing both factually consistent and factually inconsistent content, which are used to train and test detection models.

In improving classifiers, many studies have innovated classifier forms and perform hallucination detection from different dimensions and granularities. Zha et al.[150] proposed a unified alignment evaluation function for assessing semantic, factual, and logical consistency between generated text and input text. Specifically, it treats text alignment as a process of computing a continuous alignment score, segments long text by contextual chunks, and evaluates sentence by sentence the degree of alignment between each sentence in the generated text and the input text. Sentences with low alignment scores are usually regarded as hallucinated content. To address the hallucination problem in news-title generation, where the title is inconsistent with the news content, Shen et al.[151] proposed ExHalder, a hallucination-detection method based on NLI and natural language explanation. This method models the relationship between the title and the news content as an NLI task, constructs a unified inference classifier to evaluate whether the title is supported by the news content, and at the same time enhances classifier performance by generating natural-language explanations. Choi et al.[152] used Monte Carlo tree search to simulate future generation paths, computed a knowledge-consistency score for each path, and guided the optimal decoding strategy by evaluating the combined consistency of the current and future states. Meanwhile, by training a classifier to detect the starting points of hallucinations in generated sequences, all tokens from the inflection point onward are marked as potentially hallucinated, thereby providing fine-grained token-level knowledge-consistency scores. Qiu et al.[153] proposed a decoding method that combines hypothesis verification to detect hallucination problems in generated text. At each decoding step, the currently generated sequence (called the “backward hypothesis”) and possible future sequences that may be generated (called the “forward hypothesis”) are regarded as hypotheses. A hypothesis-verification model is used to evaluate how well these hypotheses match the input facts, and decoding candidates are ranked according to the generated confidence scores.

In addition, Himmi et al.[154], addressing the problem that most hallucination-detection methods rely on a single type of detector, proposed an unsupervised multi-detector aggregation framework. This framework combines the scores of multiple external and internal detectors and fully exploits the complementary advantages of different detectors to capture the characteristics of different types of hallucinations.

In summary, non-zero-resource hallucination-detection methods introduce external knowledge sources or auxiliary tools, thereby effectively compensating for the limitations of the model’s internal knowledge and significantly improving the reliability and accuracy of detection. However, the effectiveness and applicability of such methods are limited to some extent by the quality of external resources and computational conditions. For example, when the knowledge base is outdated, incomplete, or insufficiently authoritative, misjudgments may occur; and because the introduction of external resources increases system complexity, the computational cost and time overhead of the detection process increase significantly.

To help readers understand the detection objects, detection principles, and characteristics of the above hallucination-detection methods, the reviewed literature is classified and summarized, as shown in Table 3 and Table 4.

From Table 3, it can be seen that different types of hallucinations require different detection methods. Overall, because white-box model detection methods can directly access the internal state of a model, they exhibit strong adaptability and detection capability when identifying the four types of hallucinations. However, existing black-box model detection methods infer only from input–output behavior and are limited...

Table 3 Classification of Hallucination Detection Methods

Applicable model	Category	Detection approach	Context-consistency hallucination	Semantic-faithfulness hallucination	Factual-consistency hallucination	External-dependence hallucination
White-box model		Feature-based	Refs. [92, 95, 100]	Refs. [95, 99, 103]	Refs. [89–90, 93–98, 101–102, 104]	Ref. [105]
Black-box model	Zero-resource	Transfer of traditional methods		Refs. [107–108, 110]	Ref. [109]
Black-box model	Zero-resource	Generating multiple responses	Ref. [112]	Refs. [111–112]	Refs. [36, 112–113]
Black-box model	Zero-resource	Reverse verification			Refs. [113–120]
Black-box model	Zero-resource	Based on large language models	Ref. [125]		Refs. [123, 125–130]
Black-box model	Non-zero-resource	Based on external databases			Refs. [132–142]
Black-box model	Non-zero-resource	Based on classifiers		Refs. [145, 150]	Refs. [143–144, 146–154]	Ref. [72]

Table 4 Summary of Hallucination Detection Methods

Applicable model	Category	Detection approach	Detection principle	Advantages	Limitations
White-box model	Single feature	Hidden-layer activation values	Uses the model’s internal state information to identify hallucinations in generated content	Captures deep semantic changes and offers strong fine-grained detection capability	High computational overhead; weak model transferability and generality
White-box model	Single feature	Logits values	Uses the model’s internal state information to identify hallucinations in generated content	Simple to implement and computationally efficient	Difficult to capture problems such as deep-level semantic inconsistency
White-box model	Single feature	Entropy values	Uses the model’s internal state information to identify hallucinations in generated content	Entropy values can be combined with derived indicators such as conditional entropy to enable fine-grained assessment and dynamic adjustment	Overly dependent on the accuracy of the probability distribution
White-box model	Single feature	Attention weights	Uses the model’s internal state information to identify hallucinations in generated content	Can explicitly measure the degree to which generated content depends on the input context; relatively strong interpretability	Bias in the attention mechanism itself can easily cause misjudgment
White-box model	Single feature	Gradients	Uses the model’s internal state information to identify hallucinations in generated content	Reflects, at fine granularity, the influence of input features on the output; sensitive detection	High computational overhead; weak model transferability and generality
White-box model	Multiple features		Combines multiple features of the model’s internal states for hallucination detection	Can use rich information and has relatively high detection accuracy	Information redundancy may exist among different features
Black-box model	Zero-resource	Transfer of traditional methods	Transfers the ideas of traditional hallucination-detection methods to large language models	Technically simple and reduces the cost of redesigning detection mechanisms	Weak generalization capability across domains and tasks
Black-box model	Zero-resource	Generating multiple responses	Generates responses multiple times and analyzes sample consistency to evaluate the reliability of generated content	Requires no additional annotated data; simple and easy to implement; relatively low computational cost	Noise introduced by random sampling itself may affect the analysis results, leading to detection errors
Black-box model	Zero-resource	Reverse verification	Reconstructs a query from the generated answer and evaluates the consistency between the generated query and the original query	Requires no support from external knowledge bases and has relatively strong reliability	The reverse-verification process involves highly complex syntactic or logical design, and errors are also easily introduced
Black-box model	Zero-resource	Based on large models	Uses the internal knowledge of large language models to detect generated content	Avoids dependence on external databases or knowledge bases; highly automated, easy to implement, and strongly extensible	Highly dependent on the reasoning capability and generation quality of the language model itself
Black-box model	Non-zero-resource	Based on external databases	Compares and verifies generated content against external databases to identify possible hallucinations in the generated content	Has relatively high accuracy and credibility	Limited by the quality of external resources and computational conditions
Black-box model	Non-zero-resource	Based on classifiers	Constructs an appropriate hallucination dataset to train a classifier, and uses the trained classifier for hallucination detection	Has relatively high detection accuracy and flexibility, and a high degree of automated detection	Dataset construction is costly, and the classifier’s generalization ability may be constrained by the coverage of the dataset

In terms of detection principles, these methods often have certain limitations when identifying some specific types of hallucinations.

Specifically, at present, research on semantic-faithfulness hallucination detection based on reverse verification, large language models, and external databases remains relatively scarce. The core of reverse-verification methods is a closed-loop process of generation–reconstruction–verification, focusing mainly on the factual accuracy of generated content. Once the reconstructed content is consistent with the input facts, it is judged to be free of hallucination. However, this mechanism essentially ignores possible deviations between the semantics of the user prompt and the generated content, and therefore has limited capability in identifying semantic-faithfulness hallucinations. By contrast, when detection methods based on large language models are used to detect semantic-faithfulness hallucinations, they involve more complex semantic understanding and intent modeling, which exceeds the direct capabilities of current large language models. To effectively detect this type of hallucination, models with stronger semantic-understanding capabilities or more effective semantic-matching methods must be employed. In addition, although external databases provide explicit factual knowledge, knowledge bases usually exist in the form of factual triples or simplified text, lacking the ability to model complex semantic relations and contextual reasoning processes. Therefore, methods based on external databases have relatively weak applicability in semantic-faithfulness hallucination detection. Although current research on external-dependence hallucinations remains limited, because RAG technology can effectively support the vertical application of large language models, detection methods for external-dependence hallucinations should not be ignored.

As can be seen from Table 4, selecting an appropriate hallucination-detection method according to the specific application scenario is crucial. For example, for open-ended question answering or long-text generation tasks, priority should be given to methods capable of handling semantic-faithfulness and context-consistency hallucinations; whereas in medical, legal, and financial know-

In applications in knowledge-intensive and other domains, greater attention must therefore be paid to detecting hallucinations related to factual consistency and external dependency.

4 Hallucination-Detection Benchmarks

With the broad deployment of vertical large language models across multiple domains and the continuously increasing demand for reliability, developing scientific and comprehensive hallucination-detection benchmarks has become an important research direction. By defining standardized datasets, metrics, and detection procedures, such benchmarks can help researchers effectively analyze and locate unreliable content output by large language models. High-quality benchmark design must fully reflect the task-specific requirements and account for different types of hallucinations, so as to ensure both the accuracy and comprehensiveness of detection. Therefore, this section provides an in-depth discussion of existing hallucination-detection benchmarks.

It is worth noting that Ref. [15] divides hallucination benchmarks into hallucination-evaluation benchmarks and hallucination-detection benchmarks. The former focus on assessing the degree to which large language models produce hallucinations, whereas the latter are mainly used to evaluate the performance of existing hallucination-detection methods. Our survey finds that the two have similar data sources and construction processes, differing only slightly in evaluation objects and evaluation metrics. Therefore, for ease of understanding and use, hallucination-evaluation benchmarks and hallucination-detection benchmarks are reviewed together, and the characteristics of various benchmarks and their applicable scenarios are summarized.

Benchmark construction usually involves steps such as data collection, hallucination generation, and hallucination annotation. Most benchmarks directly use existing datasets. For example, the HELM benchmark[155] takes high-quality Wikipedia documents as its primary source, randomly sampling 50 000 articles from them to provide authentic and reliable corpus support for language-model generation tasks. In the hallucination-generation stage, prompt templates are usually designed and large language models are used to generate texts containing hallucinations. However, because it is impossible to directly determine whether model-generated content contains hallucinations, subsequent verification often needs to rely on an annotation stage. In this process, annotation is usually completed manually, thereby ensuring the accuracy of the results. Therefore, during benchmark construction, most workflows adopt a semi-automated form of “manual + automated” processing. In addition, only a small number of studies construct datasets in a purely manual manner, such as TruthfulQA[156].

Detection granularity can be divided into six levels, including token-level, knowledge-triplet-level, sentence-level, passage-level, dialogue-level, and semantic-level. Each level corresponds to a different depth of analysis and application scenario. At present, most research focuses on sentence-level and passage-level hallucination detection. However, when a sentence is superficially correct in grammar and logic but certain subjects or objects contain concrete factual errors, these benchmarks may have difficulty accurately identifying and locating the hallucination. Therefore, some studies attempt to construct more fine-grained hallucination-detection benchmarks. For example, the HADES benchmark[157] analyzes whether each token in the generated text accurately reflects information in the input data or knowledge base, thereby enabling the localization of specific erroneous words. Similarly, the UHGEval[158] and HalOmi[159] benchmarks not only support sentence-level hallucination detection, but can also be extended to the token level to improve detection precision. Considering that generated content often contains complex, multi-level factual statements, sentence- or token-level detection granularity may be insufficiently explicit and may easily lead to overlapping cross-factual issues. To this end, the RefChecker benchmark[160] uses structured representations of knowledge triples, such as subject, predicate, and object, to provide clear boundaries and semantic independence, thereby identifying hallucinated content more effectively. In addition, traditional sentence-level and passage-level benchmarks have difficulty capturing contextual dependencies and the global consistency of dialogue logic in multi-turn interaction scenarios. To solve this problem, the DiaHalu[161] and HalluDial[162] benchmarks consider the global context of dialogue and the logic of multi-turn interaction, evaluating hallucination-detection capability from a dialogue-level perspective. At the same time, the HaluBench benchmark[163] focuses not only on errors at the textual surface or syntactic level, but also on the consistency of semantic content. This benchmark constructs extremely difficult-to-detect hallucinated texts through semantic perturbation in order to test hallucination-detection capability.

In terms of detection languages, most benchmarks are constructed on the basis of English datasets. Among the many benchmarks, only UHGEval[158] and HalluQA[164] focus on Chinese data. This phenomenon arises from factors such as the language preferences of model training and the common language used in academia. As a result, hallucination-detection benchmarks have certain limitations in their applicability to Chinese and other languages. On the one hand, because there are significant differences in language structure, expression habits, entity-boundary segmentation, and other aspects, directly transferring English benchmarks to a Chinese environment may cause evaluation bias[165]. On the other hand, hallucinations may take different forms in different languages[166]; for example, phenomena that occur in Chinese, such as ambiguous entities and simplified–traditional Chinese conversion errors, are not common in English. In addition, the uneven language distribution of training corpora also causes models to be more prone to hallucinations in low-resource languages such as Basque[167]. Although benchmarks such as UHGEval and HalluQA have explored Chinese, overall, current hallucination-detection benchmarks still show clear deficiencies in cross-lingual transfer and generalization capability.

In terms of evaluation metrics, most studies use general classification metrics, such as the $F1$ score, accuracy (Acc), precision, and recall. Other studies have proposed new computational methods, such as FActScore[133]. However, these metrics still have shortcomings in their coverage of the complexity of generation tasks and multidimensional errors, and they cannot dynamically adapt to requirements such as the complexity and real-time nature of generation tasks. Specifically, content generated by large language models may involve continued generation in multi-turn dialogue or reasonable inference based on limited input. This makes it difficult to comprehensively measure the quality of generated results by relying only on traditional metrics such as accuracy, precision, and recall. In addition, errors in generation tasks are usually multidimensional. These metrics mainly focus on single-dimensional issues such as semantic consistency or factual consistency, making it difficult to effectively

capture the diverse and complex hallucination phenomena that arise during generation. Therefore, researchers are increasingly inclined to develop benchmarks that cover multidimensional evaluation criteria, in order to capture potential hallucination problems in generated content.

From the perspective of hallucination types, at the present stage the vast majority of benchmark sets focus on detecting factual-consistency hallucinations. Except for RAGTruth[71], other benchmarks all cover the category of factual-consistency hallucinations. This reflects the high level of attention currently paid to the authenticity of content generated by large language models. By comparison, detection benchmarks for semantic-faithfulness hallucinations, context-consistency hallucinations, and external-dependency hallucinations are still in the exploratory stage; only benchmarks such as HaluEval[168] and AutoHall[169] involve these types of hallucinations in some tasks. However, the overall coverage of the above benchmarks is relatively low, and the detection granularity is mostly limited to the sentence level or concept level, making it difficult to accurately capture hallucination phenomena such as semantic detachment, contextual conflict, or shifts in retrieved knowledge.

In terms of application tasks, current hallucination detection benchmarks cover multiple types of generation tasks, mainly including mainstream NLP tasks such as question answering, summarization, multi-turn dialogue, and machine translation. At the same time, benchmarks such as HaLoGen[170] introduce complex generation tasks such as code generation, scientific citation, and long-text generation, expanding the application scenarios of hallucination detection. However, from the overall trend, most current hallucination detection benchmarks are still dominated by general-domain tasks, and there is a lack of systematic hallucination detection benchmarks for specialized domains such as medical question answering and legal document generation. This limitation is mainly constrained by factors such as privacy protection for specialized-domain data and the high cost of annotation.

Based on the above analysis, hallucination detection benchmarks can be classified according to dimensions such as benchmark category, detection granularity, and evaluation metrics, as shown in Table 5.

By counting the citation frequencies of the papers corresponding to 23 benchmarks in Google Scholar, it can be found that the three most commonly used benchmarks at present are TruthfulQA, FactScore, and HaluEval, as shown in Figure 7. Combined with Table 5, it can be seen that the widespread use of these three benchmarks reflects the current research community’s strong concern for the factual consistency and semantic faithfulness of generated content. Specifically, TruthfulQA mainly focuses on the truthfulness and consistency of language models when answering questions, especially the accuracy of models’ answers to factual questions; FactScore evaluates models by quantifying the factuality of generated content, and is widely applied to factual-consistency hallucination detection for long texts; HaluEval places greater emphasis on detecting hallucination phenomena in multiple generation tasks, covering complex situations such as semantic bias and factual deviation. Benchmarks targeting context-consistency hallucinations or external-dependency hallucinations, such as HADES and RAGTruth, likewise provide more precise detection tools for specific application scenarios.

It can thus be seen that the selection of different benchmarks still needs to be flexibly adjusted according to the characteristics of the application scenario and the target type of hallucination detection, so as to ensure the effectiveness and specificity of detection.

5 Future Research Directions and Challenges

Large language models have received extensive attention from industry and academia in recent years and have achieved many breakthrough advances. However, research on hallucination detection for large language models is still in its infancy and continues to face many challenges. Based on the in-depth analysis of the current status of hallucination detection research in this paper, future research in this field should focus on four directions:

1) Define the boundaries of hallucinations in large language models, and establish clearer classification criteria and measurement systems. Hallucination is a special type of error in large language models, but not all errors are hallucinations. Unlike simple grammatical or logical errors, hallucinations often involve more complex knowledge conflicts or information loss, and under the influence of prompts they appear more hidden and diverse. Clarifying the boundary of hallucination and identifying the transition point between “false guidance” and “other errors” in large language models is the foundation for studying hallucination problems in large language models. In addition, existing research still lacks fine-grained classification of hallucinations, and multidimensional analysis of hallucination characteristics remains incomplete. Therefore, clarifying the boundary between hallucinations and other errors, and constructing scientific classification criteria and measurement systems, is one of the theoretically important directions for future research on hallucination detection.

2) Explore the disentanglement mechanisms of composite hallucinations, reveal the associations and generation patterns among different hallucination types, and propose detection methods for composite hallucinations. Existing hallucination detection methods usually focus on specific types of hallucinations. However, in practical applications, hallucinations generated by large language models are not of a single type; rather, different hallucination types intertwine and overlap, presenting more complex forms of error. This makes it difficult for traditional single-type detection methods to effectively capture hallucinations. For example, knowledge-verification-based detection methods may miss logical contradictions, whereas logic-analysis-based methods may overlook factual errors. By disentangling the associations and generation mechanisms among hallucination types through multidimensional causal analysis, and by exploring methods for detecting composite hallucinations, comprehensive coverage and accurate identification of complex hallucination phenomena can be achieved; this is one of the future research directions at the methodological level.

3) Study hallucination detection methods in cross-modal, cross-language, and cross-domain scenarios. With the broad application of large language models in multilingual, multimodal, and multidomain environments, hallucination problems exhibit more complex characteristics. On the one hand, in cross-language scenarios, differences in semantic expression, uneven knowledge coverage, and imbalanced resource distribution among different languages increase the difficulty of hallucination detection. On the other hand, in cross-modal scenarios, such as image-text generation, visual question answering, and audio generation, textual hallucinations often conflict with the actual content of other modalities; traditional methods based only on textual reasoning have difficulty covering this type of composite hallucination. Therefore, it is urgently necessary to construct a detection framework capable of uniformly processing multilingual and multimodal inputs

Table 5 Benchmark for Hallucination Detection in Large Language Models

Table 5 Benchmark for hallucination detection in large language models

Benchmark	Type	Scale	Granularity	Language	Hallucination type: SFH	Hallucination type: FCH	Hallucination type: CCH	Hallucination type: EDH	Task type (including but not limited to)	Evaluation metrics
TruthfulQA[156]	Hallucination evaluation	817	Sentence level	English		√			Knowledge QA, multiple choice	Acc
HaluEval[168]	Hallucination evaluation	35 000	Sentence level	English	√	√	√		Knowledge QA, multi-turn dialogue, summary generation	Acc
UHGEval[158]	Hallucination evaluation	5 141	Token level, sentence level	Chinese		√			Text generation	Acc, BLEU-4, ROUGE-L, etc.
HalluQA[164]	Hallucination evaluation	450	Sentence level	Chinese		√			Knowledge QA, multi-turn dialogue	Non-Hallucination Rate, etc.
AutoHall[169]	Hallucination detection	2 844	Sentence level	English		√	√		Knowledge QA, summary generation, multi-turn dialogue	Acc, $F1$
DiaHalu[161]	Hallucination evaluation, hallucination detection	1 103	Dialogue level	English	√	√			Multi-turn dialogue	Precission, Recall, $F1$
FactCHD[170]	Hallucination evaluation	58 343	Sentence level	English		√			Knowledge QA, multi-turn dialogue	Micro $F1$, Expmatch
FaithBench[171]	Hallucination evaluation	660	Sentence level	English	√	√			Summary generation	Micro $F1$, Balanced Accuracy
HALoGen[172]	Hallucination evaluation	150 000	Sentence level	English	√	√			Code generation, summary generation, scientific citation	Hallucination Score, etc.
HELM[155]	Hallucination detection	3 342	Sentence level, paragraph level	English	√	√			Knowledge QA, summary generation, text generation	AUC, PCCs
HalOmi[159]	Hallucination detection	<3 546	Token level, sentence level	Multilingual	√	√			Machine translation	AUC, AOC
HalluDial[162]	Hallucination evaluation	146 856	Dialogue level	English	√	√			Knowledge QA	Acc, Macro $F1$
HaluBench[163]	Hallucination evaluation	15 000	Sentence level, semantic level	English		√	√	√	Knowledge QA, summary generation	Acc, Macro $F1$, etc.
HaluEval 2.0[173]	Hallucination evaluation	8 770	Sentence level, paragraph level	English		√	√		Knowledge QA, summary generation	Micro hallucination rate, etc.
RefChecker[160]	Hallucination detection	11 000	Knowledge-triple level	English		√		√	Knowledge QA, summary generation	Acc, Macro $F1$
HADES[157]	Hallucination detection	12 719	Token level	English	√	√	√		Knowledge QA, multi-turn dialogue	Precision, Recall, $F1$, etc.
PHD[114]	Hallucination detection	300	Paragraph level	English		√			Knowledge QA	Acc, Precision, $F1$, Recall
FELM[174]	Hallucination detection	3 948	Sentence level, paragraph level	English		√			Multiple tasks	Precision, $F1$, Recall, etc.
REALTIMEQA[175]	Hallucination evaluation		Sentence level	English		√			Knowledge QA	Acc, Exact Match, $F1$
FACTOR[176]	Hallucination evaluation	4 266	Sentence level	English		√			Knowledge QA, summary generation	Acc
BAMBOO[177]	Hallucination evaluation	3 004	Sentence level	English		√	√		Long-form text generation	Concordance Index, Acc, $F1$
Poly-FEVER[25]	Hallucination detection	77 973	Sentence level	Multilingual		√			Fact verification	Acc
RAGTruth[71]	Hallucination evaluation	18 000	Sentence level, paragraph level	English				√	Knowledge QA, summary generation	Precision, Recall, $F1$

Note: SFH denotes semantic-faithfulness hallucination, FCH denotes factual-consistency hallucination, CCH denotes contextual-consistency hallucination, and EDH denotes external-dependence hallucination.

Fig. 7 Citation count of literature corresponding to the hallucination detection benchmarks

Figure 7 Citation counts of the literature corresponding to hallucination detection benchmarks

framework, taking into account the knowledge characteristics and reasoning requirements of different domains, and improving the model’s robustness and generalization in cross-domain applications. It can therefore be seen that breaking the current dependence of hallucination detection methods on a single language, a single modality, and a single domain, and establishing a unified hallucination detection system oriented toward multiple languages, multiple modalities, and multiple domains, is one of the future research directions for hallucination detection at the methodological level.

4) Study the collaborative mechanism between hallucination detection and mitigation to achieve end-to-end generation optimization. Hallucination detection and mitigation in large language models are usually designed as two independent processing stages. This makes it difficult for the hallucination mitigation process to fully utilize the fine-grained information from the detection stage, thereby limiting the practical effectiveness of mitigation strategies in improving the quality of generated content. At the same time, this processing approach significantly increases the time and computational overhead of the overall workflow, showing clear limitations especially in application scenarios with high real-time requirements. It can therefore be seen that constructing a collaborative mechanism for hallucination detection and correction, and dynamically evaluating and adjusting the output content during the generation process, is one of the future research directions for hallucination detection at the application level.

Author Contribution Statement: Li Zituo was responsible for the literature review, content design, manuscript writing, and revision of the final version; Sun Jianbin was responsible for providing guidance, framework design, and revision of the full text; Chen Guangzhou, Fang Xinyue, and Cui Ruijing were responsible for manuscript revision; Tian Zhiliang and Huang Zhen were responsible for manuscript review; Yang Kewei provided guidance and revised the manuscript. Among them, Sun Jianbin and Tian Zhiliang are the co-corresponding authors of this paper.

References

[1] Min B, Ross H, Sulem E, et al. Recent advances in natural language processing via large pre-trained language models: A survey[J]. ACM Computing Surveys, 2023, 56(2): 1–40

[2] Fan Lizhou, Li Lingyao, Ma Zihui, et al. A bibliometric review of large language models research from 2017 to 2023[J]. ACM Transactions on Intelligent Systems and Technology, 2024, 15(5): 1–25

[3] Navigli R, Conia S, Ross B. Biases in large language models: Origins, inventory, and discussion[J]. ACM Journal of Data and Information Quality, 2023, 15(2): 1–21

[4] Zhao Xin Wayne, Zhou Kun, Li Junyi, et al. A survey of large language models[J]. arXiv preprint, arXiv: 2303.18223, 2023

[5] Chen Huimin, Liu Zhiyuan, Sun Maosong. The social opportunities and challenges in the era of large language models[J]. Journal of Computer Research and Development, 2024, 61(5): 1094–1103 (in Chinese)
(Chen Huimin, Liu Zhiyuan, Sun Maosong. Social opportunities and challenges in the era of large language models[J]. Journal of Computer Research and Development, 2024, 61(5): 1094–1103)

[6] Ye Wentao, Hu Jiaqi, Wang Haobo, et al. A trusted evaluation system for safe deployment of large language models[J]. Journal of Computer Research and Development, 2025, 62(7): 1668–1684 (in Chinese)
(Ye Wentao, Hu Jiaqi, Wang Haobo, et al. A trustworthy evaluation system for safe deployment of large language models[J]. Journal of Computer Research and Development, 2025, 62(7): 1668–1684)

[7] Kalai A T, Santosh S V. Calibrated language models must hallucinate[C]//Proc of the 56th Annual ACM Symp on Theory of Computing. New York: ACM, 2024: 160–171

[8] Lin, Zichao, Guan Shuyan, Zhang Wending, et al. Towards trustworthy LLMs: A review on debiasing and dehallucinating in large language models[J]. Artificial Intelligence Review, 2024, 57(9): 1–50

[9] Hu Songlin, Li Juanzi, Qin Bing, et al. The good and evil big model: A special topic on big models and security[J]. Journal of Computer Research and Development, 2024, 61(5): 1085–1093 (in Chinese)
(Hu Songlin, Li Juanzi, Qin Bing, et al. The righteous and evil large model—A special-topic introduction to large models and security[J]. Journal of Computer Research and Development, 2024, 61(5): 1085–1093)

[10] Venkit P N, Chakravorti T, Gupta V, et al. An audit on the perspectives and challenges of hallucinations in NLP[C]//Proc of the 2024 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2024: 6528–6548

[11] Li Xu, Zhu Rui, Chen Xiaolei, et al. A survey of hallucinations in large vision-language models: Causes, evaluations and mitigations[J]. Journal of Computer Research and Development, 2025, 62(12): 2929–2950 (in Chinese)
(Li Xu, Zhu Rui, Chen Xiaolei, et al. A survey of hallucinations in large vision-language models: Causes, evaluation, and governance[J]. Journal of Computer Research and Development, 2025, 62(12): 2929–2950)

[12] Dahl M, Varun M, Mirac S, et al. Large legal fictions: Profiling legal hallucinations in large language models[J]. Journal of Legal Analysis, 2024, 16(1): 64–93

[13] Liu Zhuang, Huang Ddegen, Huang Kaiyu, et al. FinBERT: A pre-trained financial language representation model for financial text mining[C]//Proc of the 29th Int Joint Conf on Artificial Intelligence. San Mateo, CA: Morgan Kaufmann, 2021: 4513–4519

[14] Liu Zeyuan, Wang Pengjiang, Song Xiaobin, et al. Survey on hallucinations in large language models[J]. Journal of Software, 2025, 36(3): 1152–1185 (in Chinese)
(Liu Zeyuan, Wang Pengjiang, Song Xiaobin, et al. A survey of research on hallucination problems in large language models[J]. Journal of Software, 2025, 36(3): 1152–1185)

[15] Ji Ziwei, Lee N, Frieske R, et al. Survey of hallucination in natural language generation[J]. ACM Computing Surveys, 2023, 55(12): 1–38

[16] Zhang Yue, Li Yafu, Cui Leyang, et al. Siren’s song in the AI ocean: A survey on hallucination in large language models[J]. arXiv preprint, arXiv: 2309.01219, 2023

[17] Huang Lei, Yu Weijiang, Ma Weitao, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions[J]. ACM Transactions on Information Systems, 2025, 43(2): 1–55

[18] Abbasi Yadkori Y, Kuzborskij I, György A, et al. To believe or not to believe your LLM: Iterative prompting for estimating epistemic uncertainty[C]//Proc of the 37th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2024: 58077–58117

[19] Wu Junchao, Shu Yang, Zhan Runzhe, et al. A survey on LLM-generated text detection: Necessity, methods, and future directions[J]. Computational Linguistics, 2025, 51(1): 275–338

[20] Black S, Biderman S, Hallahan E, et al. GPT-NeoX-20B: An open-source autoregressive language model[J]. arXiv preprint, arXiv: 2204.06745, 2022

[21] Sun Weiwei, Shi Zhengliang, Gao Shen, et al. Contrastive learning reduces hallucination in conversations[C]//Proc of the 37th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2023: 13618–13626

[22] Chen Sihao, Zhang Fan, Sone K, et al. Improving faithfulness in abstractive summarization with contrast candidate generation and selection[C]//Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 5935–5941

[23] Dziri N, Madotto A, Zaiane O, et al. Neural path hunter: Reducing hallucination in dialogue systems via path grounding[C]//Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 2197–2214

[24] Huang Yichong, Feng Xiachong, Feng Xiaocheng. The factual inconsistency problem in abstractive text summarization: A survey[J]. arXiv preprint, arXiv: 2104.14839, 2023

Li Zituo et al.: A Survey of Hallucination Detection Methods for Large Language Models

[25] Chen Xinxi, Wang Li, Wu Wei, et al. Honest AI: Fine-tuning “small” language models to say “I Don’t Know”, and reducing hallucination in RAG[J]. arXiv preprint, arXiv: 2410.09699, 2024

[26] Huang Yizheng, Huang J. A survey on retrieval-augmented text generation for large language models[J]. arXiv preprint, arXiv: 2404.10981, 2024

[27] Xie Jinheng, Mao Weijia, Bai Zechen, et al. Show-o: One single transformer to unify multimodal understanding and generation[J]. arXiv preprint, arXiv: 2408.12528, 2024

[28] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proc of the 30th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2017: 5998–6008

[29] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171–4186

[30] Radford, A, Narasimhan T, Salimans I, et al. Improving language understanding by generative pre-training[EB/OL]. (2018-06-11) [2024-12-21]. https://openai.com/index/language-unsupervised/

[31] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[EB/OL]. [2024-12-21]. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

[32] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[C]//Proc of the 33rd Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2020: 1877–1901

[33] OpenAI. GPT-4 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023

[34] Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text Transformer[J]. Machine Learning Research, 2020, 21(140): 1–67.

[35] Yang Zhilin, Dai Zihang, Yang Yiming, et al. XLNet: Generalized autoregressive pretraining for language understanding[J]. arXiv preprint, arXiv: 1906.08237, 2019

[36] Elaraby M, Lu Mengyin, Dunn J, et al. HaLo: Estimation and reduction of hallucinations in open-source weak large language models[J]. arXiv preprint, arXiv: 2308.11764, 2023

[37] Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302.13971, 2023

[38] Yin Ziqi, Zhang Mingxin, Kawahara D. Harmony: A home agent for responsive management and action optimization with a locally deployed large language model[J]. arXiv preprint, arXiv: 2410.14252, 2024

[39] Li Zuchao, Zhang Shitou, Zhao Hai, et al. BatGPT: A bidirectional autoregressive talker from generative pre-trained transformer[J]. arXiv preprint, arXiv: 2307.00360, 2023

[40] DeRose J F, Wang Jiayao, Berger M. Attention flows: Analyzing and comparing attention mechanisms in language models[J]. IEEE Transactions on Visualization and Computer Graphics, 2020, 27(2): 1160–1170

[41] Saxena A, Bhattacharyya P. Hallucination detection in machine generated text: A survey[EB/OL]. (2025-01-21) [2025-01-22]. https://www.cfilt.iitb.ac.in/resources/surveys/2024/survey_ashita_hallucination_detection_in_machine_generated_text_2024.pdf

[42] Hahn M. Theoretical limitations of selfattention in neural sequence models[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 156–171

[43] Chiang D, Cholak P. Overcoming a theoretical limitation of self-attention[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 7654–7664

[44] Naveed H, Khan A U, Qiu Shi, et al. A comprehensive overview of large language models[J]. arXiv preprint arXiv: 2307.06435, 2023

[45] Annepaka Y, Pakray P. Large language models: A survey of their development, capabilities, and applications[J]. Knowledge and Information Systems, 2025, 67(3): 2967–3022

[46] Xu Weijia, Agrawal S, Briakou E, et al. Understanding and detecting hallucinations in neural machine translation via model introspection[J]. Transactions of the Association for Computational Linguistics, 2023, 11: 546–564

[47] Filippova K. Controlled hallucinations: Learning to generate faithfully from noisy data[C]//Proc of the Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg, PA: ACL, 2020: 864–870

[48] Nicholas C, Tramer F, Wallace E, et al. Extracting training data from large language models[C]//Proc of the 30th USENIX Security Symp. Berkeley, CA: USENIX Association, 2021: 2633–2650

[49] Carlini N, Ippolito D, Jagielski M, et al. Quantifying memorization across neural language models[J]. arXiv preprint, arXiv: 2202.07646, 2022

[50] Chowdhery A, Narang S, Devlin J, et al. Palm: Scaling language modeling with pathways[J]. Journal of Machine Learning Research, 2023, 24(240): 1–13

[51] Lee K, Ippolito D, Nystrom A, et al. Deduplicating training data makes language models better[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 8424–8445

[52] Kandpal N, Deng Haikang, Roberts A, et al. Large language models struggle to learn long-tail knowledge[C]//Proc of the 40th Int Conf on Machine Learning. New York: PMLR, 2023: 15696–15707

[53] Hernandez D, Brown T, Conerly T, et al. Scaling laws and interpretability of learning from repeated data[J]. arXiv preprint,

[56] Wan Yixin, Pu G, Sun Jiao, et al. “Kelly is a warm person, Joseph is a role model”: Gender biases in LLM-generated reference letters[C]//Proc of the Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA: ACL, 2023: 3730–3748

[57] Karpowicz, M. On the fundamental impossibility of hallucination control in large language models[J]. arXiv preprint, arXiv: 2506.06382, 2025

[58] Liu Yinqiu, Liu Guangyuan, Zhang Ruichen, et al. Hallucination-aware optimization for large language model-empowered communications[J]. arXiv preprint, arXiv: 2412.06007, 2024

[59] Pal A, Umapathi L K, Sankarasubbu M. Med-HALT: Medical domain hallucination test for large language models[C]//Proc of the 27th Conf on Computational Natural Language Learning, Stroudsburg, PA: ACL, 2023: 314–334

[60] Roychowdhury S. Journey of hallucination-minimized generative AI solutions for financial decision makers[C]//Proc of the 17th ACM Int Conf on Web Search and Data Mining. New York: ACM, 2024: 1180–1181

[61] Wang Mengru, Yao Yunzhi, Xu Ziwen, et al. Knowledge mechanisms in large language models: A survey and perspective[C]//Proc of the Findings of the Association for Computational Linguistics: EMNLP 2024. Stroudsburg, PA: ACL, 2024: 7097–7135

[62] Zhang Ningyu, Yao Yunzhi, Tian Bozhong, et al. A comprehensive study of knowledge editing for large language models[J]. arXiv preprint, arXiv: 2401.01286, 2024

[63] Feng Zhangyin, Ma Weitao, Yu Weijiang, et al. Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications[J]. arXiv preprint, arXiv: 2311.05876, 2023

[64] Schulman J. Reinforcement learning from human feedback: Progress and challenges[EB/OL]. [2025-01-01]. https://eecs.berkeley.edu/research/colloquium/230419-2/

[65] Pal A, Sankarasubbu M. Gemini goes to med school: Exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations[J]. arXiv preprint, arXiv: 2402.07023, 2024

[66] Wang Haochun, Zhao Sendong, Qiang Zewen, et al. Knowledge-tuning large language models with structured medical knowledge bases for trustworthy response generation in Chinese[J]. ACM Transactions on Knowledge Discovery from Data, 2025, 19(2): 1–17

[67] Wei J, Huang Da, Lu Yifeng, et al. Simple synthetic data reduces sycophancy in large language models[J]. arXiv preprint, arXiv: 2308.03958, 2023

[68] Sharma Mrinank, Tong M, Korbak T, et al. Towards understanding sycophancy in language models[J]. arXiv preprint, arXiv: 2310.13548, 2023

[69] Perez E, Ringer S, Lukošiūtė K, et al. Discovering language model behaviors with model-written evaluations[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 13387–13434

[70] Lu Taiming, Shen Lingfeng, Yang Xinyu, et al. It takes two: On the seamlessness between reward and policy model in RLHF[J]. arXiv preprint, arXiv: 2406.07971, 2024

[71] Lee N, Wei Ping, Xu Peng, et al. Factuality enhanced language models for open-ended text generation[C]//Proc of the 35th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2022: 34586–34599

[72] Renze M, Guven E. The effect of sampling temperature on problem solving in large language models[J]. arXiv preprint, arXiv: 2402.05201, 2024

[73] Chang H S, Peng Nanyun, Bansal M, et al. Real sampling: Boosting factuality and diversity of open-ended generation via asymptotic entropy[J]. arXiv preprint, arXiv: 2406.07735, 2024

[74] Zhang Muru, Press O, Merrill W, et al. How language model hallucinations can snowball[J]. arXiv preprint, arXiv: 2305.13534, 2023

[75] Kang Haoqiang, Ni Juntong, Yao Huaxiu. Ever: Mitigating hallucination in large language models through real-time verification and rectification[J]. arXiv preprint, arXiv: 2311.09114, 2023

[76] Wei J, Wang Xuezhi, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Proc of the 35th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2022: 24824–24837

[77] Kojima T, Gu S S, Reid M, et al. Large language models are zero-shot reasoners[C]//Proc of the 35th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2022: 22199–22213

[78] Yee E, Li A, Tang Chenyu, et al. Faithful and unfaithful error recovery in chain of thought[EB/OL]. (2024-07-10)[2024-12-21]. https://openreview.net/forum?id=IPZ28ZqD4I

[79] Agarwal C, Tanneru S H, Lakkaraju H. Faithfulness vs. plausibility: On the (un) reliability of explanations from large language models[J]. arXiv preprint, arXiv: 2402.04614, 2024

[80] Turpin M, Michael J, Perez E, et al. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting[C]//Proc of the 37th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2024: 74952–74965

[81] Lanham T, Chen A, Radhakrishnan A, et al. Measuring faithfulness in chain-of-thought reasoning[J]. arXiv preprint, arXiv: 2307.13702, 2023

[82] Chen Qiguang, Chen Libo, Liu Jinhao, et al. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models[J]. arXiv preprint, arXiv: 2503.09567, 2025

[83] Ren Ruiyang, Wang Yuhao, Qu Yingqi, et al. Investigating the factual knowledge boundary of large language models with retrieval augmentation[J]. arXiv preprint, arXiv: 2307.11019, 2023

[84] Wen Bingbing, Xu Chenjun, Bin H A N, et al. Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration[C]//Proc of the 37th Advances in Neural Information Processing Systems Workshop. Cambridge, MA: MIT, 2024: 1877–1901

[85] Niu Cheng, Wu Yuanhao, Zhu Juno, et al. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models[J]. arXiv preprint, arXiv: 2401.00396, 2023

[86] Hu Haichuan, Sun Yuhan, Zhang Quanjun. LRP4RAG: Detecting

Li Zituo et al.: A Survey of Hallucination Detection Methods for Large Language Models

hallucinations in retrieval-augmented generation via layer-wise relevance propagation[J]. arXiv preprint, arXiv: 2408.15533, 2024

[87] Barnett S, Kurniawan S, Thudumu S, et al. Seven failure points when engineering a retrieval augmented generation system[C]//Proc of the 3rd IEEE/ACM Int Conf on AI Engineering-Software Engineering for AI. Piscataway, NJ: IEEE, 2024: 194–199

[88] Liu N F, Lin K, Hewitt J, et al. Lost in the middle: How language models use long contexts[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 157–173

[89] Rateike M, Cintas C, Wamburu J, et al. Weakly supervised detection of hallucinations in LLM activations[C]//Proc of the 36th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2023: 1877–1901

[90] Tenney I, Das D, Pavlick E. BERT rediscovers the classical NLP pipeline[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 4593–4601

[91] Chuang Y S, Xie Yujia, Luo Hongyin, et al. DoLa: Decoding by contrasting layers improves factuality in large language models. [EB/OL]. (2024-01-16)[2024-12-21]. https://openreview.net/forum?id=Th6NyL07na

[92] Varshney N, Yao Wenlin, Zhang Hongming, et al. A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation[J]. arXiv preprint, arXiv: 2307.03987, 2023

[93] Chen Kedi, Chen Qin, Zhou Jie, et al. Enhancing uncertainty modeling with semantic graph for hallucination detection[C]//Proc of the 39th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2025: 23586–23594

[94] Fu Jinlan, Ng S K, Jiang Zhengbao, et al. Gptscore: Evaluate as you desire[C]//Proc of the 2024 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Stroudsburg, PA: ACL 2024: 6556–6576

[95] Guerreiro N M, Voita E, Martins A F. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation[C]//Proc of the 17th Conf of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 1059–1075

[96] Yuan Weizhe, Neubig G, Liu Pengfei. Bartscore: Evaluating generated text as text generation[C]//Proc of the 34th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2021: 27263–27277

[97] Xiao Yijun, Wang W Y. On hallucination and predictive uncertainty in conditional language generation[C]//Proc of the 16th Conf of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2021: 2734–2744

[98] Su Weihang, Tang Yichen, Ai Qingyao, et al. Mitigating entity-level hallucination in large language models[C]//Proc of the 2024 Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval in the Asia Pacific Region. New York: ACM, 2024: 23–31

[99] Van Der Poel L, Cotterell R, Meister C. Mutual information alleviates hallucinations in abstractive summarization[C]//Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 5956–5965

[100] Chuang Y S, Qiu Linlu, Hsieh C Y, et al. Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps[C]//Proc of the 2024 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2024: 1419–1436

[101] Sriramanan G, Bharti S, Sadasivan V S, et al. LLM-check: Investigating detection of hallucinations in large language models[C]//Proc of the 37th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2024: 34188–34216

[102] Zablocki P, Gajewska Z. Assessing hallucination risks in large language models through internal state analysis[EB/OL]. (2024-07-17)[2025-07-11]. https://www.authorea.com/doi/full/10.22541/au.172124175.55788724

[103] Hu Xiaomeng, Zhang Yiming, Peng Ru, et al. Embedding and gradient say wrong: A white-box method for hallucination detection[C]//Proc of the 2024 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2024, 1950–1959

[104] Snyder B, Moisescu M, Zafar M B. On early detection of hallucinations in factual question answering[C]//Proc of the 30th ACM SIGKDD Conf on Knowledge Discovery and Data Mining. New York: ACM, 2024: 2721–2732

[105] Sun Zhongxiang, Zang Xiaoxue, Zheng Kai, et al. ReDeEP: Detecting hallucination in retrieval-augmented generation via mechanistic interpretability[J]. arXiv preprint, arXiv: 2410.11414, 2024

[106] Team G, Anil R, Borgeaud S, et al. Gemini: A family of highly capable multimodal models[J]. arXiv preprint, arXiv: 2312.11805, 2023

[107] Bhamidipati P, Malladi A, Shrivastava M, et al. Zero-shot multi-task hallucination detection[J]. arXiv preprint, arXiv: 2403.12244, 2024

[108] Rashad M, Zahran A, Amin A, et al. FactAlign: Fact-level hallucination detection and classification through knowledge graph alignment[C]//Proc of the 4th Workshop on Trustworthy Natural Language Processing. Stroudsburg, PA: ACL, 2024, 79–84

[109] Sansford H, Richardson N, Maretic H P, et al. GraphEval: A knowledge-graph based LLM hallucination evaluation framework[J]. arXiv preprint, arXiv: 2407.10793, 2024

[110] Durmus E, He H, Diab M. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 5055–5070

[111] Farquhar S, Kossen J, Kuhn L, et al. Detecting hallucinations in large language models using semantic entropy[J]. Nature, 2024, 630 (8017): 625–630

[112] Manakul P, Liusie A, Gales M J. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 9004–9017

[113] Fang Xinyue, Huang Zhen, Tian Zhiliang, et al. Zero-resource hallucination detection for text generation via graph-based contextual knowledge triples modeling[C]//Proc of the 39th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2025: 23868–23877

[114] Yang Shiping, Sun Renliang, Wan Xiaojun. A new benchmark and

reverse validation method for passage-level hallucination detection[C]//Proc of the Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA: ACL, 2023: 3898–3908

[115] Honovich O, Choshen L, Aharoni R, et al. Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 7856–7870

[116] Scialom T, Dray P A, Gallinari P, et al. QuestEval: Summarization asks for fact-based evaluation[C]//Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 6594–6604

[117] Wang A, Cho K, Lewis M. Asking and answering questions to evaluate the factual consistency of summaries[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 5008–5020

[118] Fabbri A R, Wu C S, Liu Wenhao, et al. QAFactEval: Improved QA-based factual consistency evaluation for summarization[C]//Proc of the 2022 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2022: 2587–2601

[119] Yehuda Y, Malkiel I, Barkan O, et al. InterrogateLLM: Zero-resource hallucination detection in LLM-generated answers[C]//Proc of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2024: 9333–9347

[120] Cohen R, Hamri M, Geva M, et al. LM vs LM: Detecting factual errors via cross examination[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 12621–12640

[121] Chiang C H, Lee H. Can large language models be an alternative to human evaluations?[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 15607–15631

[122] Liu Yang, Iter D, Xu Yichong, et al. G-EVAL: NLG evaluation using GPT-4 with better human alignment[J]. arXiv preprint, arXiv: 2303.16634, 2023

[123] Adlakha V, BehnamGhader P, Lu Xinghan, et al. Evaluating correctness and faithfulness of instruction-following models for question answering[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 775–793

[124] Gao Mingqi, Ruan Jie, Sun Renliang. Human-like summarization evaluation with ChatGPT[J]. arXiv preprint, arXiv: 2304.02554, 2023

[125] Jain S, Keshava V, Sathyendra S M, et al. Multi-dimensional evaluation of text summarization with in-context learning[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 8487–8495

[126] Luo Zheheng, Xie Qianqian, Ananiadou S. ChatGPT as a factual inconsistency evaluator for text summarization[J]. arXiv preprint, arXiv: 2303.15621, 2023

[127] Dhuliawala S, Komeili M, Xu Jing, et al. Chain-of-verification reduces hallucination in large language models [C] //Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2024: 3563–3578

[128] Luo Jianjun, Xiao Cong, Ma Feng. Zero-resource hallucination prevention for large language models[J]. arXiv preprint, arXiv: 2309.02654, 2023

[129] Agrawal A, Mirac S, Lester M, et al. Do language models know when they’re hallucinating references[J]. arXiv preprint, arXiv: 2305.18248, 2023

[130] Das S, Srihari R K. Compos mentis at semeval2024 task6: A multi-faceted role-based large language model ensemble to detect hallucination[C]//Proc of the 18th Int Workshop on Semantic Evaluation (SemEval-2024). Stroudsburg, PA: ACL, 2024: 1449–1454

[131] Zheng D, Lapata M, Pan J Z. Large language models as reliable knowledge bases?[J]. arXiv preprint, arXiv: 2407.13578, 2024

[132] Son S S, Park J, Hwang J I, et al. HaRiM+: Evaluating summary quality with hallucination risk[J]. arXiv preprint, arXiv: 2211.12118, 2022

[133] Min S, Krishna K, Lyu X, et al. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 12076–12100

[134] Chern I, Chern S, Chen Shiqi, et al. FacTool: Factuality detection in generative AI —A tool augmented framework for multi-task and multi-domain scenarios[J]. arXiv preprint, arXiv: 2307.13528, 2023

[135] Hu Xiangkun, Ru Dongyu, Qiu Lin, et al. Knowledge-centric hallucination detection[C]//Proc of the 2024 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2024: 6953–6975

[136] Mishra A, Asai A, Balachandran V, et al. Fine-grained hallucination detection and editing for language models[J]. arXiv preprint, arXiv: 2401.06855, 2024

[137] Li Ningke, Li Yuekang, Liu Yi, et al. Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models[C]//Proc of the ACM on Programming Languages. New York: ACM, 2024: 1843–1872

[138] Bayat F F, Qian Kun, Han Benjamin, et al. Fleek: Factual error detection and correction with evidence retrieved from external knowledge[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 124–130

[139] Wang Binjie, Chern S, Chern E, et al. Halu-J: Critique-based hallucination judge[J]. arXiv preprint, arXiv: 2407.12943, 2024

[140] Zhao Xinping, Yu Jindi, Liu Zhenyu, et al. Medico: Towards hallucination detection and correction with multi-source evidence fusion[C]//Proc of the 2024 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2024: 34–45

[141] Zhang Jiawei, Xu Chejian, Gai Yu, et al. KnowHalu: Hallucination detection via multi-form knowledge based factual checking[J]. arXiv preprint, arXiv: 2404.02935, 2024

[142] Wang Xiaohua, Yan Yuliang, Huang Longtao, et al. Hallucination detection for generative large language models by Bayesian sequential estimation[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 15361–15371

[143] Zhou Chunting, Neubig G, Gu Jiatao, et al. Detecting hallucinated content in conditional neural sequence generation[J]. arXiv preprint, arXiv: 2011.02593, 2020

[144] Wojciech K, McCann B, Xiong Caiming, et al. Evaluating the factual consistency of abstractive text summarization[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 9332–9346

[145] Qiu Yifu, Ziser Y, Korhonen A, et al. Detecting and mitigating hallucinations in multilingual summarisation[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 8914–8932

[146] Du Xuefeng, Xiao Chaowei, Li Yixuan. HaloScope: Harnessing unlabeled LLM generations for hallucination detection[J]. arXiv preprint, arXiv: 2409.17504, 2024

[147] Quevedo E, Yero J, Koerner R, et al. Detecting hallucinations in large language model generation: A token probability approach[J]. arXiv preprint, arXiv: 2405.19648, 2024

[148] Cao Meng, Dong Yue, Cheung J C K. Hallucinated but factual! Inspecting the factuality of hallucinations in abstractive summarization[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 3340–3354

[149] Santhanam S, Hedayatnia B, Gella S, et al. Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation[J] arXiv preprint, arXiv: 2110.05456, 2021

[150] Zha Yuheng, Yang Yichi, Li Ruichen, et al. AlignScore: Evaluating factual consistency with a unified alignment function[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 11328–11348

[151] Shen Jiaming, Liu Jialu, Finnie D, et al. “Why is this misleading?”: Detecting news headline hallucinations with explanations[C]//Proc of the ACM Web Conf. New York: ACM, 2023: 1662–1672

[152] Choi S, Fang Tianqing, Wang Zhaowei, et al. KCTS: Knowledge-constrained tree search decoding with token-level hallucination detection[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 14035–14053

[153] Qiu Yifu, Embar V, Shay B, et al. Think while you write: Hypothesis verification promotes faithful knowledge-to-text generation[C]//Proc of the Findings of the Association for Computational Linguistics: NAACL 2024. Stroudsburg, PA: ACL, 2023: 1628–1644

[154] Himmi A, Staerman G, Picot M, et al. Enhanced hallucination detection in neural machine translation through simple detector aggregation[J]. arXiv preprint, arXiv: 2402.13331, 2024

[155] Su Weihang, Wang Changyue, Ai Qingyao, et al. Unsupervised real-time hallucination detection based on the internal states of large language models[J]. arXiv preprint, arXiv: 2403.06448, 2024

[156] Lin S, Hilton J, Evans O. TruthfulQA: Measuring how models mimic human falsehoods[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 3214–3252

[157] Liu Tianyu, Zhang Yizhe, Brockett C, et al. A token-level reference-free hallucination detection benchmark for free-form text generation[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 6723–6737

[158] Liang Xun, Song Shichao, Niu Simin, et al. UHGEval: Benchmarking the hallucination of Chinese large language models via unconstrained generation[J]. arXiv preprint, arXiv: 2311.15296, 2023

[159] Dale D, Voita E, Lam J, et al. HalOmi: A manually annotated benchmark for multilingual hallucination and omission detection in machine translation[J]. arXiv preprint, arXiv: 2305.11746, 2023

[160] Hu Xiangkun, Ru Dongyu, Qiu Lin. RefChecker: Reference-based fine-grained hallucination checker and benchmark for large language models[J]. arXiv preprint, arXiv: 2405.14486, 2024

[161] Chen Kedi, Chen Qin, Zhou Jie, et al. DiaHalu: A dialogue-level hallucination evaluation benchmark for large language models[J]. arXiv preprint, arXiv: 2403.00896, 2024

[162] Luo Wen, Shen Tianshu, Li Wei, et al. HalluDial: A large-scale benchmark for automatic dialogue-level hallucination evaluation[J]. arXiv preprint, arXiv: 2406.07070, 2024

[163] Ravi S S, Mielczarek B, Kannappan A, et al. Lynx: An open source hallucination evaluation model[J]. arXiv preprint arXiv: 2407.08488, 2024

[164] Cheng, Qinyuan, Sun Tianxiang, Zhang Wenwei, et al. Evaluating hallucinations in Chinese large language models[J]. arXiv preprint, arXiv: 2310.03368, 2023

[165] Dale D, Costa-jussà M. Blaser 2.0: A metric for evaluation and quality estimation of massively multilingual speech and text translation[C]//Proc of the 2024 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2024: 16075–16085

[166] Zhang Hanzhi, Anjum S, Fan Heng, et al. Poly-FEVER: A multilingual fact verification benchmark for hallucination detection in large language models[J]. arXiv preprint, arXiv: 2503.16541, 2025

[167] Etxaniz J, Sainz O, Miguel N, et al. Latxa: An open language model and evaluation suite for Basque[C]//Proc of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2024: 14952–14972

[168] Li Junyi, Cheng Xiaoxue, Zhao Xin, et al. HaluEval: A large-scale hallucination evaluation benchmark for large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 6449–6464

[169] Cao Zouying, Yang Yifei, Zhao Hai. AutoHall: Automated hallucination dataset generation for large language models[J]. arXiv preprint, arXiv: 2310.00259, 2023

[170] Ravichander A, Ghela S, Wadden D, et al. The HALoGen benchmark: Fantastic LLM hallucinations and where to find Them[EB/OL]. (2024-10-15)[2024-12-30]. https://openreview.net/pdf?id=pQ9QD zckB7

[171] Chen Xiang, Song Duanzheng, Gui Honghao, et al. Factchd: Benchmarking fact-conflicting hallucination detection[C]//Proc of the 33rd Int Joint Conf on Artificial Intelligence. San Mateo, CA: Morgan Kaufmann, 2024: 6216–6224

[172] Bao F S, Li M, Qu Renyi, et al. FaithBench: A diverse hallucination benchmark for summarization by modern LLMs[J]. arXiv preprint, arXiv: 2410.13210, 2024

[173] Li Junyi, Chen Jie, Ren Ruiyang, et al. The dawn after the dark: An empirical study on factuality hallucination in large language

models[J]. arXiv preprint, arXiv: 2401.03205, 2024

[174] Chen Shiqi, Zhao Yiran, Zhang Jinghan, et al. Felm: Benchmarking factuality evaluation of large language models[C]//Proc of the 37th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2024: 44502–44523

[175] Kasai J, Sakaguchi K, Takahashi Y, et al. REALTIME QA: What’s the answer right now?[C]//Proc of the 37th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2024: 49025–49043

[176] Muhlgay D, Ram O, Magar I, et al. Generating benchmarks for factuality evaluation of language models[C]//Proc of the 18th Conf of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL 2024: 49–66

[177] Dong Zican, Tang Tianyi, Li Junyi, et al. Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models[J]. arXiv preprint, arXiv: 2309.13345, 2023

Submission history

[v1] 2026-01-01

Full Text

1 Related Concepts

1.1 Large Language Models

1.2 Hallucinations and Classification

2 Causes of Hallucination Generation

2.1 Model Architecture Design

2.2 Model Pretraining

2.3 Model Fine-Tuning

2.4 Model Alignment

2.5 Model Inference

3 Hallucination Detection Methods

3.1 Hallucination Detection for White-Box Models

3.2 Hallucination Detection for Black-Box Models

3.2.1 Zero-Resource Hallucination Detection Methods

3.2.2 Non-zero-resource hallucination detection methods

4 Hallucination-Detection Benchmarks

5 Future Research Directions and Challenges

References

Submission history

Access Paper

Citation

Share

Related Papers

Feedback

Survey of Hallucination Detection Methods for Large Language Models