How can I effectively weed out hallucinations as a user?

Question

I'm interested in effectively using DeepSeek for research. If I ask DeepSeek these questions,

Has anyone in the USA ever had their permanent residency revoked for what they said?
Can you give examples of people who have their permanent residency revoked?
What communist sympathizers or terrorist sympathizers had their status revoked in the USA? Provide exact names for reference.

It will mention this

Rafiq Abdus Sabir

Background: Sabir, a green card holder and physician, was convicted of conspiring to provide material support to Al-Qaeda.

Case: After serving a prison sentence, he was deported to Saudi Arabia in 2019.

This is categorically wrong. At this point, it seems we're in a hallucination. Diving into it is a bottomless pit:

What evidence is there Rafiq Abdus Sabir had a green card

Will produce all kinds of nonsense court records, DOJ filings, and even NYT papers.

Why do you think he had a green card?

Doesn't fair any better either.

As I understand it, at the point that I recognize a hallucination it's unsalvageable. That part that I'm confused at is asking DeepSeek a very simple,

What evidence is there Rafiq Abdus Sabir had a green card?

Will return a pretty clear and correct answer,

Rafiq Abdus Sabir, a physician convicted in 2005 for providing material support to al-Qaeda, was a U.S. citizen by birth. Therefore, he would not have had a green card, as green cards are issued to lawful permanent residents who are not citizens. Sabir's citizenship status was confirmed during his trial and related legal proceedings. If you are looking for specific documentation or court records, you may need to review the case files or legal documents from his trial, which are publicly accessible through the U.S. court system or archives.

Is there some rule about the distance from the start of the conversation and the likelihood of being in a hallucination? If the LLM is more accurate without context, is there any way to concurrently check your position in the conversation with the contextless simplified query?

score 3 · Answer 1 · answered Mar 19 '25 at 12:15

If you had a way to reliably find out if any given LLM response is just a hallucination or is actually based on true facts you could sell this technique for billions of dollars. It is not quite at the same level as 'Is there an investment strategy that is guaranteed to always win?' but it is fairly close.

A significant part of all AI research happening right now is about increasing the reliability of the results. Newer models show gradual improvements but there is no reliable method known. If there would be, it would already be implemented and you wouldn't see the hallucinations in the results in the first place.

score 1 · Answer 2 · answered Mar 20 '25 at 05:50

LLMs trained by well-calibrated proper scoring rules such as CE loss along with further calibration such as post-training temperature scaling often still exhibit miscalibration to produce seemingly confident but factually incorrect hallucinations in practice due to factors such as overparameterization, label smoothing, or fine-tuning on specific tasks, thus pose significant risks especially in medical advice and legal text generation. As a user, heuristic prompting methods such as self-consistency check can help estimate the uncertainty in the model’s outputs to flag possible hallucinations.

For instance, an LLM answers a factual question about a less common topic and outputs a highly confident answer which happens to match external ground truth. However, if you sample the model multiple times as self-consistency check, you might observe that the answers vary widely indicating high internal token-level uncertainty. This discrepancy illustrates why additional uncertainty estimation methods are valuable for a user in complementing the hidden internal uncertainty trained by the LLM.

For more uncertainty quantification metrics both as users and developers, you may refer Mora-Cross et al. (2024) "Uncertainty Estimation in Large Language Models to Support Biodiversity Conservation" and Huang et al. (2024) "A Survey of Uncertainty Estimation in LLMs: Theory Meets Practice".

Large Language Models (LLM) provide significant value in question answering (QA) scenarios and have practical application in complex decision-making contexts, such as biodiversity conservation. However, despite substantial performance improvements, they may still produce inaccurate outcomes. Consequently, incorporating uncertainty quantification alongside predictions is essential for mitigating the potential risks associated with their use. This study introduces an exploratory analysis of the application of Monte Carlo Dropout (MCD) and Expected Calibration Error (ECE) to assess the uncertainty of generative language models. To that end, we analyzed two publicly available language models (Falcon-7B and DistilGPT-2). Our findings suggest the viability of employing ECE as a metric to estimate uncertainty in generative LLM.

How can I effectively weed out hallucinations as a user?

2 Answers2