How to Fix LLM Hallucinations About Past Conversations
LLMs make things up about what you told them before. Here are three practical techniques to fix memory hallucinations, with Python code.
Your AI assistant keeps making things up about what you told it before. You never said you prefer dark mode, but it insists you do. You mentioned your project deadline was March, not April, and now it is giving you March milestones even though you corrected it weeks ago. It is merging details from your colleague's conversation with yours, presenting them as facts about you.
This is not a niche edge case. It is one of the most common failure modes in AI systems that operate across sessions. And it is getting worse as more applications add memory features. The Stanford HAI AI Index Report 2025 found that hallucination rates in production LLM systems remain between 3-15% depending on the domain. For memory-dependent applications, the problem is more severe because hallucinations compound over time — a wrong fact stored today becomes a source of wrong answers for every future conversation.
This post explains why LLMs hallucinate about past conversations and gives you three concrete techniques to fix the problem. Every technique includes runnable Python code you can adapt to your own projects.
Why LLMs Hallucinate About Past Context
Understanding the root causes matters because each cause demands a different fix. LLMs hallucinate about past conversations for three structural reasons.
Token Limits Create Information Gaps
Every LLM has a fixed context window. GPT-4 Turbo supports 128K tokens, Claude 3 supports 200K tokens, Gemini 1.5 Pro supports 2M tokens. These sound large until you consider that a busy user generates thousands of tokens per conversation. After a few dozen sessions, the full history exceeds the window.
When you truncate history to fit, you lose information. The model then fills gaps with plausible-sounding fabrications. The OpenAI Cookbook on managing long conversations documents this pattern explicitly: models perform worse on recent context when forced to process very long histories.
No Retrieval Means No Verification
Without a retrieval mechanism, the model has no way to verify its claims about the past. It relies entirely on what you provided in the current prompt. If you ask "What did I say about the API design?" and the relevant conversation happened three sessions ago, the model either guesses or makes something up.
Research from Li et al. (2024) at Google DeepMind measured this directly. They found that LLMs hallucinate approximately 27% more often when asked about information that was present in earlier context but not in the current prompt. The model "knows" it discussed something with the user but cannot retrieve the specific details, so it generates plausible-sounding but incorrect details.
No Cross-Session State
API calls are stateless. Each request starts fresh. The model has no persistent memory between sessions unless you explicitly provide it. This means every conversation is an isolated event, and any "memory" the model appears to have comes from what you injected into the prompt.
When developers try to solve this by stuffing conversation summaries into the system prompt, they create a new problem: the model treats summaries as ground truth without distinguishing between confirmed facts and model-generated summaries. A summary that says "User prefers TypeScript" might have been generated by the model from ambiguous signals — but once it is in the system prompt, it is treated as an established fact.
The 3 Types of Memory Hallucination
Memory hallucinations are not random. They fall into three distinct categories, each with its own cause and fix.
Fabricated Facts
This is the most visible type. The model states something about the user's past that never happened. "You mentioned you worked at Google in 2024." The user never said that. The model generated a plausible-sounding detail to fill a gap.
Fabricated facts typically occur when the model has partial context about a user. It knows some facts (name, profession, interests) and uses them to construct fictional details that seem consistent. The user might not even notice — the fabrication often fits the pattern of their other known facts.
Wrong Timeline
The model correctly identifies that a fact changed but gets the order wrong. "You switched from dark mode to light mode last month." In reality, the user switched two months ago and switched back last month. The facts are real, but the temporal relationship is wrong.
Wrong timelines are particularly dangerous because they are harder to detect. The individual facts check out when the user verifies them — they did switch to light mode at some point — but the reasoning built on the timeline is incorrect. The model might recommend light mode themes based on "your recent preference" when the preference actually changed back.
Merged Users
In multi-user systems, the model mixes up which user said what. Details from User A's conversations appear in User B's profile. This happens when memory systems lack proper user isolation — either at the storage layer or the retrieval layer.
Merged users are the most dangerous type because they can expose private information across user boundaries. A user's medical details might appear in another user's conversation. Their financial preferences might be attributed to someone else. Beyond the privacy violation, it destroys trust in the system.
Fix 1: Retrieval-Augmented Memory
The first fix attacks the root cause: without retrieval, the model has nothing to base its claims on. Retrieval-augmented memory stores past conversations externally and retrieves relevant segments when needed.
Here is a complete implementation using ChromaDB for vector storage and OpenAI embeddings:
import chromadb
import json
from datetime import datetime
from openai import OpenAI
client = OpenAI()
chroma = chromadb.PersistentClient(path="./conversation_memory")
collection = chroma.get_or_create_collection(
name="conversations",
metadata={"hnsw:space": "cosine"},
)
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
def store_conversation(
user_id: str,
message: str,
role: str,
session_id: str,
) -> str:
mem_id = f"{user_id}_{session_id}_{collection.count()}"
embedding = get_embedding(message)
collection.add(
ids=[mem_id],
embeddings=[embedding],
documents=[message],
metadatas=[{
"user_id": user_id,
"session_id": session_id,
"role": role,
"timestamp": datetime.now().isoformat(),
}],
)
return mem_id
def retrieve_relevant_memories(
user_id: str,
query: str,
top_k: int = 5,
) -> list[dict]:
query_embedding = get_embedding(query)
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"],
where={"user_id": user_id},
)
memories = []
if results["ids"][0]:
for i, mem_id in enumerate(results["ids"][0]):
memories.append({
"id": mem_id,
"content": results["documents"][0][i],
"metadata": results["metadatas"][0][i],
"relevance": max(0, (1 - results["distances"][0][i]) * 100),
})
return memories
def chat_with_memory(user_id: str, message: str) -> str:
relevant = retrieve_relevant_memories(user_id, message)
context_parts = []
if relevant:
context_parts.append("Relevant past conversations:")
for mem in relevant:
date = mem["metadata"]["timestamp"][:10]
context_parts.append(
f"- [{date}] {mem['metadata']['role']}: {mem['content']} "
f"(relevance: {mem['relevance']:.0f}%)"
)
context = "\n".join(context_parts) if context_parts else "No prior conversation history available."
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": (
f"You are a helpful assistant with access to conversation history.\n\n"
f"Retrieved memories:\n{context}\n\n"
f"Use these memories to provide accurate, grounded responses. "
f"If the retrieved memories do not contain relevant information, "
f"say so clearly. Do not fabricate details about past conversations."
),
},
{"role": "user", "content": message},
],
)
reply = response.choices[0].message.content
store_conversation(user_id, message, "user", "current")
store_conversation(user_id, reply, "assistant", "current")
return reply
# Demo
store_conversation("user_1", "I'm working on a Rust web server with Actix", "user", "session_1")
store_conversation("user_1", "Great! Actix is a solid choice for Rust web services.", "assistant", "session_1")
store_conversation("user_1", "I decided to switch from Actix to Axum", "user", "session_2")
# Later: retrieve relevant context
memories = retrieve_relevant_memories("user_1", "What web framework am I using?")
for m in memories:
print(f"{m['content']} ({m['relevance']:.0f}%)")
# → "I decided to switch from Actix to Axum" (92%)
# → "I'm working on a Rust web server with Actix" (78%)
The key design decisions in this implementation:
- User-scoped retrieval. The
where={"user_id": user_id}filter ensures memories never leak across users. This is the primary defense against merged-user hallucinations. - Timestamped storage. Every memory carries its creation timestamp, making it possible to reason about when things were said.
- Relevance scoring. The system returns confidence scores so the model (and your application logic) can decide whether to trust retrieved memories or disclose uncertainty.
This approach eliminates fabricated facts because the model now has grounded evidence. It cites specific past conversations rather than guessing. But it does not solve timeline problems — we need Fix 2 for that.
Fix 2: Timestamped Fact Storage
Retrieval-augmented memory helps the model know what was said, but it does not help it know when things changed. For that, you need structured fact storage with explicit timestamps and supersession logic.
The idea is simple: instead of storing raw conversation messages, extract facts and track their validity over time. When a fact changes, you mark the old version as superseded and store the new one. Retrieval always returns the current version.
import chromadb
import json
from datetime import datetime
from openai import OpenAI
client = OpenAI()
chroma = chromadb.PersistentClient(path="./fact_store")
facts_collection = chroma.get_or_create_collection(
name="facts",
metadata={"hnsw:space": "cosine"},
)
def extract_facts(message: str) -> list[dict]:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": (
"Extract factual claims from this message. Return JSON:\n"
'{"facts": [{"content": "...", "topic": "...", '
'"fact_type": "personal_info|preference|project|fact"}]}\n'
"Only extract explicit facts, not opinions or questions."
),
},
{"role": "user", "content": message},
],
response_format={"type": "json_object"},
)
result = json.loads(response.choices[0].message.content)
return result.get("facts", [])
def store_fact(
user_id: str,
content: str,
topic: str,
fact_type: str,
session_id: str,
) -> str:
fact_id = f"fact_{facts_collection.count() + 1}"
facts_collection.add(
ids=[fact_id],
documents=[content],
embeddings=[client.embeddings.create(
model="text-embedding-3-small",
input=content,
).data[0].embedding],
metadatas=[{
"user_id": user_id,
"topic": topic,
"fact_type": fact_type,
"session_id": session_id,
"created_at": datetime.now().isoformat(),
"superseded": False,
"superseded_by": None,
}],
)
return fact_id
def find_existing_fact(user_id: str, topic: str) -> dict | None:
results = facts_collection.get(
where={
"$and": [
{"user_id": user_id},
{"topic": topic},
{"superseded": False},
]
},
include=["documents", "metadatas"],
)
if results["ids"]:
return {
"id": results["ids"][0],
"content": results["documents"][0],
"metadata": results["metadatas"][0],
}
return None
def supersede_fact(old_fact_id: str, new_content: str, user_id: str, topic: str, fact_type: str, session_id: str):
old_meta = facts_collection.get(
ids=[old_fact_id],
include=["metadatas"],
)["metadatas"][0]
old_meta["superseded"] = True
old_meta["superseded_by"] = f"fact_{facts_collection.count() + 1}"
old_meta["superseded_at"] = datetime.now().isoformat()
facts_collection.update(ids=[old_fact_id], metadatas=[old_meta])
new_id = store_fact(user_id, new_content, topic, fact_type, session_id)
return new_id
def store_message_facts(user_id: str, message: str, session_id: str) -> list[dict]:
facts = extract_facts(message)
actions = []
for fact in facts:
existing = find_existing_fact(user_id, fact["topic"])
if existing:
if existing["content"] != fact["content"]:
new_id = supersede_fact(
existing["id"],
fact["content"],
user_id,
fact["topic"],
fact["fact_type"],
session_id,
)
actions.append({
"action": "superseded",
"old_id": existing["id"],
"new_id": new_id,
"topic": fact["topic"],
})
else:
new_id = store_fact(
user_id,
fact["content"],
fact["topic"],
fact["fact_type"],
session_id,
)
actions.append({
"action": "created",
"new_id": new_id,
"topic": fact["topic"],
})
return actions
def get_current_facts(user_id: str) -> list[dict]:
results = facts_collection.get(
where={
"$and": [
{"user_id": user_id},
{"superseded": False},
]
},
include=["documents", "metadatas"],
)
facts = []
for i, fact_id in enumerate(results["ids"]):
facts.append({
"id": fact_id,
"content": results["documents"][i],
"metadata": results["metadatas"][i],
})
return facts
# Demo
store_message_facts("user_1", "I work as a backend engineer at Stripe", "s1")
store_message_facts("user_1", "I use Python for most of my projects", "s1")
# User changes jobs
actions = store_message_facts("user_1", "I just started a new job as a staff engineer at Vercel", "s2")
for a in actions:
print(f"{a['action']}: {a['topic']}")
# → "created: employment" (or "superseded" if topic matched)
# Get current facts
current = get_current_facts("user_1")
for f in current:
print(f" {f['metadata']['topic']}: {f['content']}")
# → "employment: I just started a new job as a staff engineer at Vercel"
# → "programming: I use Python for most of my projects"
This implementation solves the timeline problem. When the user changes jobs, the old fact is not deleted — it is marked as superseded with a timestamp. The current facts always reflect reality. If you need historical queries ("What did the user's job used to be?"), you query with superseded=True and sort by timestamp.
The fact extraction step uses an LLM to identify factual claims from natural language. This adds latency and cost, but it transforms unstructured conversation text into structured, queryable facts. The trade-off is worth it for applications where temporal accuracy matters.
Fix 3: Deduplication and Contradiction Detection
The first two fixes address fabricated facts and wrong timelines. The third fix addresses the subtler problem: redundant and contradictory memories that confuse retrieval.
When a user says "I prefer dark mode" in January and "I like light mode now" in March, a naive system stores both as current facts. Both appear in retrieval results. The model then has to decide which to trust — a judgment call it frequently gets wrong.
Contradiction detection catches these conflicts before they reach the model. It compares new facts against existing ones and handles conflicts automatically.
import chromadb
import json
from datetime import datetime
from openai import OpenAI
client = OpenAI()
chroma = chromadb.PersistentClient(path="./dedup_store")
facts_collection = chroma.get_or_create_collection(
name="facts",
metadata={"hnsw:space": "cosine"},
)
def get_embedding(text: str) -> list[float]:
return client.embeddings.create(
model="text-embedding-3-small",
input=text,
).data[0].embedding
def store_fact(user_id: str, content: str, topic: str, session_id: str) -> str:
fact_id = f"fact_{facts_collection.count() + 1}"
facts_collection.add(
ids=[fact_id],
documents=[content],
embeddings=[get_embedding(content)],
metadatas=[{
"user_id": user_id,
"topic": topic,
"session_id": session_id,
"created_at": datetime.now().isoformat(),
"superseded": False,
}],
)
return fact_id
def find_similar_facts(user_id: str, content: str, threshold: float = 0.7) -> list[dict]:
embedding = get_embedding(content)
results = facts_collection.query(
query_embeddings=[embedding],
n_results=5,
where={
"$and": [
{"user_id": user_id},
{"superseded": False},
]
},
include=["documents", "metadatas", "distances"],
)
similar = []
if results["ids"][0]:
for i, fact_id in enumerate(results["ids"][0]):
distance = results["distances"][0][i]
similarity = 1 - distance
if similarity >= threshold:
similar.append({
"id": fact_id,
"content": results["documents"][0][i],
"metadata": results["metadatas"][0][i],
"similarity": similarity,
})
return similar
def detect_contradiction(new_fact: str, existing_facts: list[dict]) -> dict | None:
if not existing_facts:
return None
facts_text = "\n".join(
f"- ID {f['id']}: {f['content']}" for f in existing_facts
)
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": (
"You are a fact-checking system. Given a new statement and "
"existing facts, determine if the new statement contradicts, "
"uplicates, or is consistent with any existing fact.\n\n"
"Return JSON:\n"
'{"verdict": "contradiction|duplicate|new", '
'"related_fact_id": "...", '
'"explanation": "...", '
'"supersedes": true/false}'
),
},
{
"role": "user",
"content": (
f"New statement: {new_fact}\n\n"
f"Existing facts:\n{facts_text}"
),
},
],
response_format={"type": "json_object"},
)
result = json.loads(response.choices[0].message.content)
return result
def supsert_fact(old_id: str):
old_meta = facts_collection.get(
ids=[old_id],
include=["metadatas"],
)["metadatas"][0]
old_meta["superseded"] = True
old_meta["superseded_at"] = datetime.now().isoformat()
facts_collection.update(ids=[old_id], metadatas=[old_meta])
def smart_store_fact(
user_id: str,
content: str,
topic: str,
session_id: str,
) -> dict:
similar = find_similar_facts(user_id, content)
contradiction = detect_contradiction(content, similar)
if contradiction and contradiction["verdict"] == "contradiction":
related_id = contradiction.get("related_fact_id")
if related_id:
supsert_fact(related_id)
new_id = store_fact(user_id, content, topic, session_id)
return {
"action": "contradiction_resolved",
"new_id": new_id,
"superseded_id": related_id,
"explanation": contradiction["explanation"],
}
elif contradiction and contradiction["verdict"] == "duplicate":
return {
"action": "duplicate_skipped",
"existing_id": contradiction.get("related_fact_id"),
"explanation": contradiction["explanation"],
}
else:
new_id = store_fact(user_id, content, topic, session_id)
return {
"action": "created",
"new_id": new_id,
}
# Demo: Duplicate detection
store_fact("user_1", "I prefer dark mode for my IDE", "preferences", "s1")
result = smart_store_fact(
"user_1",
"I like dark mode in my code editor",
"preferences",
"s2",
)
print(result)
# → {
# "action": "duplicate_skipped",
# "existing_id": "fact_1",
# "explanation": "The new statement is a restatement of the existing preference for dark mode."
# }
# Demo: Contradiction resolution
store_fact("user_1", "I use Vim as my primary editor", "tools", "s1")
result = smart_store_fact(
"user_1",
"I switched from Vim to VS Code",
"tools",
"s3",
)
print(result)
# → {
# "action": "contradiction_resolved",
# "new_id": "fact_4",
# "superseded_id": "fact_3",
# "explanation": "The user states they switched editors, directly contradicting the previous Vim preference."
# }
This three-layer defense works as follows:
- Semantic similarity finds candidate facts that might conflict. The embedding-based search catches paraphrases and rewordings that string matching would miss.
- LLM judgment determines the relationship between new and existing facts. It distinguishes between genuine contradictions, restatements of the same fact, and genuinely new information.
- Automatic supersession handles contradictions by marking old facts as outdated and storing new ones. This keeps the fact store clean and current.
The duplicate detection step is particularly valuable. Without it, your fact store accumulates redundant entries that dilute retrieval quality. "I prefer dark mode," "I like dark mode," and "Dark mode is my preference" all compete in results, wasting context window space on repetition.
Fix 4: Use a Purpose-Built Memory Engine (No More DIY)
The three fixes above work. They solve fabricated facts, wrong timelines, and merged users. But they also require you to build and maintain a significant amount of infrastructure: ChromaDB for vector storage, OpenAI for embeddings and contradiction detection, timestamp management, supersession logic, user isolation filters, and periodic consolidation scripts. Each piece is individually simple. Together, they are a distributed system you need to operate.
Tellodb is an open-source memory engine that provides all three fixes in a single binary. Instead of stitching together ChromaDB, OpenAI, and custom supersession logic, you get fact supersession, temporal ranking, and user isolation through one API.
Fact Supersession Prevents Timeline Hallucinations
The most common hallucination — "you said X, but actually you said Y" — happens because the retrieval layer returns both old and new facts with no mechanism to suppress the outdated one. Tellodb handles this at the engine level:
from tellodb import TellodbClient
from datetime import datetime
client = TellodbClient.from_local()
# User sets a preference
client.ingest(
entity_id="user-1",
text="I prefer dark mode for my IDE",
timestamp=datetime(2026, 1, 15),
)
# User changes their mind two months later
client.ingest(
entity_id="user-1",
text="I switched to light mode for my IDE",
timestamp=datetime(2026, 3, 20),
)
# Query returns the current preference — old fact is automatically suppressed
hits = client.query(
"What is my display preference?",
entity_id="user-1",
)
# → "I switched to light mode for my IDE"
Compare this to the DIY approach. With ChromaDB and manual supersession (Fix 2), you need ~100 lines of Python to detect topics, search for existing facts, compare content, mark old entries as superseded, and store new ones. With Tellodb, the engine handles temporal ordering automatically: newer facts on the same topic supersede older ones. No LLM-based contradiction detection needed for the common case.
Temporal Ranking Prevents Merged-User Hallucinations
Merged-user hallucinations happen when details from one user's conversation leak into another's. The defense is user-scoped retrieval. In the DIY approach (Fix 1), you apply where={"user_id": user_id} filters manually and hope the database enforces them:
# DIY approach: ChromaDB with manual user filter
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
where={"user_id": user_id}, # ← easy to forget, hard to verify
)
With Tellodb, entity isolation is enforced by the engine, not by your query:
# Tellodb: entity isolation is built into the query API
hits = client.query("What editor do I use?", entity_id="user-1")
hits = client.query("What editor do I use?", entity_id="user-2")
# Each returns only their own data — no filter to write, no filter to forget
And because the engine ranks results by temporal recency in addition to semantic relevance, the most recent facts always surface first:
# Tellodb returns results with temporal + semantic ranking
hits = client.query("What project am I working on?", entity_id="user-1")
for hit in hits:
print(f"[{hit.created_at_ms}] {hit.textual_content} (score: {hit.similarity})")
# Results are ordered by recency-weighted relevance, not just embedding similarity
DIY vs. Tellodb: Side-by-Side
Here is what you need to build yourself versus what you get with one SDK call:
| Capability | DIY (ChromaDB + OpenAI) | Tellodb |
|---|---|---|
| Semantic retrieval | ~20 lines of embedding + query code | client.query(query, entity_id=user_id) |
| User isolation | Manual where={"user_id": ...} filter per query |
entity_id parameter, enforced by engine |
| Fact supersession | ~100 lines of LLM-based contradiction detection | Automatic — newer facts supersede older ones |
| Temporal ranking | Manual sort by timestamp after retrieval | Built into query ranking function |
| Contradiction handling | LLM call per new fact (~500ms, costs tokens) | Engine-level — no extra LLM calls |
| Cloud deployment | Docker + managed ChromaDB + embedding service | from_cloud() — same API, managed infra |
| Total implementation | ~400+ lines of Python, 3 dependencies, ongoing maintenance | ~10 lines of Python, 1 dependency |
Building this yourself works for demos, but production needs engine-level guarantees.
Testing Your Fixes
Building a memory system is not enough. You need to verify it actually prevents hallucinations. Here is a testing framework that measures memory accuracy:
import time
import json
from dataclasses import dataclass
@dataclass
class TestCase:
user_id: str
setup_messages: list[dict]
query: str
expected_answer: str
category: str # "fact", "timeline", "contradiction"
def run_memory_accuracy_test(
store_message_facts_fn,
get_current_facts_fn,
chat_fn,
test_cases: list[TestCase],
) -> dict:
correct = 0
total = len(test_cases)
category_scores = {}
for case in test_cases:
for msg in case.setup_messages:
store_message_facts_fn(
case.user_id,
msg["content"],
msg.get("session", "test"),
)
response = chat_fn(case.user_id, case.query)
is_correct = case.expected_answer.lower() in response.lower()
if is_correct:
correct += 1
cat = case.category
if cat not in category_scores:
category_scores[cat] = {"correct": 0, "total": 0}
category_scores[cat]["total"] += 1
if is_correct:
category_scores[cat]["correct"] += 1
accuracy = correct / total if total > 0 else 0
category_accuracies = {
cat: scores["correct"] / scores["total"]
for cat, scores in category_scores.items()
}
return {
"overall_accuracy": accuracy,
"category_accuracies": category_accuracies,
"total_cases": total,
"correct_cases": correct,
}
def run_hallucination_test(
chat_fn,
user_id: str,
no_context_queries: list[str],
) -> dict:
hallucinated = 0
disclosed = 0
for query in no_context_queries:
response = chat_fn(user_id, query)
claims_past = any(
phrase in response.lower()
for phrase in ["you said", "you mentioned", "you told me", "last time"]
)
discloses_unknown = any(
phrase in response.lower()
for phrase in ["i don't have", "i don't recall", "no record", "cannot find"]
)
if claims_past:
hallucinated += 1
elif discloses_unknown:
disclosed += 1
total = len(no_context_queries)
return {
"total_queries": total,
"hallucinations": hallucinated,
"correct_disclosures": disclosed,
"hallucination_rate": hallucinated / total if total > 0 else 0,
}
test_cases = [
TestCase(
user_id="test_user",
setup_messages=[
{"content": "I work as a data scientist at Netflix", "session": "s1"},
],
query="Where do I work?",
expected_answer="Netflix",
category="fact",
),
TestCase(
user_id="test_user",
setup_messages=[
{"content": "I prefer tabs over spaces", "session": "s1"},
{"content": "Actually, I switched to spaces", "session": "s2"},
],
query="Do I prefer tabs or spaces?",
expected_answer="spaces",
category="contradiction",
),
]
results = run_memory_accuracy_test(
store_message_facts,
get_current_facts,
chat_with_memory,
test_cases,
)
print(f"Accuracy: {results['overall_accuracy']:.1%}")
print(f"By category: {results['category_accuracies']}")
The hallucination test is particularly important. It verifies that when the system has no relevant memory, it says "I don't know" rather than making something up. This is the single most important behavior for preventing fabricated facts.
Run these tests regularly. Memory systems degrade as data grows — a system that works perfectly with 100 facts might hallucinate with 10,000 facts because retrieval quality drops. Set up automated tests that catch accuracy regressions before they reach users.
Production Checklist
Before deploying a memory system that handles real users, verify each item:
Data Isolation
- User-scoped retrieval with server-side filters (not client-side filtering)
- No shared embeddings across user boundaries
- Audit logs for cross-user data access attempts
Temporal Correctness
- Facts stored with creation timestamps
- Supersession logic that marks old facts as outdated
- Retrieval that returns current facts by default
- Historical query support for "what was true at time X?"
Hallucination Prevention
- System prompt explicitly instructs the model to cite retrieved evidence
- Graceful handling when no relevant memory exists (disclose, don't fabricate)
- Confidence scoring on retrieved memories
- Regular testing with no-context queries to verify disclosure behavior
Deduplication
- Semantic similarity detection for incoming facts
- LLM-based contradiction detection for ambiguous cases
- Automatic supersession of contradicted facts
- Periodic consolidation of redundant entries
Monitoring
- Track hallucination rate on a sample of conversations
- Monitor retrieval latency (p50, p95, p99)
- Alert on sudden increases in memory store size
- Log contradiction detection decisions for audit
Cost Management
- Embedding cache for frequently accessed memories
- Batch processing for fact extraction during low-traffic periods
- Token budget limits for contradiction detection calls
- Storage size monitoring and archival policies
Conclusion
LLM hallucinations about past conversations are not mysterious. They have clear structural causes: limited context windows, no retrieval mechanism, and no cross-session state. Each cause has a corresponding fix.
Retrieval-augmented memory eliminates fabricated facts by giving the model grounded evidence to cite. Timestamped fact storage solves wrong timelines by tracking when facts change and always returning current versions. Deduplication and contradiction detection clean up redundant and conflicting entries that confuse retrieval.
None of these fixes are complex in isolation. The challenge is implementing all three together and testing them thoroughly. A system that retrieves memories but does not track timelines will hallucinate about when things happened. A system that tracks timelines but does not deduplicate will confuse the model with contradictory evidence.
Start with retrieval-augmented memory if you have nothing today. It provides the biggest improvement for the least effort. Add timestamped fact storage when your users have evolving preferences or information. Add contradiction detection when you need to serve multiple users reliably.
The OpenAI Cookbook has additional patterns for managing conversational memory at scale. For deeper research on hallucination measurement, the HaluEval benchmark and FActScore paper provide frameworks for evaluating factual accuracy in language models.
Memory hallucinations are solvable. The techniques in this post give you the foundation. Test them with your data, measure the results, and iterate.