Skip to main content

Why Clinical AI Forgets Your Patients: Memory Research

Abdus Muwwakkil – Chief Executive Officer
Visualization of AI navigating patient records across time, representing the shift from limited memory to intelligent information retrieval in healthcare.

The takeaway: MIT researchers published a new AI architecture that processes 10+ million tokens without forgetting. On reasoning tasks, it achieved 91% accuracy where current retrieval systems hit 51%. For healthcare, where patient histories span decades and missed context creates safety risks, this research suggests the current generation of clinical AI tools may be fundamentally limited—and points toward what comes next.


Why Clinical AI Keeps Forgetting Your Patients

You’ve seen it happen.

The AI scribe generates a note that misses the medication change from six months ago. The clinical decision support tool recommends a drug the patient already failed. The ambient documentation system produces a note that reads like a first visit when this is the patient’s fourteenth.

The tools aren’t broken. They’re working exactly as designed. And that’s the problem.

Every clinical AI product on the market today operates within the same constraint: a fixed window of text the model can “see” at once. Feed it more than fits in that window and it starts forgetting. Details from early in a document fade. Information in the middle disappears entirely. Researchers have a name for this: context rot.

As one of the MIT researchers put it: “The model gets…dumber? It’s sort of this well-known but hard to describe failure mode that we don’t talk about in papers.”

For clinical applications, where patient histories span decades and a single missed detail can change a treatment plan, this isn’t a minor limitation. It’s structural.

In late 2025, MIT researchers published something that could matter: a new architecture called Recursive Language Models that sidesteps the memory problem entirely. Instead of trying to remember everything at once, the AI learns to search for what it needs.

The results were striking: on tasks requiring reasoning across 1,000 documents (roughly 10 million tokens), RLMs achieved 91% accuracy where standard models scored 0% because they literally couldn’t process inputs that large. And here’s the kicker: RLMs were actually cheaper per query than brute-force approaches.

Here’s what this means for healthcare.


The Problem Hiding in Plain Sight

How Clinical AI Actually Works Today

Pick any AI documentation tool, scribe, or clinical decision support system. Under the hood, they all work the same way.

The model receives a prompt containing whatever context fits in its window. For most systems, that’s 8,000 to 200,000 tokens, roughly 6,000 to 150,000 words. Everything else gets summarized, truncated, or ignored.

When you ask an AI to generate a note for a patient with 15 years of records, it’s not reading those 15 years. It’s reading whatever excerpt someone decided to stuff into the prompt, plus the current encounter.

For a routine visit with a healthy patient, this works fine. For a complex patient with multiple chronic conditions, medication changes, specialist consultations, and prior adverse events, critical details get lost.

Context Rot in Clinical Terms

Chroma Research tested 18 major AI models in 2025, including the systems underlying most clinical AI products. They found performance degradation across every model as input length increased, even on simple tasks.

The implications for healthcare are concrete:

The medication reconciliation that misses interactions. The model sees the current med list and the new prescription. It doesn’t see the adverse reaction documented eight months ago in a specialist note that never made it into the prompt.

The clinical decision support that ignores history. The system recommends a standard workup. It doesn’t know the patient already had that workup, with negative results, two years ago at a different facility.

The documentation that lacks longitudinal awareness. The AI generates a note for today’s visit. It has no sense of how this visit relates to the trajectory of care over months or years.

The risk stratification that misses patterns. The algorithm scores based on discrete data points. It can’t recognize that this patient’s slow decline across multiple domains matches a pattern that warrants intervention.

These aren’t hypotheticals. They’re the daily reality of clinical AI operating within architectural constraints that guarantee incomplete context.

Why Bigger Windows Don’t Fix It

The AI industry’s response has been to scale context windows. Models now advertise 128K, 1M, even 10M token contexts.

Two problems.

First, cost. Processing a million-token prompt costs $70+ and runs slowly. For real-time clinical workflows, this isn’t practical.

Second, it doesn’t actually work. Even with massive context windows, models exhibit “lost in the middle” effects. Stanford researchers found accuracy dropped from 75% to 55% based purely on where information appeared in the prompt. More tokens doesn’t mean better recall.

The current generation of clinical AI isn’t memory-limited by engineering constraints that will disappear with better hardware. It’s limited by how transformer attention mechanisms fundamentally work.


What MIT Figured Out

The Core Insight

Recursive Language Models flip the paradigm.

Instead of cramming everything into a fixed window, RLMs treat the input as external storage the model can search programmatically. The full patient record, whether 10,000 tokens or 10 million, lives in an environment the model can query, slice, and examine piece by piece.

The model never tries to hold everything in active memory. It navigates the data like a clinician would: searching for relevant sections, reading them carefully, cross-referencing with other parts, synthesizing findings.

From the MIT paper: “We propose treating long prompts as part of an external environment and allowing the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt.”

A note on terminology: “Recursive Language Models” here refers to this 2025 agentic paradigm where LLMs programmatically call themselves. This is distinct from the older “Recursive Neural Networks” (RvNNs) of the 2010s, which processed tree-structured parse data. Same word, completely different concept.

How It Works in Practice

The architecture has four components:

External storage. The full context loads into a searchable environment. The model’s prompt contains only the query, not the data itself.

Programmatic search. The model operates through a persistent Python environment where it can inspect, filter, and transform data. It might filter by date range, search for keywords, or identify relevant document types using actual code execution.

Recursive sub-queries. When the model finds relevant sections, it dispatches focused sub-LLM calls to analyze just those sections. Critically, these sub-models can run in parallel via batch processing, analyzing multiple documents simultaneously. A sub-model reads a specific note, lab result, or document segment, analyzes it, and returns findings.

Iterative synthesis. The main model collects results from sub-queries, reasons over them, and can search further if needed. It builds the answer incrementally.

The key architectural insight: the main model never directly accesses tools or raw data. It delegates all context-heavy operations to sub-models, keeping its own context window lean. This delegation pattern achieves 2x token efficiency compared to standard approaches, with the main model’s context reduced substantially while sub-models handle verbose content.

Think of the difference between a clinician who tries to memorize a patient’s entire chart before seeing them versus one who knows how to efficiently find what they need in the record. RLMs give AI the second capability.

Emergent Behaviors (No Training Required)

Perhaps the most fascinating finding: models naturally develop sophisticated navigation strategies without explicit training. When given the RLM scaffolding, they spontaneously exhibit:

Peeking. Before processing, models inspect the structure of the context, understanding what types of documents they’re dealing with.

Grepping. Models use regex patterns and keyword searches to narrow the search space without reading every token.

Partition and map. When faced with large contexts, models chunk the data and dispatch parallel sub-calls to process sections simultaneously.

Verification loops. Models spawn sub-calls to double-check their answers, catching errors through redundant validation.

These behaviors emerged from the architecture alone. No one programmed “use regex to filter.” The models figured it out. This suggests RLMs tap into capabilities that frontier models already possess but can’t express through standard prompting.

The Numbers

MIT tested RLMs across five benchmarks ranging from 32K to 11M tokens:

BenchmarkTask TypeRLM (GPT-5)Base GPT-5
BrowseComp+ (6-11M tokens)Web research across 1000 docs91.3%0% (can’t process)
CodeQA (23K-4.2M tokens)Code reasoning62%24%
OOLONG (131K tokens)Aggregation tasks56.5%44%
OOLONG-Pairs (32K tokens)Pairwise comparison58%0.04%

The standard model scores zero on large inputs because it literally cannot process them. Even on tasks within its context window, RLMs outperformed by 12-58 percentage points.

The most striking finding: RLM with GPT-5-mini (a smaller, cheaper model) outperformed base GPT-5 by over 33% on aggregation tasks. A smaller model that knows how to navigate information beats a larger model that tries to memorize everything.

This has profound implications: you don’t necessarily need bigger models. You need smarter information retrieval strategies.

Where RLMs Fall Short (Important Caveats)

The research isn’t uniformly positive. Honest assessment requires acknowledging where RLMs underperform:

Mathematical reasoning. On math benchmarks, standard LLMs with Python tools actually outperformed RLM architectures. The recursive overhead doesn’t help when the task fits comfortably in a single context window.

Untrained scaffolding waste. Without specific training or detailed prompting, models underutilize the RLM scaffolding. They provide minimal context to sub-models despite having access, suggesting the architecture requires reinforcement learning to reach full potential.

Recursion depth limits. Current implementations are locked at depth-1, meaning sub-models can’t spawn their own sub-models. Multi-level recursion remains future work.

Wall-clock time. RLMs consistently increase processing time due to multiple model calls and correction cycles. For real-time clinical workflows, latency matters.

These results represent a performance floor, not a ceiling. The researchers emphasize that “the true potential of RLM and context folding will be unleashed after being trained via RL.” Current API-based evaluations test untrained models using novel scaffolding. Production systems with trained models should perform substantially better.


What This Means for Healthcare

The Current Generation Is Fundamentally Limited

This isn’t about any specific vendor. Abridge, Nuance DAX, Freed, Suki, and every other clinical AI product operates within the same constraints. They’re building excellent tools within an architectural paradigm that guarantees incomplete context.

When Abridge generates a note, it works primarily from the current encounter plus whatever summary fits in its window. When DAX integrates with Epic, it pulls what it can but can’t reason across a patient’s complete longitudinal record. When any clinical AI tool misses something, it’s often because that something wasn’t in the context it could see.

RLMs suggest these limitations aren’t permanent. They’re artifacts of the current architectural generation.

What Changes When AI Can Navigate Records

Consider what becomes possible when clinical AI can programmatically search and reason across complete patient histories:

Documentation that understands trajectory. Notes that reflect not just what happened today, but how today fits the patient’s journey. “Patient reports improved energy, consistent with response to thyroid medication initiated three months ago. TSH normalized from 8.2 to 2.1.”

Clinical decision support with memory. Systems that know what’s already been tried, what worked, what didn’t, without someone manually encoding that history. “Alert: This medication class previously caused adverse reaction (see note 3/15/2024). Consider alternative.”

Risk stratification that sees patterns. Algorithms that can identify concerning trajectories across visits, not just score discrete snapshots. “This patient’s functional decline across three domains over six months matches patterns associated with early cognitive impairment.”

Care coordination with full context. Handoffs that don’t lose information because the summary couldn’t capture everything. The receiving clinician’s AI can search the complete record for what’s relevant.

The Cost Equation

Here’s the counterintuitive finding: RLMs are actually cheaper than brute-force approaches.

MIT’s cost analysis on the BrowseComp+ benchmark:

ApproachCost per Query
Direct context ingestion$1.50 - $2.75
Summarization baseline~$3.00
RLM$0.99

RLMs achieved up to 3x cost reduction compared to summarization approaches while delivering superior accuracy. The recursive architecture pays for itself.

Why? The main model stays lean, orchestrating rather than processing. Sub-models handle the heavy lifting at lower per-token costs. You’re not paying frontier-model prices to read every token of a massive record.

For health systems evaluating AI costs, this changes the calculus. The question isn’t just “can the AI handle our complex patients?” but “can it do so more economically than current approaches?” The answer appears to be yes.

The Patient Safety Dimension

Context limitations aren’t just an inconvenience. They’re a safety issue.

When RAG-based retrieval fails to surface a documented allergy, the consequence isn’t a lower benchmark score. It’s a potential adverse event. When clinical AI misses a prior medication failure, someone might retry a drug that already didn’t work. When documentation lacks longitudinal awareness, the care plan loses continuity.

Current systems mitigate these risks through human oversight. Clinicians review AI output, catch errors, add context. This works, but it means the AI isn’t fully trusted. It’s a tool that helps but requires verification.

RLM-style architectures suggest a future where clinical AI could be trusted with more complete context, reducing the cognitive burden on clinicians to catch what the AI missed.

Timeline and Practical Implications

RLMs are research prototypes today, not production systems. The MIT paper demonstrates potential but notes limitations: models need careful prompting to use recursive tools effectively, and some tasks required additional training.

Realistic timeline:

12-24 months: RLM capabilities begin appearing in foundation models from OpenAI, Anthropic, Google. Initially as improved long-context handling rather than explicit “recursive” features.

2-3 years: Healthcare-specific AI products begin incorporating these capabilities. Early adopters gain advantage in complex use cases requiring longitudinal reasoning.

3-5 years: RLM-style navigation becomes standard expectation for clinical AI. Products without it seem limited.

What to Watch For

If you’re evaluating clinical AI products or building healthcare technology, here’s what matters:

Ask about context handling. How does the system handle patients with extensive records? What gets summarized or excluded? How do they ensure nothing critical gets lost?

Evaluate longitudinal reasoning. Can the AI connect information across visits? Does it demonstrate awareness of patient trajectory, or does it treat each encounter as isolated?

Understand the retrieval architecture. If the system uses RAG, how does it decide what to retrieve? What failure modes exist? How do they validate retrieval quality?

Watch vendor roadmaps. Companies discussing “intelligent context management,” “active memory,” or “recursive reasoning” are likely tracking this research. Those still focused purely on larger context windows may fall behind.


The Bigger Picture

For a decade, the AI industry treated memory limitations as an engineering problem to route around. Build better embeddings. Design smarter retrieval. Expand context windows.

MIT’s research suggests the problem was architectural. The solution isn’t giving AI more to memorize. It’s teaching AI to navigate information intelligently.

For healthcare, where the information that matters most is often buried in years of records across multiple systems, this shift matters enormously.

The current generation of clinical AI has delivered real value: reduced documentation burden, faster note generation, improved capture of visit content. The next generation could do something more fundamental: reason across complete patient histories the way clinicians do, but faster and without fatigue.

We’re not there yet. But the research path is visible.

The question for healthcare organizations isn’t whether AI memory matters. It’s whether you’re evaluating your current tools and future investments with this trajectory in mind.


Key Takeaways

For health systems evaluating AI: Current clinical AI products are architecturally constrained in ways that limit longitudinal reasoning. This isn’t a vendor quality issue—it’s a paradigm limitation. RLM-style capabilities are 12-24 months from appearing in foundation models, potentially sooner in specialized applications. Factor this trajectory into procurement decisions.

For clinical AI vendors: RAG-based retrieval is a waypoint, not a destination. Companies developing expertise in intelligent context navigation will have structural advantages as foundation models evolve. Consider how your architecture would leverage recursive capabilities when they’re available.

For clinicians using AI tools: Understand that current AI has limited visibility into patient history. Continue reviewing and supplementing AI output with your longitudinal knowledge. The tools will improve, but human oversight remains essential during this transition.

For healthcare IT leaders: The limiting factor for clinical AI may be shifting from model capability to data accessibility. Organizations with well-structured, accessible longitudinal records will be positioned to benefit more from RLM-style architectures than those with fragmented data.


Further Reading

  • Zhang, A. L., Kraska, T., & Khattab, O. (2025). “Recursive Language Models.” arXiv:2512.24601. MIT CSAIL.
  • Zhang, A. L. (2025). “Recursive Language Models.” Author’s blog with implementation insights.
  • RLM GitHub Repository. Official implementation from MIT OASYS lab.
  • Prime Intellect. (2026). “Recursive Language Models: The Paradigm of 2026.” Independent multi-model testing.
  • Hong, K., et al. (2025). “Context Rot: How Increasing Input Tokens Impacts LLM Performance.” Chroma Research.
  • Liu, N., et al. (2024). “Lost in the Middle: How Language Models Use Long Contexts.” Stanford University.

OrbDoc publishes research and analysis on healthcare AI developments. For questions about clinical documentation technology, contact us at admin@orbdoc.com or visit orbdoc.com/research.