Published on

A Scientific Approach to Context Engineering

Authors

Large Language Models now boast context windows of 200K, 1M, even 10M tokens. But here's the uncomfortable truth the marketing doesn't tell you: the ability to accept long contexts does not equal the ability to effectively utilize them.

This post synthesizes findings from 11 key academic studies and industry research papers to provide an evidence-based framework for context engineering.


The Studies

StudyAuthors/SourceYearFocus
Lost in the MiddleLiu et al. (Stanford)2023Position bias in long contexts
RULERNVIDIA Research2024Long-context benchmarking
RAG vs Long-ContextLi et al.2024Retrieval vs context tradeoffs
Context RotChroma Research2025Performance degradation patterns
NoLiMaarXiv:2502.051672025Non-lexical retrieval evaluation
BABILongNeurIPS 20242024Reasoning in long contexts
Position-Agnostic TrainingarXiv:2311.091982023Mitigating position bias
Attention SortingarXiv:2310.014272023Combating recency bias
Multimodal NIAHarXiv:2406.112302024Cross-modal retrieval
Recursive SummarizationarXiv:2308.150222023Long-term dialogue memory
Effective Context EngineeringAnthropic2025Practical agent guidance

Finding 1: The Lost in the Middle Problem

The Research

Liu et al. at Stanford conducted a foundational study published in TACL 2023 (arXiv:2307.03172). They tested multiple LLMs on multi-document question answering, systematically varying where relevant information appeared in the context.

Methodology

  • Placed a relevant document at different positions within a set of 20 documents
  • Measured retrieval accuracy across context positions
  • Tested both open-source and closed-source models
  • Controlled for document content and question difficulty

Results

The study revealed a U-shaped performance curve:

  • Beginning positions: ~75% accuracy
  • Middle positions: ~50% accuracy
  • End positions: ~70% accuracy

That's a 20+ percentage point drop simply from changing where information appears.

Why It Happens

Transformer attention mechanisms exhibit two biases:

  1. Primacy bias: Attention concentrates on beginning tokens
  2. Recency bias: Causal language modeling makes recent tokens more influential

Middle positions receive diffuse attention from both directions but strong focus from neither.


Finding 2: Your Context Window Is Smaller Than Advertised

The Research

NVIDIA's RULER benchmark (COLM 2024, arXiv:2404.06654) systematically evaluated whether models actually perform at their claimed context sizes.

Methodology

RULER goes beyond simple retrieval:

  • Multi-hop tracing: Following chains of information
  • Aggregation tasks: Synthesizing across the full context
  • Variable complexity: Multiple targets, not just single needles
  • Tested at various context lengths up to and beyond claimed limits

Results

"Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K."

The effective context window is often 50-70% of the marketed size.

Corroboration: The Context Rot Study

Chroma Research (2025) confirmed this with their "Context Rot" study testing 18 models including GPT-4.1 and Claude 4:

"Even under minimal conditions, model performance degrades as input length increases, often in surprising and non-uniform ways."

Their methodology added crucial variables:

  • Needle-question similarity: Lower semantic similarity accelerates degradation
  • Distractor effects: Related-but-wrong information compounds errors
  • Haystack structure: Coherent vs shuffled text affects processing

Finding 3: NIAH Tests Are Misleading

The Problem

The popular Needle-in-a-Haystack test has become the default benchmark. Modern models ace it. But it measures the wrong thing.

The Research

NoLiMa (arXiv:2502.05167) introduced non-lexical matching requirements:

  • Question: "Which character has been to Helsinki?"
  • Needle: "Actually, Yuki lives next to the Kiasma museum."
  • Requires knowing Kiasma museum is in Helsinki

Results

Llama3.1-70B performance:

BenchmarkAccuracy at 32K
RULER94.8%
NoLiMa43.2%

The model that "passes" standard benchmarks fails when actual reasoning is required.


Finding 4: RAG vs Long Context

The Research

Li et al. (EMNLP 2024, arXiv:2407.16833) conducted a systematic comparison between Retrieval Augmented Generation and long-context approaches.

Methodology

  • Controlled comparison across multiple task types
  • Measured both performance (accuracy) and efficiency (cost, latency)
  • Tested hybrid approaches

Results

MetricRAGLong ContextWinner
Raw PerformanceGoodBetterLong Context
Cost EfficiencyMuch LowerHigherRAG
Latency ConsistencyVariablePredictableLong Context
Setup ComplexityHigherLowerLong Context

Long context wins on accuracy when resources are unlimited. RAG wins on economics.

The Hybrid Solution: Self-Route

The paper proposes queries self-reflect to choose their strategy:

  • Simple, targeted queries → RAG (cheaper)
  • Complex, multi-faceted queries → Long Context (more accurate)

Finding 5: Position Engineering Works

The Research

Anthropic's internal testing with Claude 2.1 found a striking result:

Adding "Here is the most relevant sentence in the context:" to the start of Claude's response raised accuracy from 27% to 98%.

Same context, same question—just a different prompt structure.

Mitigation Approaches

Two papers specifically address fixing the lost-in-the-middle problem:

  1. Position-Agnostic Decompositional Training (arXiv:2311.09198): Training approach that reduces position sensitivity
  2. Attention Sorting (arXiv:2310.01427): Inference-time technique to combat recency bias

Both show meaningful improvements but require model-level changes.

What You Can Do Now

For existing models, strategic positioning remains your primary lever:

  • Put critical information at the beginning
  • Repeat key constraints at the end
  • Never bury important details in the middle

Finding 6: Context Management Is Essential

The Research

Anthropic's "Effective Context Engineering for AI Agents" documentation and the Recursive Summarization paper (arXiv:2308.15022) both address managing context over extended interactions.

Strategies That Work

Compaction: When approaching context limits, summarize and restart. Preserve architectural decisions, unresolved issues, key details. Discard redundant tool outputs and resolved discussions.

Structured Note-Taking: The Recursive Summarization paper shows models can maintain coherence across context resets by persisting structured notes for objective tracking, key decisions, and strategic context.

Just-in-Time Retrieval: Rather than pre-loading everything, maintain lightweight identifiers (file paths, query templates) and load data dynamically at runtime.


Methodology Notes

How These Studies Were Conducted

The research uses several evaluation paradigms:

  • Controlled Position Testing: Systematically varying where target information appears while holding content constant (Liu et al., Chroma)
  • Multi-task Benchmarking: Testing beyond retrieval to include aggregation, multi-hop reasoning, and semantic inference (RULER, NoLiMa, BABILong)
  • Ablation Studies: Isolating variables like distractor presence, similarity levels, and structural patterns (Chroma)
  • Comparative Analysis: Head-to-head evaluation of architectural approaches with controlled conditions (Li et al.)

Limitations

  • Most studies use English text; cross-lingual effects are underexplored
  • Benchmark tasks may not reflect all production use cases
  • Model capabilities evolve faster than publications
  • Industry studies may have undisclosed methodological details

Key Insights

Based on this synthesis of 11 studies:

  1. Effective context is ~50-70% of marketed context: Don't plan for the full window. Budget for degradation.

  2. Position matters more than you think: A 20%+ accuracy swing from position alone. Put critical info at edges, never in the middle.

  3. Standard benchmarks lie: NIAH tests show near-perfect scores while semantic retrieval fails. Evaluate with realistic queries.

  4. The U-curve is real and consistent: Primacy and recency biases appear across models, architectures, and time. This is structural, not a bug being fixed.

  5. RAG and long context serve different purposes: Performance-critical → long context. Cost-critical → RAG. Best results → hybrid routing.

  6. Prompt structure creates 3x+ performance swings: "Here is the most relevant sentence..." took Claude from 27% to 98%. Engineering the prompt is not optional.

  7. Active context management is required: Compaction, note-taking, and just-in-time retrieval are necessities for extended interactions.

  8. Distractors compound degradation: Related-but-wrong information in context makes errors more likely. Clean context > more context.

  9. Semantic similarity affects robustness: Lower similarity between query and target accelerates performance loss. Design for the hard cases.

  10. Position-agnostic training helps but isn't available everywhere: Research solutions exist but require model-level changes. Prompt engineering remains your primary tool.


Practical Application

An evidence-based prompt structure:

[System Prompt - Clear, specific instructions]
[CRITICAL CONTEXT - Most important information first]
[Supporting details - Moderately important]
[Background information - Least critical, middle zone]
[Reference material - Relevant but lower priority]
[KEY CONSTRAINTS - Repeat critical requirements here]
[User Query]

For extended sessions:

  • Monitor token consumption
  • Implement compaction at ~70% capacity
  • Use retrieval tools rather than stuffing context
  • Test your specific use case at realistic lengths

References

  1. Liu, N. F., Lin, K., Hewitt, J., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. TACL. arXiv:2307.03172

  2. NVIDIA Research. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? COLM. arXiv:2404.06654

  3. Li, Z., et al. (2024). Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. EMNLP. arXiv:2407.16833

  4. Chroma Research. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. https://research.trychroma.com/context-rot

  5. NoLiMa: Long-Context Evaluation Beyond Literal Matching. (2025). arXiv:2502.05167

  6. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. (2024). NeurIPS.

  7. Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training. (2023). arXiv:2311.09198

  8. Attention Sorting Combats Recency Bias In Long Context Language Models. (2023). arXiv:2310.01427

  9. Multimodal Needle in a Haystack. (2024). arXiv:2406.11230

  10. Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models. (2023). arXiv:2308.15022

  11. Anthropic. (2025). Effective Context Engineering for AI Agents. Anthropic Engineering Documentation.