A Scientific Approach to Context Engineering

Large Language Models now boast context windows of 200K, 1M, even 10M tokens. But here's the uncomfortable truth the marketing doesn't tell you: the ability to accept long contexts does not equal the ability to effectively utilize them.

This post synthesizes findings from 11 key academic studies and industry research papers to provide an evidence-based framework for context engineering.

The Studies

Study	Authors/Source	Year	Focus
Lost in the Middle	Liu et al. (Stanford)	2023	Position bias in long contexts
RULER	NVIDIA Research	2024	Long-context benchmarking
RAG vs Long-Context	Li et al.	2024	Retrieval vs context tradeoffs
Context Rot	Chroma Research	2025	Performance degradation patterns
NoLiMa	arXiv:2502.05167	2025	Non-lexical retrieval evaluation
BABILong	NeurIPS 2024	2024	Reasoning in long contexts
Position-Agnostic Training	arXiv:2311.09198	2023	Mitigating position bias
Attention Sorting	arXiv:2310.01427	2023	Combating recency bias
Multimodal NIAH	arXiv:2406.11230	2024	Cross-modal retrieval
Recursive Summarization	arXiv:2308.15022	2023	Long-term dialogue memory
Effective Context Engineering	Anthropic	2025	Practical agent guidance

Finding 1: The Lost in the Middle Problem

The Research

Liu et al. at Stanford conducted a foundational study published in TACL 2023 (arXiv:2307.03172). They tested multiple LLMs on multi-document question answering, systematically varying where relevant information appeared in the context.

Methodology

Placed a relevant document at different positions within a set of 20 documents
Measured retrieval accuracy across context positions
Tested both open-source and closed-source models
Controlled for document content and question difficulty

Results

The study revealed a U-shaped performance curve:

Beginning positions: ~75% accuracy
Middle positions: ~50% accuracy
End positions: ~70% accuracy

That's a 20+ percentage point drop simply from changing where information appears.

Why It Happens

Transformer attention mechanisms exhibit two biases:

Primacy bias: Attention concentrates on beginning tokens
Recency bias: Causal language modeling makes recent tokens more influential

Middle positions receive diffuse attention from both directions but strong focus from neither.

Finding 2: Your Context Window Is Smaller Than Advertised

The Research

NVIDIA's RULER benchmark (COLM 2024, arXiv:2404.06654) systematically evaluated whether models actually perform at their claimed context sizes.

Methodology

RULER goes beyond simple retrieval:

Multi-hop tracing: Following chains of information
Aggregation tasks: Synthesizing across the full context
Variable complexity: Multiple targets, not just single needles
Tested at various context lengths up to and beyond claimed limits

Results

"Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K."

The effective context window is often 50-70% of the marketed size.

Corroboration: The Context Rot Study

Chroma Research (2025) confirmed this with their "Context Rot" study testing 18 models including GPT-4.1 and Claude 4:

"Even under minimal conditions, model performance degrades as input length increases, often in surprising and non-uniform ways."

Their methodology added crucial variables:

Needle-question similarity: Lower semantic similarity accelerates degradation
Distractor effects: Related-but-wrong information compounds errors
Haystack structure: Coherent vs shuffled text affects processing

Finding 3: NIAH Tests Are Misleading

The Problem

The popular Needle-in-a-Haystack test has become the default benchmark. Modern models ace it. But it measures the wrong thing.

The Research

NoLiMa (arXiv:2502.05167) introduced non-lexical matching requirements:

Question: "Which character has been to Helsinki?"
Needle: "Actually, Yuki lives next to the Kiasma museum."
Requires knowing Kiasma museum is in Helsinki

Results

Llama3.1-70B performance:

Benchmark	Accuracy at 32K
RULER	94.8%
NoLiMa	43.2%

The model that "passes" standard benchmarks fails when actual reasoning is required.

Finding 4: RAG vs Long Context

The Research

Li et al. (EMNLP 2024, arXiv:2407.16833) conducted a systematic comparison between Retrieval Augmented Generation and long-context approaches.

Methodology

Controlled comparison across multiple task types
Measured both performance (accuracy) and efficiency (cost, latency)
Tested hybrid approaches

Results

Metric	RAG	Long Context	Winner
Raw Performance	Good	Better	Long Context
Cost Efficiency	Much Lower	Higher	RAG
Latency Consistency	Variable	Predictable	Long Context
Setup Complexity	Higher	Lower	Long Context

Long context wins on accuracy when resources are unlimited. RAG wins on economics.

The Hybrid Solution: Self-Route

The paper proposes queries self-reflect to choose their strategy:

Simple, targeted queries → RAG (cheaper)
Complex, multi-faceted queries → Long Context (more accurate)

Finding 5: Position Engineering Works

The Research

Anthropic's internal testing with Claude 2.1 found a striking result:

Adding "Here is the most relevant sentence in the context:" to the start of Claude's response raised accuracy from 27% to 98%.

Same context, same question—just a different prompt structure.

Mitigation Approaches

Two papers specifically address fixing the lost-in-the-middle problem:

Position-Agnostic Decompositional Training (arXiv:2311.09198): Training approach that reduces position sensitivity
Attention Sorting (arXiv:2310.01427): Inference-time technique to combat recency bias

Both show meaningful improvements but require model-level changes.

What You Can Do Now

For existing models, strategic positioning remains your primary lever:

Put critical information at the beginning
Repeat key constraints at the end
Never bury important details in the middle

Finding 6: Context Management Is Essential

The Research

Anthropic's "Effective Context Engineering for AI Agents" documentation and the Recursive Summarization paper (arXiv:2308.15022) both address managing context over extended interactions.

Strategies That Work

Compaction: When approaching context limits, summarize and restart. Preserve architectural decisions, unresolved issues, key details. Discard redundant tool outputs and resolved discussions.

Structured Note-Taking: The Recursive Summarization paper shows models can maintain coherence across context resets by persisting structured notes for objective tracking, key decisions, and strategic context.

Just-in-Time Retrieval: Rather than pre-loading everything, maintain lightweight identifiers (file paths, query templates) and load data dynamically at runtime.

Methodology Notes

How These Studies Were Conducted

The research uses several evaluation paradigms:

Controlled Position Testing: Systematically varying where target information appears while holding content constant (Liu et al., Chroma)
Multi-task Benchmarking: Testing beyond retrieval to include aggregation, multi-hop reasoning, and semantic inference (RULER, NoLiMa, BABILong)
Ablation Studies: Isolating variables like distractor presence, similarity levels, and structural patterns (Chroma)
Comparative Analysis: Head-to-head evaluation of architectural approaches with controlled conditions (Li et al.)

Limitations

Most studies use English text; cross-lingual effects are underexplored
Benchmark tasks may not reflect all production use cases
Model capabilities evolve faster than publications
Industry studies may have undisclosed methodological details

Key Insights

Based on this synthesis of 11 studies:

Effective context is ~50-70% of marketed context: Don't plan for the full window. Budget for degradation.
Position matters more than you think: A 20%+ accuracy swing from position alone. Put critical info at edges, never in the middle.
Standard benchmarks lie: NIAH tests show near-perfect scores while semantic retrieval fails. Evaluate with realistic queries.
The U-curve is real and consistent: Primacy and recency biases appear across models, architectures, and time. This is structural, not a bug being fixed.
RAG and long context serve different purposes: Performance-critical → long context. Cost-critical → RAG. Best results → hybrid routing.
Prompt structure creates 3x+ performance swings: "Here is the most relevant sentence..." took Claude from 27% to 98%. Engineering the prompt is not optional.
Active context management is required: Compaction, note-taking, and just-in-time retrieval are necessities for extended interactions.
Distractors compound degradation: Related-but-wrong information in context makes errors more likely. Clean context > more context.
Semantic similarity affects robustness: Lower similarity between query and target accelerates performance loss. Design for the hard cases.
Position-agnostic training helps but isn't available everywhere: Research solutions exist but require model-level changes. Prompt engineering remains your primary tool.

Practical Application

An evidence-based prompt structure:

[System Prompt - Clear, specific instructions]
[CRITICAL CONTEXT - Most important information first]
[Supporting details - Moderately important]
[Background information - Least critical, middle zone]
[Reference material - Relevant but lower priority]
[KEY CONSTRAINTS - Repeat critical requirements here]
[User Query]

For extended sessions:

Monitor token consumption
Implement compaction at ~70% capacity
Use retrieval tools rather than stuffing context
Test your specific use case at realistic lengths

References

Liu, N. F., Lin, K., Hewitt, J., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. TACL. arXiv:2307.03172
NVIDIA Research. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? COLM. arXiv:2404.06654
Li, Z., et al. (2024). Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. EMNLP. arXiv:2407.16833
Chroma Research. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. https://research.trychroma.com/context-rot
NoLiMa: Long-Context Evaluation Beyond Literal Matching. (2025). arXiv:2502.05167
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. (2024). NeurIPS.
Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training. (2023). arXiv:2311.09198
Attention Sorting Combats Recency Bias In Long Context Language Models. (2023). arXiv:2310.01427
Multimodal Needle in a Haystack. (2024). arXiv:2406.11230
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models. (2023). arXiv:2308.15022
Anthropic. (2025). Effective Context Engineering for AI Agents. Anthropic Engineering Documentation.