- Published on
A Scientific Approach to Context Engineering
- Authors

- Name
- Avasdream
- @avasdream_
Large Language Models now boast context windows of 200K, 1M, even 10M tokens. But here's the uncomfortable truth the marketing doesn't tell you: the ability to accept long contexts does not equal the ability to effectively utilize them.
This post synthesizes findings from 11 key academic studies and industry research papers to provide an evidence-based framework for context engineering.
The Studies
| Study | Authors/Source | Year | Focus |
|---|---|---|---|
| Lost in the Middle | Liu et al. (Stanford) | 2023 | Position bias in long contexts |
| RULER | NVIDIA Research | 2024 | Long-context benchmarking |
| RAG vs Long-Context | Li et al. | 2024 | Retrieval vs context tradeoffs |
| Context Rot | Chroma Research | 2025 | Performance degradation patterns |
| NoLiMa | arXiv:2502.05167 | 2025 | Non-lexical retrieval evaluation |
| BABILong | NeurIPS 2024 | 2024 | Reasoning in long contexts |
| Position-Agnostic Training | arXiv:2311.09198 | 2023 | Mitigating position bias |
| Attention Sorting | arXiv:2310.01427 | 2023 | Combating recency bias |
| Multimodal NIAH | arXiv:2406.11230 | 2024 | Cross-modal retrieval |
| Recursive Summarization | arXiv:2308.15022 | 2023 | Long-term dialogue memory |
| Effective Context Engineering | Anthropic | 2025 | Practical agent guidance |
Finding 1: The Lost in the Middle Problem
The Research
Liu et al. at Stanford conducted a foundational study published in TACL 2023 (arXiv:2307.03172). They tested multiple LLMs on multi-document question answering, systematically varying where relevant information appeared in the context.
Methodology
- Placed a relevant document at different positions within a set of 20 documents
- Measured retrieval accuracy across context positions
- Tested both open-source and closed-source models
- Controlled for document content and question difficulty
Results
The study revealed a U-shaped performance curve:
- Beginning positions: ~75% accuracy
- Middle positions: ~50% accuracy
- End positions: ~70% accuracy
That's a 20+ percentage point drop simply from changing where information appears.
Why It Happens
Transformer attention mechanisms exhibit two biases:
- Primacy bias: Attention concentrates on beginning tokens
- Recency bias: Causal language modeling makes recent tokens more influential
Middle positions receive diffuse attention from both directions but strong focus from neither.
Finding 2: Your Context Window Is Smaller Than Advertised
The Research
NVIDIA's RULER benchmark (COLM 2024, arXiv:2404.06654) systematically evaluated whether models actually perform at their claimed context sizes.
Methodology
RULER goes beyond simple retrieval:
- Multi-hop tracing: Following chains of information
- Aggregation tasks: Synthesizing across the full context
- Variable complexity: Multiple targets, not just single needles
- Tested at various context lengths up to and beyond claimed limits
Results
"Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K."
The effective context window is often 50-70% of the marketed size.
Corroboration: The Context Rot Study
Chroma Research (2025) confirmed this with their "Context Rot" study testing 18 models including GPT-4.1 and Claude 4:
"Even under minimal conditions, model performance degrades as input length increases, often in surprising and non-uniform ways."
Their methodology added crucial variables:
- Needle-question similarity: Lower semantic similarity accelerates degradation
- Distractor effects: Related-but-wrong information compounds errors
- Haystack structure: Coherent vs shuffled text affects processing
Finding 3: NIAH Tests Are Misleading
The Problem
The popular Needle-in-a-Haystack test has become the default benchmark. Modern models ace it. But it measures the wrong thing.
The Research
NoLiMa (arXiv:2502.05167) introduced non-lexical matching requirements:
- Question: "Which character has been to Helsinki?"
- Needle: "Actually, Yuki lives next to the Kiasma museum."
- Requires knowing Kiasma museum is in Helsinki
Results
Llama3.1-70B performance:
| Benchmark | Accuracy at 32K |
|---|---|
| RULER | 94.8% |
| NoLiMa | 43.2% |
The model that "passes" standard benchmarks fails when actual reasoning is required.
Finding 4: RAG vs Long Context
The Research
Li et al. (EMNLP 2024, arXiv:2407.16833) conducted a systematic comparison between Retrieval Augmented Generation and long-context approaches.
Methodology
- Controlled comparison across multiple task types
- Measured both performance (accuracy) and efficiency (cost, latency)
- Tested hybrid approaches
Results
| Metric | RAG | Long Context | Winner |
|---|---|---|---|
| Raw Performance | Good | Better | Long Context |
| Cost Efficiency | Much Lower | Higher | RAG |
| Latency Consistency | Variable | Predictable | Long Context |
| Setup Complexity | Higher | Lower | Long Context |
Long context wins on accuracy when resources are unlimited. RAG wins on economics.
The Hybrid Solution: Self-Route
The paper proposes queries self-reflect to choose their strategy:
- Simple, targeted queries → RAG (cheaper)
- Complex, multi-faceted queries → Long Context (more accurate)
Finding 5: Position Engineering Works
The Research
Anthropic's internal testing with Claude 2.1 found a striking result:
Adding "Here is the most relevant sentence in the context:" to the start of Claude's response raised accuracy from 27% to 98%.
Same context, same question—just a different prompt structure.
Mitigation Approaches
Two papers specifically address fixing the lost-in-the-middle problem:
- Position-Agnostic Decompositional Training (arXiv:2311.09198): Training approach that reduces position sensitivity
- Attention Sorting (arXiv:2310.01427): Inference-time technique to combat recency bias
Both show meaningful improvements but require model-level changes.
What You Can Do Now
For existing models, strategic positioning remains your primary lever:
- Put critical information at the beginning
- Repeat key constraints at the end
- Never bury important details in the middle
Finding 6: Context Management Is Essential
The Research
Anthropic's "Effective Context Engineering for AI Agents" documentation and the Recursive Summarization paper (arXiv:2308.15022) both address managing context over extended interactions.
Strategies That Work
Compaction: When approaching context limits, summarize and restart. Preserve architectural decisions, unresolved issues, key details. Discard redundant tool outputs and resolved discussions.
Structured Note-Taking: The Recursive Summarization paper shows models can maintain coherence across context resets by persisting structured notes for objective tracking, key decisions, and strategic context.
Just-in-Time Retrieval: Rather than pre-loading everything, maintain lightweight identifiers (file paths, query templates) and load data dynamically at runtime.
Methodology Notes
How These Studies Were Conducted
The research uses several evaluation paradigms:
- Controlled Position Testing: Systematically varying where target information appears while holding content constant (Liu et al., Chroma)
- Multi-task Benchmarking: Testing beyond retrieval to include aggregation, multi-hop reasoning, and semantic inference (RULER, NoLiMa, BABILong)
- Ablation Studies: Isolating variables like distractor presence, similarity levels, and structural patterns (Chroma)
- Comparative Analysis: Head-to-head evaluation of architectural approaches with controlled conditions (Li et al.)
Limitations
- Most studies use English text; cross-lingual effects are underexplored
- Benchmark tasks may not reflect all production use cases
- Model capabilities evolve faster than publications
- Industry studies may have undisclosed methodological details
Key Insights
Based on this synthesis of 11 studies:
Effective context is ~50-70% of marketed context: Don't plan for the full window. Budget for degradation.
Position matters more than you think: A 20%+ accuracy swing from position alone. Put critical info at edges, never in the middle.
Standard benchmarks lie: NIAH tests show near-perfect scores while semantic retrieval fails. Evaluate with realistic queries.
The U-curve is real and consistent: Primacy and recency biases appear across models, architectures, and time. This is structural, not a bug being fixed.
RAG and long context serve different purposes: Performance-critical → long context. Cost-critical → RAG. Best results → hybrid routing.
Prompt structure creates 3x+ performance swings: "Here is the most relevant sentence..." took Claude from 27% to 98%. Engineering the prompt is not optional.
Active context management is required: Compaction, note-taking, and just-in-time retrieval are necessities for extended interactions.
Distractors compound degradation: Related-but-wrong information in context makes errors more likely. Clean context > more context.
Semantic similarity affects robustness: Lower similarity between query and target accelerates performance loss. Design for the hard cases.
Position-agnostic training helps but isn't available everywhere: Research solutions exist but require model-level changes. Prompt engineering remains your primary tool.
Practical Application
An evidence-based prompt structure:
[System Prompt - Clear, specific instructions]
[CRITICAL CONTEXT - Most important information first]
[Supporting details - Moderately important]
[Background information - Least critical, middle zone]
[Reference material - Relevant but lower priority]
[KEY CONSTRAINTS - Repeat critical requirements here]
[User Query]
For extended sessions:
- Monitor token consumption
- Implement compaction at ~70% capacity
- Use retrieval tools rather than stuffing context
- Test your specific use case at realistic lengths
References
Liu, N. F., Lin, K., Hewitt, J., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. TACL. arXiv:2307.03172
NVIDIA Research. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? COLM. arXiv:2404.06654
Li, Z., et al. (2024). Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. EMNLP. arXiv:2407.16833
Chroma Research. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. https://research.trychroma.com/context-rot
NoLiMa: Long-Context Evaluation Beyond Literal Matching. (2025). arXiv:2502.05167
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. (2024). NeurIPS.
Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training. (2023). arXiv:2311.09198
Attention Sorting Combats Recency Bias In Long Context Language Models. (2023). arXiv:2310.01427
Multimodal Needle in a Haystack. (2024). arXiv:2406.11230
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models. (2023). arXiv:2308.15022
Anthropic. (2025). Effective Context Engineering for AI Agents. Anthropic Engineering Documentation.