Why Your RAG System May Be Underperforming

This is part 1 of a three-part series:

Part 1: Why Your RAG System May Be Underperforming

Part 2: Using Tool Calling to Go Beyond RAG

Part 3: Agentic AI Is a Systems Design Problem

‍

Faced with the question “What is the capital of the Netherlands?” you have a few possible responses:

‍Answer confidently : If you know it‍
Look it up : If uncertain ‍
Take a guess : Might be wrong

Large Language Models (LLMs) face the same challenge. They excel when a question falls inside their training data, but when it doesn’t, they may “hallucinate,” producing an answer that sounds plausible but is wrong.

The key difference is that LLMs don’t have direct access to your enterprise data or knowledge bases without additional retrieval methods. That’s where Retrieval-Augmented Generation (RAG) comes in.

RAG in a Nutshell

RAG is the process of giving an LLM access to relevant, external information so it can answer queries more accurately. The typical RAG workflow looks like this:

User query: A user asks a question.
Retrieval: A separate system searches a knowledge base for relevant documents or data.
Augmentation: The retrieved content is combined with the query and sent to the LLM, which generates a response.

The value of RAG is that it allows models of any size to deliver high-quality, context-aware answers, whether it’s the latest company policy, current product details, or niche industry knowledge. But RAG doesn’t operate in isolation. For RAG to deliver consistently, it needs to be part of a well-designed information environment, also known as context engineering.

The Shift from Prompt to Context Engineering

In the early days, “prompt engineering” was the art of crafting the right wording to get the right answer. But as AI systems have grown more complex, the industry has realized that context quality of context matters more than the cleverness of the prompt.

Context engineering builds the full information environment around the LLM, not just the immediate instruction, but also system settings, past conversation history, retrieved documents, tools, and output formats.

Prompt Engineering : Shaping single-turn prompts for answers
‍Context Engineering : Shaping context for multi-step tasks

RAG is a critical part of context engineering, ensuring that the model’s “world” includes the exact information needed for the task.

It’s Not Your RAG, It’s Your Context

In real-world deployments, many RAG systems disappoint, and the issue is almost never the model. It’s bad context engineering. Common pitfalls include:

Irrelevant retrieval: Pulling the wrong documents wastes tokens and distracts the model.
Excessive retrieval: Overloading the context window with too much data.
Token limits and truncation: Cutting off content can cause the model to miss critical context.
Incomplete context: Missing critical information like user profiles or prior steps.

Imagine an AI system reviewing legal contracts that confidently reports a key clause is missing. In reality, the clause exists, but the retrieval process never pulled it into the model’s context. This kind of gap shows why careful retrieval design is essential.

Engineering Retrieval for Success

Preventing these failures starts with designing retrieval around the business use case:

Score for relevance: Don’t just match keywords; ensure retrieved content truly answers the question.
Chunk intelligently: Break documents into logical, searchable segments.
Compress when needed: Summarize or strip redundancy to avoid token waste.
Preserve essentials: Keep high-priority context like instructions and user state intact.

Done well, RAG produces grounded, fresh, scalable, and personalized AI outputs. But in many real-world environments, not all the information you need is text. From images and videos to audio clips and charts, handling different content formats introduces new retrieval challenges — and that’s where multi-modal context comes in.

Handling Multi-Modal Context

Most embedding models are optimized for a single type of data, and text models usually outperform others. Multi-modal embeddings (for example, image plus text models) often underdeliver in production.

A surprisingly effective solution is to convert all content to text before retrieval.

For example:

Images: Use a vision-language model to generate captions.
Videos with speech: Transcribe audio using a tool like Whisper.
Videos without speech: Extract keyframes and caption them.

By indexing text representations, retrieval accuracy for non-text content improves dramatically.

RAG in the Real World

OneSix built an AI-powered chatbot for a higher education client to help students get answers faster.

By applying RAG, the chatbot summarized thousands of unstructured documents, giving students accurate answers instantly and helping the university better serve its community.

Real-world RAG success comes from context engineering, feeding models the right information to deliver accurate, reliable, business-ready answers.

Beyond the Prompt: Why Your RAG System May Be Underperforming

This is part 1 of a three-part series:

RAG in a Nutshell

The Shift from Prompt to Context Engineering

It’s Not Your RAG, It’s Your Context

Engineering Retrieval for Success

Handling Multi-Modal Context

RAG in the Real World

Related Insights

Snowflake Cortex: Bringing ML and AI Solutions to Your Data

Preparing Your Data for AI: 5 Key Requirements

OneSix and CTI Data Unite to Help Enterprises Scale AI Responsibly

Every engagement starts with a conversation.