The Self-Correcting RAG: Implementing Agentic and Recursive Retrieval Loops

Why Static RAG Falls Short, and Why Your Brain Doesn't

Let's be blunt: the standard RAG pipeline, while a significant leap, is fundamentally limited. You embed a query, retrieve a chunk of documents, cram it into a prompt, and pray the LLM can synthesize a coherent answer. It's a single-shot, often naive approach that assumes your initial retrieval is perfectly comprehensive and relevant.

But think about how your brain works. When you're trying to understand a complex topic or solve a problem, you don't just recall one memory and call it a day. You probe, you realize you're missing information, you formulate follow-up questions, you search for new data, you cross-reference, you synthesize. This iterative, self-correcting process is how we build a robust understanding.

My fascination with replicating aspects of the human brain, particularly through architectures like MoE (Mixture of Experts), drives me to push past these simplistic models. Our goal shouldn't be to just retrieve, but to reason about retrieval. To identify knowledge gaps, articulate new search queries, and recursively build a more complete, nuanced answer.

This isn't just about throwing more context at an LLM; it's about empowering the LLM to become an intelligent orchestrator of its own knowledge acquisition. This is the next frontier of RAG: agentic and recursive retrieval loops, inspired by cutting-edge concepts from papers like FLARE and Self-RAG.

And no, we won't be using bloated, opinionated frameworks that obscure the underlying mechanics. We're going raw, direct, and performant. TypeScript, raw APIs, and precise control are our tools of choice.

Architecture: The Recursive Retrieval Agent

Our self-correcting RAG system operates as a recursive agent. The core idea is that an LLM evaluates the currently retrieved context against the original query. If the context is insufficient or incomplete, the LLM intelligently formulates a new search query and initiates another retrieval step. This loop continues until the LLM deems the answer satisfactory or a maximum iteration limit is reached.

Here’s the high-level architecture:

Initial Query & Retrieval: The process begins with a user query. We generate an embedding and perform an initial search against our vector store.
Context Evaluation Agent (LLM): This is the brain of our system. It receives the original query, all currently accumulated context, and critically, its own understanding of the current state. It decides:
- Is the current context sufficient to answer the query comprehensively?
- If not, what specific information is missing?
- What new search query should be formulated to bridge this knowledge gap?
Recursive Retrieval & Context Accumulation: If the agent decides more information is needed, the new search query is executed against the vector store. The newly retrieved documents are added to our accumulating context.
Synthesis & Output: Once the agent signals completion, the LLM performs a final synthesis, generating the comprehensive answer based on all gathered information.

Core Components & Implementation Details

1. The Vector Store & Embedding Layer

We'll assume a basic interface for our vector database. Whether you're using Pinecone, Weaviate, Qdrant, or a local FAISS index, the interaction will boil down to embedding a query and fetching nearest neighbors.

// For brevity, we'll abstract away the specific Vector DB client
// In a real project, this would wrap your specific client (e.g., Pinecone, Weaviate)
interface VectorDB {
    search: (embedding: number[], k: number) => Promise<Document[]>;
}
 
interface Document {
    id: string;
    content: string;
    metadata?: Record<string, any>;
    embedding?: number[]; // Potentially store embeddings, or generate on-the-fly
}
 
// A simple abstraction for an embedding service
const getEmbedding = async (text: string): Promise<number[]> => {
    // Example using OpenAI's embedding API directly
    const response = await fetch('https://api.openai.com/v1/embeddings', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
            'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        },
        body: JSON.stringify({
            input: text,
            model: 'text-embedding-ada-002', // Or a more performant model like 'text-embedding-3-small'
        }),
    });
 
    if (!response.ok) {
        throw new Error(`Embedding API error: ${response.statusText}`);
    }
    const data = await response.json();
    return data.data[0].embedding;
};

2. The LLM Interaction Layer

We need a clean, direct way to interact with our LLM provider (e.g., OpenAI, Anthropic, Agno). No unnecessary abstractions; just raw HTTP calls.

type LLMCallOptions = {
    model: string;
    temperature?: number;
    max_tokens?: number;
};
 
// Generic LLM call function
const callLLM = async (messages: Array<{ role: 'system' | 'user' | 'assistant', content: string }>, options: LLMCallOptions): Promise<string> => {
    // Example for OpenAI API
    const response = await fetch('https://api.openai.com/v1/chat/completions', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
            'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        },
        body: JSON.stringify({
            model: options.model,
            messages: messages,
            temperature: options.temperature || 0.7,
            max_tokens: options.max_tokens,
            response_format: { type: "json_object" } // Crucial for structured output
        }),
    });
 
    if (!response.ok) {
        throw new Error(`LLM API error: ${response.statusText}`);
    }
    const data = await response.json();
    return data.choices[0].message.content;
};

3. The Context Evaluation Agent (Prompt Engineering)

This is where the magic happens. We'll use a structured JSON output for the LLM to make it easier to parse and control the flow.

interface AgentDecision {
    status: 'COMPLETE' | 'NEEDS_MORE_INFO';
    followUpQuery?: string;
    critique?: string;
}
 
const getAgentDecisionPrompt = (originalQuery: string, currentContext: string[], accumulatedSearchHistory: string[], maxTokens: number): Array<{ role: 'system' | 'user', content: string }> => [
    {
        role: 'system',
        content: `You are an expert research agent. Your task is to evaluate a user's query against provided context.
        Your goal is to determine if the current context is sufficient to answer the original query comprehensively.
        If it is, output status: 'COMPLETE'.
        If not, you must identify the knowledge gaps, provide a 'critique' of what's missing, and formulate a precise 'followUpQuery' to find the missing information.
        The follow-up query should be specific and target the identified gap.
        You have a maximum context window of ${maxTokens} tokens for your final answer. Manage it wisely.
        Output your decision as a JSON object with 'status', 'followUpQuery' (if status is 'NEEDS_MORE_INFO'), and 'critique' (if status is 'NEEDS_MORE_INFO').`
    },
    {
        role: 'user',
        content: `Original Query: "${originalQuery}"
        
        ---
        
        Currently Accumulated Context:
        ${currentContext.length > 0 ? currentContext.map((doc, i) => `Document ${i + 1}:\n${doc}`).join('\n---\n') : 'No context yet.'}
        
        ---
        
        Previous Search History (to avoid redundant searches):
        ${accumulatedSearchHistory.length > 0 ? accumulatedSearchHistory.join('\n') : 'No previous searches.'}
        
        ---
        
        Based on the above, is the current context sufficient? If not, what should the next search query be?
        
        JSON Output:`
    }
];
 
const evaluateContext = async (
    originalQuery: string,
    currentContext: string[],
    accumulatedSearchHistory: string[],
    llmOptions: LLMCallOptions,
    maxTokensForFinalAnswer: number
): Promise<AgentDecision> => {
    const prompt = getAgentDecisionPrompt(originalQuery, currentContext, accumulatedSearchHistory, maxTokensForFinalAnswer);
    const responseJson = await callLLM(prompt, llmOptions);
    try {
        return JSON.parse(responseJson) as AgentDecision;
    } catch (error) {
        console.error("Failed to parse agent decision JSON:", responseJson, error);
        // Fallback: If LLM screws up JSON, assume it needs more info with a generic query
        return {
            status: 'NEEDS_MORE_INFO',
            followUpQuery: `Refine search for "${originalQuery}" based on current context.`,
            critique: "LLM failed to output valid JSON. Assuming more info needed."
        };
    }
};

4. The Recursive RAG Loop

This orchestrates the entire process. Efficiency here is key, especially managing token windows and preventing infinite loops.

const MAX_RAG_ITERATIONS = 3; // Limit the number of recursive searches
const CONTEXT_CHUNK_SIZE = 4000; // Max tokens per context chunk before summarization/truncation
const FINAL_ANSWER_MAX_TOKENS = 1500; // Max tokens for the final generated answer
 
interface RecursiveRAGResult {
    answer: string;
    iterations: number;
    finalContext: string[];
    searchHistory: string[];
}
 
const runRecursiveRAG = async (
    userQuery: string,
    vectorDB: VectorDB,
    llmOptions: LLMCallOptions,
    k_initial: number = 5, // Number of docs for initial search
    k_followup: number = 3 // Number of docs for follow-up searches
): Promise<RecursiveRAGResult> => {
    let accumulatedContext: string[] = [];
    let searchHistory: string[] = [];
    let currentIteration = 0;
    let currentSearchQuery = userQuery;
 
    while (currentIteration < MAX_RAG_ITERATIONS) {
        currentIteration++;
        console.log(`--- Iteration ${currentIteration}: Searching for "${currentSearchQuery}" ---`);
        searchHistory.push(currentSearchQuery);
 
        const embedding = await getEmbedding(currentSearchQuery);
        const docs = await vectorDB.search(embedding, currentIteration === 1 ? k_initial : k_followup);
        const newContexts = docs.map(d => d.content);
 
        // Simple context management: Add new context, deduplicate, and maybe truncate if too large
        // For production, this would involve more sophisticated summarization or re-ranking
        accumulatedContext = Array.from(new Set([...accumulatedContext, ...newContexts]));
 
        // Prompt LLM to evaluate context and decide next steps
        const decision = await evaluateContext(
            userQuery,
            accumulatedContext,
            searchHistory,
            llmOptions,
            FINAL_ANSWER_MAX_TOKENS
        );
 
        if (decision.status === 'COMPLETE') {
            console.log(`Agent decided context is COMPLETE. Critque: ${decision.critique || 'N/A'}`);
            break; // Exit loop, context is deemed sufficient
        } else {
            console.log(`Agent needs more info. Critique: ${decision.critique}. Follow-up Query: "${decision.followUpQuery}"`);
            currentSearchQuery = decision.followUpQuery!; // LLM must provide a query if status is 'NEEDS_MORE_INFO'
        }
    }
 
    // Final synthesis step
    console.log("--- Final Synthesis ---");
    const finalSynthesisPrompt = [
        {
            role: 'system',
            content: `You are an expert answer generator. Based on the provided context, answer the original query comprehensively and concisely.
            Ensure your answer directly addresses all aspects of the query. If the context is insufficient for a part of the query, state that explicitly.
            Your final answer should not exceed ${FINAL_ANSWER_MAX_TOKENS} tokens.`
        },
        {
            role: 'user',
            content: `Original Query: "${userQuery}"
            
            ---
            
            Context for Synthesis:
            ${accumulatedContext.map((doc, i) => `Document ${i + 1}:\n${doc}`).join('\n---\n')}
            
            ---
            
            Please provide the most comprehensive answer possible based on the above context.`
        }
    ];
 
    const finalAnswer = await callLLM(finalSynthesisPrompt, { ...llmOptions, max_tokens: FINAL_ANSWER_MAX_TOKENS });
 
    return {
        answer: finalAnswer,
        iterations: currentIteration,
        finalContext: accumulatedContext,
        searchHistory: searchHistory
    };
};

Optimizations and Considerations

Context Window Management: As accumulatedContext grows, we hit LLM token limits.
- Summarization: Before passing the context to the evaluation agent or final synthesis, summarize individual documents or groups of documents.
- Prioritization/Re-ranking: Use a smaller LLM or a heuristic to re-rank documents based on their relevance to the original query and current knowledge gaps, discarding less relevant ones.
- Windowing: Only pass the most recent N documents and a summary of older documents.
LLM Choice: For the agentic decision-making, a smaller, faster model (e.g., gpt-3.5-turbo, or an equivalent from Agno/other providers) might be more cost-effective and performant than a large, expensive model, especially for multiple iterations. The final synthesis can then use a more capable model. This hints at a micro-MoE like setup.
Prompt Robustness: The getAgentDecisionPrompt is critical. Extensive testing and fine-tuning with few-shot examples (not included here for brevity) will be necessary to ensure the LLM consistently outputs the desired JSON format and makes sound decisions.
Concurrency: Can follow-up queries be parallelized if the agent identifies multiple independent knowledge gaps? This would require a more sophisticated agent capable of decomposing the problem.
Error Handling & Fallbacks: What if the embedding service fails? What if the LLM produces malformed JSON? Graceful degradation is crucial.
Evaluation Metrics: How do we objectively measure if the "self-correction" is actually improving the answer quality compared to a static RAG? RAGAS, human evaluation, or specific fact-checking metrics are essential.

What I Learned & The Road Ahead

Building this recursive RAG system was a profound lesson in moving beyond static, linear pipelines. It's exhilarating to see an LLM not just generate text, but reason about its own information state and actively pursue missing knowledge.

LLMs as Orchestrators: The shift from viewing LLMs purely as text generators to intelligent control planes is powerful. They can manage complex workflows, identify strategic next steps, and adapt dynamically. This is a critical step towards more autonomous AI systems.
The True Cost of Intelligence: More LLM calls mean higher latency and increased cost. Performance optimization (model choice, prompt efficiency, context management) becomes paramount. This is where my competitive programming mindset kicks in – every token, every API call counts.
The MoE Connection: This agentic loop is a primitive form of a Mixture of Experts. The "expert" here is the recursive RAG agent itself. In the future, I envision specialized agents (different prompts, different models, different retrieval strategies) that could be called upon based on the type of knowledge gap identified (e.g., a "fact-checker expert," a "summarizer expert," a "deep-dive expert"). This modularity is key to building truly brain-like AI.
The "Mind" of the System: The accumulatedSearchHistory and the critique from the LLM represent a rudimentary form of the system's "thought process." Exposing and leveraging this internal monologue can lead to even more intelligent behaviors and debuggability.

The journey to building AI systems that mimic human-level understanding and problem-solving is long. But by embracing agentic, recursive, and self-correcting architectures, we're taking significant strides. This isn't just about better answers; it's about building smarter, more resilient, and ultimately, more "aware" digital intelligences. And we'll do it with clean code, direct APIs, and a relentless focus on performance.