Symfony DOCS RAG experiments. PART 2

In https://blog.ineersa.com/post/symfony-docs-rag-experiments-part-1 we implemented Simple RAG with embeddings and vector search.
As this project is more for exploration and ideas I went to explore https://github.com/VectifyAI/PageIndex

Building a tree

As we have toctree directive in our RST files we don't need to use their approach to build a tree. Basically they parse document and ask model to do TOC for it, but we already have a structure.

1. Scan Toctrees (File Hierarchy)

The script first scans all RST files to understand how they relate to each other.

It looks for .. toctree:: directives.
It parses the file paths listed under these directives.
It builds a parent_map where child_file -> parent_file.

2. Parse RST Files (internal Structure)

Each RST file is processed individually to extract its internal structure.

Resolution: resolve_includes handles .. include:: directives to create a single text stream.
Docutils Parsing: The content is parsed into a Docutils document tree.
Node Extraction:
- File Node: The root node for the file.
- Top Node: Represents content at the top of the file before the first section title.
- Section Nodes: Formed by RST headers. These are nested recursively based on header levels.
- Content: Text is cleaned and extracted for each section. Line numbers are recorded.

3. Attach File Hierarchy

The individual file trees are assembled into a global tree using the parent_map from Step 1.

A Synthetic Root ("Symfony Docs") is created.
If File A is a parent of File B (via toctree), File B's root node becomes a child of File A's root node.
Files without parents are attached directly to the Synthetic Root.
Result: A nested structure Root -> File -> [Sections, Sub-Files].

4. Summarization

Nodes are enriched with summaries to assist in retrieval.

LLM Mode: Sends text chunks to an LLM to generate concise summaries.

5. Flattening & Output

The tree is traversed to create a flat list of nodes, linking children to parents via parent_id.
The full nested tree is saved to tree.json.
The flat list is saved to nodes.jsonl.

Logic Diagram

[ Start ]
    |
    v
[ Scan RST Files for Toctrees ]
    |
    v
[ Build Parent Map (Child -> Parent) ]
    |
    v
[ Parse Each RST File ]
    |
    +---> [ Resolve Includes ]
    |
    +---> [ Docutils Parse ]
    |
    +---> [ Extract Sections & Content ]
    |
    +---> [ Create Local File Tree ]
    |
    v
[ Attach File Hierarchy ]
    |
    v
[ Create Synthetic Root ]
    |
    v
[ Enrich Summaries (Heuristic or LLM) ]
    |
    v
[ Flatten Tree ]
    |
    +---> [ Save tree.json ]
    |
    +---> [ Save nodes.jsonl ]
    |
    v
[ End ]

Retrieval Part

For retrieval hybrid scoring we made some poor-man BM25 lexical scoring.

It combines hierarchical exploration (Beam Search) with a global lexical safety net.

Phase A: Tree Traversal (Beam Search)

Starting from the synthetic root (pageindex_root):

Expand: Get children of current frontier nodes.
Filter:
- Standard Mode: The LLM selects the beam_width most relevant children given the query _llm_pick_children.
- LLM-Final-Only Mode (--llm-final-only): Children are selected purely by lexical_score to save LLM calls.
Loop: Repeat for max_depth iterations or until no children remain.
Collect: All nodes visited during traversal are added to a visited set.

Phase B: Candidate Pool Construction

To strictly relying on tree traversal can miss relevant nodes if a high-level parent is misjudged. To mitigate this:

Global Lexical Search: The top 80 nodes from the entire index (scored lexically) are retrieved.
Merge: The visited nodes from the tree traversal are merged with these top 80 global matches.
Deduplicate: Creates a unified candidates pool.

Phase C: Final LLM Reranking

Pre-rank: The candidate pool is sorted lexically to pick the top final_candidate_limit (default 24) items.
LLM Rank: The LLM is asked to rank these final candidates based on relevance to the query (_llm_rank_candidates).
- The LLM is given summaries and text excerpts.
- It returns a JSON list of IDs in best-first order.
Fallback/Safety:
- If the LLM omits relevant items, high-scoring lexical matches (score >= 0.35) are appended to fill top_k.
- If the LLM fails, it falls back to lexical ranking.

Logic Diagram

      [ Start ]
          |
          v
    < LLM Enabled? >
    |              |
    | No           | Yes
    |              v
    |      [ Start at Root ]
    |              |
    |              v
    |      [ Expand Children ] <---------------+
    |              |                           |
    |              v                           |
    |      < Pick Children >                   |
    |     (LLM or Lexical only)                |
    |              |                           |
    |              v                           |
    |      [ Update Frontier ]                 |
    |              |                           |
    |              v                           |
    |      < Max Depth? > -------- No ---------+
    |              |
    |              | Yes
    |              v
    |      [ Collect Visited ]
    |              |
    |              v
    |      [ Merge Candidates ] <--- [ Top 80 Global Lexical ]
    |              |
    |              v
    |      [ Pre-rank by Lexical ]
    |              |
    |              v
    |      [ Take Top N Candidates ]
    |              |
    |              v
    |      [ LLM Final Rerank ]
    |              |
    |              v
    |      [ Fill/Fallback (Lexical > 0.35) ]
    |              |
    v              v
    [ Return Top K Hits ]

Results

Most interesting part =)

Benchmark Scores                   
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Mode    ┃ Hit@1          ┃ Hit@5          ┃ Count ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ strict  │ 60/100 (60.0%) │ 72/100 (72.0%) │ 100   │
│ relaxed │ 64/100 (64.0%) │ 76/100 (76.0%) │ 100   │
└─────────┴────────────────┴────────────────┴───────┘

It actually performs very well, tested it on some very tricky question and it works really good. But, it's slow, with 4 levels hierarchy in a tree it does 5 LLM requests, with quite large context.

So while this works, speed makes it pretty much unusable for real life work, and this naive tree traversal approach should be used only for deep researches where speed doesn't matter, and you need the highest quality.