- Published on
Symfony DOCS RAG experiments. PART 2
- Author

- Illia Vasylevskyi
In https://blog.ineersa.com/post/symfony-docs-rag-experiments-part-1 we implemented Simple RAG with embeddings and vector search.
As this project is more for exploration and ideas I went to explore https://github.com/VectifyAI/PageIndex
Building a tree
As we have toctree directive in our RST files we don't need to use their approach to build a tree.
Basically they parse document and ask model to do TOC for it, but we already have a structure.
1. Scan Toctrees (File Hierarchy)
The script first scans all RST files to understand how they relate to each other.
- It looks for
.. toctree::directives. - It parses the file paths listed under these directives.
- It builds a
parent_mapwherechild_file -> parent_file.
2. Parse RST Files (internal Structure)
Each RST file is processed individually to extract its internal structure.
- Resolution:
resolve_includeshandles.. include::directives to create a single text stream. - Docutils Parsing: The content is parsed into a Docutils document tree.
- Node Extraction:
- File Node: The root node for the file.
- Top Node: Represents content at the top of the file before the first section title.
- Section Nodes: Formed by RST headers. These are nested recursively based on header levels.
- Content: Text is cleaned and extracted for each section. Line numbers are recorded.
3. Attach File Hierarchy
The individual file trees are assembled into a global tree using the parent_map from Step 1.
- A Synthetic Root ("Symfony Docs") is created.
- If File A is a parent of File B (via toctree), File B's root node becomes a child of File A's root node.
- Files without parents are attached directly to the Synthetic Root.
- Result: A nested structure
Root -> File -> [Sections, Sub-Files].
4. Summarization
Nodes are enriched with summaries to assist in retrieval.
- LLM Mode: Sends text chunks to an LLM to generate concise summaries.
5. Flattening & Output
- The tree is traversed to create a flat list of nodes, linking children to parents via
parent_id. - The full nested tree is saved to
tree.json. - The flat list is saved to
nodes.jsonl.
Logic Diagram
[ Start ]
|
v
[ Scan RST Files for Toctrees ]
|
v
[ Build Parent Map (Child -> Parent) ]
|
v
[ Parse Each RST File ]
|
+---> [ Resolve Includes ]
|
+---> [ Docutils Parse ]
|
+---> [ Extract Sections & Content ]
|
+---> [ Create Local File Tree ]
|
v
[ Attach File Hierarchy ]
|
v
[ Create Synthetic Root ]
|
v
[ Enrich Summaries (Heuristic or LLM) ]
|
v
[ Flatten Tree ]
|
+---> [ Save tree.json ]
|
+---> [ Save nodes.jsonl ]
|
v
[ End ]
Retrieval Part
For retrieval hybrid scoring we made some poor-man BM25 lexical scoring.
It combines hierarchical exploration (Beam Search) with a global lexical safety net.
Phase A: Tree Traversal (Beam Search)
Starting from the synthetic root (pageindex_root):
- Expand: Get children of current frontier nodes.
- Filter:
- Standard Mode: The LLM selects the
beam_widthmost relevant children given the query_llm_pick_children. - LLM-Final-Only Mode (
--llm-final-only): Children are selected purely bylexical_scoreto save LLM calls.
- Standard Mode: The LLM selects the
- Loop: Repeat for
max_depthiterations or until no children remain. - Collect: All nodes visited during traversal are added to a
visitedset.
Phase B: Candidate Pool Construction
To strictly relying on tree traversal can miss relevant nodes if a high-level parent is misjudged. To mitigate this:
- Global Lexical Search: The top 80 nodes from the entire index (scored lexically) are retrieved.
- Merge: The
visitednodes from the tree traversal are merged with these top 80 global matches. - Deduplicate: Creates a unified
candidatespool.
Phase C: Final LLM Reranking
- Pre-rank: The candidate pool is sorted lexically to pick the top
final_candidate_limit(default 24) items. - LLM Rank: The LLM is asked to rank these final candidates based on relevance to the query (
_llm_rank_candidates).- The LLM is given summaries and text excerpts.
- It returns a JSON list of IDs in best-first order.
- Fallback/Safety:
- If the LLM omits relevant items, high-scoring lexical matches (score >= 0.35) are appended to fill
top_k. - If the LLM fails, it falls back to lexical ranking.
- If the LLM omits relevant items, high-scoring lexical matches (score >= 0.35) are appended to fill
Logic Diagram
[ Start ]
|
v
< LLM Enabled? >
| |
| No | Yes
| v
| [ Start at Root ]
| |
| v
| [ Expand Children ] <---------------+
| | |
| v |
| < Pick Children > |
| (LLM or Lexical only) |
| | |
| v |
| [ Update Frontier ] |
| | |
| v |
| < Max Depth? > -------- No ---------+
| |
| | Yes
| v
| [ Collect Visited ]
| |
| v
| [ Merge Candidates ] <--- [ Top 80 Global Lexical ]
| |
| v
| [ Pre-rank by Lexical ]
| |
| v
| [ Take Top N Candidates ]
| |
| v
| [ LLM Final Rerank ]
| |
| v
| [ Fill/Fallback (Lexical > 0.35) ]
| |
v v
[ Return Top K Hits ]
Results
Most interesting part =)
Benchmark Scores
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Mode ┃ Hit@1 ┃ Hit@5 ┃ Count ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ strict │ 60/100 (60.0%) │ 72/100 (72.0%) │ 100 │
│ relaxed │ 64/100 (64.0%) │ 76/100 (76.0%) │ 100 │
└─────────┴────────────────┴────────────────┴───────┘
It actually performs very well, tested it on some very tricky question and it works really good. But, it's slow, with 4 levels hierarchy in a tree it does 5 LLM requests, with quite large context.
So while this works, speed makes it pretty much unusable for real life work, and this naive tree traversal approach should be used only for deep researches where speed doesn't matter, and you need the highest quality.
