How Dhi Understands Your Entire Codebase: Multi-Language Tree-sitter, BM25, and Hybrid Search
In Post 1 we built a working FIM autocomplete engine: Tree-sitter chunking for Python and TypeScript, nomic-embed-text embeddings in Chroma, and StarCoder2-3B generating ghost text in VS Code. The engine works โ but only for two languages, and only if the extension sends files one by one.
Today we fix both limits.
By the end of this post, Dhi understands six languages, uses a hybrid lexical + semantic search to retrieve context, and can index an entire workspace in one request via a new /index-dir endpoint. The search quality difference is significant: BM25 catches exact function names and identifiers that vector search misses; vector search catches semantic similarity that keyword search misses. Together they catch both.
Everything is at github.com/sochaty/dhi, tag post-2.
What Changed Between Post 1 and Post 2
| Component | Post 1 | Post 2 |
| Languages | Python, TypeScript | + Go, Rust, Java, JavaScript |
| Search | Vector only (Chroma cosine) | Hybrid: vector + BM25, merged with RRF |
| Indexing | One file per /index request | Entire workspace via /index-dir |
.gitignore | Ignored | Respected (via pathspec) |
| New endpoints | โ | POST /index-dir, POST /search |
Part 1: Expanding Tree-sitter to Six Languages
The Post 1 chunker loaded grammars eagerly at import time:
# Post 1 โ fails at import if a grammar package is missing
import tree_sitter_python as tspython
PY_LANGUAGE = Language(tspython.language())
That approach breaks when you add optional grammars. If a user's environment doesn't have tree-sitter-rust installed, the entire server crashes at startup. Post 2 uses lazy loading with a per-language cache:
_LANGUAGE_CACHE: dict[str, Language | None] = {}
def _get_language(name: str) -> Language | None:
if name not in _LANGUAGE_CACHE:
_LANGUAGE_CACHE[name] = _load_grammar(name)
return _LANGUAGE_CACHE[name]
def _load_grammar(name: str) -> Language | None:
try:
if name == "python":
import tree_sitter_python as m
return Language(m.language())
if name == "typescript":
import tree_sitter_typescript as m
return Language(m.language_typescript())
if name == "javascript":
import tree_sitter_javascript as m
return Language(m.language())
if name == "go":
import tree_sitter_go as m
return Language(m.language())
if name == "rust":
import tree_sitter_rust as m
return Language(m.language())
if name == "java":
import tree_sitter_java as m
return Language(m.language())
except (ImportError, AttributeError):
return None
return None
_load_grammar returns None if the package is missing. The chunker skips that language gracefully instead of crashing. The first call for each language pays the import cost; subsequent calls are instant (dict lookup).
Node Types Per Language
Each language has different AST node types for the same semantic concepts. Here is the full mapping:
CHUNK_NODE_TYPES: dict[str, set[str]] = {
"python": {
"import_statement", "import_from_statement",
"function_definition", "class_definition",
},
"typescript": {
"import_declaration", "import_statement",
"function_declaration", "arrow_function",
"class_declaration", "method_definition",
"interface_declaration", "type_alias_declaration",
"export_statement",
},
"javascript": {
"import_declaration", "import_statement",
"function_declaration", "arrow_function",
"class_declaration", "method_definition",
"export_statement",
},
"go": {
"function_declaration", "method_declaration",
"type_declaration", "import_declaration",
"const_declaration",
},
"rust": {
"function_item", "impl_item", "struct_item",
"enum_item", "use_declaration",
"trait_item", "mod_item",
},
"java": {
"class_declaration", "interface_declaration",
"method_declaration", "import_declaration",
"enum_declaration",
},
}
One non-obvious thing: Go uses function_declaration for package-level functions and method_declaration for methods on types โ these are separate node types, unlike Python where all def nodes are function_definition. Rust uses function_item and impl_item โ an impl block is not unwrapped into its methods, it is kept as a single chunk so the self type context is preserved.
File Extension Mapping
LANGUAGE_BY_SUFFIX: dict[str, str] = {
".py": "python",
".ts": "typescript",
".tsx": "typescript",
".js": "javascript",
".jsx": "javascript",
".go": "go",
".rs": "rust",
".java": "java",
}
def detect_language(path: str) -> str | None:
suffix = Path(path).suffix.lower()
return LANGUAGE_BY_SUFFIX.get(suffix)
detect_language is what iter_source_files uses to decide whether a file is worth indexing โ if the extension isn't in this table, the file is skipped.
The Chunking Pipeline
%%{init: {'theme': 'dark'}}%%
flowchart LR
FILE["Source File\n.py .ts .js .go .rs .java"]
LANG["detect_language()\nextension โ language name"]
GRAM["_get_language()\nlazy-load grammar\n(cached after first call)"]
PARSE["tree_sitter.Parser\nโ Concrete Syntax Tree"]
WALK["_parse_and_walk()\ndepth-first traversal"]
FILTER["CHUNK_NODE_TYPES filter\nskip nested nodes"]
CHUNK["Chunk dataclass\ntext ยท file_path ยท start/end line\nlanguage ยท node_type"]
FILE --> LANG --> GRAM --> PARSE --> WALK --> FILTER --> CHUNK
The walker visits every node in the CST. When it hits a node whose type is in CHUNK_NODE_TYPES[language], it emits a chunk and does not recurse into that node's children. This prevents double-chunking โ a method inside a class becomes part of the class chunk, not a separate chunk.
Part 2: Workspace-Level File Discovery
Post 1 required the VS Code extension to iterate over open files and send each one to /index individually. That works for active editing but can't index a 10,000-file repository on startup.
Post 2 adds server/rag/indexer.py โ a file-system walker that produces paths suitable for chunking, with two important guardrails:
- Skip-list pruning:
node_modules,.git,__pycache__,.venv,dist,build,target, and other directories that are never worth indexing are pruned fromos.walkin-place. .gitignoreawareness: whenpathspecis installed, files matching the root.gitignoreare filtered out.
_SKIP_DIRS: frozenset[str] = frozenset({
".git", ".hg", ".svn", "__pycache__", "node_modules",
".venv", "venv", "env", "dist", "build", "target",
".next", ".nuxt", "out", "vendor", ".cache",
"coverage", ".pytest_cache", ".mypy_cache", ".ruff_cache",
})
def iter_source_files(
root: str, *, respect_gitignore: bool = True,
) -> Iterator[str]:
root_path = Path(root).resolve()
if not root_path.is_dir():
raise ValueError(f"directory not found: {root}")
spec = _load_gitignore(root_path) if respect_gitignore else None
for dirpath, dirnames, filenames in os.walk(root_path, topdown=True):
# Prune in-place โ controls os.walk recursion
dirnames[:] = [
d for d in dirnames
if d not in _SKIP_DIRS and not d.startswith(".")
]
for filename in filenames:
abs_path = Path(dirpath) / filename
if spec is not None:
rel = abs_path.relative_to(root_path)
if spec.match_file(str(rel)):
continue
if detect_language(str(abs_path)) is not None:
yield str(abs_path)
The in-place dirnames[:] mutation is the key. os.walk with topdown=True respects in-place modifications to dirnames โ setting it to an empty list stops recursion into that directory entirely. This is the correct way to prune os.walk; deleting entries after the walk starts is too late.
.gitignore Loading
def _load_gitignore(root: Path) -> object | None:
gi_file = root / ".gitignore"
if not gi_file.is_file():
return None
try:
import pathspec
with gi_file.open() as fh:
return pathspec.PathSpec.from_lines("gitwildmatch", fh)
except ImportError:
return None
pathspec is an optional dependency. If it is not installed, .gitignore filtering is silently skipped โ the indexer still works, it just indexes a few more files than necessary. This is the right tradeoff: a missing optional dependency should degrade gracefully, not crash.
Part 3: Hybrid Search with Reciprocal Rank Fusion
Vector search is excellent at semantic similarity. If you have save_to_disk() in your codebase and query persist data, vector search will find it. But if you query the exact function name save_to_disk, keyword search (BM25) will rank it higher โ because BM25 is an exact lexical match.
Real developer queries are a mix: sometimes you remember a function name, sometimes you describe what you need semantically. The optimal retrieval system uses both.
BM25 Okapi
BM25 (Best Match 25) is a probabilistic keyword ranking function from 1994 that still beats many neural retrievers on exact-match tasks. The scoring formula for a document D against query Q:
score(D, Q) = ฮฃ_i IDF(q_i) ยท (tf(q_i, D) ยท (kโ + 1)) / (tf(q_i, D) + kโ ยท (1 - b + b ยท |D|/avgdl))
Where:
IDF(q_i)โ inverse document frequency: rare terms score highertf(q_i, D)โ term frequency in the documentkโ = 1.5โ term frequency saturation (diminishing returns after kโ occurrences)b = 0.75โ length normalization (long documents don't dominate)avgdlโ average document length in the corpus
The implementation uses rank_bm25.BM25Okapi. The corpus is maintained in memory as a dict of chunk_id โ text:
def _tokenize(text: str) -> list[str]:
return text.lower().split() # simple โ language-agnostic
class ChunkStore:
def __init__(self, ...) -> None:
...
self._bm25_corpus: dict[str, str] = {}
self._bm25: Any = None # BM25Okapi | None
self._init_bm25() # seed from Chroma on startup
def _rebuild_bm25(self) -> None:
if not self._bm25_corpus:
self._bm25 = None
return
from rank_bm25 import BM25Okapi
corpus = list(self._bm25_corpus.values())
self._bm25 = BM25Okapi([_tokenize(t) for t in corpus])
self._bm25_ids = list(self._bm25_corpus.keys())
def bm25_query(self, text: str, n_results: int = 5) -> list[str]:
if self._bm25 is None:
return []
scores = self._bm25.get_scores(_tokenize(text))
ranked = sorted(
zip(self._bm25_ids, scores),
key=lambda x: x[1],
reverse=True,
)
return [
self._bm25_corpus[cid]
for cid, score in ranked[:n_results]
if score > 0 and cid in self._bm25_corpus
]
One important implementation note: _init_bm25 seeds the in-memory corpus from Chroma when the server starts. This means that even after a server restart, the BM25 index is recovered from the persistent Chroma store โ you don't lose lexical search on restart.
Reciprocal Rank Fusion
The problem: vector search returns a ranked list and BM25 returns a different ranked list. How do you combine them into one ranking without knowing the absolute relevance scores?
Reciprocal Rank Fusion (RRF) is the answer. It was introduced in a 2009 SIGIR paper and remains the de facto standard for combining ranked lists in hybrid search. The formula:
score(doc) = ฮฃ_r 1 / (k + rank_r(doc) + 1)
Where k = 60 (the original paper's default), rank_r(doc) is the 0-indexed rank of the document in result list r, and the sum is over all result lists.
The key insight: rank matters, absolute scores don't. A document ranked 1st in both lists scores 1/61 + 1/61 โ 0.033. A document ranked 1st in one list and absent from the other scores 1/61 โ 0.016. The k constant dampens the advantage of being ranked #1 vs #2 in a single list.
def _rrf_merge(
vector_hits: list[str],
bm25_hits: list[str],
n: int,
k: int = 60,
) -> list[str]:
scores: dict[str, float] = {}
for rank, doc in enumerate(vector_hits):
scores[doc] = scores.get(doc, 0.0) + 1.0 / (k + rank + 1)
for rank, doc in enumerate(bm25_hits):
scores[doc] = scores.get(doc, 0.0) + 1.0 / (k + rank + 1)
return sorted(scores, key=lambda d: scores[d], reverse=True)[:n]
The Full Hybrid Pipeline
%%{init: {'theme': 'dark'}}%%
flowchart TB
QUERY["Query text\n(last 200 chars of prefix)"]
subgraph vector["Vector search"]
EMBED["nomic-embed-text\n768-dim embedding"]
CHROMA["Chroma ANN\ncosine similarity\ntop-2N results"]
end
subgraph lexical["Lexical search"]
TOK["Whitespace tokenizer"]
BM25["BM25Okapi\nIDF ยท TF ยท length norm\ntop-2N results"]
end
RRF["RRF merge\nscore = ฮฃ 1/(k + rank + 1)\nk = 60"]
OUT["Top-N chunks\ninjected into FIM prefix"]
QUERY --> EMBED --> CHROMA --> RRF
QUERY --> TOK --> BM25 --> RRF
RRF --> OUT
def hybrid_query(self, text: str, n_results: int = 5) -> list[str]:
fetch = n_results * 2 # over-fetch from each source before merging
vector_hits = self._vector_query(text, fetch)
bm25_hits = self.bm25_query(text, fetch)
if not bm25_hits:
return vector_hits[:n_results] # graceful fallback
return _rrf_merge(vector_hits, bm25_hits, n_results)
The 2ร over-fetch is a standard trick: each retriever fetches 2N results instead of N, then RRF picks the best N from the union. This gives RRF enough candidates to work with โ if both retrievers returned exactly N, any documents that are only in one list would have no overlap to merge.
When Does Hybrid Outperform Vector-Only?
| Query type | Vector search | BM25 | Hybrid |
"parse_config" (exact name) | May miss if semantically distant | Exact match | โ |
"parse configuration file" (semantic) | Strong | May miss | โ |
"JWT token expiry" (mixed) | Good on "token expiry" | Good on "JWT" | โ |
"def foo" (syntax fragment) | Poor | Poor (too generic) | Poor |
The last row is important: neither approach handles syntax fragments well. This is expected โ the FIM context assembler uses the last 200 characters of prefix, not the raw def foo token.
Part 4: New API Endpoints
POST /index-dir
Index an entire workspace directory in one request. The endpoint walks the directory, reads each source file, chunks it with Tree-sitter, and upserts into the store. Files that fail to parse are logged and skipped โ a single malformed file cannot abort the whole indexing run.
class IndexDirRequest(BaseModel):
dir_path: str
respect_gitignore: bool = True
class IndexDirResponse(BaseModel):
indexed_files: int
indexed_chunks: int
@app.post("/index-dir", response_model=IndexDirResponse)
def index_dir_endpoint(req: IndexDirRequest) -> IndexDirResponse:
try:
total_files = 0
total_chunks = 0
for file_path in iter_source_files(
req.dir_path, respect_gitignore=req.respect_gitignore
):
try:
chunks = list(chunk_file(file_path))
if chunks:
store.upsert(chunks)
total_chunks += len(chunks)
total_files += 1
except Exception:
logging.exception("index_dir_endpoint: skipping %s", file_path)
return IndexDirResponse(indexed_files=total_files, indexed_chunks=total_chunks)
except Exception as exc:
raise HTTPException(status_code=500, detail=str(exc)) from exc
Typical usage from the extension on workspace open:
await dhiClient.indexDir(workspaceRoot, { respectGitignore: true });
This single call replaces the extension iterating over every file with vscode.workspace.findFiles. For a 500-file TypeScript monorepo, this takes about 4 seconds on an M2 MacBook Air โ versus 12+ seconds when the extension sends files one by one over HTTP.
POST /search
An explicit search endpoint for the chat panel and future agent context assembly. Supports three modes:
class SearchRequest(BaseModel):
query: str
n_results: int = 5
mode: str = "hybrid" # "hybrid" | "vector" | "bm25"
@app.post("/search", response_model=SearchResponse)
def search_endpoint(req: SearchRequest) -> SearchResponse:
n = max(1, min(req.n_results, 20))
if req.mode == "bm25":
results = store.bm25_query(req.query, n)
elif req.mode == "vector":
results = store.query(req.query, n)
else:
results = store.hybrid_query(req.query, n)
return SearchResponse(results=results)
The n_results cap at 20 is deliberate โ FIM context windows are small, and returning 20 chunks risks blowing the model's token budget on retrieval noise. The bm25 mode is useful for debugging: if hybrid returns unexpected results, switching to bm25 shows whether the lexical index is the culprit.
Part 5: BM25 Corpus Lifecycle
The BM25 index is in-memory only โ it is not persisted to disk. But the corpus data (the raw text of each chunk) is persisted in Chroma. So on startup:
%%{init: {'theme': 'dark'}}%%
sequenceDiagram
participant S as Server startup
participant C as Chroma (persistent)
participant B as BM25 (in-memory)
S->>C: get all documents + IDs
C-->>S: {ids: [...], documents: [...]}
S->>B: BM25Okapi(tokenized corpus)
Note over B: Index ready for queries
Note over S: On /index (upsert)
S->>C: col.upsert(...)
S->>B: corpus[id] = text; rebuild BM25
Note over S: On /index-dir
loop each file
S->>C: col.upsert(chunks)
S->>B: corpus[id] = text
end
S->>B: rebuild BM25 once after all files
The full rebuild on each upsert is O(corpus_size). For a 10,000-chunk corpus, BM25Okapi.__init__ takes about 200ms on a laptop. This is acceptable for the current scale โ each file save triggers one rebuild. If the corpus grows to 100,000+ chunks (large monorepo), the rebuild time becomes noticeable and should be batched. That optimization is not in scope for post-2.
Part 6: Updated Directory Structure
dhi/
โโโ docker-compose.yml
โโโ extension/
โ โโโ src/completion/provider.ts
โโโ server/
โโโ main.py โ added /index-dir, /search
โโโ requirements.txt โ added new grammar packages + rank-bm25 + pathspec
โโโ inference/
โ โโโ fim.py โ store.hybrid_query() when FIM_USE_RAG=True
โโโ rag/
โโโ chunker.py โ 6 languages, lazy grammar loading
โโโ store.py โ BM25 corpus, hybrid_query(), _rrf_merge()
โโโ indexer.py โ iter_source_files(), .gitignore support
The new requirements.txt additions:
tree-sitter-javascript==0.23.0
tree-sitter-go==0.23.0
tree-sitter-rust==0.23.0
tree-sitter-java==0.23.2
rank-bm25>=0.2.2
pathspec>=0.12.1
All six grammar packages pin to 0.23.x to match tree-sitter==0.23.0. Mixing tree-sitter==0.23 with a tree-sitter-python==0.21 grammar causes silent parsing failures โ the Language() constructor accepts the wrong-version language object without raising.
Part 7: Tests
Post 2 adds 55 new test cases across four files:
| Test file | New tests | What they cover |
tests/rag/test_chunker.py | 28 | All 6 languages, detect_language, node types, line numbers |
tests/rag/test_store.py | 18 | _rrf_merge, bm25_query, hybrid_query, BM25 corpus lifecycle |
tests/rag/test_indexer.py | 20 | iter_source_files, skip-dir pruning, .gitignore filtering |
tests/test_api.py | 12 | /index-dir, /search (3 modes, n_results capping, error handling) |
One interesting testing challenge: BM25Okapi gives IDF = 0 for any term that appears in 50% or more of the corpus. With a 2-document test corpus where each term appears in exactly one document, the IDF is log(1) = 0 and BM25 scores every query at zero. Tests need at least 3 documents with the target term appearing in fewer than half:
def test_returns_matching_doc(self):
# 3-doc corpus โ "greet" appears in 1/3 โ IDF > 0
store = self._store_with_corpus({
"id1": "greet hello world function",
"id2": "add subtract multiply divide",
"id3": "loop iterate repeat cycle",
})
results = store.bm25_query("greet hello", n_results=1)
assert len(results) == 1
assert "greet" in results[0]
The full test suite runs in under 6 seconds with no external services (Ollama, Chroma, Redis all mocked at layer boundaries).
Performance
On a 2,000-file Python monorepo (typical Django app):
| Operation | Time |
/index-dir cold (first index) | ~8s |
/index single file save | ~40ms |
| BM25 rebuild on single file save | ~15ms |
hybrid_query latency | ~25ms |
bm25_query latency | ~2ms |
vector_query latency | ~23ms |
The FIM completion round-trip (including context retrieval) stays under 200ms for the vector + BM25 fetch โ the bottleneck is Ollama generation, not retrieval.
What's Next: Post 3 โ Chat Panel with Streaming RAG
Post 2 gives Dhi repository-level awareness at autocomplete time. The next post builds the chat panel: a VS Code webview where you can ask questions about your codebase and get streamed answers with code diffs.
The interesting engineering problems in Post 3:
- Context slot management: a 4K context window has 6 competing slots (system prompt, active file, diagnostics, retrieved chunks, history, user message). How do you fill them without overflowing?
- SSE streaming: FastAPI sends tokens as server-sent events; the extension renders them incrementally into the webview
- Code block detection: prose goes to the chat panel; code blocks auto-populate a diff preview buffer
The full series code is at github.com/sochaty/dhi. Check out post-2 to reproduce this post exactly. Star the repo if you want to build a production-grade open-source AI IDE without paying $20/month for the privilege.
Sources:
Did you find this helpful?
