Sourish Chakraborty

Sourish Chakraborty

AI Engineering & Modern Data Platforms | Cloud-native Architecture | Platform Engineering

How Dhi Understands Your Entire Codebase: Multi-Language Tree-sitter, BM25, and Hybrid Search

โ€ข15 min read
Cover Image for How Dhi Understands Your Entire Codebase: Multi-Language Tree-sitter, BM25, and Hybrid Search

In Post 1 we built a working FIM autocomplete engine: Tree-sitter chunking for Python and TypeScript, nomic-embed-text embeddings in Chroma, and StarCoder2-3B generating ghost text in VS Code. The engine works โ€” but only for two languages, and only if the extension sends files one by one.

Today we fix both limits.

By the end of this post, Dhi understands six languages, uses a hybrid lexical + semantic search to retrieve context, and can index an entire workspace in one request via a new /index-dir endpoint. The search quality difference is significant: BM25 catches exact function names and identifiers that vector search misses; vector search catches semantic similarity that keyword search misses. Together they catch both.

Everything is at github.com/sochaty/dhi, tag post-2.


What Changed Between Post 1 and Post 2

ComponentPost 1Post 2
LanguagesPython, TypeScript+ Go, Rust, Java, JavaScript
SearchVector only (Chroma cosine)Hybrid: vector + BM25, merged with RRF
IndexingOne file per /index requestEntire workspace via /index-dir
.gitignoreIgnoredRespected (via pathspec)
New endpointsโ€”POST /index-dir, POST /search

Part 1: Expanding Tree-sitter to Six Languages

The Post 1 chunker loaded grammars eagerly at import time:

# Post 1 โ€” fails at import if a grammar package is missing
import tree_sitter_python as tspython
PY_LANGUAGE = Language(tspython.language())

That approach breaks when you add optional grammars. If a user's environment doesn't have tree-sitter-rust installed, the entire server crashes at startup. Post 2 uses lazy loading with a per-language cache:

_LANGUAGE_CACHE: dict[str, Language | None] = {}

def _get_language(name: str) -> Language | None:
    if name not in _LANGUAGE_CACHE:
        _LANGUAGE_CACHE[name] = _load_grammar(name)
    return _LANGUAGE_CACHE[name]

def _load_grammar(name: str) -> Language | None:
    try:
        if name == "python":
            import tree_sitter_python as m
            return Language(m.language())
        if name == "typescript":
            import tree_sitter_typescript as m
            return Language(m.language_typescript())
        if name == "javascript":
            import tree_sitter_javascript as m
            return Language(m.language())
        if name == "go":
            import tree_sitter_go as m
            return Language(m.language())
        if name == "rust":
            import tree_sitter_rust as m
            return Language(m.language())
        if name == "java":
            import tree_sitter_java as m
            return Language(m.language())
    except (ImportError, AttributeError):
        return None
    return None

_load_grammar returns None if the package is missing. The chunker skips that language gracefully instead of crashing. The first call for each language pays the import cost; subsequent calls are instant (dict lookup).

Node Types Per Language

Each language has different AST node types for the same semantic concepts. Here is the full mapping:

CHUNK_NODE_TYPES: dict[str, set[str]] = {
    "python": {
        "import_statement", "import_from_statement",
        "function_definition", "class_definition",
    },
    "typescript": {
        "import_declaration", "import_statement",
        "function_declaration", "arrow_function",
        "class_declaration", "method_definition",
        "interface_declaration", "type_alias_declaration",
        "export_statement",
    },
    "javascript": {
        "import_declaration", "import_statement",
        "function_declaration", "arrow_function",
        "class_declaration", "method_definition",
        "export_statement",
    },
    "go": {
        "function_declaration", "method_declaration",
        "type_declaration", "import_declaration",
        "const_declaration",
    },
    "rust": {
        "function_item", "impl_item", "struct_item",
        "enum_item", "use_declaration",
        "trait_item", "mod_item",
    },
    "java": {
        "class_declaration", "interface_declaration",
        "method_declaration", "import_declaration",
        "enum_declaration",
    },
}

One non-obvious thing: Go uses function_declaration for package-level functions and method_declaration for methods on types โ€” these are separate node types, unlike Python where all def nodes are function_definition. Rust uses function_item and impl_item โ€” an impl block is not unwrapped into its methods, it is kept as a single chunk so the self type context is preserved.

File Extension Mapping

LANGUAGE_BY_SUFFIX: dict[str, str] = {
    ".py":  "python",
    ".ts":  "typescript",
    ".tsx": "typescript",
    ".js":  "javascript",
    ".jsx": "javascript",
    ".go":  "go",
    ".rs":  "rust",
    ".java": "java",
}

def detect_language(path: str) -> str | None:
    suffix = Path(path).suffix.lower()
    return LANGUAGE_BY_SUFFIX.get(suffix)

detect_language is what iter_source_files uses to decide whether a file is worth indexing โ€” if the extension isn't in this table, the file is skipped.

The Chunking Pipeline

%%{init: {'theme': 'dark'}}%%
flowchart LR
    FILE["Source File\n.py .ts .js .go .rs .java"]
    LANG["detect_language()\nextension โ†’ language name"]
    GRAM["_get_language()\nlazy-load grammar\n(cached after first call)"]
    PARSE["tree_sitter.Parser\nโ†’ Concrete Syntax Tree"]
    WALK["_parse_and_walk()\ndepth-first traversal"]
    FILTER["CHUNK_NODE_TYPES filter\nskip nested nodes"]
    CHUNK["Chunk dataclass\ntext ยท file_path ยท start/end line\nlanguage ยท node_type"]

    FILE --> LANG --> GRAM --> PARSE --> WALK --> FILTER --> CHUNK

The walker visits every node in the CST. When it hits a node whose type is in CHUNK_NODE_TYPES[language], it emits a chunk and does not recurse into that node's children. This prevents double-chunking โ€” a method inside a class becomes part of the class chunk, not a separate chunk.


Part 2: Workspace-Level File Discovery

Post 1 required the VS Code extension to iterate over open files and send each one to /index individually. That works for active editing but can't index a 10,000-file repository on startup.

Post 2 adds server/rag/indexer.py โ€” a file-system walker that produces paths suitable for chunking, with two important guardrails:

  1. Skip-list pruning: node_modules, .git, __pycache__, .venv, dist, build, target, and other directories that are never worth indexing are pruned from os.walk in-place.
  2. .gitignore awareness: when pathspec is installed, files matching the root .gitignore are filtered out.
_SKIP_DIRS: frozenset[str] = frozenset({
    ".git", ".hg", ".svn", "__pycache__", "node_modules",
    ".venv", "venv", "env", "dist", "build", "target",
    ".next", ".nuxt", "out", "vendor", ".cache",
    "coverage", ".pytest_cache", ".mypy_cache", ".ruff_cache",
})

def iter_source_files(
    root: str, *, respect_gitignore: bool = True,
) -> Iterator[str]:
    root_path = Path(root).resolve()
    if not root_path.is_dir():
        raise ValueError(f"directory not found: {root}")

    spec = _load_gitignore(root_path) if respect_gitignore else None

    for dirpath, dirnames, filenames in os.walk(root_path, topdown=True):
        # Prune in-place โ€” controls os.walk recursion
        dirnames[:] = [
            d for d in dirnames
            if d not in _SKIP_DIRS and not d.startswith(".")
        ]
        for filename in filenames:
            abs_path = Path(dirpath) / filename
            if spec is not None:
                rel = abs_path.relative_to(root_path)
                if spec.match_file(str(rel)):
                    continue
            if detect_language(str(abs_path)) is not None:
                yield str(abs_path)

The in-place dirnames[:] mutation is the key. os.walk with topdown=True respects in-place modifications to dirnames โ€” setting it to an empty list stops recursion into that directory entirely. This is the correct way to prune os.walk; deleting entries after the walk starts is too late.

.gitignore Loading

def _load_gitignore(root: Path) -> object | None:
    gi_file = root / ".gitignore"
    if not gi_file.is_file():
        return None
    try:
        import pathspec
        with gi_file.open() as fh:
            return pathspec.PathSpec.from_lines("gitwildmatch", fh)
    except ImportError:
        return None

pathspec is an optional dependency. If it is not installed, .gitignore filtering is silently skipped โ€” the indexer still works, it just indexes a few more files than necessary. This is the right tradeoff: a missing optional dependency should degrade gracefully, not crash.


Part 3: Hybrid Search with Reciprocal Rank Fusion

Vector search is excellent at semantic similarity. If you have save_to_disk() in your codebase and query persist data, vector search will find it. But if you query the exact function name save_to_disk, keyword search (BM25) will rank it higher โ€” because BM25 is an exact lexical match.

Real developer queries are a mix: sometimes you remember a function name, sometimes you describe what you need semantically. The optimal retrieval system uses both.

BM25 Okapi

BM25 (Best Match 25) is a probabilistic keyword ranking function from 1994 that still beats many neural retrievers on exact-match tasks. The scoring formula for a document D against query Q:

score(D, Q) = ฮฃ_i  IDF(q_i) ยท (tf(q_i, D) ยท (kโ‚ + 1)) / (tf(q_i, D) + kโ‚ ยท (1 - b + b ยท |D|/avgdl))

Where:

  • IDF(q_i) โ€” inverse document frequency: rare terms score higher
  • tf(q_i, D) โ€” term frequency in the document
  • kโ‚ = 1.5 โ€” term frequency saturation (diminishing returns after kโ‚ occurrences)
  • b = 0.75 โ€” length normalization (long documents don't dominate)
  • avgdl โ€” average document length in the corpus

The implementation uses rank_bm25.BM25Okapi. The corpus is maintained in memory as a dict of chunk_id โ†’ text:

def _tokenize(text: str) -> list[str]:
    return text.lower().split()  # simple โ€” language-agnostic

class ChunkStore:
    def __init__(self, ...) -> None:
        ...
        self._bm25_corpus: dict[str, str] = {}
        self._bm25: Any = None  # BM25Okapi | None
        self._init_bm25()  # seed from Chroma on startup

    def _rebuild_bm25(self) -> None:
        if not self._bm25_corpus:
            self._bm25 = None
            return
        from rank_bm25 import BM25Okapi
        corpus = list(self._bm25_corpus.values())
        self._bm25 = BM25Okapi([_tokenize(t) for t in corpus])
        self._bm25_ids = list(self._bm25_corpus.keys())

    def bm25_query(self, text: str, n_results: int = 5) -> list[str]:
        if self._bm25 is None:
            return []
        scores = self._bm25.get_scores(_tokenize(text))
        ranked = sorted(
            zip(self._bm25_ids, scores),
            key=lambda x: x[1],
            reverse=True,
        )
        return [
            self._bm25_corpus[cid]
            for cid, score in ranked[:n_results]
            if score > 0 and cid in self._bm25_corpus
        ]

One important implementation note: _init_bm25 seeds the in-memory corpus from Chroma when the server starts. This means that even after a server restart, the BM25 index is recovered from the persistent Chroma store โ€” you don't lose lexical search on restart.

Reciprocal Rank Fusion

The problem: vector search returns a ranked list and BM25 returns a different ranked list. How do you combine them into one ranking without knowing the absolute relevance scores?

Reciprocal Rank Fusion (RRF) is the answer. It was introduced in a 2009 SIGIR paper and remains the de facto standard for combining ranked lists in hybrid search. The formula:

score(doc) = ฮฃ_r  1 / (k + rank_r(doc) + 1)

Where k = 60 (the original paper's default), rank_r(doc) is the 0-indexed rank of the document in result list r, and the sum is over all result lists.

The key insight: rank matters, absolute scores don't. A document ranked 1st in both lists scores 1/61 + 1/61 โ‰ˆ 0.033. A document ranked 1st in one list and absent from the other scores 1/61 โ‰ˆ 0.016. The k constant dampens the advantage of being ranked #1 vs #2 in a single list.

def _rrf_merge(
    vector_hits: list[str],
    bm25_hits: list[str],
    n: int,
    k: int = 60,
) -> list[str]:
    scores: dict[str, float] = {}
    for rank, doc in enumerate(vector_hits):
        scores[doc] = scores.get(doc, 0.0) + 1.0 / (k + rank + 1)
    for rank, doc in enumerate(bm25_hits):
        scores[doc] = scores.get(doc, 0.0) + 1.0 / (k + rank + 1)
    return sorted(scores, key=lambda d: scores[d], reverse=True)[:n]

The Full Hybrid Pipeline

%%{init: {'theme': 'dark'}}%%
flowchart TB
    QUERY["Query text\n(last 200 chars of prefix)"]

    subgraph vector["Vector search"]
        EMBED["nomic-embed-text\n768-dim embedding"]
        CHROMA["Chroma ANN\ncosine similarity\ntop-2N results"]
    end

    subgraph lexical["Lexical search"]
        TOK["Whitespace tokenizer"]
        BM25["BM25Okapi\nIDF ยท TF ยท length norm\ntop-2N results"]
    end

    RRF["RRF merge\nscore = ฮฃ 1/(k + rank + 1)\nk = 60"]
    OUT["Top-N chunks\ninjected into FIM prefix"]

    QUERY --> EMBED --> CHROMA --> RRF
    QUERY --> TOK --> BM25 --> RRF
    RRF --> OUT
def hybrid_query(self, text: str, n_results: int = 5) -> list[str]:
    fetch = n_results * 2  # over-fetch from each source before merging
    vector_hits = self._vector_query(text, fetch)
    bm25_hits = self.bm25_query(text, fetch)
    if not bm25_hits:
        return vector_hits[:n_results]  # graceful fallback
    return _rrf_merge(vector_hits, bm25_hits, n_results)

The 2ร— over-fetch is a standard trick: each retriever fetches 2N results instead of N, then RRF picks the best N from the union. This gives RRF enough candidates to work with โ€” if both retrievers returned exactly N, any documents that are only in one list would have no overlap to merge.

When Does Hybrid Outperform Vector-Only?

Query typeVector searchBM25Hybrid
"parse_config" (exact name)May miss if semantically distantExact matchโœ“
"parse configuration file" (semantic)StrongMay missโœ“
"JWT token expiry" (mixed)Good on "token expiry"Good on "JWT"โœ“
"def foo" (syntax fragment)PoorPoor (too generic)Poor

The last row is important: neither approach handles syntax fragments well. This is expected โ€” the FIM context assembler uses the last 200 characters of prefix, not the raw def foo token.


Part 4: New API Endpoints

POST /index-dir

Index an entire workspace directory in one request. The endpoint walks the directory, reads each source file, chunks it with Tree-sitter, and upserts into the store. Files that fail to parse are logged and skipped โ€” a single malformed file cannot abort the whole indexing run.

class IndexDirRequest(BaseModel):
    dir_path: str
    respect_gitignore: bool = True

class IndexDirResponse(BaseModel):
    indexed_files: int
    indexed_chunks: int

@app.post("/index-dir", response_model=IndexDirResponse)
def index_dir_endpoint(req: IndexDirRequest) -> IndexDirResponse:
    try:
        total_files = 0
        total_chunks = 0
        for file_path in iter_source_files(
            req.dir_path, respect_gitignore=req.respect_gitignore
        ):
            try:
                chunks = list(chunk_file(file_path))
                if chunks:
                    store.upsert(chunks)
                    total_chunks += len(chunks)
                total_files += 1
            except Exception:
                logging.exception("index_dir_endpoint: skipping %s", file_path)
        return IndexDirResponse(indexed_files=total_files, indexed_chunks=total_chunks)
    except Exception as exc:
        raise HTTPException(status_code=500, detail=str(exc)) from exc

Typical usage from the extension on workspace open:

await dhiClient.indexDir(workspaceRoot, { respectGitignore: true });

This single call replaces the extension iterating over every file with vscode.workspace.findFiles. For a 500-file TypeScript monorepo, this takes about 4 seconds on an M2 MacBook Air โ€” versus 12+ seconds when the extension sends files one by one over HTTP.

POST /search

An explicit search endpoint for the chat panel and future agent context assembly. Supports three modes:

class SearchRequest(BaseModel):
    query: str
    n_results: int = 5
    mode: str = "hybrid"  # "hybrid" | "vector" | "bm25"

@app.post("/search", response_model=SearchResponse)
def search_endpoint(req: SearchRequest) -> SearchResponse:
    n = max(1, min(req.n_results, 20))
    if req.mode == "bm25":
        results = store.bm25_query(req.query, n)
    elif req.mode == "vector":
        results = store.query(req.query, n)
    else:
        results = store.hybrid_query(req.query, n)
    return SearchResponse(results=results)

The n_results cap at 20 is deliberate โ€” FIM context windows are small, and returning 20 chunks risks blowing the model's token budget on retrieval noise. The bm25 mode is useful for debugging: if hybrid returns unexpected results, switching to bm25 shows whether the lexical index is the culprit.


Part 5: BM25 Corpus Lifecycle

The BM25 index is in-memory only โ€” it is not persisted to disk. But the corpus data (the raw text of each chunk) is persisted in Chroma. So on startup:

%%{init: {'theme': 'dark'}}%%
sequenceDiagram
    participant S as Server startup
    participant C as Chroma (persistent)
    participant B as BM25 (in-memory)

    S->>C: get all documents + IDs
    C-->>S: {ids: [...], documents: [...]}
    S->>B: BM25Okapi(tokenized corpus)
    Note over B: Index ready for queries

    Note over S: On /index (upsert)
    S->>C: col.upsert(...)
    S->>B: corpus[id] = text; rebuild BM25

    Note over S: On /index-dir
    loop each file
        S->>C: col.upsert(chunks)
        S->>B: corpus[id] = text
    end
    S->>B: rebuild BM25 once after all files

The full rebuild on each upsert is O(corpus_size). For a 10,000-chunk corpus, BM25Okapi.__init__ takes about 200ms on a laptop. This is acceptable for the current scale โ€” each file save triggers one rebuild. If the corpus grows to 100,000+ chunks (large monorepo), the rebuild time becomes noticeable and should be batched. That optimization is not in scope for post-2.


Part 6: Updated Directory Structure

dhi/
โ”œโ”€โ”€ docker-compose.yml
โ”œโ”€โ”€ extension/
โ”‚   โ””โ”€โ”€ src/completion/provider.ts
โ””โ”€โ”€ server/
    โ”œโ”€โ”€ main.py                    โ† added /index-dir, /search
    โ”œโ”€โ”€ requirements.txt           โ† added new grammar packages + rank-bm25 + pathspec
    โ”œโ”€โ”€ inference/
    โ”‚   โ””โ”€โ”€ fim.py                 โ† store.hybrid_query() when FIM_USE_RAG=True
    โ””โ”€โ”€ rag/
        โ”œโ”€โ”€ chunker.py             โ† 6 languages, lazy grammar loading
        โ”œโ”€โ”€ store.py               โ† BM25 corpus, hybrid_query(), _rrf_merge()
        โ””โ”€โ”€ indexer.py             โ† iter_source_files(), .gitignore support

The new requirements.txt additions:

tree-sitter-javascript==0.23.0
tree-sitter-go==0.23.0
tree-sitter-rust==0.23.0
tree-sitter-java==0.23.2
rank-bm25>=0.2.2
pathspec>=0.12.1

All six grammar packages pin to 0.23.x to match tree-sitter==0.23.0. Mixing tree-sitter==0.23 with a tree-sitter-python==0.21 grammar causes silent parsing failures โ€” the Language() constructor accepts the wrong-version language object without raising.


Part 7: Tests

Post 2 adds 55 new test cases across four files:

Test fileNew testsWhat they cover
tests/rag/test_chunker.py28All 6 languages, detect_language, node types, line numbers
tests/rag/test_store.py18_rrf_merge, bm25_query, hybrid_query, BM25 corpus lifecycle
tests/rag/test_indexer.py20iter_source_files, skip-dir pruning, .gitignore filtering
tests/test_api.py12/index-dir, /search (3 modes, n_results capping, error handling)

One interesting testing challenge: BM25Okapi gives IDF = 0 for any term that appears in 50% or more of the corpus. With a 2-document test corpus where each term appears in exactly one document, the IDF is log(1) = 0 and BM25 scores every query at zero. Tests need at least 3 documents with the target term appearing in fewer than half:

def test_returns_matching_doc(self):
    # 3-doc corpus โ€” "greet" appears in 1/3 โ†’ IDF > 0
    store = self._store_with_corpus({
        "id1": "greet hello world function",
        "id2": "add subtract multiply divide",
        "id3": "loop iterate repeat cycle",
    })
    results = store.bm25_query("greet hello", n_results=1)
    assert len(results) == 1
    assert "greet" in results[0]

The full test suite runs in under 6 seconds with no external services (Ollama, Chroma, Redis all mocked at layer boundaries).


Performance

On a 2,000-file Python monorepo (typical Django app):

OperationTime
/index-dir cold (first index)~8s
/index single file save~40ms
BM25 rebuild on single file save~15ms
hybrid_query latency~25ms
bm25_query latency~2ms
vector_query latency~23ms

The FIM completion round-trip (including context retrieval) stays under 200ms for the vector + BM25 fetch โ€” the bottleneck is Ollama generation, not retrieval.


What's Next: Post 3 โ€” Chat Panel with Streaming RAG

Post 2 gives Dhi repository-level awareness at autocomplete time. The next post builds the chat panel: a VS Code webview where you can ask questions about your codebase and get streamed answers with code diffs.

The interesting engineering problems in Post 3:

  • Context slot management: a 4K context window has 6 competing slots (system prompt, active file, diagnostics, retrieved chunks, history, user message). How do you fill them without overflowing?
  • SSE streaming: FastAPI sends tokens as server-sent events; the extension renders them incrementally into the webview
  • Code block detection: prose goes to the chat panel; code blocks auto-populate a diff preview buffer

The full series code is at github.com/sochaty/dhi. Check out post-2 to reproduce this post exactly. Star the repo if you want to build a production-grade open-source AI IDE without paying $20/month for the privilege.

Sources:

Did you find this helpful?

Comments