FIM Autocomplete at < 150ms: Tree-sitter, nomic Embeddings, and StarCoder2-3B on a Laptop

Jun 8, 2026•13 min read

In the previous post I laid out the full six-layer architecture of an open-source AI coding IDE. Today we start building it.

This post ships the first real component of Dhi (धी) — an open-source AI coding IDE built entirely on open-source models. Dhi means pure intellect in Sanskrit, from the Gayatri Mantra. The name fits: the goal is an IDE that gives you genuine intelligence over your codebase — with no API keys, no token pricing, and no closed-source inference backend.

By the end of this post you will have:

A working FIM autocomplete engine running on your laptop
A Tree-sitter-based semantic chunker for Python and TypeScript
A Chroma vector store with nomic-embed-text-v1.5 embeddings
A StarCoder2-3B inference server via Ollama
A VS Code extension with ghost-text inline completions

Everything runs with docker compose up. The full code is at github.com/sochaty/dhi — tag post-1.

What FIM Actually Is

Most developers think of autocomplete as next-token prediction: the model sees everything before the cursor and predicts what comes next. That is how GPT-2 works. It is not how a modern AI coding assistant works.

Real autocomplete in 2026 is fill-in-the-middle (FIM): the model sees both the prefix (everything before the cursor) and the suffix (everything after), then generates the completion that bridges them. This is far more accurate because the model knows what the code is supposed to arrive at, not just what it started from.

The three special tokens that make FIM work:

<fim_prefix>  → everything before the cursor
<fim_suffix>  → everything after the cursor
<fim_middle>  → the model generates this

StarCoder2, DeepSeek-Coder, and Qwen2.5-Coder all support FIM natively. This is non-negotiable for a production autocomplete engine — a model without native FIM support gives noticeably worse inline completions.

The full request flow for Dhi's autocomplete engine:

%%{init: {'theme': 'dark'}}%%
flowchart TB
    ED["VS Code Editor\nCursor position event"]
    DB["Debouncer\n150ms wait"]
    CA["Context Assembler"]

    subgraph prefix["fim_prefix"]
        direction TB
        RC["Retrieved chunks\ntop-3 from vector store"]
        FA["Current file\nlines 0 → cursor"]
    end

    subgraph suffix["fim_suffix"]
        FB["Current file\ncursor → EOF"]
    end

    INF["Inference Server\nOllama (dev) · vLLM (prod)"]
    EXT["Extension\nGhost text render"]

    ED --> DB --> CA
    CA --> prefix & suffix
    prefix & suffix --> INF --> EXT

The critical design decision here: the prefix is not just the current file. It includes retrieved chunks from the rest of the repository. Without that context, the model cannot complete a function call using a helper defined in another file.

Bootstrapping Dhi

Clone the repo and start the CPU stack:

git clone https://github.com/sochaty/dhi
cd dhi
git checkout post-1
docker compose up

The docker-compose.yml at this tag runs three services:

services:
  server:
    build: ./server
    ports: ["8000:8000"]
    environment:
      - CHROMA_HOST=chroma
      - OLLAMA_HOST=ollama
    depends_on: [chroma, ollama]

  chroma:
    image: chromadb/chroma:0.5.0
    volumes: ["chroma_data:/chroma/chroma"]

  ollama:
    image: ollama/ollama:latest
    volumes: ["ollama_models:/root/.ollama"]
    entrypoint: ["/bin/sh", "-c", "ollama serve & sleep 5 && ollama pull starcoder2:3b && wait"]

On first start, Ollama pulls StarCoder2-3B (~1.7GB). Subsequent starts are instant. The full repo directory structure at post-1:

dhi/
├── docker-compose.yml
├── extension/
│   └── src/completion/provider.ts
└── server/
    ├── main.py
    ├── inference/fim.py
    └── rag/
        ├── chunker.py
        └── store.py

Layer 1: Tree-sitter Semantic Chunking

The naive approach to chunking code for a vector store is splitting on character count — every 500 characters becomes a chunk. This is wrong for two reasons: it cuts function bodies in the middle (destroying semantic meaning) and it groups unrelated code together (polluting retrieval).

Dhi uses Tree-sitter to split on semantic boundaries. A function definition becomes one chunk. An import block becomes one chunk. A class body becomes one chunk. Each chunk is self-contained and contextually coherent.

%%{init: {'theme': 'dark'}}%%
flowchart LR
    SRC["Source File\n.py or .ts"]
    TS["Tree-sitter Parser\n(per language grammar)"]
    AST["Concrete Syntax Tree"]

    subgraph chunks["Semantic Chunks"]
        direction TB
        C1["import_block → 1 chunk\nall imports as a unit"]
        C2["function_definition → 1 chunk\nentire function body"]
        C3["class_definition → 1 chunk\nentire class body"]
    end

    META["Metadata Overlay\nfile path · start line · end line · language · symbol name"]
    EMBED["nomic-embed-text-v1.5\n768-dim · runs locally via Ollama"]
    CHROMA["Chroma\npersistent vector store"]

    SRC --> TS --> AST --> chunks --> META --> EMBED --> CHROMA

Here is server/rag/chunker.py — the full implementation for Python and TypeScript:

from dataclasses import dataclass
from pathlib import Path
from typing import Generator

import tree_sitter_python as tspython
import tree_sitter_typescript as tstypescript
from tree_sitter import Language, Parser

PY_LANGUAGE = Language(tspython.language())
TS_LANGUAGE = Language(tstypescript.language_typescript())

# Node types that become independent chunks
CHUNK_NODE_TYPES = {
    "python": {
        "import_statement", "import_from_statement",
        "function_definition", "async_function_definition",
        "class_definition",
    },
    "typescript": {
        "import_declaration", "import_statement",
        "function_declaration", "arrow_function",
        "class_declaration", "method_definition",
        "interface_declaration", "type_alias_declaration",
    },
}


@dataclass
class Chunk:
    text: str
    file_path: str
    start_line: int
    end_line: int
    language: str
    node_type: str


def _get_parser(language: str) -> Parser:
    parser = Parser()
    if language == "python":
        parser.set_language(PY_LANGUAGE)
    else:
        parser.set_language(TS_LANGUAGE)
    return parser


def _detect_language(path: str) -> str | None:
    suffix = Path(path).suffix
    if suffix == ".py":
        return "python"
    if suffix in (".ts", ".tsx"):
        return "typescript"
    return None


def chunk_file(file_path: str) -> Generator[Chunk, None, None]:
    language = _detect_language(file_path)
    if language is None:
        return

    source = Path(file_path).read_bytes()
    parser = _get_parser(language)
    tree = parser.parse(source)
    target_types = CHUNK_NODE_TYPES[language]

    def walk(node):
        if node.type in target_types:
            text = source[node.start_byte:node.end_byte].decode("utf-8", errors="replace")
            yield Chunk(
                text=text,
                file_path=file_path,
                start_line=node.start_point[0] + 1,
                end_line=node.end_point[0] + 1,
                language=language,
                node_type=node.type,
            )
        else:
            for child in node.children:
                yield from walk(child)

    yield from walk(tree.root_node)

Two things worth noting:

Greedy top-level chunking. The walk function yields a node and stops recursing into it. A class body becomes one chunk — it does not also yield its individual methods as separate chunks. This keeps retrieval units large enough to be coherent.

Language-specific grammar packages. tree_sitter_python and tree_sitter_typescript are PyPI packages that ship pre-compiled grammars. No C compilation step required, which matters for Docker images.

Layer 2: Embedding and Vector Storage

Each chunk is embedded with nomic-embed-text-v1.5 — a 768-dimension model that runs locally via Ollama. It outperforms OpenAI's ada-002 on code retrieval benchmarks while costing nothing per query.

Here is server/rag/store.py:

import hashlib
import os
from typing import Sequence

import chromadb
import httpx

CHROMA_HOST = os.getenv("CHROMA_HOST", "localhost")
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "localhost")
COLLECTION_NAME = "dhi_chunks"
EMBED_MODEL = "nomic-embed-text"


def _embed(texts: list[str]) -> list[list[float]]:
    resp = httpx.post(
        f"http://{OLLAMA_HOST}:11434/api/embed",
        json={"model": EMBED_MODEL, "input": texts},
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()["embeddings"]


def _chunk_id(chunk) -> str:
    key = f"{chunk.file_path}:{chunk.start_line}:{chunk.end_line}"
    return hashlib.md5(key.encode()).hexdigest()


class ChunkStore:
    def __init__(self):
        client = chromadb.HttpClient(host=CHROMA_HOST, port=8000)
        self._col = client.get_or_create_collection(
            name=COLLECTION_NAME,
            metadata={"hnsw:space": "cosine"},
        )

    def upsert(self, chunks: Sequence) -> None:
        if not chunks:
            return
        ids = [_chunk_id(c) for c in chunks]
        texts = [c.text for c in chunks]
        embeddings = _embed(texts)
        metadatas = [
            {
                "file_path": c.file_path,
                "start_line": c.start_line,
                "end_line": c.end_line,
                "language": c.language,
                "node_type": c.node_type,
            }
            for c in chunks
        ]
        self._col.upsert(ids=ids, embeddings=embeddings, documents=texts, metadatas=metadatas)

    def query(self, text: str, n_results: int = 3) -> list[str]:
        if self._col.count() == 0:
            return []
        embeddings = _embed([text])
        results = self._col.query(
            query_embeddings=embeddings,
            n_results=min(n_results, self._col.count()),
            include=["documents"],
        )
        return results["documents"][0]

The upsert method is idempotent — re-indexing a file replaces its chunks rather than duplicating them. The chunk ID is a deterministic hash of file_path:start_line:end_line, so the same chunk always maps to the same Chroma document.

Layer 3: Assembling the FIM Prompt

The FIM prompt has a precise structure. The prefix slot has two sub-parts: retrieved context from the rest of the repo, followed by the current file up to the cursor. The suffix is everything after the cursor.

%%{init: {'theme': 'dark'}}%%
flowchart TB
    subgraph fim_prefix["<fim_prefix>"]
        direction TB
        CTX["Repo context\n3 retrieved chunks · ~1500 tokens\n(most relevant functions/classes)"]
        CUR["Current file prefix\nlines 0 → cursor · ~800 tokens"]
    end

    subgraph fim_suffix["<fim_suffix>"]
        direction TB
        SUF["Current file suffix\ncursor → EOF · ~400 tokens"]
    end

    MID["<fim_middle>\nModel generates the completion here"]

    fim_prefix --> MID
    fim_suffix --> MID

Here is server/inference/fim.py:

import os
from dataclasses import dataclass

import httpx

from rag.store import ChunkStore

OLLAMA_HOST = os.getenv("OLLAMA_HOST", "localhost")
FIM_MODEL = os.getenv("FIM_MODEL", "starcoder2:3b")

# StarCoder2 FIM special tokens
FIM_PREFIX = "<fim_prefix>"
FIM_SUFFIX = "<fim_suffix>"
FIM_MIDDLE = "<fim_middle>"


@dataclass
class FIMRequest:
    file_path: str
    prefix: str   # current file content above cursor
    suffix: str   # current file content below cursor
    language: str


def build_fim_prompt(request: FIMRequest, store: ChunkStore) -> str:
    # Query the store with the last ~200 chars of prefix as the search query
    query = request.prefix[-200:].strip() or request.file_path
    context_chunks = store.query(query, n_results=3)

    context_block = "\n\n".join(context_chunks)
    if context_block:
        context_block = f"# Repo context\n{context_block}\n\n# Current file\n"

    return (
        f"{FIM_PREFIX}"
        f"{context_block}"
        f"{request.prefix}"
        f"{FIM_SUFFIX}"
        f"{request.suffix}"
        f"{FIM_MIDDLE}"
    )


def complete(request: FIMRequest, store: ChunkStore, max_new_tokens: int = 64) -> str:
    prompt = build_fim_prompt(request, store)

    resp = httpx.post(
        f"http://{OLLAMA_HOST}:11434/api/generate",
        json={
            "model": FIM_MODEL,
            "prompt": prompt,
            "stream": False,
            "options": {
                "num_predict": max_new_tokens,
                "temperature": 0.1,
                "stop": ["\n\n", FIM_PREFIX, FIM_SUFFIX],
            },
        },
        timeout=10,
    )
    resp.raise_for_status()
    return resp.json()["response"]

Two implementation choices worth explaining:

Low temperature (0.1). Autocomplete is not a creative task. You want the most probable continuation, not a diverse sample. High temperature produces hallucinated variable names and incorrect function signatures.

Stop tokens include \n\n. A single blank line is the natural end of a completion. Without this stop token the model continues generating until it hits max_new_tokens, wasting latency and producing over-completion.

Model Selection

Not every developer has the same hardware. Here is the recommended model per tier:

Model	Size	VRAM	FIM Support	Recommended for
StarCoder2-3B	3B	4GB	✅ Native	8GB GPU or Apple M-series
Qwen2.5-Coder-7B	7B	8GB	✅ Native	16GB GPU
DeepSeek-Coder-V2-Lite	16B	12GB	✅ Native	24GB GPU (best quality)
StarCoder2-3B (Q4_K_M)	3B	2.5GB	✅ Native	CPU-only (slow but works)

Change the model in .env:

FIM_MODEL=qwen2.5-coder:7b

Ollama pulls it on next container start. No code changes required — the FIM special tokens differ between model families but Ollama handles the tokenization automatically.

Layer 4: The VS Code Extension

The extension registers an InlineCompletionItemProvider. VS Code calls it whenever the user pauses typing. The debounce prevents a network round-trip on every keystroke.

%%{init: {'theme': 'dark'}}%%
flowchart LR
    KEYSTROKE["Keystroke in editor"]
    DEBOUNCE["Debounce\n150ms — cancel if user keeps typing"]
    CONTEXT["Extract context\nfile path · prefix · suffix · language"]
    POST["POST /complete\n{file_path, prefix, suffix, language}"]
    SERVER["Dhi server\nbuild FIM prompt · Ollama · return completion"]
    GHOST["VS Code\nrender ghost text"]
    ACCEPT["Tab key\naccept completion"]

    KEYSTROKE --> DEBOUNCE --> CONTEXT --> POST --> SERVER --> GHOST --> ACCEPT

Here is extension/src/completion/provider.ts:

import * as vscode from 'vscode';

const SERVER_URL = vscode.workspace
    .getConfiguration('dhi')
    .get<string>('serverUrl', 'http://localhost:8000');

interface CompletionRequest {
    file_path: string;
    prefix: string;
    suffix: string;
    language: string;
}

async function fetchCompletion(req: CompletionRequest): Promise<string | null> {
    try {
        const res = await fetch(`${SERVER_URL}/complete`, {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify(req),
            signal: AbortSignal.timeout(3000),
        });
        if (!res.ok) return null;
        const data = await res.json() as { completion: string };
        return data.completion ?? null;
    } catch {
        return null;
    }
}

export class DhiCompletionProvider implements vscode.InlineCompletionItemProvider {
    private pending: NodeJS.Timeout | null = null;

    async provideInlineCompletionItems(
        document: vscode.TextDocument,
        position: vscode.Position,
        _context: vscode.InlineCompletionContext,
        token: vscode.CancellationToken,
    ): Promise<vscode.InlineCompletionList | null> {
        // Debounce: cancel previous pending request
        if (this.pending) {
            clearTimeout(this.pending);
            this.pending = null;
        }

        const completion = await new Promise<string | null>((resolve) => {
            this.pending = setTimeout(async () => {
                if (token.isCancellationRequested) {
                    resolve(null);
                    return;
                }

                const offset = document.offsetAt(position);
                const text = document.getText();

                const req: CompletionRequest = {
                    file_path: document.uri.fsPath,
                    prefix: text.slice(0, offset),
                    suffix: text.slice(offset),
                    language: document.languageId,
                };

                resolve(await fetchCompletion(req));
            }, 150);
        });

        if (!completion || token.isCancellationRequested) return null;

        return {
            items: [
                new vscode.InlineCompletionItem(
                    completion,
                    new vscode.Range(position, position),
                ),
            ],
        };
    }
}

import * as vscode from 'vscode';
import { DhiCompletionProvider } from './completion/provider';

export function activate(context: vscode.ExtensionContext) {
    const provider = new DhiCompletionProvider();
    context.subscriptions.push(
        vscode.languages.registerInlineCompletionItemProvider(
            { pattern: '**' },
            provider,
        ),
    );
}

export function deactivate() {}

And the FastAPI endpoint in server/main.py:

from fastapi import FastAPI
from pydantic import BaseModel

from inference.fim import FIMRequest, complete
from rag.store import ChunkStore

app = FastAPI()
store = ChunkStore()


class CompleteRequest(BaseModel):
    file_path: str
    prefix: str
    suffix: str
    language: str


@app.post("/complete")
def complete_endpoint(req: CompleteRequest):
    fim_req = FIMRequest(
        file_path=req.file_path,
        prefix=req.prefix,
        suffix=req.suffix,
        language=req.language,
    )
    completion = complete(fim_req, store)
    return {"completion": completion}


@app.post("/index")
def index_endpoint(body: dict):
    from rag.chunker import chunk_file
    chunks = list(chunk_file(body["file_path"]))
    store.upsert(chunks)
    return {"indexed": len(chunks)}

The /index endpoint is called by a file-watcher in the extension whenever a file is saved. This keeps the vector store in sync with your edits without a full re-index.

Latency in Practice

On an Apple M3 Pro (no external GPU) with StarCoder2-3B via Ollama:

Scenario	P50	P95
Cold (no Ollama cache)	380ms	520ms
Warm (model loaded)	95ms	145ms
Warm + context retrieval	110ms	160ms

The P50 of 95ms warm sits well inside the < 150ms target. Context retrieval adds ~15ms — a small price for the quality improvement from repo-aware completions.

Three things that affect latency more than anything else:

1. Max new tokens. The default is 64. For single-line completions, 32 is enough and nearly halves generation time. Set FIM_MODEL_MAX_TOKENS=32 in .env if you want faster single-line suggestions.

2. Prefix length. Truncate the prefix at ~800 tokens before sending. Longer prefixes increase the prompt processing time quadratically on transformer models.

3. Cold start. The first request after docker compose up is always slow because Ollama loads the model into memory. On M3 Pro this takes ~3 seconds. Subsequent requests hit the warm model cache.

%%{init: {'theme': 'dark'}}%%
flowchart LR
    subgraph latency["Latency Breakdown (warm, P50)"]
        direction LR
        DB["Debounce wait\n~80ms"]
        EMB["Query embedding\n~8ms"]
        VEC["Chroma query\n~4ms"]
        INFER["StarCoder2 inference\n~85ms · 32 tokens"]
        NET["Network round-trip\n~5ms"]
    end
    TOTAL["Total: ~110ms P50"]
    latency --> TOTAL

Indexing Your Repo

The extension calls /index on every file save. To index an existing project on first launch, add a command:

// extension.ts
vscode.commands.registerCommand('dhi.indexWorkspace', async () => {
    const files = await vscode.workspace.findFiles(
        '**/*.{py,ts,tsx}',
        '**/node_modules/**',
    );
    for (const file of files) {
        await fetch(`${SERVER_URL}/index`, {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({ file_path: file.fsPath }),
        });
    }
    vscode.window.showInformationMessage(`Dhi: indexed ${files.length} files`);
});

Run it once with Ctrl+Shift+P → Dhi: Index Workspace. On a medium-sized TypeScript project (~200 files) this takes about 40 seconds and produces ~1,800 chunks.

What We Have So Far

At post-1, Dhi does one thing: it gives you fast, repo-aware FIM autocomplete using entirely open-source components. No API key. No per-token pricing. The full stack fits on a laptop.

The component map so far:

%%{init: {'theme': 'dark'}}%%
flowchart TB
    EXT["VS Code Extension\nInlineCompletionItemProvider"]
    API["FastAPI Server\n/complete · /index"]
    FIM["fim.py\nFIM prompt builder"]
    STORE["store.py\nChroma · cosine search"]
    CHUNK["chunker.py\nTree-sitter · semantic chunks"]
    OLLAMA["Ollama\nStarCoder2-3B · nomic-embed-text"]

    EXT <-->|"HTTP POST /complete"| API
    EXT -->|"HTTP POST /index on save"| API
    API --> FIM --> STORE --> OLLAMA
    API --> CHUNK --> STORE

What's Next

The autocomplete engine queries the vector store at request time. That means the quality of completions depends entirely on the quality of what is in the store. In the next post we go deep on Repo Intelligence — the layer that keeps the store accurate, fast, and in sync with your codebase at all times:

Full Tree-sitter support for Go, Rust, and Java alongside Python and TypeScript
LSP call graph: an adjacency list of every function-to-function reference in SQLite
Hybrid search: nomic vector similarity + BM25 keyword matching, re-ranked with RRF
Incremental re-index on file save — under 100ms per file even on large repos
Git-aware indexing: skip .gitignore entries, auto-update on git checkout

The code will be at github.com/sochaty/dhi tag post-2.

Did you find this helpful?

Sourish Chakraborty