FIM Autocomplete at < 150ms: Tree-sitter, nomic Embeddings, and StarCoder2-3B on a Laptop
In the previous post I laid out the full six-layer architecture of an open-source AI coding IDE. Today we start building it.
This post ships the first real component of Dhi (เคงเฅ) โ an open-source AI coding IDE built entirely on open-source models. Dhi means pure intellect in Sanskrit, from the Gayatri Mantra. The name fits: the goal is an IDE that gives you genuine intelligence over your codebase โ with no API keys, no token pricing, and no closed-source inference backend.
By the end of this post you will have:
- A working FIM autocomplete engine running on your laptop
- A Tree-sitter-based semantic chunker for Python and TypeScript
- A Chroma vector store with nomic-embed-text-v1.5 embeddings
- A StarCoder2-3B inference server via Ollama
- A VS Code extension with ghost-text inline completions
Everything runs with docker compose up. The full code is at github.com/sochaty/dhi โ tag post-1.
What FIM Actually Is
Most developers think of autocomplete as next-token prediction: the model sees everything before the cursor and predicts what comes next. That is how GPT-2 works. It is not how a modern AI coding assistant works.
Real autocomplete in 2026 is fill-in-the-middle (FIM): the model sees both the prefix (everything before the cursor) and the suffix (everything after), then generates the completion that bridges them. This is far more accurate because the model knows what the code is supposed to arrive at, not just what it started from.
The three special tokens that make FIM work:
<fim_prefix> โ everything before the cursor
<fim_suffix> โ everything after the cursor
<fim_middle> โ the model generates this
StarCoder2, DeepSeek-Coder, and Qwen2.5-Coder all support FIM natively. This is non-negotiable for a production autocomplete engine โ a model without native FIM support gives noticeably worse inline completions.
The full request flow for Dhi's autocomplete engine:
%%{init: {'theme': 'dark'}}%%
flowchart TB
ED["VS Code Editor\nCursor position event"]
DB["Debouncer\n150ms wait"]
CA["Context Assembler"]
subgraph prefix["fim_prefix"]
direction TB
RC["Retrieved chunks\ntop-3 from vector store"]
FA["Current file\nlines 0 โ cursor"]
end
subgraph suffix["fim_suffix"]
FB["Current file\ncursor โ EOF"]
end
INF["Inference Server\nOllama (dev) ยท vLLM (prod)"]
EXT["Extension\nGhost text render"]
ED --> DB --> CA
CA --> prefix & suffix
prefix & suffix --> INF --> EXT
The critical design decision here: the prefix is not just the current file. It includes retrieved chunks from the rest of the repository. Without that context, the model cannot complete a function call using a helper defined in another file.
Bootstrapping Dhi
Clone the repo and start the CPU stack:
git clone https://github.com/sochaty/dhi
cd dhi
git checkout post-1
docker compose up
The docker-compose.yml at this tag runs three services:
services:
server:
build: ./server
ports: ["8000:8000"]
environment:
- CHROMA_HOST=chroma
- OLLAMA_HOST=ollama
depends_on: [chroma, ollama]
chroma:
image: chromadb/chroma:0.5.0
volumes: ["chroma_data:/chroma/chroma"]
ollama:
image: ollama/ollama:latest
volumes: ["ollama_models:/root/.ollama"]
entrypoint: ["/bin/sh", "-c", "ollama serve & sleep 5 && ollama pull starcoder2:3b && wait"]
On first start, Ollama pulls StarCoder2-3B (~1.7GB). Subsequent starts are instant. The full repo directory structure at post-1:
dhi/
โโโ docker-compose.yml
โโโ extension/
โ โโโ src/completion/provider.ts
โโโ server/
โโโ main.py
โโโ inference/fim.py
โโโ rag/
โโโ chunker.py
โโโ store.py
Layer 1: Tree-sitter Semantic Chunking
The naive approach to chunking code for a vector store is splitting on character count โ every 500 characters becomes a chunk. This is wrong for two reasons: it cuts function bodies in the middle (destroying semantic meaning) and it groups unrelated code together (polluting retrieval).
Dhi uses Tree-sitter to split on semantic boundaries. A function definition becomes one chunk. An import block becomes one chunk. A class body becomes one chunk. Each chunk is self-contained and contextually coherent.
%%{init: {'theme': 'dark'}}%%
flowchart LR
SRC["Source File\n.py or .ts"]
TS["Tree-sitter Parser\n(per language grammar)"]
AST["Concrete Syntax Tree"]
subgraph chunks["Semantic Chunks"]
direction TB
C1["import_block โ 1 chunk\nall imports as a unit"]
C2["function_definition โ 1 chunk\nentire function body"]
C3["class_definition โ 1 chunk\nentire class body"]
end
META["Metadata Overlay\nfile path ยท start line ยท end line ยท language ยท symbol name"]
EMBED["nomic-embed-text-v1.5\n768-dim ยท runs locally via Ollama"]
CHROMA["Chroma\npersistent vector store"]
SRC --> TS --> AST --> chunks --> META --> EMBED --> CHROMA
Here is server/rag/chunker.py โ the full implementation for Python and TypeScript:
from dataclasses import dataclass
from pathlib import Path
from typing import Generator
import tree_sitter_python as tspython
import tree_sitter_typescript as tstypescript
from tree_sitter import Language, Parser
PY_LANGUAGE = Language(tspython.language())
TS_LANGUAGE = Language(tstypescript.language_typescript())
# Node types that become independent chunks
CHUNK_NODE_TYPES = {
"python": {
"import_statement", "import_from_statement",
"function_definition", "async_function_definition",
"class_definition",
},
"typescript": {
"import_declaration", "import_statement",
"function_declaration", "arrow_function",
"class_declaration", "method_definition",
"interface_declaration", "type_alias_declaration",
},
}
@dataclass
class Chunk:
text: str
file_path: str
start_line: int
end_line: int
language: str
node_type: str
def _get_parser(language: str) -> Parser:
parser = Parser()
if language == "python":
parser.set_language(PY_LANGUAGE)
else:
parser.set_language(TS_LANGUAGE)
return parser
def _detect_language(path: str) -> str | None:
suffix = Path(path).suffix
if suffix == ".py":
return "python"
if suffix in (".ts", ".tsx"):
return "typescript"
return None
def chunk_file(file_path: str) -> Generator[Chunk, None, None]:
language = _detect_language(file_path)
if language is None:
return
source = Path(file_path).read_bytes()
parser = _get_parser(language)
tree = parser.parse(source)
target_types = CHUNK_NODE_TYPES[language]
def walk(node):
if node.type in target_types:
text = source[node.start_byte:node.end_byte].decode("utf-8", errors="replace")
yield Chunk(
text=text,
file_path=file_path,
start_line=node.start_point[0] + 1,
end_line=node.end_point[0] + 1,
language=language,
node_type=node.type,
)
else:
for child in node.children:
yield from walk(child)
yield from walk(tree.root_node)
Two things worth noting:
Greedy top-level chunking. The walk function yields a node and stops recursing into it. A class body becomes one chunk โ it does not also yield its individual methods as separate chunks. This keeps retrieval units large enough to be coherent.
Language-specific grammar packages. tree_sitter_python and tree_sitter_typescript are PyPI packages that ship pre-compiled grammars. No C compilation step required, which matters for Docker images.
Layer 2: Embedding and Vector Storage
Each chunk is embedded with nomic-embed-text-v1.5 โ a 768-dimension model that runs locally via Ollama. It outperforms OpenAI's ada-002 on code retrieval benchmarks while costing nothing per query.
Here is server/rag/store.py:
import hashlib
import os
from typing import Sequence
import chromadb
import httpx
CHROMA_HOST = os.getenv("CHROMA_HOST", "localhost")
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "localhost")
COLLECTION_NAME = "dhi_chunks"
EMBED_MODEL = "nomic-embed-text"
def _embed(texts: list[str]) -> list[list[float]]:
resp = httpx.post(
f"http://{OLLAMA_HOST}:11434/api/embed",
json={"model": EMBED_MODEL, "input": texts},
timeout=30,
)
resp.raise_for_status()
return resp.json()["embeddings"]
def _chunk_id(chunk) -> str:
key = f"{chunk.file_path}:{chunk.start_line}:{chunk.end_line}"
return hashlib.md5(key.encode()).hexdigest()
class ChunkStore:
def __init__(self):
client = chromadb.HttpClient(host=CHROMA_HOST, port=8000)
self._col = client.get_or_create_collection(
name=COLLECTION_NAME,
metadata={"hnsw:space": "cosine"},
)
def upsert(self, chunks: Sequence) -> None:
if not chunks:
return
ids = [_chunk_id(c) for c in chunks]
texts = [c.text for c in chunks]
embeddings = _embed(texts)
metadatas = [
{
"file_path": c.file_path,
"start_line": c.start_line,
"end_line": c.end_line,
"language": c.language,
"node_type": c.node_type,
}
for c in chunks
]
self._col.upsert(ids=ids, embeddings=embeddings, documents=texts, metadatas=metadatas)
def query(self, text: str, n_results: int = 3) -> list[str]:
if self._col.count() == 0:
return []
embeddings = _embed([text])
results = self._col.query(
query_embeddings=embeddings,
n_results=min(n_results, self._col.count()),
include=["documents"],
)
return results["documents"][0]
The upsert method is idempotent โ re-indexing a file replaces its chunks rather than duplicating them. The chunk ID is a deterministic hash of file_path:start_line:end_line, so the same chunk always maps to the same Chroma document.
Layer 3: Assembling the FIM Prompt
The FIM prompt has a precise structure. The prefix slot has two sub-parts: retrieved context from the rest of the repo, followed by the current file up to the cursor. The suffix is everything after the cursor.
%%{init: {'theme': 'dark'}}%%
flowchart TB
subgraph fim_prefix["<fim_prefix>"]
direction TB
CTX["Repo context\n3 retrieved chunks ยท ~1500 tokens\n(most relevant functions/classes)"]
CUR["Current file prefix\nlines 0 โ cursor ยท ~800 tokens"]
end
subgraph fim_suffix["<fim_suffix>"]
direction TB
SUF["Current file suffix\ncursor โ EOF ยท ~400 tokens"]
end
MID["<fim_middle>\nModel generates the completion here"]
fim_prefix --> MID
fim_suffix --> MID
Here is server/inference/fim.py:
import os
from dataclasses import dataclass
import httpx
from rag.store import ChunkStore
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "localhost")
FIM_MODEL = os.getenv("FIM_MODEL", "starcoder2:3b")
# StarCoder2 FIM special tokens
FIM_PREFIX = "<fim_prefix>"
FIM_SUFFIX = "<fim_suffix>"
FIM_MIDDLE = "<fim_middle>"
@dataclass
class FIMRequest:
file_path: str
prefix: str # current file content above cursor
suffix: str # current file content below cursor
language: str
def build_fim_prompt(request: FIMRequest, store: ChunkStore) -> str:
# Query the store with the last ~200 chars of prefix as the search query
query = request.prefix[-200:].strip() or request.file_path
context_chunks = store.query(query, n_results=3)
context_block = "\n\n".join(context_chunks)
if context_block:
context_block = f"# Repo context\n{context_block}\n\n# Current file\n"
return (
f"{FIM_PREFIX}"
f"{context_block}"
f"{request.prefix}"
f"{FIM_SUFFIX}"
f"{request.suffix}"
f"{FIM_MIDDLE}"
)
def complete(request: FIMRequest, store: ChunkStore, max_new_tokens: int = 64) -> str:
prompt = build_fim_prompt(request, store)
resp = httpx.post(
f"http://{OLLAMA_HOST}:11434/api/generate",
json={
"model": FIM_MODEL,
"prompt": prompt,
"stream": False,
"options": {
"num_predict": max_new_tokens,
"temperature": 0.1,
"stop": ["\n\n", FIM_PREFIX, FIM_SUFFIX],
},
},
timeout=10,
)
resp.raise_for_status()
return resp.json()["response"]
Two implementation choices worth explaining:
Low temperature (0.1). Autocomplete is not a creative task. You want the most probable continuation, not a diverse sample. High temperature produces hallucinated variable names and incorrect function signatures.
Stop tokens include \n\n. A single blank line is the natural end of a completion. Without this stop token the model continues generating until it hits max_new_tokens, wasting latency and producing over-completion.
Model Selection
Not every developer has the same hardware. Here is the recommended model per tier:
| Model | Size | VRAM | FIM Support | Recommended for |
| StarCoder2-3B | 3B | 4GB | โ Native | 8GB GPU or Apple M-series |
| Qwen2.5-Coder-7B | 7B | 8GB | โ Native | 16GB GPU |
| DeepSeek-Coder-V2-Lite | 16B | 12GB | โ Native | 24GB GPU (best quality) |
| StarCoder2-3B (Q4_K_M) | 3B | 2.5GB | โ Native | CPU-only (slow but works) |
Change the model in .env:
FIM_MODEL=qwen2.5-coder:7b
Ollama pulls it on next container start. No code changes required โ the FIM special tokens differ between model families but Ollama handles the tokenization automatically.
Layer 4: The VS Code Extension
The extension registers an InlineCompletionItemProvider. VS Code calls it whenever the user pauses typing. The debounce prevents a network round-trip on every keystroke.
%%{init: {'theme': 'dark'}}%%
flowchart LR
KEYSTROKE["Keystroke in editor"]
DEBOUNCE["Debounce\n150ms โ cancel if user keeps typing"]
CONTEXT["Extract context\nfile path ยท prefix ยท suffix ยท language"]
POST["POST /complete\n{file_path, prefix, suffix, language}"]
SERVER["Dhi server\nbuild FIM prompt ยท Ollama ยท return completion"]
GHOST["VS Code\nrender ghost text"]
ACCEPT["Tab key\naccept completion"]
KEYSTROKE --> DEBOUNCE --> CONTEXT --> POST --> SERVER --> GHOST --> ACCEPT
Here is extension/src/completion/provider.ts:
import * as vscode from 'vscode';
const SERVER_URL = vscode.workspace
.getConfiguration('dhi')
.get<string>('serverUrl', 'http://localhost:8000');
interface CompletionRequest {
file_path: string;
prefix: string;
suffix: string;
language: string;
}
async function fetchCompletion(req: CompletionRequest): Promise<string | null> {
try {
const res = await fetch(`${SERVER_URL}/complete`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(req),
signal: AbortSignal.timeout(3000),
});
if (!res.ok) return null;
const data = await res.json() as { completion: string };
return data.completion ?? null;
} catch {
return null;
}
}
export class DhiCompletionProvider implements vscode.InlineCompletionItemProvider {
private pending: NodeJS.Timeout | null = null;
async provideInlineCompletionItems(
document: vscode.TextDocument,
position: vscode.Position,
_context: vscode.InlineCompletionContext,
token: vscode.CancellationToken,
): Promise<vscode.InlineCompletionList | null> {
// Debounce: cancel previous pending request
if (this.pending) {
clearTimeout(this.pending);
this.pending = null;
}
const completion = await new Promise<string | null>((resolve) => {
this.pending = setTimeout(async () => {
if (token.isCancellationRequested) {
resolve(null);
return;
}
const offset = document.offsetAt(position);
const text = document.getText();
const req: CompletionRequest = {
file_path: document.uri.fsPath,
prefix: text.slice(0, offset),
suffix: text.slice(offset),
language: document.languageId,
};
resolve(await fetchCompletion(req));
}, 150);
});
if (!completion || token.isCancellationRequested) return null;
return {
items: [
new vscode.InlineCompletionItem(
completion,
new vscode.Range(position, position),
),
],
};
}
}
Register it in extension.ts:
import * as vscode from 'vscode';
import { DhiCompletionProvider } from './completion/provider';
export function activate(context: vscode.ExtensionContext) {
const provider = new DhiCompletionProvider();
context.subscriptions.push(
vscode.languages.registerInlineCompletionItemProvider(
{ pattern: '**' },
provider,
),
);
}
export function deactivate() {}
And the FastAPI endpoint in server/main.py:
from fastapi import FastAPI
from pydantic import BaseModel
from inference.fim import FIMRequest, complete
from rag.store import ChunkStore
app = FastAPI()
store = ChunkStore()
class CompleteRequest(BaseModel):
file_path: str
prefix: str
suffix: str
language: str
@app.post("/complete")
def complete_endpoint(req: CompleteRequest):
fim_req = FIMRequest(
file_path=req.file_path,
prefix=req.prefix,
suffix=req.suffix,
language=req.language,
)
completion = complete(fim_req, store)
return {"completion": completion}
@app.post("/index")
def index_endpoint(body: dict):
from rag.chunker import chunk_file
chunks = list(chunk_file(body["file_path"]))
store.upsert(chunks)
return {"indexed": len(chunks)}
The /index endpoint is called by a file-watcher in the extension whenever a file is saved. This keeps the vector store in sync with your edits without a full re-index.
Latency in Practice
On an Apple M3 Pro (no external GPU) with StarCoder2-3B via Ollama:
| Scenario | P50 | P95 |
| Cold (no Ollama cache) | 380ms | 520ms |
| Warm (model loaded) | 95ms | 145ms |
| Warm + context retrieval | 110ms | 160ms |
The P50 of 95ms warm sits well inside the < 150ms target. Context retrieval adds ~15ms โ a small price for the quality improvement from repo-aware completions.
Three things that affect latency more than anything else:
1. Max new tokens. The default is 64. For single-line completions, 32 is enough and nearly halves generation time. Set FIM_MODEL_MAX_TOKENS=32 in .env if you want faster single-line suggestions.
2. Prefix length. Truncate the prefix at ~800 tokens before sending. Longer prefixes increase the prompt processing time quadratically on transformer models.
3. Cold start. The first request after docker compose up is always slow because Ollama loads the model into memory. On M3 Pro this takes ~3 seconds. Subsequent requests hit the warm model cache.
%%{init: {'theme': 'dark'}}%%
flowchart LR
subgraph latency["Latency Breakdown (warm, P50)"]
direction LR
DB["Debounce wait\n~80ms"]
EMB["Query embedding\n~8ms"]
VEC["Chroma query\n~4ms"]
INFER["StarCoder2 inference\n~85ms ยท 32 tokens"]
NET["Network round-trip\n~5ms"]
end
TOTAL["Total: ~110ms P50"]
latency --> TOTAL
Indexing Your Repo
The extension calls /index on every file save. To index an existing project on first launch, add a command:
// extension.ts
vscode.commands.registerCommand('dhi.indexWorkspace', async () => {
const files = await vscode.workspace.findFiles(
'**/*.{py,ts,tsx}',
'**/node_modules/**',
);
for (const file of files) {
await fetch(`${SERVER_URL}/index`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ file_path: file.fsPath }),
});
}
vscode.window.showInformationMessage(`Dhi: indexed ${files.length} files`);
});
Run it once with Ctrl+Shift+P โ Dhi: Index Workspace. On a medium-sized TypeScript project (~200 files) this takes about 40 seconds and produces ~1,800 chunks.
What We Have So Far
At post-1, Dhi does one thing: it gives you fast, repo-aware FIM autocomplete using entirely open-source components. No API key. No per-token pricing. The full stack fits on a laptop.
The component map so far:
%%{init: {'theme': 'dark'}}%%
flowchart TB
EXT["VS Code Extension\nInlineCompletionItemProvider"]
API["FastAPI Server\n/complete ยท /index"]
FIM["fim.py\nFIM prompt builder"]
STORE["store.py\nChroma ยท cosine search"]
CHUNK["chunker.py\nTree-sitter ยท semantic chunks"]
OLLAMA["Ollama\nStarCoder2-3B ยท nomic-embed-text"]
EXT <-->|"HTTP POST /complete"| API
EXT -->|"HTTP POST /index on save"| API
API --> FIM --> STORE --> OLLAMA
API --> CHUNK --> STORE
What's Next
The autocomplete engine queries the vector store at request time. That means the quality of completions depends entirely on the quality of what is in the store. In the next post we go deep on Repo Intelligence โ the layer that keeps the store accurate, fast, and in sync with your codebase at all times:
- Full Tree-sitter support for Go, Rust, and Java alongside Python and TypeScript
- LSP call graph: an adjacency list of every function-to-function reference in SQLite
- Hybrid search: nomic vector similarity + BM25 keyword matching, re-ranked with RRF
- Incremental re-index on file save โ under 100ms per file even on large repos
- Git-aware indexing: skip
.gitignoreentries, auto-update ongit checkout
The code will be at github.com/sochaty/dhi tag post-2.
Did you find this helpful?
