Building an AI Coding IDE from Scratch: A Full Open-Source Architecture

Jun 7, 2026•10 min read

AI coding assistants have fundamentally changed how developers write software. The best ones do one thing well: collapse the distance between the developer's intent and the code that expresses it. But under the hood, they are largely orchestration layers — the individual primitives they use (language servers, vector search, tool-calling LLMs, sandboxed execution) are all open-source or replicable.

This post lays out a complete open-source architecture that covers every capability a modern AI coding IDE ships: autocomplete, chat-in-editor, multi-file agent editing, repo understanding, system design reasoning, and a safe code execution loop. No proprietary APIs required.

The Full Picture

Before going layer by layer, here is how the system fits together:

Each vertical slice is independently deployable. You can run only the autocomplete engine on a laptop, or scale the agent engine on a GPU cluster. The layers communicate over a local JSON-RPC bus (same protocol VS Code uses for LSP).

Layer 1 — Repo Understanding

Every other capability is only as good as the system's understanding of the codebase. This is where most open-source IDE projects are weakest. Getting it right means going beyond naive file-splitting.

Parsing with Tree-sitter

Tree-sitter produces a concrete syntax tree for 40+ languages in under 5ms per file. Rather than splitting on character count, we split on semantic boundaries: functions, classes, method bodies. This keeps each chunk self-contained and reduces context fragmentation at retrieval time.

The Call Graph Layer

Pure vector search finds semantically similar code but misses structural relationships. A symbol reference graph (built from LSP textDocument/references calls) lets you answer questions like: "find every function that touches the auth middleware" with a graph traversal rather than a fuzzy search.

Store this as an adjacency list in SQLite — lightweight, zero infrastructure, always in sync with the repo.

Layer 2 — Autocomplete (Fill-in-the-Middle)

Autocomplete in a modern AI IDE is not next-token prediction. It is fill-in-the-middle (FIM): the model sees the prefix (everything before the cursor) and the suffix (everything after), and generates the completion that bridges them.

Model choices

Model	Parameters	FIM Support	Runs on
StarCoder2-3B	3B	✅ native	Apple M2 / 8GB GPU
DeepSeek-Coder-V2-Lite	16B	✅ native	24GB GPU
Qwen2.5-Coder-7B	7B	✅ native	16GB GPU
CodeLlama-13B	13B	✅ native	24GB GPU

Serve them with Ollama for local dev or vLLM in production (PagedAttention cuts memory by ~40%, continuous batching removes queuing).

Speculative Decoding

Pair a small draft model (StarCoder2-1B) with a large verifier (DeepSeek-Coder-V2-Lite). The draft generates K tokens; the verifier accepts or rejects in a single forward pass. Effective throughput: 3–5× faster than the large model alone for typical completion lengths.

Layer 3 — Chat-in-Editor

Chat works differently from autocomplete. The latency bar is 2–5 seconds (acceptable for a conversational exchange), but the context window must be carefully assembled to fit within model limits while including what's most relevant.

The critical UX insight: stream tokens to the chat panel in real time, but buffer code blocks and only apply them to the editor after the complete block arrives. Partial code blocks applied live cause flickering and make the diff unreadable.

For the model, any instruction-tuned model with a large context window works here: Qwen2.5-Coder-32B-Instruct, DeepSeek-V3, or Llama-3.3-70B-Instruct via Ollama / vLLM.

Layer 4 — Multi-File Agent Editing

This is the hardest layer to get right. The agent must plan, act across multiple files, observe outcomes (compiler errors, test failures), and revise — all without losing context of the original goal.

The Plan-Act-Observe Loop

Tool Set

read_file(path)                 → returns file contents
write_file(path, content)       → applies diff
search_codebase(query)          → vector + keyword hybrid search
run_command(cmd)                → sandboxed shell (Docker)
list_directory(path)            → file tree
get_diagnostics()               → LSP errors/warnings
get_references(symbol)          → call graph lookup
create_file(path, content)      → new file
delete_file(path)               → with undo stack

Orchestration: LangGraph

LangGraph models the agent loop as a directed graph of nodes (think, act, observe, plan, verify). Edges are conditional — the observe node routes back to think on errors, or forwards to verify on success.

The key advantage over a simple while loop: checkpointing. LangGraph can pause the loop mid-execution, serialize state to disk, and resume — critical for long refactors that might span dozens of file edits.

Layer 5 — System Design and Reasoning

Architecture-level questions ("should I use event sourcing here?", "draw the service dependency graph") require a different mode: long-horizon reasoning over the entire codebase context, not just a few files.

The repo summary is the critical artifact. Build it once on first index, then update incrementally using git diff — only re-summarise modules that changed in the last commit.

Layer 6 — Safe Code Execution Loop

Agents that can write code must be able to run it. But running arbitrary LLM-generated code on the host machine is a hard no. The execution layer must be:

Isolated: no access to host filesystem, network, or env vars outside the project
Ephemeral: container torn down after each run
Auditable: all stdin/stdout/stderr captured and shown to the developer

The Self-Healing Loop

When tests fail, the output becomes the next observation in the agent loop. The agent sees the exact error, reasons about the fix, edits the file, and re-runs — typically converging in 2–3 iterations for straightforward bugs.

For an even tighter sandbox, use gVisor (Google's container runtime that intercepts syscalls in user space) or Firecracker (AWS's micro-VM used in Lambda) instead of vanilla Docker.

The Full Open-Source Stack

Capability	Component	Notes
Editor	Monaco Editor	MIT, same engine as VS Code
Syntax parsing	Tree-sitter	MIT, 40+ languages
Code intelligence	LSP servers (clangd, pylsp, ts-ls)	Per-language
Embeddings	nomic-embed-text-v1.5	Apache 2.0, 768-dim, runs locally
Vector store	Chroma (dev) / Qdrant (prod)	Both open-source
FIM autocomplete	StarCoder2-3B / Qwen2.5-Coder-7B	BigCode / Qwen license
Chat model	Qwen2.5-Coder-32B-Instruct	Apache 2.0
Reasoning model	QwQ-32B / DeepSeek-R1-32B	MIT / MIT
Model serving	Ollama (local) / vLLM (production)	MIT / Apache 2.0
Agent orchestration	LangGraph	MIT
Execution sandbox	Docker + seccomp / gVisor	Apache 2.0
Backend API	FastAPI	MIT
Frontend	Next.js + Tailwind	MIT

What You Don't Get For Free

An honest architecture post should name the hard parts:

Latency at low VRAM. A 32B model doing chat on a single 24GB GPU hits 15–20 tokens/second. Acceptable for most workflows, but noticeably slower than cloud-hosted alternatives. The fix is speculative decoding, quantisation (GGUF Q4), or offloading to a small cloud GPU when needed.

Prompt cache invalidation. Managed AI coding services almost certainly implement prompt caching across requests. Replicating this without a managed inference provider requires careful key-value cache management in vLLM — possible, but non-trivial.

Index freshness. Keeping the vector store in sync with active edits (every keystroke rewrites files) requires debounced incremental re-indexing — easy to get wrong and end up with stale retrieval.

Security surface. The Docker sandbox is safe for test runners. But agents that can write_file anywhere in the repo, modify CI configs, or touch secrets files are a different risk level. Implement a path allowlist and require developer confirmation for writes outside the current working directory.

Closing Thoughts

The individual components here — Tree-sitter, vLLM, LangGraph, Docker — are each battle-tested in production at scale. The architecture challenge is the orchestration: assembling the right context, routing to the right model at the right latency budget, and designing a UX that keeps the developer in control of what the agent actually touches.

The moat of any great AI coding tool is not its architecture. It's the years of UX iteration on top of this architecture. The open-source community now has every primitive it needs to build something just as capable.

The next post in this series will walk through implementing the FIM autocomplete engine end-to-end: Tree-sitter chunking, nomic embeddings, and a StarCoder2-3B server with speculative decoding — all running on a single laptop.

Did you find this helpful?

Sourish Chakraborty