Sourish Chakraborty

Sourish Chakraborty

AI Engineering & Modern Data Platforms | Cloud-native Architecture | Platform Engineering

Building an AI Coding IDE from Scratch: A Full Open-Source Architecture

10 min read
Cover Image for Building an AI Coding IDE from Scratch: A Full Open-Source Architecture

AI coding assistants have fundamentally changed how developers write software. The best ones do one thing well: collapse the distance between the developer's intent and the code that expresses it. But under the hood, they are largely orchestration layers — the individual primitives they use (language servers, vector search, tool-calling LLMs, sandboxed execution) are all open-source or replicable.

This post lays out a complete open-source architecture that covers every capability a modern AI coding IDE ships: autocomplete, chat-in-editor, multi-file agent editing, repo understanding, system design reasoning, and a safe code execution loop. No proprietary APIs required.


The Full Picture

Before going layer by layer, here is how the system fits together:

Architecture diagram

Each vertical slice is independently deployable. You can run only the autocomplete engine on a laptop, or scale the agent engine on a GPU cluster. The layers communicate over a local JSON-RPC bus (same protocol VS Code uses for LSP).


Layer 1 — Repo Understanding

Every other capability is only as good as the system's understanding of the codebase. This is where most open-source IDE projects are weakest. Getting it right means going beyond naive file-splitting.

Parsing with Tree-sitter

Tree-sitter produces a concrete syntax tree for 40+ languages in under 5ms per file. Rather than splitting on character count, we split on semantic boundaries: functions, classes, method bodies. This keeps each chunk self-contained and reduces context fragmentation at retrieval time.

Architecture diagram

The Call Graph Layer

Pure vector search finds semantically similar code but misses structural relationships. A symbol reference graph (built from LSP textDocument/references calls) lets you answer questions like: "find every function that touches the auth middleware" with a graph traversal rather than a fuzzy search.

Store this as an adjacency list in SQLite — lightweight, zero infrastructure, always in sync with the repo.


Layer 2 — Autocomplete (Fill-in-the-Middle)

Autocomplete in a modern AI IDE is not next-token prediction. It is fill-in-the-middle (FIM): the model sees the prefix (everything before the cursor) and the suffix (everything after), and generates the completion that bridges them.

Architecture diagram

Model choices

ModelParametersFIM SupportRuns on
StarCoder2-3B3B✅ nativeApple M2 / 8GB GPU
DeepSeek-Coder-V2-Lite16B✅ native24GB GPU
Qwen2.5-Coder-7B7B✅ native16GB GPU
CodeLlama-13B13B✅ native24GB GPU

Serve them with Ollama for local dev or vLLM in production (PagedAttention cuts memory by ~40%, continuous batching removes queuing).

Speculative Decoding

Pair a small draft model (StarCoder2-1B) with a large verifier (DeepSeek-Coder-V2-Lite). The draft generates K tokens; the verifier accepts or rejects in a single forward pass. Effective throughput: 3–5× faster than the large model alone for typical completion lengths.


Layer 3 — Chat-in-Editor

Chat works differently from autocomplete. The latency bar is 2–5 seconds (acceptable for a conversational exchange), but the context window must be carefully assembled to fit within model limits while including what's most relevant.

Architecture diagram

The critical UX insight: stream tokens to the chat panel in real time, but buffer code blocks and only apply them to the editor after the complete block arrives. Partial code blocks applied live cause flickering and make the diff unreadable.

For the model, any instruction-tuned model with a large context window works here: Qwen2.5-Coder-32B-Instruct, DeepSeek-V3, or Llama-3.3-70B-Instruct via Ollama / vLLM.


Layer 4 — Multi-File Agent Editing

This is the hardest layer to get right. The agent must plan, act across multiple files, observe outcomes (compiler errors, test failures), and revise — all without losing context of the original goal.

The Plan-Act-Observe Loop

Architecture diagram

Tool Set

read_file(path)                 → returns file contents
write_file(path, content)       → applies diff
search_codebase(query)          → vector + keyword hybrid search
run_command(cmd)                → sandboxed shell (Docker)
list_directory(path)            → file tree
get_diagnostics()               → LSP errors/warnings
get_references(symbol)          → call graph lookup
create_file(path, content)      → new file
delete_file(path)               → with undo stack

Orchestration: LangGraph

LangGraph models the agent loop as a directed graph of nodes (think, act, observe, plan, verify). Edges are conditional — the observe node routes back to think on errors, or forwards to verify on success.

The key advantage over a simple while loop: checkpointing. LangGraph can pause the loop mid-execution, serialize state to disk, and resume — critical for long refactors that might span dozens of file edits.


Layer 5 — System Design and Reasoning

Architecture-level questions ("should I use event sourcing here?", "draw the service dependency graph") require a different mode: long-horizon reasoning over the entire codebase context, not just a few files.

Architecture diagram

The repo summary is the critical artifact. Build it once on first index, then update incrementally using git diff — only re-summarise modules that changed in the last commit.


Layer 6 — Safe Code Execution Loop

Agents that can write code must be able to run it. But running arbitrary LLM-generated code on the host machine is a hard no. The execution layer must be:

  • Isolated: no access to host filesystem, network, or env vars outside the project
  • Ephemeral: container torn down after each run
  • Auditable: all stdin/stdout/stderr captured and shown to the developer
Architecture diagram

The Self-Healing Loop

Architecture diagram

When tests fail, the output becomes the next observation in the agent loop. The agent sees the exact error, reasons about the fix, edits the file, and re-runs — typically converging in 2–3 iterations for straightforward bugs.

For an even tighter sandbox, use gVisor (Google's container runtime that intercepts syscalls in user space) or Firecracker (AWS's micro-VM used in Lambda) instead of vanilla Docker.


The Full Open-Source Stack

CapabilityComponentNotes
EditorMonaco EditorMIT, same engine as VS Code
Syntax parsingTree-sitterMIT, 40+ languages
Code intelligenceLSP servers (clangd, pylsp, ts-ls)Per-language
Embeddingsnomic-embed-text-v1.5Apache 2.0, 768-dim, runs locally
Vector storeChroma (dev) / Qdrant (prod)Both open-source
FIM autocompleteStarCoder2-3B / Qwen2.5-Coder-7BBigCode / Qwen license
Chat modelQwen2.5-Coder-32B-InstructApache 2.0
Reasoning modelQwQ-32B / DeepSeek-R1-32BMIT / MIT
Model servingOllama (local) / vLLM (production)MIT / Apache 2.0
Agent orchestrationLangGraphMIT
Execution sandboxDocker + seccomp / gVisorApache 2.0
Backend APIFastAPIMIT
FrontendNext.js + TailwindMIT

What You Don't Get For Free

An honest architecture post should name the hard parts:

Latency at low VRAM. A 32B model doing chat on a single 24GB GPU hits 15–20 tokens/second. Acceptable for most workflows, but noticeably slower than cloud-hosted alternatives. The fix is speculative decoding, quantisation (GGUF Q4), or offloading to a small cloud GPU when needed.

Prompt cache invalidation. Managed AI coding services almost certainly implement prompt caching across requests. Replicating this without a managed inference provider requires careful key-value cache management in vLLM — possible, but non-trivial.

Index freshness. Keeping the vector store in sync with active edits (every keystroke rewrites files) requires debounced incremental re-indexing — easy to get wrong and end up with stale retrieval.

Security surface. The Docker sandbox is safe for test runners. But agents that can write_file anywhere in the repo, modify CI configs, or touch secrets files are a different risk level. Implement a path allowlist and require developer confirmation for writes outside the current working directory.


Closing Thoughts

The individual components here — Tree-sitter, vLLM, LangGraph, Docker — are each battle-tested in production at scale. The architecture challenge is the orchestration: assembling the right context, routing to the right model at the right latency budget, and designing a UX that keeps the developer in control of what the agent actually touches.

The moat of any great AI coding tool is not its architecture. It's the years of UX iteration on top of this architecture. The open-source community now has every primitive it needs to build something just as capable.

The next post in this series will walk through implementing the FIM autocomplete engine end-to-end: Tree-sitter chunking, nomic embeddings, and a StarCoder2-3B server with speculative decoding — all running on a single laptop.

Did you find this helpful?

Comments