LLM Wiki: persistent knowledge bases for AI agents

Today we're launching LLM Wiki — a feature that lets AI agents build and maintain persistent knowledge bases on top of Coregit's Git infrastructure. Every page is a commit. Every revision is traceable. Every concept is searchable by meaning, not just keywords.

The problem with agent memory

AI agents have a fundamental storage problem. Context windows are finite. Conversation history gets truncated. RAG systems work, but they're read-only — the agent can retrieve information, but it can't organize it, refine it, or build on it over time.

What agents need is a persistent, structured, version-controlled knowledge base that they can both read and write. Something that grows and improves with every interaction.

The Karpathy pattern

In early 2026, Andrej Karpathy described a pattern he called "LLM Wiki" — a knowledge base where LLMs ingest raw sources, synthesize wiki pages, and maintain an interconnected graph of knowledge. The key insight: the wiki is the agent's long-term memory.

But Karpathy described it as a concept, not an implementation. When you try to build it, the questions pile up: Where do you store the pages? How do you version them? How do you search across thousands of pages? How does an agent actually read and write them?

Coregit's answer: Git.

Three layers, one repo

Every LLM Wiki is a standard Git repository with a specific directory structure:

my-wiki/
├── raw/              ← Immutable source documents
│   ├── paper-1.pdf
│   ├── meeting-notes.md
│   └── api-docs.html
├── wiki/             ← LLM-generated pages
│   ├── attention-mechanism.md
│   ├── transformer-architecture.md
│   └── training-strategies.md
├── schema.md         ← Agent instructions
├── index.md          ← Auto-generated index
└── log.md            ← Changelog of wiki operations

raw/ contains immutable source material — papers, transcripts, documentation, notes. These are the ground truth. They never change after ingestion.

wiki/ contains LLM-generated pages. Each page has YAML frontmatter with structured metadata:

---
title: Attention Mechanism
summary: Core building block of transformer architectures
related:
  - transformer-architecture
  - self-attention-variants
confidence: 0.92
sources:
  - raw/attention-is-all-you-need.pdf
  - raw/flash-attention-paper.pdf
tags:
  - deep-learning
  - nlp
  - architecture
---

The attention mechanism allows models to focus on relevant parts
of the input sequence...

The related field creates an explicit knowledge graph. The sources field maintains provenance. The confidence field lets agents flag pages that need revision.

schema.md is the agent's instruction set — rules for how to process raw sources, what frontmatter fields to include, naming conventions, and quality standards.

Why Git is the right foundation

Git gives LLM Wiki four properties that no other storage layer provides simultaneously:

Immutable history. Every version of every page is preserved forever. You can diff any two versions to see exactly what changed and when. If an agent introduces an error in a wiki page, you can trace it back to the exact commit and the exact source that caused it.

Atomic operations. Coregit's commit endpoint lets an agent update 50 wiki pages in a single atomic commit. Either all changes land, or none do. No partial updates, no inconsistent state.

Content-addressed storage. If two pages quote the same paragraph from a source document, Git stores it once. If an agent regenerates a page and produces identical content, the blob SHA matches and no storage is wasted.

Branch-based workflows. An agent can create a branch, experiment with restructuring the wiki, and merge only if the result is better. This enables speculative knowledge refinement without risking the main knowledge base.

Semantic search over knowledge

Wiki search isn't keyword matching. When an agent asks "how does attention work," it should find pages about self-attention, multi-head attention, cross-attention, and scaled dot-product attention — even if none of them contain the exact phrase "how does attention work."

Coregit's search pipeline for wikis:

Embed the query via AI code embeddings (1024-dim vectors)
Search both raw/ and wiki/ — sometimes the answer is in a source document, sometimes in a synthesized page
Post-filter by tree membership — results are version-aware, tied to a specific commit
Rerank via an AI cross-attention reranker for relevance ordering
MMR diversification — spread results across different pages instead of clustering on one

Embeddings are content-addressed (keyed by blob SHA), so re-indexing unchanged pages is instant — the cache hit rate improves as the wiki matures.

Knowledge graph from frontmatter

The related field in each page's frontmatter creates an explicit, machine-readable knowledge graph. Coregit parses these links and exposes them via API:

# Get the knowledge graph for a wiki
cgt wiki graph my-research

This enables agents to traverse related concepts, find gaps in coverage, and identify clusters of knowledge that could be consolidated. The graph is stored in PostgreSQL and updated on every commit — no separate graph database needed.

11 API endpoints

The full wiki API surface:

# Initialize a wiki with default structure
cgt wiki init my-research --title "AI Research"

# List pages with parsed frontmatter
cgt wiki pages my-research

# Read a specific page (content + metadata)
cgt wiki page my-research attention-mechanism

# Create or update a page
cgt wiki write my-research transformer-architecture --content "..."

# Semantic search across pages and sources
cgt wiki search my-research "how does attention work"

# Auto-generated llms.txt for LLM consumption
cgt wiki llms-txt my-research

# Knowledge graph as JSON
cgt wiki graph my-research

# Wiki health stats (page count, word count, staleness)
cgt wiki stats my-research

The `llms.txt` standard

Every wiki auto-generates an llms.txt file — a structured, LLM-readable summary of the entire knowledge base. This follows the emerging llms.txt convention: a single file that an agent can ingest to understand the scope and structure of available knowledge before making targeted queries.

Use cases

Research synthesis. Feed papers into raw/, let an agent build wiki pages that connect findings across papers. The knowledge compounds — each new paper enriches the entire graph.

Codebase documentation. Point an agent at your repo. It reads the code, writes wiki pages explaining architecture, data flows, and design decisions. The wiki stays in sync because it's version-controlled alongside the code.

Customer support knowledge. Ingest support tickets, product docs, and internal guides. The wiki becomes a semantic search layer over everything your team knows.

Personal knowledge management. Meeting notes, articles, bookmarks — all ingested, synthesized, and searchable by meaning.

Get started

npx coregit-wizard@latest
cgt wiki init my-wiki --title "My Knowledge Base"

Start feeding it sources. Let the agent build the wiki. Watch the knowledge compound.