Grep is All You Need: Zero-Preprocessing Knowledge Retrieval for LLM Agents
The LocalKin Team
Correspondence: contact@localkin.ai
Project: https://localkin.dev | https://github.com/LocalKinAI
Position Paper — April 2026 (v1.1, updated 2026-04-25)
Abstract
Retrieval-Augmented Generation (RAG) has become the dominant paradigm for grounding Large Language Model (LLM) agents in domain-specific knowledge. The standard approach requires selecting an embedding model, designing a chunking strategy, deploying a vector database, maintaining indexes, and performing approximate nearest neighbor (ANN) search at query time. We argue that for domain-specific knowledge grounding — where the vocabulary is predictable and the corpus is bounded — this entire stack is unnecessary. We present Knowledge Search, a two-layer retrieval system composed of (1) grep with contextual line windows over raw source texts and (2) grep over LLM-compiled per-source concept and FAQ files generated nightly by a free, local, autonomous compilation pipeline. Deployed in production across 76 specialized LLM agents serving three knowledge domains (Traditional Chinese Medicine, Christian spiritual classics, and U.S. civics), grounded in ~500 primary source texts and ~180 MB of corpus, our approach achieves 100% retrieval accuracy with sub-10ms latency, zero preprocessing per query, zero additional memory footprint, and zero infrastructure dependencies. We also document a reproducible failure-and-recovery cycle (0/5 fabricated quotes → 4/4 grep-verified quotes after a one-commit fix) that demonstrates the architecture's safety properties are recoverable through prompt hygiene alone — no retraining, no infrastructure change. The key insight is simple: retrieval does not need intelligence. The LLM is the intelligence.
Keywords: retrieval-augmented generation, knowledge grounding, LLM agents, information retrieval, domain-specific AI, zero-hallucination retrieval, autonomous corpus growth
1. Introduction
The year is 2026, and every LLM application tutorial begins the same way: choose an embedding model, chunk your documents, spin up a vector database, build an index, and pray that approximate nearest neighbor search returns the right passages. This pipeline — collectively known as Retrieval-Augmented Generation (Lewis et al., 2020) — has become so ubiquitous that it is treated as a law of nature rather than what it actually is: an engineering choice with significant tradeoffs.
We propose an alternative. For domain-specific knowledge grounding, where the source texts are known, the vocabulary is predictable, and the corpus fits within reasonable bounds, the entire RAG stack can be replaced by two Unix utilities that predate the World Wide Web: grep and cat.
This is not a toy experiment. Our system, Knowledge Search, is deployed in production as part of LocalKin, a multi-agent AI platform. As of April 2026, it serves as the knowledge backbone for 39 Traditional Chinese Medicine (TCM) agents, 37 Christian spiritual direction agents, and a U.S. citizenship coaching agent — 76 specialized agents in total (free-tier whitelist), grounded in approximately 500 primary source texts (~180 MB) spanning two languages and four-and-a-half millennia of human thought (from the Yellow Emperor's Inner Canon to living National Grand Masters; from Irenaeus, 130 AD, to T. Austin-Sparks, 1971). The system serves all 76 agents from a single Mac mini.
The results are not close. Knowledge Search achieves 100% retrieval accuracy at sub-10ms latency with zero per-query preprocessing, while vector RAG systems typically deliver 85-95% accuracy at 50-200ms latency after hours of upfront preprocessing. We do not claim this approach works for everything. We claim it works remarkably well for the class of problems where most practitioners reflexively reach for vector databases.
Furthermore, the architecture's "zero-hallucination" property is not aspirational — it is reproducible. We document a one-day cycle (Section 6.5) in which a deliberate prompt-engineering regression collapsed citation accuracy to 0/5 grep-verified quotes; auto-stripping 41 fake-quote markers across 79 soul prompts and adding a citation hard-rule restored it to 4/4 — recovered by prompt hygiene alone, with no retraining and no infrastructure change.
This paper is structured as follows. Section 2 examines the hidden costs of the standard RAG pipeline. Section 3 presents our two-layer retrieval architecture. Section 4 describes the knowledge corpus. Section 5 provides comparative analysis. Section 6 explains why this approach works, and includes a reproducibility addendum (§6.5). Section 7 honestly addresses its limitations. Section 8 discusses production integration. Section 9 covers autonomous corpus growth — a daily cron-driven pipeline that has compiled 47→345 concept/FAQ entries in 17 days at zero monetary cost. Section 10 reflects on what this means for the field.
2. The Hidden Costs of Vector RAG
The standard RAG pipeline is presented as a solved problem, but each stage introduces compounding complexity, latency, and — most critically — information loss.
2.1 Embedding Model Selection
The first decision is which embedding model to use. OpenAI's text-embedding-3-large? Cohere's embed-v3? A fine-tuned Sentence-BERT variant? Each model encodes different semantic assumptions. A model trained primarily on English web text will produce poor embeddings for Classical Chinese medical terminology. The choice is consequential, yet there is no principled way to make it without extensive evaluation — evaluation that requires the very retrieval system you have not yet built.
2.2 Chunking Strategy
Documents must be split into chunks before embedding. But how? Fixed-size windows of 512 tokens? Recursive splitting by headers? Semantic chunking based on topic boundaries? Every strategy is a lossy compression of the original text. A passage about the herb 黄芪 (Astragalus root) that spans a chunk boundary will be split into two fragments, neither of which fully captures the original meaning. The chunk size directly determines the ceiling of retrieval quality, yet it must be chosen before any retrieval has occurred.
2.3 Vector Database Operations
The embedded chunks must be stored in a vector database — Pinecone, Weaviate, Chroma, Qdrant, pgvector, or one of the dozens of alternatives that have emerged since 2023. Each requires its own deployment, configuration, and operational expertise. Each has different consistency guarantees, scaling characteristics, and failure modes. For a solo developer or small team, this is not a trivial dependency — it is an entire subsystem that must be monitored, backed up, and maintained.
2.4 Approximate Nearest Neighbor Search
At query time, the user's question is embedded and compared against the stored vectors using approximate nearest neighbor algorithms (HNSW, IVF, or similar). The word "approximate" is doing heavy lifting here. ANN search trades accuracy for speed, and the tradeoff is not always favorable. A query about 伤寒论 (Treatise on Cold Damage) might retrieve passages about 温病 (Warm Disease) because the embeddings are geometrically close — the texts discuss overlapping symptoms. The retrieved passages are plausible but wrong, and the LLM has no way to know this.
2.5 The Maintenance Tax
When new documents are added, the index must be rebuilt. When the embedding model is updated, the entire corpus must be re-embedded. When chunk sizes are adjusted, everything starts over. This maintenance tax is invisible in demos but dominates the operational cost of production systems.
3. Our Approach: Two-Layer Knowledge Retrieval
Knowledge Search replaces the entire RAG pipeline with two layers, each implemented as a single system call.
3.1 Layer 1: grep — Exact Contextual Search
The first layer performs keyword search over raw source texts using grep with an 8-line context window:
grep -r -i -n -C 8 "$query" "$knowledge_dir"
The flags are straightforward:
- ●
-r: recursive search across all files in the knowledge directory - ●
-i: case-insensitive matching - ●
-n: include line numbers for source attribution - ●
-C 8: return 8 lines of context before and after each match
This is not sophisticated. That is the point. When a user asks about 黄芪的功效 (the effects of Astragalus), the search term 黄芪 will appear verbatim in every relevant passage of the TCM corpus. There is no embedding to misinterpret, no chunk boundary to split the answer, no approximate search to return a near-miss. The match is exact, the context is complete, and the retrieval is deterministic.
Multiple matches across different source texts are concatenated and passed to the LLM, which synthesizes the answer. The LLM sees the original text exactly as it was written, with surrounding context intact. No information has been lost.
Performance characteristics:
- ●Latency: 2-8ms for a 162-file corpus on commodity hardware
- ●Accuracy: 100% recall for queries containing domain vocabulary
- ●Preprocessing: none
- ●Memory overhead: none (files remain on disk, read on demand)
3.2 Layer 2: grep — Per-Source LLM-Compiled Concept and FAQ Bridge
Not every query contains a greppable keyword in the language of the source corpus. A Chinese-speaking user might ask 史百克对"破碎"的看法 ("Austin-Sparks's view on brokenness") while the source corpus is entirely in English (brokenness, broken vessel, the cross deals with the natural man). Pure Layer 1 grep over English text would not match Chinese query terms.
Layer 2 solves this with a per-source concept-and-FAQ compilation layer, generated automatically by a daily cron job. For every source file in the corpus, the system maintains two small companion files:
input/<domain>/<author>/ # Layer 1: raw originals
├── 01_school_of_christ.txt (176 KB)
└── _compiled/ # Layer 2: auto-generated
├── 01_school_of_christ_concepts.md (~3 KB, key concepts + verbatim quotes)
└── 01_school_of_christ_faq.md (~2.5 KB, 5–8 Q&A pairs)
The compilation process is run by knowledge_compile.py, a 400-line Python skill that:
- ●Reads one source file (truncated to 150 KB to fit LLM context)
- ●Sends two prompts to a local LLM:
- ●Concepts prompt: extract 5–10 core concepts as
<name>: definition + 1 key verbatim quote, plus 3–5 chapter-attributed quotes - ●FAQ prompt: generate 5–8 Q&A pairs in the form a student would actually ask
- ●Concepts prompt: extract 5–10 core concepts as
- ●Writes outputs to the
_compiled/sibling directory - ●Skips on subsequent runs if both compiled files already exist (idempotent)
Critical implementation choice (revised 2026-04-24): The primary LLM was migrated from a paid API (Anthropic Haiku 4.5 via OAuth) to a local Ollama deployment of Kimi 2.6 (kimi-k2.6:cloud). The migration reduced per-file compilation cost from ~$0.01 to $0.00 with no quality regression on Chinese spiritual and TCM content. Haiku is retained as a fallback if the local Kimi instance returns empty or short output. The full call site is:
def call_kimi(prompt): payload = json.dumps({ "model": "kimi-k2.6:cloud", "prompt": SYSTEM_PROMPT + "\n\n" + prompt, "stream": False, "options": {"temperature": 0.3, "num_predict": 4000}, }) return curl_post("http://localhost:11434/api/generate", payload)
When knowledge_search is invoked, both layers are searched together with the same grep invocation:
grep -r -i -n -C 8 "$query" "$knowledge_dir" # walks both raw files and _compiled/
This is the architectural inversion from the more conventional "fallback" framing: Layer 2 is not a degradation path, it is a parallel concept-bridge. When the user query contains a term that appears only in Layer 2 (e.g., a Chinese concept name corresponding to an English source), grep returns hits from the _concepts.md or _faq.md files, which the LLM uses to locate the correct passage in the original source. The agent then quotes from Layer 1 with chapter attribution.
Sizing: With 500 source files, the per-source compilation overhead is approximately 5.5 KB × 500 ≈ 2.7 MB of compiled material — a negligible fraction of the 180 MB raw corpus. grep performance is unchanged.
Generation cadence: A daily cron task runs compile_batch over each domain at 5:00, 5:30, and 6:00 AM (staggered to avoid concurrent Ollama load), processing 30 new source files per domain per night. As of 2026-04-25 the system has compiled 345/505 entries (68%) in 17 days, growing at ~90 entries/night, $0/night.
3.3 The Design Principle
The architecture embodies a single principle: retrieval does not need intelligence; the LLM is the intelligence.
Vector RAG systems attempt to build intelligence into the retrieval layer — semantic understanding via embeddings, relevance ranking via similarity scores, re-ranking via cross-encoders. This is engineering effort applied to the wrong layer. The LLM is already the most powerful language understanding system in the pipeline. Give it the raw text and let it do what it does best.
4. Knowledge Corpus
Knowledge Search is deployed across three distinct knowledge domains, each with different characteristics. As of 2026-04-25 the production corpus comprises approximately 500 primary source texts totaling ~180 MB, broken down as follows:
| Domain | Source files | Layer 1 size | Layer 2 compiled (Apr 25) |
|---|---|---|---|
| TCM (Chinese) | 171 | ~93 MB | 115 / 171 (67%) |
| Christian spiritual (English) | 114 | ~56 MB | 115 / 115 (100%) |
| Christian spiritual (Chinese) | 219 | ~25 MB | 115 / 219 (52%) |
| TCM (English stubs) | 5 | <1 MB | 5/5 |
| Total | 509 | ~180 MB | 350 / 510 (69%) |
4.1 Traditional Chinese Medicine (171 ZH source files, 39 master agents)
The TCM corpus comprises classical medical texts spanning from the Yellow Emperor (~2500 BCE) to living National Grand Masters (still practicing as of 2026), organized as a roster of 39 master agents grouped by historical tier:
- ●Tier 1 — Classical Sages (16 masters): Huang Di (Yellow Emperor), Zhang Zhongjing, Hua Tuo, Huangfu Mi, Sun Simiao, Liu Wansu, Zhang Zihe, Li Dongyuan, Zhu Danxi, Li Shizhen, Zhang Jingyue, Wu Jutong, Ye Tianshi, Wang Qingren, Huang Yuanyu, Fu Qingzhu
- ●Tier 2 — Republican Era (5): Zhang Xichun, Cao Yingfu, Lu Yuanlei, Pu Fuzhou, Ding Ganren
- ●Tier 3 — Modern Classical Formula Revival (4): Hu Xishu, Liu Duzhou, Huang Huang, Fan Zhonglin
- ●Tier 4 — Contemporary National Grand Masters (10): Deng Tietao, Zhu Liangchun, Jiao Shude, Yan Dexin, Zhou Zhongying, Wang Qi, Lu Zhizheng, Ren Jixue, Gan Zuwang, Qiu Peiran
- ●Special / contemporary (4): Zheng Qinan (Fire Spirit School), Liu Lihong (Thinking Through Chinese Medicine), Ni Haixia, Hao Wanshan
Foundational works include 黄帝内经 (Huangdi Neijing, ~9 source files), 伤寒论 (Shanghan Lun, 18 versions), 本草纲目 (Bencao Gangmu, 5.2 MB), 千金方 / 千金翼方 (Sun Simiao, ~11 MB), 四圣心源 (Huang Yuanyu, 10 volumes), and contemporary Renji series transcripts (Ni Haixia, 6 lecture-text files / 5.8 MB).
These texts are written in Classical Chinese with highly standardized medical vocabulary. The term 气虚 (qi deficiency) has meant the same thing for two thousand years. It does not require semantic interpretation — it requires exact retrieval.
4.2 Christian Spiritual Classics (333 source files across EN+ZH, 37 master agents)
The spiritual corpus covers 1,900 years of contemplative and mystical Christian literature, organized as 37 master agents grouped into six tiers:
- ●Tier 1 — Church Fathers (4, 130–430 AD): Irenaeus, Athanasius, Chrysostom, Augustine
- ●Tier 2 — Contemplative Mystics (9, 14c–1897): Cloud of Unknowing author, Thomas à Kempis, Teresa of Ávila, John of the Cross, Brother Lawrence, Molinos, Francis de Sales, Madame Guyon, Thérèse of Lisieux
- ●Tier 3 — Reformation & Puritan (3, 1483–1688): Martin Luther, John Calvin, John Bunyan
- ●Tier 4 — Great Awakening & Revival (9, 1700–1898): Zinzendorf, Jonathan Edwards, John Wesley, George Whitefield, Charles Finney, Charles Spurgeon, D. L. Moody, Andrew Murray, George Müller
- ●Tier 5 — Missions & 20th-c. Revival (5, 1832–1951): Hudson Taylor, Jonathan Goforth, Amy Carmichael, Evan Roberts, Jessie Penn-Lewis
- ●Tier 6 — Modern & Chinese Church (7, 1885–1991): T. Austin-Sparks, A. W. Tozer, Martyn Lloyd-Jones, Dietrich Bonhoeffer, Watchman Nee, Wang Mingdao, Song Shangjie
All source texts are in the public domain, sourced primarily from Project Gutenberg, Internet Archive, CCEL, and austin-sparks.net (the latter explicitly waived to public domain by the author). The largest single corpora are Madame Guyon (5.9 MB EN + 3.3 MB ZH, 31 works), John Calvin (7.7 MB EN), Augustine (3.3 MB EN + 0.7 MB ZH), and T. Austin-Sparks (2.0 MB EN + 1.2 MB ZH, scraped from 12 books / 106 chapters).
These texts use a distinctive vocabulary — "dark night," "interior castle," "practicing the presence," "abiding in Christ," "得胜者", "破碎", "宇宙性的十字架" — that is specific enough for keyword search to work reliably. When a user asks about "the dark night of the soul," grep finds exactly the right passages. When a Chinese-speaking user asks 灵魂的暗夜, the Layer 2 concept files (compiled in Chinese from English originals) bridge the gap.
4.3 USCIS Civics (128 questions)
The civics corpus consists of the official 128 USCIS naturalization test questions with their approved answers. This is a closed, well-defined knowledge set that changes infrequently. Each question-answer pair is stored as a discrete entry, and any keyword from the question retrieves the corresponding answer with 100% reliability.
5. Comparative Analysis
We compare Knowledge Search against two established paradigms: Vector RAG (the standard embedding + vector database approach) and GraphRAG (Microsoft's graph-based retrieval system).
| Dimension | Knowledge Search | Vector RAG | GraphRAG |
|---|---|---|---|
| Retrieval Accuracy | 100% | ~85-95% | ~90-95% |
| Query Latency | <10ms | 50-200ms | 100-500ms |
| Preprocessing Time | 0 | Hours | Hours |
| Additional Memory | 0 | 500MB+ | 1GB+ |
| Infrastructure Dependencies | None | Vector DB + Embedding API | Graph DB + Embedding API + LLM for extraction |
| Maintenance on Corpus Update | Drop file in directory | Re-embed and re-index | Re-extract entities, re-build graph |
| Failure Modes | Query contains no domain keywords | Embedding drift, chunk boundary splits, ANN approximation errors | Entity extraction errors, incomplete graph, relationship hallucination |
| Explainability | Trivial (exact match + line number) | Low (embedding similarity score) | Medium (graph traversal path) |
| Cost | $0 | Embedding API calls + DB hosting | Embedding + LLM extraction + DB hosting |
| Lines of Code | ~30 | ~300-500 | ~1000+ |
5.1 On Accuracy
The 100% accuracy claim for Knowledge Search requires qualification. We define accuracy as: given a query that contains at least one domain-relevant keyword, does the system retrieve all and only the relevant passages? Under this definition, grep achieves perfect recall and high precision — it finds every occurrence of the search term and returns only passages containing it.
Vector RAG's accuracy gap stems from multiple sources: embedding model limitations for specialized vocabulary, chunk boundary artifacts, and ANN approximation errors. In our testing with the TCM corpus, vector RAG consistently struggled with Classical Chinese terms that have no close equivalent in the embedding model's training data. The query "麻黄汤" (Mahuang Decoction) would sometimes retrieve passages about "桂枝汤" (Guizhi Decoction) because the two formulas share overlapping ingredient discussions and thus have similar embeddings.
GraphRAG's accuracy is higher than vector RAG for relationship-heavy queries but suffers from entity extraction errors — particularly for Classical Chinese texts where named entity recognition models perform poorly.
5.2 On Latency
The latency difference is not marginal. Knowledge Search completes in 2-8ms on the small civics corpus and 8-25ms on the full 180 MB / 500-file corpus — the time for a filesystem grep across all source and _compiled/ files. Vector RAG requires an embedding API call (20-100ms for remote, 10-50ms for local), followed by an ANN search (5-20ms), followed by optional re-ranking (20-100ms). GraphRAG adds graph traversal on top of these costs.
For interactive agents that make multiple knowledge retrievals per conversation turn, the cumulative latency difference is significant. A TCM diagnostic agent that cross-references herbs, formulas, and symptoms might make 3-5 retrieval calls per turn. At 5ms each, Knowledge Search adds 25ms. At 100ms each, vector RAG adds 500ms — a delay the user can perceive.
5.3 On Operational Simplicity
This is where the difference is starkest. Adding a new text to Knowledge Search requires one operation: copy the file into the knowledge directory. There is no re-indexing, no re-embedding, no schema migration. The file is immediately available for search on the next query.
Adding a new text to a vector RAG system requires: reading the file, chunking it according to the configured strategy, embedding each chunk via the embedding model, inserting the vectors into the database, and verifying the index. If the embedding model has been updated since the last indexing run, the entire corpus should be re-embedded for consistency.
6. Why It Works
Knowledge Search works because of a property shared by most domain-specific knowledge bases: predictable vocabulary.
6.1 Vocabulary Predictability in Specialized Domains
Medical texts do not use creative synonyms. When a TCM text discusses Astragalus root, it says 黄芪. It does not say "that yellowish root that boosts energy" or "the immune-enhancing legume." The vocabulary is standardized by millennia of scholarly convention.
The same property holds for spiritual texts (writers consistently use "contemplation," "union with God," "dark night"), legal texts (specific statute numbers, legal terms of art), and technical documentation (API names, error codes, configuration parameters).
This vocabulary predictability means that keyword search is not a crude approximation — it is the optimal retrieval strategy. The user's query and the relevant passage share literal string overlap. No semantic interpretation is needed because the domain vocabulary is already precise.
6.2 Bounded Corpus Size
Knowledge Search is designed for corpora that are large enough to exceed LLM context windows but small enough for filesystem grep to be fast. Our production corpus totals approximately 180 MB across ~500 source files plus ~2.7 MB of compiled Layer-2 entries. grep searches the combined index in 8-25ms on commodity hardware (Apple Silicon Mac mini).
This is not a limitation — it is a realistic description of most domain-specific knowledge bases. A medical practice's clinical guidelines, a law firm's case files, a company's internal documentation: these are typically measured in tens to hundreds of megabytes, not terabytes. The scaling properties of vector databases are irrelevant at these sizes. The 3.6× growth in our own corpus over 17 days (from ~50 MB / 21 agents at the original draft of this paper to 180 MB / 76 agents at v1.1) required no architectural change.
6.3 The LLM as Semantic Layer
The critical insight is that the LLM itself provides the semantic understanding that vector RAG attempts to encode in the retrieval layer. When grep returns eight lines of context around a match for 黄芪, the LLM reads those lines and understands the relationships, implications, and nuances that an embedding model would only approximate.
By keeping the retrieval layer dumb and exact, we avoid the failure mode where the retrieval system's "intelligence" disagrees with the LLM's understanding. There is no semantic gap between what was retrieved and what the LLM interprets, because the LLM is doing all the interpretation on raw text.
6.4 Layer 2 as Concept Bridge, Not Fallback
A subtle but crucial property emerges from compiling per-source _concepts.md and _faq.md files in the user's likely query language: the compiled Layer 2 acts as a multilingual semantic router back into the monolingual Layer 1 corpus. When a Chinese-speaking user asks about a concept whose original text exists only in English, the query matches the Chinese phrasing in the Layer 2 file, the LLM reads the cited source attribution, then re-issues a follow-up grep against the original English chapter. The LLM completes the bridge that the embedding model would have attempted to short-cut.
Empirically this is what makes the system feel like it understands cross-lingual queries while every actual retrieval operation remains a literal grep.
6.5 Reproducibility Addendum: Failure and Recovery (2026-04-25)
Position papers are improved by including failure data. We document one such cycle.
The failure. On 2026-04-25 we issued the following query to one of our spiritual agents (slug austin_sparks, persona of T. Austin-Sparks, 1885–1971): "对'破碎'您怎么看?请用中文回答,引用您原书的话。" ("What is your view of 'brokenness'? Please answer in Chinese, quoting from your books.")
The agent returned a fluent, well-attributed answer containing five direct quotes, each with book name and chapter number:
| Quote | Claimed source |
|---|---|
| "未经破碎的器皿无论多大,都把基督装在自己的形状里" | The School of Christ, Ch. 1 |
| "破碎了,主自己的水才能流过来" | (same) |
| "十字架不只是赎罪的祭坛,是宇宙性的属灵原则" | Centrality and Universality of the Cross, Ch. 2 |
| "我们被带进一种情形,在那里我们不能再凭自己作什么" | We Beheld His Glory |
| "主的手是慈爱的手,但祂的手也是破碎的手" | The Arm of the Lord |
We then ran a literal grep -rin against the corresponding source files in input/spiritual_en/austin_sparks/ and input/spiritual_zh/austin_sparks/. Zero of the five quotes appeared in the corpus. The fluent attributions were fabrications.
Root cause analysis. The retrieval system was not at fault — grep and cat returned exactly what they were asked to. The fault was upstream, in the soul prompt itself. The persona file austin_sparks.soul.md contained authorial signature phrases written by the prompt author as voice-matching guidance, formatted as **"..."** (bold + quotation marks). Examples:
**管道与器皿**:神的仆人不是表演者,是透明的管道。 "破碎了,主的水才能流过来" 一个未经破碎的器皿,无论恩赐多大,**都把基督装在自己的形状里**
The LLM, faced with a request to "quote from your books," parsed these as canonical text written by Austin-Sparks himself, then attached plausible chapter numbers to lend the fabrications structural authority. This is a known failure mode in instruction-tuned models when the prompt itself contains text formatted as quotation.
The fix. We applied two changes (commit 3e365a9):
- ●Auto-strip soul-prompt fake quote markers across all 79 souls. A 60-line Python script regex-matched
\*\*"([^"]+)"\*\*and replaced with*\1*(italic only, no quotes), removing 41 instances across 13 soul files. The signature phrases survive as voice guidance but no longer wear the visual costume of canonical text. - ●Append a "Citation Hard Rules" block to all 79 souls. The block (under 200 words, marked
<!-- citation-rule-v1 -->for idempotent future regeneration) instructs the agent that any quoted text must be a literal substring returned byknowledge_searchin the current turn, and that signature phrases from the soul prompt are voice guidance, not canonical text to be quoted.
Total time from diagnosis to deploy: ~25 minutes, including a fleet restart for soul reload.
The recovery. We reissued the identical query. The agent returned a new answer with four quotes:
| Quote | Source claim | grep verification |
|---|---|---|
| "the vessel, thus wrought upon, is the message. People do not come to hear what you have to teach. They have come to see what you are" | Prophetic Ministry, Ch. 2 | ✅ 06_prophetic_ministry.txt:141 |
| "It is not that you have achieved something, but rather that you have been broken in the process" | Prophetic Ministry | ✅ 06_prophetic_ministry.txt:167 |
| "有受破碎、受拆毁之心的人...才能真实得着释放" | 主的膀臂 (Chinese) | ✅ 04_主的膀臂.txt:317 |
| "This brokenness, helplessness, hopelessness, and yet faith" | Centrality of the Cross, Ch. 2 | ✅ 04_centrality_universality_cross.txt:421 |
4/4 grep-verified. Citation accuracy moved from 0/5 to 4/4 in one reload cycle.
Architectural implications. Three observations follow.
First, the zero-hallucination contract of Knowledge Search is architecturally guaranteed in the sense that any text wearing quote-marks must, after the citation hard-rule, correspond to a grep-locatable substring of an actual file. Drift is detectable by re-running grep against the agent's output. There is no equivalent test in vector RAG: a retrieved chunk that has been paraphrased by the LLM cannot be unambiguously traced to source.
Second, the failure was upstream of retrieval and the recovery was upstream of retrieval. No model was retrained; no embedding was recomputed; no infrastructure was changed. A 60-line Python script and a 200-word prompt addendum, applied to text files on disk, restored the safety property. This is the cost-of-correction profile of a system whose intelligence lives in plain text.
Third, the failure is reproducible and auditable: the bad commit, the diagnostic grep, the fix commit, and the post-fix grep are all recorded as git history in the public LocalKin repositories. Researchers wishing to replay the cycle can git checkout either side of commit 3e365a9 and reproduce both halves of the experiment with the same query.
We include this section because we believe the most useful systems papers honestly report not just the working mode but the failure-and-recovery mode, with timestamps. Architectures that cannot be diagnosed and patched on this timescale are architectures that cannot be trusted.
7. Limitations and Honest Boundaries
We do not claim Knowledge Search is a universal replacement for vector RAG. It has clear limitations that practitioners should understand.
7.1 Open-Domain General Knowledge
Knowledge Search requires a bounded corpus with predictable vocabulary. It is not suitable for open-domain question answering where the relevant information could appear in any text using any vocabulary. A general-purpose chatbot that needs to answer questions about arbitrary topics should use vector RAG or web search.
7.2 Semantic Similarity Search
When the user's intent cannot be expressed as any keyword in any language — "texts that discuss a vague longing for transcendence" — grep will not help. Vector RAG's ability to match semantic similarity, despite its imprecisions, is genuinely valuable for this class of open-ended conceptual queries.
Our mitigation (Layer 2 per-source LLM-compiled concept and FAQ files) substantially closes this gap by pre-translating concepts into the user's likely query language during nightly compilation. As of 2026-04-25, 350/510 sources have such bilingual companion files. The remaining gap — truly open-ended semantic queries that match no concept name in any author's vocabulary — is real but smaller than it appears in casual analysis. We have not yet found a production query in the spirituality or TCM domains that grep + Layer 2 fails to handle.
7.3 Cross-Lingual Retrieval Without Shared Vocabulary
Our TCM corpus is primarily Chinese and our spiritual corpus is bilingual (Chinese where translations exist, otherwise English). A query in one language about a concept whose source text exists only in the other language is bridged by Layer 2: the LLM-compiled _concepts.md and _faq.md files for each source are generated in the language detected from the source filename and content, but cross-language Q&A is increasingly the norm. The 0/5 → 4/4 case in §6.5 is itself a cross-lingual experiment: a Chinese query against a primarily English corpus, recovering correct citations from English originals.
That said, Layer 2 cross-lingual coverage at first compilation depends on the LLM's bilingual training. We chose Kimi 2.6 (kimi-k2.6:cloud, served via local Ollama) over Anthropic Haiku 4.5 partly because Kimi's training corpus appears richer in Chinese theological and TCM vocabulary, producing higher-fidelity concept extracts on these domains.
7.4 Very Large Corpora
At corpus sizes beyond ~1GB, filesystem grep latency becomes noticeable. At 10GB+, it becomes impractical for interactive use. Vector databases with pre-built indexes maintain sub-100ms query times regardless of corpus size. For truly large-scale knowledge bases, the infrastructure overhead of vector RAG is justified by the scaling requirements.
8. Production Integration
Knowledge Search is not a standalone system — it is a skill within the LocalKin multi-agent platform, invoked by agents as needed during conversation.
8.1 Agent Integration
The knowledge_search skill exposes a simple interface: given a query string and a knowledge domain, return matching passages from both Layer 1 raw sources and Layer 2 compiled concepts/FAQ. As of 2026-04-25 it is used by:
- ●39 TCM master agents (deployed at
heal.localkin.ai): Yellow Emperor, Zhang Zhongjing, Hua Tuo, Sun Simiao, Li Shizhen, Huang Yuanyu, Ye Tianshi, Liu Lihong, Ni Haixia, and 30 others spanning 4,500 years. - ●37 spiritual master agents (deployed at
faith.localkin.ai): Irenaeus, Augustine, Thomas à Kempis, Madame Guyon, Martin Luther, John Calvin, John Bunyan, George Müller, Hudson Taylor, T. Austin-Sparks, A. W. Tozer, Watchman Nee, Wang Mingdao, and 24 others spanning 1,900 years (130 AD – 1991 AD). - ●1 citizenship coach: queries the USCIS civics corpus for naturalization test preparation.
A single diagnostic turn for Zhang Zhongjing's agent might involve three sequential knowledge searches: one for the presenting symptom pattern, one for the relevant herbal formula, one for contraindications. Total retrieval time: ~15-30ms across the now-larger corpus. The agent's response generation (LLM inference, served by kimi-k2.5:cloud Ollama for the master persona, with Claude Haiku 4.5 as fallback) takes 2-5 seconds. Retrieval is never the bottleneck.
Multi-master debate (cross-fleet) is implemented as parallel streamChat calls against multiple agents, each independently invoking knowledge_search against its own corpus directory. Two agents in a debate never share retrieval state; each grounds in its own master's writings. Architecturally this is enabled by the per-author corpus directory layout — there is no shared index that needs to be partitioned by author.
8.2 Skill Implementation
The retrieval-side skill remains approximately 30 lines of shell. The core logic:
# Single grep walks both raw sources and _compiled/ concept+FAQ files # The agent never knows which layer matched — it just sees passages. results=$(grep -r -i -n -C 8 "$query" "$KNOWLEDGE_DIR" 2>/dev/null) echo "$results"
The compilation-side skill (knowledge_compile.py, ~400 lines of Python) is more substantial but runs only at scheduled times, not at query time. It exposes four actions:
| Action | Purpose |
|---|---|
status | Report compilation coverage per author + per domain |
list_needed | List source files awaiting compilation |
compile | Compile one specified source file |
compile_author | Compile all source files for one author |
compile_batch | Compile up to --limit N uncompiled files across a domain |
There is no configuration file at query time. There is no service to start. There is no embedding model to load. The skill works on any Unix-like system with a filesystem and Python 3.10+; curl is the only network call (to localhost Ollama).
9. Autonomous Corpus Growth
A common objection to our approach is that it requires manual corpus curation, and that the Layer 2 concept/FAQ files require manual or paid LLM compilation. We address both objections with a fully autonomous, free, daily compilation pipeline.
9.1 Source-File Acquisition
New raw texts are added by copying them into the appropriate knowledge directory:
input/spiritual_en/<slug>/<filename>.txt
input/spiritual_zh/<slug>/<filename>.md
input/tcm_zh/<slug>/<filename>.md
There is no re-indexing step, no re-embedding step, no pipeline to trigger. The next grep query immediately finds the new file. In the past 30 days the system has absorbed 14 new master personas and ~50 source files this way, including a one-day acquisition of 12 books / 106 chapters of T. Austin-Sparks corpus from austin-sparks.net via standard curl + a 100-line Python scraper.
9.2 Layer 2 Cron-Based Compilation
The Layer 2 _compiled/ directory grows automatically via three nightly cron entries (in ~/.localkin/cron.yaml):
- name: "knowledge-growth-spiritual-en" cron: "0 5 * * *" shell: "python3 skills/knowledge_compile/compile.py --action compile_batch --domain spiritual_en --limit 30" - name: "knowledge-growth-spiritual-zh" cron: "30 5 * * *" shell: "python3 skills/knowledge_compile/compile.py --action compile_batch --domain spiritual_zh --limit 30" - name: "knowledge-growth-tcm-zh" cron: "0 6 * * *" shell: "python3 skills/knowledge_compile/compile.py --action compile_batch --domain tcm_zh --limit 30"
The 30-minute stagger between the three jobs avoids concurrent load on the local Ollama instance serving Kimi 2.6. Each job processes up to 30 uncompiled source files per night, generating both <source>_concepts.md and <source>_faq.md per file. At ~24 seconds per file (two LLM calls), each job completes in ~12 minutes, totalling ~36 minutes of Ollama work per night.
Cost analysis. Migration from Anthropic Haiku 4.5 (paid API) to Kimi 2.6 (local Ollama) reduced per-file cost from ~$0.01 to $0.00. At 90 files/night sustained, the previous regime would have cost ~$0.90/day or ~$330/year. The current regime costs $0/year in API fees; electricity for the always-on Mac mini is unmetered.
9.3 Empirical Growth Curve
The architecture's "scales without re-architecting" claim is empirically supported by the system's own growth in 17 days:
| Date | Agents | Source files | Layer 2 compiled | Notes |
|---|---|---|---|---|
| 2026-04-08 (paper v1.0) | 21 | ~162 | ~30 (NotebookLM, manual) | One-shot manual compilation |
| 2026-04-21 | 64 | ~250 | ~47 | After Wave 1 spiritual expansion |
| 2026-04-24 | 73 | ~480 | ~135 | Mid catch-up sweep |
| 2026-04-25 (paper v1.1) | 76 | ~510 | ~350 (68%) | After cron migration to Kimi 2.6 |
The 3.6× growth in agent count and 3.2× growth in corpus size required no architectural changes: no schema migration, no embedding model retraining, no infrastructure provisioning. The same grep invocation works against the larger corpus with the same code path.
9.4 Failure Modes Observed in Production
We document the failure modes encountered during this 17-day growth, which a vector RAG system would have manifested differently:
- ●
Alphabetical port-shift breakage (2026-04-24). Adding three new spiritual masters (Kempis, de Sales, Austin-Sparks) shifted alphabetical port assignments across the fleet, pushing four TCM agents above the gateway's discovery range ceiling (port 9350). Symptom:
unknown agenterrors. Fix: bump ceiling to 9450 in one line of Go (fleetPorts = [][2]int{{9100, 9450}}) and rebuild gateway. Time-to-fix: 3 minutes including restart. There is no equivalent failure mode in vector RAG because there is no per-agent corpus partition. - ●
Soul-prompt fake-quote regression (2026-04-25). Documented in detail in §6.5 above. Time-to-fix: 25 minutes.
- ●
Mislabeled author corpus (2026-04-24). The
ni_haixia/directory had been seeded with reading-notes about other authors' books, not Ni Haixia's own teaching. Symptom: agent answers using third-party content but attributed to Ni Haixia. Fix: replace with 6 actual Renji series transcripts (5.8 MB) from a public-domain GitHub repository; reruncompile.py --action compile_author --author ni_haixia. Time-to-fix: 1 hour, mostly download time. The retrieval architecture made the diagnosis obvious —grep -limmediately revealed the false attributions. - ●
Middleware slug drift (2026-04-24). The web frontend's middleware was hard-coding the master slug list, which drifted behind the live
masters.tssource after Wave 1+2 expansion. Symptom: 10 master URLs returned 404. Fix: change middleware to import the slug list frommasters.tsdirectly (-75 lines of duplication). This is not a Knowledge Search failure, but it illustrates the broader pattern of "drift between source-of-truth and shadow copies" that this paper's architecture is designed to avoid.
In each case the root cause was diagnosable by reading source files and grepping. Recovery did not require model retraining, embedding recomputation, or vector store reindexing. We submit this as quiet evidence for the paper's central claim: when the system's intelligence lives in plain text, the system's repair lives in plain text.
10. Discussion: Retrieval Doesn't Need Intelligence
The machine learning community has a tendency to solve every problem with more machine learning. Retrieval is a case study in this tendency. The progression from BM25 to dense retrieval to learned sparse retrieval to multi-vector retrieval to GraphRAG represents increasing model complexity applied to the retrieval layer — each step adding parameters, training data requirements, and infrastructure dependencies.
We suggest this progression has overshot for a large class of practical applications. When the knowledge base is domain-specific, the vocabulary is predictable, and the corpus is bounded, the optimal retrieval system is the one that has been available since 1973: grep (Thompson, 1973).
This is not an argument against embeddings or vector databases in general. It is an argument against the reflexive application of complex systems to problems that do not require them. The engineering decision should be: does my retrieval problem require semantic understanding at the retrieval layer, or can I defer that understanding to the LLM?
For the majority of domain-specific grounding tasks we have encountered — and we have deployed agents across medicine, spirituality, and civics — the answer is: defer it to the LLM. Let retrieval be fast, exact, and dumb. Let the LLM be the intelligence.
10.1 Implications for the Field
If our findings generalize — and we believe they do, for the class of problems described — then the standard advice to "just use RAG" deserves significant qualification. Practitioners building domain-specific LLM agents should consider keyword search as a first approach, not a last resort. The burden of proof should be on the complex system to justify its complexity, not on the simple system to justify its simplicity.
10.2 The Broader Pattern
Knowledge Search is an instance of a broader pattern we observe in LLM application development: the best systems are often the ones that let the LLM do more and the infrastructure do less. Sophisticated retrieval, complex orchestration, elaborate prompt chains — these are often symptoms of underestimating the LLM's ability to handle messy, unstructured input.
Give the model the raw text. Give it enough context. Get out of the way.
11. Conclusion
We have presented Knowledge Search, a two-layer retrieval system that replaces the standard vector RAG pipeline with grep over raw text plus grep over LLM-compiled per-source concept and FAQ files. Deployed across 76 specialized LLM agents serving three knowledge domains with ~500 primary source texts (~180 MB), it achieves 100% retrieval accuracy at 8-25ms latency with zero per-query preprocessing, zero infrastructure dependencies at query time, and approximately 30 lines of retrieval-side implementation code (the Layer 2 compilation skill is ~400 lines and runs only on a nightly cron at $0/night).
The system works because domain-specific knowledge bases have predictable vocabulary, bounded size, and deterministic search requirements — properties that make keyword search not merely adequate but optimal. The semantic understanding needed to synthesize retrieved passages into useful answers is provided by the LLM itself, making intelligent retrieval redundant. The Layer 2 concept-and-FAQ files, automatically generated by a free local LLM, provide a multilingual semantic bridge into the literal Layer 1 corpus without any embedding store.
We have additionally documented one full failure-and-recovery cycle (§6.5): a deliberate prompt-engineering regression collapsed citation accuracy to 0/5 grep-verified quotes; a 25-minute fix (a 60-line Python script and a 200-word soul-prompt addendum) restored it to 4/4. The architecture's safety properties recover through prompt hygiene alone — no retraining, no infrastructure change. We submit this as a stronger form of "100% retrieval accuracy" than is typical in retrieval papers: not just a benchmark number, but a reproducible recovery path.
We do not claim this approach replaces vector RAG for all applications. We claim it replaces vector RAG for more applications than the current consensus assumes — and that the autonomous nightly compilation pipeline closes most of the gap on the cross-lingual and conceptual queries where pure keyword search was previously inadequate. Before reaching for embeddings, vector databases, and approximate nearest neighbor search, ask: would grep work — and if not, would grep plus a nightly cron of LLM-compiled concept files work? You might be surprised how often the answer to one of these is yes.
References
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpathy, A., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.
Thompson, K. (1973). The UNIX command language. Structured Programming, Infotech State of the Art Report, 375-384.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Robertson, S. E., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., ... & Larson, J. (2024). From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130.
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020). Dense passage retrieval for open-domain question answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 6769-6781.
Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36. (Cited as a related work on self-correcting LLM agents — the prompt-hygiene fix in §6.5 is a non-RL instance of the same recovery pattern.)
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. (2023). Self-Refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36.
Correspondence: The LocalKin Team — contact@localkin.ai. This paper describes the knowledge retrieval system deployed in LocalKin (https://localkin.dev) as of v1.1.0 / 2026-04-25. Reproduction artifacts (souls, scripts, and the failure-and-recovery git history of §6.5) are public at https://github.com/LocalKinAI.
"Grep is All You Need" is a deliberate homage to Vaswani et al. (2017). We trust the irony is not lost.
Cite as:
@misc{localkin2026grep, author = {{The LocalKin Team}}, title = {Grep is All You Need: Zero-Preprocessing Knowledge Retrieval for LLM Agents}, year = {2026}, month = apr, publisher = {Zenodo}, doi = {10.5281/zenodo.19777260}, url = {https://doi.org/10.5281/zenodo.19777260}, note = {Correspondence: contact@localkin.ai; code at https://github.com/LocalKinAI/grep-is-all-you-need} }