full-text search

How We Built a Full-Text Search Engine in 400 Lines of Node.js — No Elasticsearch, No Dependencies

Ledd Consulting

06 Mar 2026 — 9 min read

Every search tutorial starts the same way: "First, spin up Elasticsearch." Or Meilisearch. Or Typesense. Or Algolia at $1/1000 searches. We needed search across 244 documents generated by our autonomous agent fleet — research reports, daily briefs, competitive intelligence, training materials, blog posts — and we needed it to run on the same VPS already hosting 25+ microservices. Adding another database wasn't an option. So we built our own.

The whole thing is ~400 lines of Node.js with zero npm dependencies. It indexes 20,697 unique words, searches in under 5ms, and has been running in production at Ledd Consulting for months with hourly re-indexing. Here's how we built it, what scoring decisions we made, and what surprised us along the way.

The Problem — Search Across a Living Knowledge Base

Our platform generates documents continuously. Research agents produce daily briefs. Analysis pipelines write competitive intelligence reports. Training documents accumulate. Blog posts get published. Action items get extracted. By the time we needed search, we had seven distinct document sources spread across the filesystem plus a CMS database:

const SOURCES = [
  { dir: 'knowledge',          pattern: 'KNOWLEDGE-BASE.md', type: 'knowledge', label: 'Knowledge Base' },
  { dir: 'research/reports',   pattern: '.md',               type: 'report',    label: 'Research Report' },
  { dir: 'research/briefs',    pattern: '.md',               type: 'brief',     label: 'Daily Brief' },
  { dir: 'research/actions',   pattern: '.json',             type: 'action',    label: 'Action Item' },
  { dir: 'training',           pattern: '.md',               type: 'training',  label: 'Training' },
  { dir: 'skill-reports',      pattern: '.md',               type: 'skill',     label: 'Skill Report' },
  { dir: 'competitor-reports',  pattern: '.md',               type: 'competitor', label: 'Competitor Intel' },
];

The requirements were straightforward: agents and dashboards need to query "what do we know about X?" and get ranked results with context snippets — fast. Elasticsearch would have cost us 512MB+ of RAM on a VPS already running 35 services. SQLite FTS5 was an option, but we wanted scoring we could fully control — recency boosts, source-type weighting, phrase matching, topic extraction. We wanted to own every line of the ranking logic.

Architecture Overview

The system is a single Node.js HTTP server with four in-memory data structures:

┌──────────────────────────────────────────────────────┐
│                  Knowledge Search (port 8080)         │
│                                                       │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐│
│  │  Inverted    │  │  Document    │  │   Topic      ││
│  │  Index       │  │  Store      │  │   Counts     ││
│  │  word→docs[] │  │  path→meta  │  │  topic→count ││
│  └─────────────┘  └──────────────┘  └──────────────┘│
│  ┌──────────────────────────────────────────────────┐│
│  │  File Mtimes (incremental reindex tracking)      ││
│  └──────────────────────────────────────────────────┘│
│                                                       │
│  REST API: /search /topics /timeline /stats /recent   │
└──────────────────────────────────────────────────────┘
         ▲               ▲                ▲
         │               │                │
   ┌─────┘      ┌───────┘       ┌────────┘
   │            │               │
 Markdown    JSON Actions    Blog CMS
 Files       Files           (SQLite)

No external dependencies. No npm install. The http, fs, path, and url modules from Node.js core are the entire dependency tree. This was a deliberate choice — one of our 25+ services going down because of a transitive dependency issue in node_modules is a risk we've been burned by before.

Implementation Walkthrough

The Inverted Index — Simple but Sufficient

The core data structure is a textbook inverted index: a JavaScript object mapping each word to an array of locations where it appears.

let invertedIndex = {};   // word -> [{ file, line, context, date, title, source }]
let documents = {};       // filepath -> { title, date, source, content, mtime, wordCount }
let topicCounts = {};     // topic -> count
let fileMtimes = {};      // filepath -> mtime (for incremental reindex)

When a document gets indexed, we strip Markdown formatting, tokenize the plain text, then walk every line building position entries:

function indexDocument(filepath, content, title, date, source, label) {
  const stripped = stripMarkdown(content);
  const words = tokenize(stripped);
  const lines = stripped.split('\n');

  documents[filepath] = {
    title, date, source, label,
    wordCount: words.length,
    lineCount: lines.length,
    preview: stripped.substring(0, 300).replace(/\n/g, ' ').trim(),
  };

  // Build inverted index
  const wordPositions = {};
  for (let i = 0; i < lines.length; i++) {
    const lineWords = tokenize(lines[i]);
    for (const word of lineWords) {
      if (!wordPositions[word]) wordPositions[word] = [];
      const contextStart = Math.max(0, i - 1);
      const contextEnd = Math.min(lines.length - 1, i + 1);
      const context = lines.slice(contextStart, contextEnd + 1).join(' ').substring(0, 250);

      wordPositions[word].push({
        file: filepath, line: i + 1, context, date, title, source, label,
      });
    }
  }

  // Merge into global index — cap at 3 occurrences per doc per word
  for (const [word, positions] of Object.entries(wordPositions)) {
    if (!invertedIndex[word]) invertedIndex[word] = [];
    invertedIndex[word].push(...positions.slice(0, 3));
  }
}

Two decisions here worth explaining. First, we store context (the surrounding lines) at index time, not query time. This costs more memory but eliminates file I/O during search — every result comes back with a snippet without touching the filesystem. Second, we cap at 3 position entries per word per document. A 5,000-word research report might mention "agent" 200 times. We don't need 200 index entries for it; 3 gives us enough context for result snippets while keeping the index manageable. With 244 documents, our index holds 20,697 unique words comfortably in memory.

Tokenization and Stop Words — Where Search Quality Lives

The tokenizer is brutally simple, and that's intentional:

function tokenize(text) {
  return text.toLowerCase()
    .replace(/[^a-z0-9\s-]/g, ' ')
    .split(/\s+/)
    .filter(w => w.length > 2 && !STOP_WORDS.has(w));
}

Lowercase everything. Strip non-alphanumeric characters (except hyphens, so "multi-agent" stays intact). Split on whitespace. Drop words under 3 characters. Drop stop words. That's it. No stemming. No lemmatization.

We considered adding a Porter stemmer — "monitoring" and "monitor" should match. But in practice, our prefix matching in the search function (we'll get to that) handles most of these cases, and stemming introduced weird false positives in our domain. "Agent" and "agency" are completely different concepts in our world; a stemmer collapses them.

The stop word list is 80+ common English words stored in a Set for O(1) lookup:

const STOP_WORDS = new Set([
  'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with',
  'by', 'from', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
  'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might', 'shall',
  'can', 'this', 'that', 'these', 'those', 'it', 'its', 'not', 'no', 'nor', 'as',
  // ... 40+ more
]);

The Scoring Algorithm — Where We Spent the Most Time

This is where a toy project becomes a useful tool. Raw inverted index lookups give you matching documents. Scoring determines whether the user sees the right one first. Our scoring is a weighted point system with five signals:

function search(query, sourceFilter = null, limit = 50) {
  const queryWords = tokenize(query);
  const scores = {};
  const contexts = {};

  for (const word of queryWords) {
    // Exact word match: +10 points per occurrence
    if (invertedIndex[word]) {
      for (const entry of invertedIndex[word]) {
        if (sourceFilter && entry.source !== sourceFilter) continue;
        const key = entry.file;
        if (!scores[key]) { scores[key] = 0; contexts[key] = []; }
        scores[key] += 10;
        if (contexts[key].length < 3) contexts[key].push(entry.context);
      }
    }

    // Prefix match: +3 points (handles "monitor" matching "monitoring")
    for (const indexWord of Object.keys(invertedIndex)) {
      if (indexWord !== word && indexWord.startsWith(word) && indexWord.length <= word.length + 4) {
        for (const entry of invertedIndex[indexWord]) {
          if (sourceFilter && entry.source !== sourceFilter) continue;
          const key = entry.file;
          if (!scores[key]) { scores[key] = 0; contexts[key] = []; }
          scores[key] += 3;
        }
      }
    }
  }

Exact matches get 10 points. Prefix matches get 3 — enough to surface related content, not enough to outrank exact hits. The indexWord.length <= word.length + 4 guard prevents "AI" from matching "airline" — prefix expansion is capped at 4 additional characters.

Then we layer on boosts:

  for (const [filepath, score] of Object.entries(scores)) {
    const doc = documents[filepath];
    if (!doc) continue;

    // Phrase match: +50 (the full query appears verbatim in context)
    if (isPhrase) {
      const hasPhrase = (contexts[filepath] || []).some(ctx =>
        ctx.toLowerCase().includes(queryLower)
      );
      if (hasPhrase) scores[filepath] += 50;
    }

    // Title match: +40
    if (doc.title && doc.title.toLowerCase().includes(queryLower)) {
      scores[filepath] += 40;
    }

    // Recency boost: +5 to +20 based on document age
    if (doc.date) {
      const daysAgo = (Date.now() - new Date(doc.date).getTime()) / (1000 * 60 * 60 * 24);
      if (daysAgo <= 1) scores[filepath] += 20;
      else if (daysAgo <= 3) scores[filepath] += 15;
      else if (daysAgo <= 7) scores[filepath] += 10;
      else if (daysAgo <= 14) scores[filepath] += 5;
    }

    // Source type boost: canonical sources rank higher
    if (doc.source === 'knowledge') scores[filepath] += 25;
    else if (doc.source === 'brief') scores[filepath] += 15;
    else if (doc.source === 'action') scores[filepath] += 10;
  }

The score weights were tuned empirically. The phrase match bonus (+50) is the highest single signal because when someone searches "multi-agent orchestration," a document containing that exact phrase is almost certainly what they want. Title matches (+40) come next — if the query appears in the title, that document is probably about the query, not just mentioning it. Recency decays linearly from +20 (today) to +5 (within two weeks) to zero (older). This keeps search results fresh without burying evergreen content.

Source-type boosting was the most debated decision. Our cumulative knowledge base (+25) is a curated, deduplicated synthesis. Daily briefs (+15) are summarized intelligence. Individual reports get no boost. This means a search for "vector embeddings" returns the knowledge base entry first, the latest brief second, and individual reports after — matching how our agents actually consume information.

The API Layer — Seven Endpoints, Zero Dependencies

The HTTP server is raw Node.js. No Express, no Fastify:

const server = http.createServer((req, res) => {
  const parsed = url.parse(req.url, true);
  const params = new URLSearchParams(parsed.search || '');

  switch (parsed.pathname) {
    case '/health':  return handleHealth(req, res);
    case '/search':  return handleSearch(req, res, params);
    case '/topics':  return handleTopics(req, res);
    case '/timeline': return handleTimeline(req, res, params);
    case '/stats':   return handleStats(req, res);
    case '/recent':  return handleRecent(req, res);
    case '/reindex': return handleReindex(req, res);
    default:
      return sendJSON(res, 404, { error: 'Not found',
        endpoints: ['/health', '/search', '/topics', '/timeline', '/stats', '/recent', '/reindex'] });
  }
});

The /timeline endpoint deserves a mention — it groups search results by date, letting our dashboards show "evolution of topic X over time." When an agent asks "what has our research said about quantum computing this month?", the timeline endpoint returns results bucketed by day. It's search + aggregation in one call.

The server re-indexes automatically every hour. The full rebuild across 244 documents and a blog CMS (queried via SQLite CLI to avoid adding any npm dependency) completes in under a second.

What Surprised Us

Prefix matching is expensive without a trie. Our prefix search iterates over all 20,697 keys in the inverted index for every query word. With a small corpus, this is fine — it runs in single-digit milliseconds. But we watched it closely as the index grew. At our current scale, iterating 20K keys per query word is negligible. At 200K keys, we'd need a trie or sorted array with binary search. We left the simple implementation and documented the threshold.

Markdown stripping is more important than stemming. Our first prototype skipped Markdown stripping, and search quality was terrible. A query for "agent" would rank a document higher because it appeared in [agent](https://...) links, table separators |agent|, and header markup ## Agent Architecture. The stripping function handles headers, bold/italic, links, code blocks, lists, and table syntax:

function stripMarkdown(text) {
  return text
    .replace(/#{1,6}\s*/g, '')
    .replace(/\*{1,3}([^*]+)\*{1,3}/g, '$1')
    .replace(/\[([^\]]+)\]\([^)]+\)/g, '$1')
    .replace(/`{1,3}[^`]*`{1,3}/g, '')
    .replace(/^\s*[-*+]\s+/gm, '')
    .replace(/\|/g, ' ')
    .replace(/---+/g, '')
    .replace(/\n{3,}/g, '\n\n');
}

This single function improved search result relevance more than any scoring change we made.

Topic extraction became a product feature. We added regex-based topic tracking (31 patterns covering AI, blockchain, compliance, cloud infrastructure, etc.) as a debugging tool — we wanted to know what our research pipeline was actually covering. Our dashboards started consuming the /topics endpoint to show coverage heatmaps, and agents started using it to identify knowledge gaps. A debugging feature became a core capability.

Capping index entries per document matters. Without the positions.slice(0, 3) cap, a single 10,000-word knowledge base document would dominate the index for common terms. Three entries per word per document gives enough context for snippets while preventing any single document from drowning out the rest.

Lessons Learned

You don't need a database for search under 10K documents. A JavaScript object is a hash map. An inverted index built from hash maps gives you sub-10ms search at our scale (244 documents, ~21K unique terms). The break-even point where Elasticsearch or Meilisearch becomes worth the operational overhead is probably around 50K-100K documents, or when you need features like faceting, geospatial queries, or distributed indexing.

Scoring is the product, not matching. Getting matching documents is trivial — a basic inverted index does that. Getting them in the right order is where you spend 80% of your time. Our five-signal scoring (exact match, prefix, phrase, recency, source type) was tuned over weeks of real usage. Start with simple scores, then adjust based on "the right document wasn't first" complaints.

Zero-dependency services are easier to operate. This service has been running for months without a single dependency-related incident. No npm audit warnings, no breaking changes in transitive packages, no supply chain concerns. For a 400-line service, the Node.js standard library is enough.

Persist metadata, not the full index. We save document metadata (titles, dates, word counts, topic counts) to disk but rebuild the inverted index from source files on startup. The metadata file is useful for dashboards and debugging; the inverted index is fast enough to rebuild from scratch (under a second for 244 documents). This avoids the complexity of index serialization, versioning, and corruption recovery.

Build search for your consumers, not for general use. Our agents and dashboards are the only consumers. That's why source-type boosting exists — it reflects how agents prefer curated knowledge over raw reports. A general-purpose search engine wouldn't make that choice. Knowing your consumers lets you build a better product with less code.

Conclusion

We've been running this search engine in production for months across a fleet of 25+ services. It handles queries from dashboards, agents, and internal tools. It indexes 244 documents containing 20,697 unique words. It runs on a server already hosting 35 ports worth of microservices without any noticeable resource impact. And the entire implementation fits in a single file with zero external dependencies.

The lesson isn't "never use Elasticsearch." It's that for a surprisingly large class of problems — internal knowledge bases, document search across hundreds or low thousands of files, agent memory retrieval — you can build exactly what you need in an afternoon and operate it for years without touching it again.

Need help building AI agent systems or designing multi-agent architectures? Ledd Consulting specializes in autonomous workflow design and agent orchestration for enterprise teams.