What is Lighthouse?

Two products sharing one engine. A hosted SaaS for engineers who want their coding agent grounded in canonical sources, and an open-source engine for teams that want to run their own.

The umbrella

Lighthouse is two products. The SaaS (this site) is a grounding layer for coding agents — we run the index, you point your agent at the MCP endpoint, your agent stops hallucinating RFC numbers. The Engine is the retrieval stack underneath, open-source under Apache 2.0, for platform teams that want a private knowledge layer with their own corpus. Same code, two corpora: ours and yours.

Hosted product

Lighthouse

Curated SDLC corpus served over MCP

  • Audience: engineers using coding agents
  • Price: Free 200/day · Pro $12/mo · Team $9/seat
  • Corpus: 71K chunks · 14K sources · 21 recipes (curated by us)
See pricing →

Open-source engine

Lighthouse Engine

Run your own retrieval stack on your own corpus

  • Audience: platform / internal-tools / AI infra teams
  • Price: Free · Apache-2.0 · your hardware
  • Corpus: empty; you bring sources + recipes
/engine →

Why Lighthouse exists

Coding agents are good at writing code, bad at remembering specs. Ask an agent for "the OAuth 2.0 PKCE flow" and you get a confident paraphrase that may or may not match RFC 7636. Lighthouse closes that gap — every agent query goes through a corpus of canonical SDLC reference, and answers come back with a pointer to the source you can actually cite. No training-data recall, no hallucinated section numbers.

Why a finder, not a republisher

Even on permissive licences, hosting full text behind a subscription makes us a content middleman. We don't want that. We retrieve enough text to identify the right doc — title, summary, topic keywords, a few lines of context — and send you to the source. The AI agent gets a precise pointer; the original publisher keeps the traffic.

How retrieval works

Every chunk has three text fields beyond the body: a Qwen-7B one-sentence summary, 3–5 topic tags, and 6–12 search-relevant keywords (synonyms and alternative phrasings). Search runs BM25 over a weighted tsvector (summary & keywords A, tags B, content C) and cosine similarity over the chunk embedding. A reciprocal-rank-fusion merge passes the top candidates through a gpt-4o-mini reranker that scores each against the query directly.

On the hosted index, weekly audit mean is 3.07/5 — just above our 3.0 useful-result threshold; we publish the score and the failure modes alongside the corpus stats.

What's indexed on the SaaS

The hosted index pulls from four buckets — canonical standards (RFCs, OWASP, NIST SP-800, MITRE ATT&CK), Tier-1 framework docs, practitioner literature, and post-cutoff streams. 21 role recipes assemble those into per-role rosters: Developer, DevOps, Security, ML, SRE, Testing, Mobile, Architecture, Designer, Reviewer, Planning, Decomposition, Clarification, PM, Data Eng, Embedded, Web3, Gamedev, Performance, Network, Self-heal.

Self-host gets none of this — the engine ships empty.

How it differs from Context7

Context7 ships a wide library-doc snapshot — hundreds of framework docs, thousands of snippets each. Lighthouse is narrower on library docs (the most-used per role) and adds three layers Context7 doesn't index: canonical standards (RFCs, OWASP, NIST), practitioner literature, and post-cutoff streams (RSS feeds, release notes). Where the two overlap, our bet is depth via summarisation; where they don't, we're the only place to find it.

Pricing or self-host

Want our curated corpus? That's the SaaS — anon 30/day per IP, signed-in free 200/day, Pro 1,500/day for $12/mo. See pricing. Want the engine itself? See /engine — Apache-2.0, your corpus, your hardware.