The umbrella
Lighthouse is two products. The SaaS (this site) is a grounding layer for coding agents — we run the index, you point your agent at the MCP endpoint, your agent stops hallucinating RFC numbers. The Engine is the retrieval stack underneath, open-source under Apache 2.0, for platform teams that want a private knowledge layer with their own corpus. Same code, two corpora: ours and yours.
Hosted product
Lighthouse
Curated SDLC corpus served over MCP
- Audience: engineers using coding agents
- Price: Free 200/day · Pro $12/mo · Team $9/seat
- Corpus: 71K chunks · 14K sources · 21 recipes (curated by us)
Open-source engine
Lighthouse Engine
Run your own retrieval stack on your own corpus
- Audience: platform / internal-tools / AI infra teams
- Price: Free · Apache-2.0 · your hardware
- Corpus: empty; you bring sources + recipes
Why Lighthouse exists
Coding agents are good at writing code, bad at remembering specs. Ask an agent for "the OAuth 2.0 PKCE flow" and you get a confident paraphrase that may or may not match RFC 7636. Lighthouse closes that gap — every agent query goes through a corpus of canonical SDLC reference, and answers come back with a pointer to the source you can actually cite. No training-data recall, no hallucinated section numbers.
Why a finder, not a republisher
Even on permissive licences, hosting full text behind a subscription makes us a content middleman. We don't want that. We retrieve enough text to identify the right doc — title, summary, topic keywords, a few lines of context — and send you to the source. The AI agent gets a precise pointer; the original publisher keeps the traffic.
How retrieval works
Every chunk has three text fields beyond the body: a Qwen-7B one-sentence summary, 3–5 topic tags, and 6–12 search-relevant keywords (synonyms and alternative phrasings). Search runs BM25 over a weighted tsvector (summary & keywords A, tags B, content C) and cosine similarity over the chunk embedding. A reciprocal-rank-fusion merge passes the top candidates through a gpt-4o-mini reranker that scores each against the query directly.
On the hosted index, weekly audit mean is 3.07/5 — just above our 3.0 useful-result threshold; we publish the score and the failure modes alongside the corpus stats.
What's indexed on the SaaS
The hosted index pulls from four buckets — canonical standards (RFCs, OWASP, NIST SP-800, MITRE ATT&CK), Tier-1 framework docs, practitioner literature, and post-cutoff streams. 21 role recipes assemble those into per-role rosters: Developer, DevOps, Security, ML, SRE, Testing, Mobile, Architecture, Designer, Reviewer, Planning, Decomposition, Clarification, PM, Data Eng, Embedded, Web3, Gamedev, Performance, Network, Self-heal.
Self-host gets none of this — the engine ships empty.
How it differs from Context7
Context7 ships a wide library-doc snapshot — hundreds of framework docs, thousands of snippets each. Lighthouse is narrower on library docs (the most-used per role) and adds three layers Context7 doesn't index: canonical standards (RFCs, OWASP, NIST), practitioner literature, and post-cutoff streams (RSS feeds, release notes). Where the two overlap, our bet is depth via summarisation; where they don't, we're the only place to find it.
Pricing or self-host
Want our curated corpus? That's the SaaS — anon 30/day per IP, signed-in free 200/day, Pro 1,500/day for $12/mo. See pricing. Want the engine itself? See /engine — Apache-2.0, your corpus, your hardware.