Open-source retrieval engine

Run your own grounding layer.

Lighthouse Engine is the retrieval stack behind lighthouse.harborgang.com — ingest pipeline, chunker, summariser, BM25 + pgvector retrieval, cross-encoder reranker, MCP server, admin panel. Apache-2.0. Bring your own corpus.

~50 MB image · runs on a $20/mo VPS · Postgres + pgvector required

What's in the box

Ingest pipeline

Sitemap + trafilatura crawler, hash-based delta-ingest, role-tagged chunkers. YAML source manifests; one cron per role.

Indexer

BM25 over weighted tsvector (summary + keywords A, tags B, content C) plus pgvector cosine on chunk embeddings. Async summary worker via OpenRouter or local LFM2.

Reranker

Cross-encoder rerank on top-N candidates. Reciprocal-rank-fusion merge. Pluggable model.

MCP server

search and fetch_source tools out of the box. Token-gated, rate-limited per subject. Admin panel for source review.

Engine vs. SaaS

The hosted Lighthouse at lighthouse.harborgang.com is the engine plus our corpus — 71K chunks of canonical SDLC reference, 14K sources, 21 role recipes, refreshed by our ingest crons. The engine itself ships empty. You bring sources, you bring recipes, you run the crons. If you want the curated corpus, that's the SaaS — see /pricing.

Deploy

1

Clone the repo and pull the image.

git clone https://github.com/ElMundiUA/lighthouse
docker pull ghcr.io/elmundiua/lighthouse:latest
2

Provision Postgres with pgvector, point the engine at it.

export DATABASE_URL=postgres://...
docker run -e DATABASE_URL -p 8000:8000 ghcr.io/elmundiua/lighthouse
3

Drop YAML source manifests into ./recipes/ and run the ingest.

docker exec lighthouse python -m lighthouse.ingest --role security

Full guide →

Bring your own corpus

The engine doesn't pick what's canonical for your team — you do. A "recipe" is a YAML manifest listing sources (URLs, sitemaps, RSS feeds, GitHub repos), a role tag, and a refresh cadence. The ingest cron crawls each source, hashes the result, chunks new content, embeds and summarises, writes to Postgres. Your private repos, internal runbooks, vendor docs — whatever you'd want a coding agent grounded on. Recipes are versioned with the engine; the SaaS uses the same format internally.

Community + by-arrangement

Issues and PRs on GitHub. For stand-up help, source curation, or private-corpus consulting, email us — we'll quote a one-off engagement.

lighthouse@harborgang.com

Looking for the hosted product? → lighthouse.harborgang.com