Ingest pipeline
Sitemap + trafilatura crawler, hash-based delta-ingest, role-tagged chunkers. YAML source manifests; one cron per role.
Open-source retrieval engine
Lighthouse Engine is the retrieval stack behind lighthouse.harborgang.com — ingest pipeline, chunker, summariser, BM25 + pgvector retrieval, cross-encoder reranker, MCP server, admin panel. Apache-2.0. Bring your own corpus.
~50 MB image · runs on a $20/mo VPS · Postgres + pgvector required
What's in the box
Ingest pipeline
Sitemap + trafilatura crawler, hash-based delta-ingest, role-tagged chunkers. YAML source manifests; one cron per role.
Indexer
BM25 over weighted tsvector (summary + keywords A, tags B, content C) plus pgvector cosine on chunk embeddings. Async summary worker via OpenRouter or local LFM2.
Reranker
Cross-encoder rerank on top-N candidates. Reciprocal-rank-fusion merge. Pluggable model.
MCP server
search and fetch_source
tools out of the box. Token-gated, rate-limited per subject. Admin
panel for source review.
Engine vs. SaaS
The hosted Lighthouse at lighthouse.harborgang.com is the engine plus our corpus — 71K chunks of canonical SDLC reference, 14K sources, 21 role recipes, refreshed by our ingest crons. The engine itself ships empty. You bring sources, you bring recipes, you run the crons. If you want the curated corpus, that's the SaaS — see /pricing.
Deploy
Clone the repo and pull the image.
git clone https://github.com/ElMundiUA/lighthouse
docker pull ghcr.io/elmundiua/lighthouse:latest Provision Postgres with pgvector, point the engine at it.
export DATABASE_URL=postgres://...
docker run -e DATABASE_URL -p 8000:8000 ghcr.io/elmundiua/lighthouse Drop YAML source manifests into ./recipes/ and run the ingest.
docker exec lighthouse python -m lighthouse.ingest --role security Bring your own corpus
The engine doesn't pick what's canonical for your team — you do. A "recipe" is a YAML manifest listing sources (URLs, sitemaps, RSS feeds, GitHub repos), a role tag, and a refresh cadence. The ingest cron crawls each source, hashes the result, chunks new content, embeds and summarises, writes to Postgres. Your private repos, internal runbooks, vendor docs — whatever you'd want a coding agent grounded on. Recipes are versioned with the engine; the SaaS uses the same format internally.
Apache-2.0
Use commercially, modify, redistribute, sublicense. No patent-grant gotchas.
LICENSE on GitHub ↗Community + by-arrangement
Issues and PRs on GitHub. For stand-up help, source curation, or private-corpus consulting, email us — we'll quote a one-off engagement.
lighthouse@harborgang.comLooking for the hosted product? → lighthouse.harborgang.com