I built a search engine that runs on Node + SQLite + FTS5.
BM25 + 384-dim vector + FTS5 hybrid ranking
Mesh network with RSA crypto identity (no central auth)
Remote nodes contribute crawl data through P2P WebSocket
930 bytes per doc (2M docs = ~2GB)
Currently indexing 52K+ domains
Runs on 2 servers for $22/month
Patent pending
Why: I wanted search infrastructure anyone could own and run. No Elasticsearch cluster. No cloud dependency. No vendor lock-in.
Demo: https://www.qwikwit.com
Stack: Node, JavaScript, SQLite, FTS5, WebSocket mesh
Happy to answer questions about the architecture.
Cool! Does ranking use term proximity of any kind, or anything similar to pagerank?
I only get 45 results for cheese. I should get far more results for cheese from an index of 2 million pages. How many pages are indexed? Is it actually 52K pages? Is the demo smaller on purpose?
What is your long term goal, to develop this into a general web search tool, or to have this be an Elasticsearch competitor that a bunch of people use for their own data?
The index is running up to 2M and at present there are 2.8 Million index but, only 52k results...
The long term goal is to provide a local search solution that is a competitor to search your own data without the complexity, cost, distribution and etc.
And yes, the site search was the original featureset for which you aptly targeted.
I suppose the answer here is that their is around 52k results - not by design; but, by limitation --- we are actively running for 2M... 10 local crawlers, 5 remote crawlers, 0 mesh crawlers, and etc.
In general - it is a huge set but, small sample size at present (growing) but, small for the moment. My goal here was to showcase the site as an example.
Index Progress 2.7% | 53,507 / 2,000,000 docs
Avg CPU / Memory 2.5% / 956MB
Avg Doc Size 1KB Users (Active) 3 (3) Active Sessions 0
Local Crawlers 11/11 Remote Crawlers 4/5 Index Rate 187/min
Total Documents 53,507 Active Domains 2,341
Ranking Algorithm:
The search uses SQLite FTS5 with BM25 ranking (Okapi BM25 probabilistic retrieval model), not PageRank. Current ranking factors:
- Field boosts: Title gets 20x weight, Tags 5x, Content 1x
- Quality boosts: +2 for meta descriptions, +3 for well-structured content (200-10k chars)
- No term proximity currently - FTS5 does boolean matching but not phrase distance scoring (though it is easy enough to do if necessary).
- No link graph/PageRank - we don't analyze inbound links between pages
Term proximity and link-based authority scoring (like our WitRank domain scoring system) are potential future enhancements. (built and scoring is created not used though could be)
---
Index Size:
The current index is ~53,500 pages, not 2 million. Only 56 pages actually contain "cheese" in the crawled content, so getting 45 results is accurate. The demo is smaller because:
1. The index is live and actively growing (currently ~100 docs/min crawl rate)
2. We're crawling real web content organically, not seeding from a dump
3. Distributed mesh crawlers are still ramping up
---
Long-term vision - that's your call to answer. The architecture currently supports both use cases:
- General web search: public crawling, distributed mesh nodes, browser-based PWA crawlers
- Private/enterprise search: SQLite-based, self-hostable, single-writer architecture
I built a search engine that runs on Node + SQLite + FTS5.
BM25 + 384-dim vector + FTS5 hybrid ranking Mesh network with RSA crypto identity (no central auth) Remote nodes contribute crawl data through P2P WebSocket 930 bytes per doc (2M docs = ~2GB) Currently indexing 52K+ domains Runs on 2 servers for $22/month Patent pending
Why: I wanted search infrastructure anyone could own and run. No Elasticsearch cluster. No cloud dependency. No vendor lock-in. Demo: https://www.qwikwit.com Stack: Node, JavaScript, SQLite, FTS5, WebSocket mesh Happy to answer questions about the architecture.
Cool! Does ranking use term proximity of any kind, or anything similar to pagerank?
I only get 45 results for cheese. I should get far more results for cheese from an index of 2 million pages. How many pages are indexed? Is it actually 52K pages? Is the demo smaller on purpose?
What is your long term goal, to develop this into a general web search tool, or to have this be an Elasticsearch competitor that a bunch of people use for their own data?
Also since this is about search engines, I need to share this interesting article: https://archive.org/details/search-timeline
The index is running up to 2M and at present there are 2.8 Million index but, only 52k results...
The long term goal is to provide a local search solution that is a competitor to search your own data without the complexity, cost, distribution and etc.
And yes, the site search was the original featureset for which you aptly targeted.
I suppose the answer here is that their is around 52k results - not by design; but, by limitation --- we are actively running for 2M... 10 local crawlers, 5 remote crawlers, 0 mesh crawlers, and etc.
In general - it is a huge set but, small sample size at present (growing) but, small for the moment. My goal here was to showcase the site as an example.
Index Progress 2.7% | 53,507 / 2,000,000 docs Avg CPU / Memory 2.5% / 956MB Avg Doc Size 1KB Users (Active) 3 (3) Active Sessions 0 Local Crawlers 11/11 Remote Crawlers 4/5 Index Rate 187/min Total Documents 53,507 Active Domains 2,341
Also, when I try to click on the about pages on the website, it says {"error":"Authentication required"}
yes, some of the pages are requiring authentication; this may either be a bug or in fact they are currently secured.