7 comments

  • joeg_usa 11 hours ago

    I built a search engine that runs on Node + SQLite + FTS5.

    BM25 + 384-dim vector + FTS5 hybrid ranking Mesh network with RSA crypto identity (no central auth) Remote nodes contribute crawl data through P2P WebSocket 930 bytes per doc (2M docs = ~2GB) Currently indexing 52K+ domains Runs on 2 servers for $22/month Patent pending

    Why: I wanted search infrastructure anyone could own and run. No Elasticsearch cluster. No cloud dependency. No vendor lock-in. Demo: https://www.qwikwit.com Stack: Node, JavaScript, SQLite, FTS5, WebSocket mesh Happy to answer questions about the architecture.

    • n1xis10t 10 hours ago

      Cool! Does ranking use term proximity of any kind, or anything similar to pagerank?

      I only get 45 results for cheese. I should get far more results for cheese from an index of 2 million pages. How many pages are indexed? Is it actually 52K pages? Is the demo smaller on purpose?

      What is your long term goal, to develop this into a general web search tool, or to have this be an Elasticsearch competitor that a bunch of people use for their own data?

      Also since this is about search engines, I need to share this interesting article: https://archive.org/details/search-timeline

      • joeg_usa 3 hours ago

        The index is running up to 2M and at present there are 2.8 Million index but, only 52k results...

        The long term goal is to provide a local search solution that is a competitor to search your own data without the complexity, cost, distribution and etc.

        And yes, the site search was the original featureset for which you aptly targeted.

        • joeg_usa 3 hours ago

          I suppose the answer here is that their is around 52k results - not by design; but, by limitation --- we are actively running for 2M... 10 local crawlers, 5 remote crawlers, 0 mesh crawlers, and etc.

          In general - it is a huge set but, small sample size at present (growing) but, small for the moment. My goal here was to showcase the site as an example.

          Index Progress 2.7% | 53,507 / 2,000,000 docs Avg CPU / Memory 2.5% / 956MB Avg Doc Size 1KB Users (Active) 3 (3) Active Sessions 0 Local Crawlers 11/11 Remote Crawlers 4/5 Index Rate 187/min Total Documents 53,507 Active Domains 2,341

          • joeg_usa 3 hours ago

              Ranking Algorithm:
            
              The search uses SQLite FTS5 with BM25 ranking (Okapi BM25 probabilistic retrieval model), not PageRank. Current ranking factors:
            
              - Field boosts: Title gets 20x weight, Tags 5x, Content 1x
              - Quality boosts: +2 for meta descriptions, +3 for well-structured content (200-10k chars)
              - No term proximity currently - FTS5 does boolean matching but not phrase distance scoring (though it is easy enough to do if necessary).
              - No link graph/PageRank - we don't analyze inbound links between pages
            
              Term proximity and link-based authority scoring (like our WitRank domain scoring system) are potential future enhancements. (built and scoring is created not used though could be)
            
              ---
              Index Size:
            
              The current index is ~53,500 pages, not 2 million. Only 56 pages actually contain "cheese" in the crawled content, so getting 45 results is accurate. The demo is smaller because:
            
              1. The index is live and actively growing (currently ~100 docs/min crawl rate)
              2. We're crawling real web content organically, not seeding from a dump
              3. Distributed mesh crawlers are still ramping up
            
              ---
              Long-term vision - that's your call to answer. The architecture currently supports both use cases:
              - General web search: public crawling, distributed mesh nodes, browser-based PWA crawlers
              - Private/enterprise search: SQLite-based, self-hostable, single-writer architecture
    • n1xis10t 9 hours ago

      Also, when I try to click on the about pages on the website, it says {"error":"Authentication required"}

      • joeg_usa 3 hours ago

        yes, some of the pages are requiring authentication; this may either be a bug or in fact they are currently secured.