Show HN: TeaTime – distributed book library powered by SQLite, IPFS and GitHub

(github.com)

307 points | by yoavm 7 months ago ago

60 comments

2color 7 months ago

This looks neat!

IPFS contributor here.

I had a look at your code and saw how you handle downloading from multiple gateways.

There's a better way to this which also leverages direct P2P retrieval in browsers, which is now a thing with IPFS! [0] If you just want to fetch, checkout @helia/verified-fetch[1] which gives you a Fetch like API that accepts CIDs. It handles all the content routing and retrieval p2p magic and can help reduce reliance on gateways.

You can also pass it gateways in case it can connect to providers directly (due to transports)

[0] https://blog.ipfs.tech/2024-shipyard-improving-ipfs-on-the-w... [1] https://github.com/ipfs/helia-verified-fetch/tree/main/packa...

[-]

yoavm 7 months ago

Thanks! I really wanted to use verified-fetch, but it seemed like it doesn't really support all types of CIDs? In my case, most of the files use blake2b-256, and the last time I checked verified-fetch didn't like that :( Did something change?

Edit: I now realised we previously talked on the Discord channel, thank you for the help there!

[-]

2color 7 months ago

With some extra config you can pass custom hashers.

Any reason you chose BLAKE2b-256?

BLAKE2/3 are amazing hashing functions and have a lot of cool properties, but they really suck on the web in terms of performance. This came up literally yesterday https://bsky.app/profile/norman.life/post/3lbw2qltokc2i and has been reported by multiple devs recently (who also shared excitement about BLAKE)

[-]

yoavm 7 months ago

The main reason I wanted to support it is that it happens to be what Library Genesis uses. I'll have a look into passing these custom headers! Would definitely be more elegant than hitting all gateways together and hoping for the best.

2color 7 months ago

Small correction:

You can also pass [@helia/verified-fetch] gateways as a fallback for when it *cannot* connect to providers directly (due to transports).

aeyes 7 months ago

The Pear P2P framework might be worth a look if you want to get off of GitHub and turn it into a truly distributed system. If the index must be on GitHub then what good is it to have the files on IPFS?

https://docs.pears.com/

[-]

yoavm 7 months ago

Thanks, I shall have a look into that. Technically the index too can live on IPFS, but the real challenge was how to know which databases exist, and where. This is what I'm using the GitHub topics for - everyone can create a repo and assign a topic to it, and the GitHub free API lets you search for matching repos. In a way, what I'm looking is to be able to make a request and get a different result when a new database is up, and the immutable nature of IPFS for example isn't great for that. GitHub stars and accounts also give some basic concept of content moderation. I should see what Pears offers in this space.

Generally, I see the distributed nature of TeaTime as an instrument for being resilient, and less so a goal by itself. My thinking is that the GitHub database repositories are "resilient enough", especially if they can contain just a config.json file pointing to database files living elsewhere. But I'm not a lawyer and could be wrong.

[-]

boramalper 7 months ago

> Technically the index too can live on IPFS

Besides, HTTP requests against indices should be fast enough for a decent user experience and IPFS (and/or its gateways) aren’t great at that in my experience. I think using GitHub (or any other static hosting providers) was a good call in that regard.

mdaniel 7 months ago

> (I'm looking into using Helia so that users are also contributing nodes in the network)

I had to look that term up <https://github.com/ipfs/helia#readme> but while sniffing around in their <https://github.com/ipfs/helia/wiki/Projects-using-Helia> I was reminded of https://github.com/orbitdb/orbitdb#readme which seems like it may be much less rolling your own parts

[-]

yoavm 7 months ago

Thanks, I haven't heard of it before. I wonder if it's possible to use OrbitDB to download a file from IPFS though? Based on it's CID, I mean, because that was my intention with Helia. I thought that one of the nice thing about reading books is that it takes time, and if users could be seeding their files from IndexedDB to the network, they could automatically and effortlessly become "contributing citizens".

Another interesting use case would be to see if this can replace (or be an addition to) SQLite as the database in which the queries are ran.

[-]

zwnow 7 months ago

Seeding sounds like what's done by torrenting too no? How do you prevent people from distributing content they do not own?

[-]

yoavm 7 months ago

Yes it is, and you're absolutely right with raising this concern. I was planning to make seeding (if I get it working at all) an opt-in feature, firstly to prevent excessive data usage on mobile connections.

[-]

zwnow 7 months ago

I see, nice project nonetheless! Hope it gets where you want it :-)

boramalper 7 months ago

Also see: Show HN: Fully-searchable Library Genesis on IPFS

https://news.ycombinator.com/item?id=28585208

https://libgen-crypto.ipns.dweb.link/

viborcip 7 months ago

"Everything is done in the browser - no users, no cookies, no tracking. LocalStorage and IndexedDB are used for saving your last readings, your position in each file."

I love this! Thanks for making this!

mushufasa 7 months ago

> The databases used in TeaTime are GitHub repositories tagged with the teatime-database topic, which are published on GitHub Pages.

Couldn't this be a security issue, for a bad actors to use this tag?

[-]

yoavm 7 months ago

That's a fair point - I guess it could be abused. Databases are sorted by their number of GitHub stars, so I was hoping that with the power of the crowds it will be possible to minimize the bad effect such actor might have, by simply not voting them up.

[-]

mushufasa 7 months ago

there's been several attacks recently where a bad actor takes over a repo where the original maintainer wants to take a step back, then launch a supply chain attack. in recent cases, the attack came from obfuscated binary files in the repo rather than code. given we are dealing with documents here (books) that would be easy to hide malicious code in a file. pdfs have all sorts of execution vulnerabilities for example

[-]

yoavm 7 months ago

Interesting - I'm kinda counting on PDF.js, which is used for PDF rendering, on doing it safely, but of course that doesn't always have to be the case. Do you have any thoughts on how to make this safer?

[-]

mushufasa 7 months ago

some other method of collection where you can hav eknown trust of the files contributed, some method of 'registering' a submission to create trust,

cicko 7 months ago

PDFs are not executables.

[-]

samatman 7 months ago

Other than the JavaScript, you mean.

https://cloudmersive.com/article/Understanding-Embedded-Java...

taneq 7 months ago

You’d be surprised what’s executable with the right attitude.

crest 7 months ago

May I recommend the old 27C3 talk "OMG WTF PDF"?

macintux 7 months ago

[dead]

garganzol 7 months ago

It does not work for me. For example:

> Access to fetch at 'https://bafykbzacedvzdo4hru3wul5excjthwjuz5ggd6jcs4wg77tg7bk...' (redirected from 'https://w3s.link/ipfs/bafykbzacedvzdo4hru3wul5excjthwjuz5ggd...') from origin 'https://bjesus.github.io' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

[-]

yoavm 7 months ago

The CORS headers seem fine, though I'm a little surprised about the content-type:

  $ curl -I https://bafykbzacedvzdo4hru3wul5excjthwjuz5ggd6jcs4wg77tg7bkn3c7hlvohq.ipfs.dweb.link/
  HTTP/2 200
  date: Thu, 28 Nov 2024 16:34:03 GMT
  content-type: application/x-rar-compressed
  content-length: 192215114
  access-control-allow-headers: Content-Type
  access-control-allow-headers: Range
  access-control-allow-headers: User-Agent
  access-control-allow-headers: X-Requested-With
  access-control-allow-methods: GET
  access-control-allow-methods: HEAD
  access-control-allow-methods: OPTIONS
  access-control-allow-origin: *

Can you try with another file? I do think that once I implement Helia's verified-fetch, issues like this should be less common.

slagfart 7 months ago

Sorry if I missed this, but is there an example instance we can play with?

[-]

yoavm 7 months ago

Yes, https://bjesus.github.io/teatime/

sram1337 7 months ago

I don't know what's in the database, so I don't know what to search for.

How about a browse feature?

[-]

yoavm 7 months ago

Ha, that's a great point... I can probably work on a meta "table of content" for each database, shouldn't be hard to do. But generally the database descriptions say what they have - public domain books, library genesis non-fiction, etc.

pizza 7 months ago

This is awesome. Would it be able to integrate Pocket bookmarks into this somehow? Would love to be able to keep a cache offline of all the things I wanted to read (esp if I can prevent link-rot), and to be able to query them.

[-]

yoavm 7 months ago

I need to check about Pocket - do you have any relevant documentation? Note however that TeaTime does keep an offline cache, both a list of your recent books and the files themselves, but it does it using LocalStorage and IndexedDB because I really wanted to prevent the need for having user logins. Maybe I can add an option to download a "state dump" kind of thing, so that you can a file with your history and can always import it. I too always feel that LocalStorage could be a little flaky, in the sense that you don't always realize that switching a browser would mean losing all your data.

evolve2k 7 months ago

Hi I’m here looking for a different offering. I have a personal library of physical books that our small group wants to be able to lend out to others.

I want people to be able to see which books are available and reserve books that are out and get notifications when due back.

And preferable self host as a small site.

Anyone know a small open source physical library management tool like this?

nyolfen 7 months ago

beautiful! i made a much crappier version of the same kind of thing a year or two ago with a manually converted libgen db from annas-archive's index, but the best i could come up with was webtorrenting a sqlite db of hashes and importing it into indexeddb then distributing the static fontend bundle as an urbit app; it was much uglier than this

[-]

yoavm 7 months ago

Thank you! I've seen quite a lot of similar projects as I was working on this, and they all served as much inspiration and also a proof that this can in fact be done.

Also, thanks to whoever came up with thebookbay.org instance :)

clueless 7 months ago

Is this like an open source distributed libgen?

[-]

yoavm 7 months ago

Libgen is one of the databases that are currently available (and was probably a good fit because they already had their files on IPFS), but I think that this architecture, in which the UI is decoupled from the database, the database doesn't hold any copyrighted materials, and the files are downloaded from IPFS, is quite resilient and could be used for serving all sorts of content.

[-]

7 months ago

[deleted]

7 months ago

[deleted]

westurner 7 months ago

Do static sites built with sphinx-build or jupyter-book or hugo or other jamstack static site generators work with TeaTime?

sphinx-build: https://www.sphinx-doc.org/en/master/man/sphinx-build.html

There may need to be a different Builder or an extension of sphinxcontrib.serializinghtml.JSONHTMLBuilder which serializes a doctree (basically a DOM document object model) to the output representation: https://www.sphinx-doc.org/en/master/usage/builders/#sphinxc...

datasette and datasette-lite can load CSV, JSON, Parquet, and SQLite databases; supports Full-Text Search; and supports search Faceting. datasette-lite is a WASM build of datasette with the pyodide python distribution.

datasette-lite > Loading SQLite databases: https://github.com/simonw/datasette-lite#loading-sqlite-data...

jupyter-lite is a WASM build of jupyter which also supports sqlite in notebooks in the browser with `import sqlite3` with the python kernel and also with a sqlite kernel: https://jupyter.org/try-jupyter/lab/

jupyterlite/xeus-sqlite-kernel: https://github.com/jupyterlite/xeus-sqlite-kernel

(edit)

xeus-sqlite-kernel > "Loading SQLite databases from a remote URL" https://github.com/jupyterlite/xeus-sqlite-kernel/issues/6#i...

%FETCH <url> <filename> https://github.com/jupyter-xeus/xeus-sqlite/blob/ce5a598bdab...

xlite.cpp > void fetch(const std::string url, const std::string filename) https://github.com/jupyter-xeus/xeus-sqlite/blob/main/src/xl...

[-]

yoavm 7 months ago

> Do static sites built with sphinx-build or jupyter-book or hugo or other jamstack static site generators work with TeaTime?

I guess it depends on what you mean by "work with TeaTime". TeaTime itself is a static site, generated using Nuxt. Nothing that it does cannot be achieved with another stack - at the end it's just HTML, CSS and JS. I haven't tried sphinx-build or jupyter-book, but there isn't a technical reason why Hugo wouldn't be able to build a TeaTime like website, using the same databases.

> datasette-lite > Loading SQLite databases: https://github.com/simonw/datasette-lite#loading-sqlite-data...

I haven't seen datasette before. What are the biggest benefits you think it has over sql.js-httpvfs (which I'm using now)? Is it about the ability to also use other formats, in addition to SQLite? I got the impression that sql.js-httpvfs was a bit more of a POC, and later some possibly better solutions came out, but I haven't really went that rabbit hole to figure out which one would be best.

Edit: looking a little more into datasette-lite, it seems like one of the nice benefits of sql.js-httpvfs is that it doesn't download the whole SQLite database in order to query it. This makes it possible have a 2GB database but still read it in chunks, skipping around efficiently until you find your data.

[-]

westurner 7 months ago

So the IPFS part is instead of a CDN? Does it replace asset URLs in templates with e.g. cf IPFS gateway URLs with SRI hashes?

datasette-lite > "Could this use the SQLite range header trick?" https://github.com/simonw/datasette-lite/issues/28

From xeus-sqlite-kernel > "Loading SQLite databases from a remote URL" https://github.com/jupyterlite/xeus-sqlite-kernel/issues/6#i... re: "Loading partial SQLite databases over HTTP":

> sql.js-httpvfs: https://github.com/phiresky/sql.js-httpvfs

> sqlite-wasm-http: https://github.com/mmomtchev/sqlite-wasm-http

>> This project is inspired from @phiresky/sql.js-httpvfs but uses the new official SQLite WASM distribution.

Datasette creates a JSON API from a SQLite database, has an optional SQL query editor with canned queries, multi DB query support, docs; https://docs.datasette.io/en/stable/sql_queries.html#cross-d... :

> SQLite has the ability to run queries that join across multiple databases. Up to ten databases can be attached to a single SQLite connection and queried together.

[-]

yoavm 7 months ago

The IPFS part is where the actual files are downloaded from. I wouldn't necessarily say that it's instead of a CDN - it just seems to be a reliable enough way of getting the files (without the pain of setting up a server with all these files myself). Indeed, it fetches them through IPFS Gateways (the Cloudflare IPFS Gateway is unfortunately no longer a thing) using their CID hash.

lakomen 7 months ago

I absolutely don't understand this.

Doesn't ipfs require a server running? Where is the server code?

[-]

2color 7 months ago

It does require an IPFS node, i.e. server to provide tha data. The point is more that data isn't tied to any specific server. By content addressing with CIDs you can fetch from multiple IPFS nodes and even become a "provider" for a CID once your IPFS node downloads a copy.

[-]

lakomen 7 months ago

Yes, I'm aware how ipfs works, but that doesn't really answer the question :)

[-]

2color 7 months ago

Sorry for misunderstanding.

I think the app was built on the assumption that someone is providing these CIDs.

yoavm 7 months ago

I'm currently using public IPFS Gateways to fetch the files from IPFS.

noamikotamir 7 months ago

super cool

user_james92 7 months ago

IPFS needs server running or am I wrong?

[-]

yoavm 7 months ago

For fetching files from IPFS, you can use a public IPFS Gateway, and the Gateway is essentially your server. That's what TeaTime is doing.

rausr 7 months ago

I'm slightly disappointed that this is not the Tea-Time that formed the comms/sync protocol used in the Croquet VR system (on Squeak Smalltalk) some years back. :)

7 months ago

[deleted]

evantbyrne 7 months ago

Any particular reason for choosing IPFS instead of bittorrent or other p2p protocols? It feels like every time I try an IPFS tool it just crashes, whereas I rarely have issues with torrents.

[-]

2color 7 months ago

IPFS contributor here.

IPFS certainly has its flaws, but it's been getting much better in the last year. If you want to do P2P retrieval in browsers, it's practically the most advanced from all protocols.

We just published an update on this topic https://blog.ipfs.tech/2024-shipyard-improving-ipfs-on-the-w...

[-]

boramalper 7 months ago

> If you want to do P2P retrieval in browsers

My understanding is that IPFS always requires an IPFS → HTTP gateway to work in browsers, so I wonder what you mean by "P2P retrival in browsers".

Verified Fetch is about verifying content retrieved from gateways on client-side but the retrival is still very much client-server (vs P2P) today, and Service Worker Gateway seems to be relying on gateways too (<trustless-gateway.link> in the demo linked).

Based on IPFS & HTTP: The Blueprint for Radical Interoperability - Lidel [0], I believe the situation is acknowledged by IPFS devs too. In my unsolicited opinion, using a custom protocol (libp2p transports?) for a project targetting web browsers was a mistake, so I'm glad to see that HTTP is being considered now.

[0] https://www.youtube.com/watch?v=4GwxrQ9Z3Bk

[-]

2color 7 months ago

> My understanding is that IPFS always requires an IPFS → HTTP gateway to work in browsers, so I wonder what you mean by "P2P retrival in browsers".

By P2P retrieval, I mean retrieval directly from a peer that has the data without additional hops.

Historically HTTP gateways were the only way, because you couldn't dial (most) peers directly from a browser unless they had a CA-signed TLS certificate (needed in secure contexts.)

A couple of things changed that:

- WebTransport and WebRTC-direct allow browser-server(peer) communication without a CA signed certificate. WebTransport is very new and still has some problems in browser implementations.

- We just launched AutoTLS which help public IPFS node get a wildcard let's encrypt TLS certificate, making it "dialable" from browsers

With these, it becomes a whole lot easier to establish connections from browsers to IPFS nodes. But it takes time for the network to upgrade.

Note that Libp2p transports are wrappers over network transports to allow interoperability across runtimes, e.g. TCP, QUIC, WebSockets, WebTransport. These are not custom protocols.

Now you point about custom protocols, by which I presume you are referring to data exchange/retrieval protocols.

There are two IPFS data transfer protocols: - Bitswap: the first and default data transfer protocol which requires a streaming network transport (so HTTP isn't ideal). - HTTP Gateways: Initially, HTTP gateways were servers that would handle retrieval (over Bitswap) for you if they didn't have the data locally (sometimes we refer to these as Recursive Gateways, like trustless-gateway.link and ipfs.io). For all the reasons in Lidel's talk (caching, interop, composability), we extended this notion and made HTTP Gateway a general interface for data retrieval.

Today, there are large providers in the network (pinning services) that expose their data with an HTTP Gateway, which allows browsers to retrieve from directly.

We still have more work to do to expose the gateway in Kubo with the new AutoTLS certificate, but for the time being, Bitswap should work well enough even in browsers over a WebSockets connection. ---

Verified Fetch's aims to ensure you get data is it's available. If it fails to find a direct provider of the data, it will fall back on a recursive gateway like trustless-gateway.link which is configured in the defaults. As more IPFS nodes upgrade to newer versions, reliance on such centralised recursive gateways will become unnecessary.

TL;DR: We're all in on HTTP. The real constraint to P2P is browser vendors and slow standard bodies.

7 months ago

[deleted]

yoavm 7 months ago

Yeah - the desire to have TeaTime run as a normal website. BitTorrent doesn't run over HTTP, unless you use WebTorrent, which most BitTorrent clients aren't 100% compatible with. This means you can basically only download from other WebTorrent nodes - and there aren't many.

I do think this might change with the introduction of things like the Direct Sockets API, but for now they are too restricted and not widely supported.

It's my first time working with IPFS and I agree it hasn't always been 100% reliable, but I do hope that if I manage to get TeaTime users to also be contributing nodes, this might actually improve the reliability of the whole network. Once it's possible to use BitTorrent in the browser, I do think it would be a great addition (https://github.com/bjesus/teatime/issues/3).