How do you use Bloom filters to replace your current logs? Bloom filters are very good at knowing for sure that something does not exist in a set. What exactly is your set in this case? In other words, how can you query a dataset that's behind a bloom filter?
The BloomSearchEngine takes a TokenizerFunc so you can determine how JSON values are tokenized (that's why each path always returns an array of strings).
So {"name": "John Smith"} is tokenized to [{Path: "name", Values: ["john", "smith"]}], and the bloom filters will store:
- field: "name"
- token: "john"
- token: "smith"
- fieldtoken: "name:john"
- fieldtoken: "name:smith"
The same tokenizer must be used at query time too.
Fuzzy searches and sub-word searches could be supported with custom tokenizers (eg trigrams, stemming), but it's more generally targeting the "I know some exact subset of the record, I need all that have this exactly" searches
Not OP, but to me, this reads fairly similar to how ClickHouse can be set up, with Bloom filters, MinMax indexes, etc.
A way to “handle” partial substrings is to break up your input data into tokens (like substrings split in spaces or dashes) and then you can break up your search string up in the same way.
haha you beat me to it! yes tokenize with trigrams is a very simple way to get this functionality. That's how systems like postgres has historically done it
To actually retrieve the row, yeah, but a btree index size scales ~linearly with the dataset size.
You can prune based on partitions, minmax indexes, then bloom filters first. By that point the row group scan, if all other cheks suggest that the row you are after is in the block, is a very small amount of data.
Super ! Bloom filters are smart. Created a hierchial bloom filter for a revisit log for an indexer almost 20 years ago. Saved us $$$ and a still kind of proud of it
How do you use Bloom filters to replace your current logs? Bloom filters are very good at knowing for sure that something does not exist in a set. What exactly is your set in this case? In other words, how can you query a dataset that's behind a bloom filter?
There are three kinds of queries supported for keywords:
- field
- term
- term in field
Each file, and each row group within the file, has 3 bloom filters to handle these queries.
So something like:
{"user": {"name": "John", "tags": [{"type": "user"}, {"role": "admin"}]}}
Gets turned into queryable pairs of:
[{Path: "user.name", Values: ["John"]}, {Path: "user.tags.type", Values: ["user"]}, {Path: "user.tags.role", Values: ["admin"]}]
Then you can search for:
- any record that has "john" in it
- any record that has the "user.tags.type" key
- any record that has "user.tags.type"="user" and "user.tags.role"="admin"
Which bloom filters are used depends on how you build the query, but they test for whether a row matching the condition(s) is in the file/row group
Does that mean that you can't query substrings or do fuzzy searches?
The BloomSearchEngine takes a TokenizerFunc so you can determine how JSON values are tokenized (that's why each path always returns an array of strings).
The default tokenizer is a a whitespace one: https://github.com/danthegoodman1/bloomsearch/blob/148a79967...
So {"name": "John Smith"} is tokenized to [{Path: "name", Values: ["john", "smith"]}], and the bloom filters will store:
- field: "name"
- token: "john"
- token: "smith"
- fieldtoken: "name:john"
- fieldtoken: "name:smith"
The same tokenizer must be used at query time too.
Fuzzy searches and sub-word searches could be supported with custom tokenizers (eg trigrams, stemming), but it's more generally targeting the "I know some exact subset of the record, I need all that have this exactly" searches
Not OP, but to me, this reads fairly similar to how ClickHouse can be set up, with Bloom filters, MinMax indexes, etc.
A way to “handle” partial substrings is to break up your input data into tokens (like substrings split in spaces or dashes) and then you can break up your search string up in the same way.
If you want to adapt the technique to full-text search, you can index trigrams instead of full keywords.
haha you beat me to it! yes tokenize with trigrams is a very simple way to get this functionality. That's how systems like postgres has historically done it
Doesnt this mean you have to do a row scan though? With BTREE you have O(log N) index query and that’s it
To actually retrieve the row, yeah, but a btree index size scales ~linearly with the dataset size.
You can prune based on partitions, minmax indexes, then bloom filters first. By that point the row group scan, if all other cheks suggest that the row you are after is in the block, is a very small amount of data.
https://itnext.io/how-do-open-source-solutions-for-logs-work... covers this very well
Super ! Bloom filters are smart. Created a hierchial bloom filter for a revisit log for an indexer almost 20 years ago. Saved us $$$ and a still kind of proud of it
Library/package with AGPL license, not a great thing even for a lot of FOSS projects.
Not true, its very permissive as long as it's not a "feature" of a product you're building, or offering as a service.
Otherwise you can happily use it in indirect backend services (e.g. your own logging) without license concerns.