Giving it a quick look, seems like they've addressed a lot of the shortcomings of Parquet which is very exciting. In no particular order:
- Parquet metadata is Thrift, but with comments saying "if this field exists, this other field must exist", and no code actually verifying the fact, so I'm pretty sure you could feed it bogus Thrift metadata and crash the reader.
- Parquet metadata must be parsed out, meaning you have to: allocate a buffer, read the metadata bytes, and then dynamically keep allocating a whole bunch of stuff as you parse the metadata bytes, since you don't know the size of the materialized metadata! Too many heap allocations! This file format's Flatbuffers approach seems to solve this as you can interpret Flatbuffer bytes directly.
- The encodings are much more powerful. I think a lot of people in the database community have been saying that we need composable/recursive lightweight encodings for a long time. BtrBlocks was the first such format that was open in my memory, and then FastLanes followed up. Both of these were much better than Parquet by itself, so I'm glad ideas from those two formats are being taken up.
- Parquet did the Dremel record-shredding thing which just made my brain explode and I'm glad they got rid of it. It seemed to needlessly complicate the format with no real benefit.
- Parquet datapages might contain different numbers of rows, so you have to scan the whole ColumnChunk to find the row you want. Here it seems like you can just jump to the DataPage (IOUnit) you want.
- They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.
Overall great improvement, I'm looking forward to this taking over the data analytics space.
> - They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.
"Heavyweight compression" as in zstd and brotli? That's very useful for columns of non-repeated strings. I get compression ratios in the order of 1% on some of those columns, because they are mostly ASCII and have lots of common substrings.
I think the wasm compiler is going to bring in more dependencies than the ‘heavy’ compression would have.
I think that more expensive compression may have made more of a difference 15 years ago when cpu was more plentiful compared to network or disk bandwidth.
I think it’s a pretty common choice when you want compression in a new format or protocol. It works better for compressing chunks of your data rather than large files where you want to maintain some kind of index or random access. Similarly, if you have many chunks then you can parallelise decompression (I’m not sure any kind of parallelism support should have been built in to the zstd format, though it is useful for command line uses).
A big problem for some people is that Java support is hard as it isn’t portable so eg making a Java web server compress its responses with are isn’t so easy.
Depends on the use-case. For transparent filesystem compression I would still recommend lz4 over zstd because speed matters more than compression ratio in that use case.
For those who don't know. Wes McKinney is the creator of Pandas, the go-to tabular analysis library for Python. That gives his format widespread buy-in from the outset, as well as a couple of decades' of Caring About The Problem which makes his insights unusually valuable.
Andy Pavlo also deserves a shout out; he's an authority on databases and lives a "data oriented lifestyle". See his two "What goes around comes around ..." papers for great overviews of the past and future 50 years in databases respectively (as well as all the amazing CMU Seminar Series). I'm excited to see him involved in this.
My apologies to the Chinese coauthors who I'm not familiar with.
I've been looking for an article that I think I found 10+ years ago on the web about a very similar topic, something like "the evolution of database systems". It was a very well written article that I just skimmed through and planned to read properly but could never find it again. I remember it had hand-drawn-style diagrams of database architectures with blue backgrounds, sometimes squiggly lines and maybe even some bees on them (could be just my mind mixing it with something). I'd be eternally grateful if someone here could find it.
Not sure if this is what you had in mind but here are the links to the two papers that I was referencing.
[1] covers the first 40 years of databases. [2] fills in the gap of the last 20 years and gives their thoughts on the future.
My apologies, the first one was actually Stonebraker & Hellerstein and didn't involve Pavlo. They're both excellent papers though for anyone working with data.
Stonebraker, for those who don't know, is the creator of Postgres, a database you might have heard of.
I'm a big fan of Wes' work and Pandas was incredibly influential. However technically it wasn't his best work. In terms of selling points, I think that the Arrow data format is technically much better and more influential on the data ecosystem as a whole, see DataFusion, etc...
That said, now let me see what F3 is actually about (and yes, your comment is what actually made me want to click through to the link) ...
I was also concerned about the wasted overhead. However I guess it's just there for compatibility (since space is cheap) and for common encodings you'll be able to skip reading it with range requests and use your trusted codec to decode the data. Smart move imho.
It really depends on the order of priorities. If the overall goal is to allow digital archeologist to make sense of some file they found, it would be prudent to give them some instructions on how it is decoded.
I just hope that people will not just execute that code in an unconfined environment.
Not in physicists' future! /jk. Exabytes of data produced in the next two decades at the Large Hadron Collider will be stored in a format homemade by CERN: https://cds.cern.ch/record/2923186/
Forgive me for not understanding the difference between columnar storage and otherwise, but why is this so cool? Is the big win that you can send around custom little vector embedding databases with a built in sandbox?
I get the "step change/foundational building block" vibe from the paper - and the project name itself implies more than a little "je ne sais quoi" - but unfortunately I only understand a few sentences per page. The pictures are pretty though, and the colour choices are tasteful yet bold. Two thumbs up from the easily swayed.
> Is the big win that you can send around custom little vector embedding databases with a built in sandbox?
No, this is a compatibility layer for future encoding changes.
For example, ORCv2 has never shipped because we tried to bundle all the new features into a new format version, ship all the writers with the features disabled, then ship all the readers with support and then finally flip the writers to write the new format.
Specifically, there was a new flipped bit version of float encoding which sent the exponent, mantissa and sign as integers for maximum compression - this would've been so much easier to ship if I could ship a wasm shim with the new file and skip the year+ wait for all readers to support it.
We'd have made progress with the format, but we'd also be able to deprecate a reader impl in code without losing compatibility if the older files carried their own information.
Today, something like Spark's variant type would benefit from this - the sub-columnarization that does would be so much easier to ship as bytecode instead of as an interpreter that contains support for all possible recombinations from split up columns.
PS: having spent a lot of nights tweaking tpc-h with ORC and fixing OOMs in the writer, it warms my heart to see it sort of hold up those bits in the benchmark
I'm less confident. Your description highlights a real problem but this particular solution looks like an attempt to shoe horn a technical solution to a political people problem. It feels like one of these great ideas that years later results in 1000s of different decoders, breakages and a nightmare to maintain. Then someone starts an initiative to move decoding from being bundled and to instead just defining the data format.
Sometimes the best option is to do the hard political work and improve the standard and get everyone moving with it. People have pushed parquet and arrow. Which they are absolutely great technologies that I use regularly but 8 years after someone asked how to write parquet in java, the best answer is to use duckdb: https://stackoverflow.com/questions/47355038/how-to-generate...
Not having a good parquet writer for java shows a poor attempt at pushing forward a standard. Similarly arrow has problems in java land. If they can't be bothered to consider how to actually implement and roll out standards to a top 5 language, I'm not sure I want them throwing WASM into the mix will fix it.
Columnar storage is great because a complex nested schema can be decomposed into its leaf values and stored as primitives. You can directly access leaf values, avoiding a ton of IO and parsing. Note that all these formats are actually partitioned into groups of rows at the top level.
One big win here is that it's possible to get Apache Arrow buffers directly from the data pages - either by using the provided WASM or bringing a native decoder.
In Parquet this is currently very complicated. Parquet uses the Dremel encoding which stores primitive values alongside two streams of integers (repetition and definition levels) that drive a state machine constructed from the schema to reconstruct records. Even getting those integer streams is hard - Parquet has settled on “RLE” which is a mixture of bit-packing and run-length encoding and the reference implementation uses 74,000 lines of generated code just for the bit-packing part.
So to get Arrow buffers from Parquet is a significant amount of work. F3 should make this much easier and future-proof.
One of the suggested wins here is random access to metadata. When using GeoParquet I index the metadata in SQLite, otherwise it would take about 10 minutes as opposed to a few milliseconds to run a spatial query on e.g. Overture Maps - I'd need to parse the footer of ~500 files meaning ~150MB of Thrift would need to be parsed and queried.
However the choice of Google's Flatbuffers is an odd one. Memory safety with FlatBuffers is a known problem [1]. I seriously doubt that generated code that must mitigate these threats will show any real-world performance benefits. Actually - why not just embed a SQLite database?
Afaik the win of columnar storage comes from the fact that you can very quickly scan the entire column across all rows making very efficient use of os buffering etc. so queries like select a where b = 'x' are very quick.
> so queries like select a where b = 'x' are very quick
I wouldn’t say “very quick”. They only need to read and look at the data for columns a and b, whereas, with a row-oriented approach, with storage being block-based, you will read additional data, often the entire dataset.
That’s faster, but for large datasets you need an index to make things “very quick”. This format supports that, but whether to have that is orthogonal to being row/column oriented.
When b = 'x' is true for many rows and you select * or multiple columns, then it's the opposite, because reading all row data is slower in column based data structures than in row based ones.
IMO it's easier to explain in terms of workload:
- OLTP (T = transactional workloads), row based, for operating on rows
- OLAP (A = analytical workloads), column based, for operating on columns (sum/min/max/...)
> The decoding performance slowdown of Wasm is minimal (10–30%) compared to a native implementation.
so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future. And you also give up any advanced decoding functions other than "decode whole block and store into memory".
I have no idea why would anyone do this. If you care about speed, then wasm is not going to cut it. If you don't care about speed, you don't need super-fancy encoding algorithms, just use any of the well-known ones.
> so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future.
The WASM is meant as a backup. If you have the native decoder installed (e.g., as a crate), then a system will prefer to use that. Otherwise, fallback to WASM. A 10-30% performance hit is worth it over not being able to read a file at all.
"Embedding the decoders in each file requires minimal storage (kilobytes) and ensures
compatibility on any platform in case native decoders are unavailable."
The idea that software I write today can decode a data file written in ten years using new encodings is quite appealing.
And the idea that new software written to make use of the new encodings doesn't have to carry the burden of implementing the whole history of encoders for backwards compatibility likewise.
Is this relevant in practice? Say I go to a website to download some data, but a malicious actor has injected an evil decoder (that does what exactly?). They could just have injected the wasm into the website I am visiting to get the data!
In fact, wasm was explicitly designed for me to run unverified wasm blobs from random sources safely on my computer.
"In case users prefer native decoding speed over Wasm, F3 plans to offer an option to associate a URL with each Wasm binary, pointing to source code or a precompiled library."
They are not suggesting that the code at the url would be automatically downloaded. It would be up to you to get the code and build it into your application like any other library.
I kind of agree with you, but there's more to the picture.
The situation you describe is kind of already the case with various approaches to compression. For example, perhaps we decide to bitpack instead of use the generic compressor. Or change compressors entirely.
This sort of thing exists without WASM, and it means you have to "transcode" i.e. rewrite the file after updating your software with the new techniques.
With WASM, it's the same. You just rewrite the file.
I do agree that this pushes the costs of iteration up the stack in a vastly less efficient way. Overall this seems way more expensive, very unclear that future proofing is worth it. I've worked with exabyte-scale systems and re-encoding swaths of data regularly would not be good.
The pitch for this sounds very similar to the pitch for Vortex (i.e. obviating the need to create a new format every time a shift occurs in data processing and computing by providing a data organization structure and a general-purpose API to allow developers to add new encoding schemes easily).
But I'm not totally clear what the relationship between F3 and Vortex is. It says their prototype uses the encoding implementation in Vortex, but does not use the Vortex type system?
The backstory is complicated. The plan was to establish a consortium between CMU, Tsinghua, Meta, CWI, VoltronData, Nvidia, and SpiralDB to unify behind a single file format. But that fell through after CMU's lawyers freaked out over Meta's NDA stuff to get access to a preview of Velox Nimble. IANAL, but Meta's NDA seemed reasonable to me. So the plan fell through after about a year, and then everyone released their own format:
On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.
I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:
Andrew, it’s always great to read the background from the author on how (and even why!) this all played out. This comment is incredibly helpful for understanding the context of why all these multiple formats were born.
I would love to bring these benefits to the multidimensional array world, via integration with the Zarr/Icechunk formats somehow (which I work on). But this fragmentation of formats makes it very hard to know where to start.
... Are you saying that there's 5 competing "universal" file format projects? Each with different non-compatible approaches? Is this a laughing/crying thing, or a "lots of interesting paths to explore" thing?
Also, back on topic - is your file format encryptable via that WASM embedding?
Requiring any implementation who wants to read the data to be able to run webassembly means the format is ill-suited to any environment which aims to reduce dependencies or bloat.
Wasm is pretty simple; Don't confuse it with the "web" part of the acronym. And as others point out; wasm is a backup. You can use native binding without going through wasm, which provide better performance,
Don't know if you need to compile, you might want to but I think interpreting might seem reasonable if size/complexity is a concern.
Is the runtime really that large? I know with wasm 2.0 with garbage collection and exceptions is a bit of a beast but wasm 1.0? What's needed (I'm speaking from a place of ignorance here, I haven't implemented a WASM runtime)? Some contiguous memory, a stack machine, IEEE float math and some utf-8 operations. I think you can add some reasonable limitations like only a single module and a handful of available imports relevant to the domain.
I know that feature creep would almost inevitably follow, but if someone cares about minimizing complexity it seems possible.
If you view WASM as a worst case fallback, then you could implement a simple interpreter and transcode the file to something that your program has native decoders for. Yes, that might be slow, but at least the data remains usable. At least, that’s my understanding. I don’t have a dog in the hunt.
But given a file in this format, you cannot know whether your native decoder can read it without running WASM; the file you read may use a feature that was invented after your library was written.
It seems one of the first file formats that embed WebAssembly modules. Is there any other prior work? I'm specifically interested from the compression perspective as a well-chosen WebAssembly preprocessor can greatly boost the compression ratio. (In fact, I'm currently working on such file format just in case.)
Alan Kay described [1] what he considers the first object oriented system, made in the 60s by an unknown programmer. It was a tape-based storage system, where the "format" of the tap was a set of routines to read, write, etc. at a known offsets on the tape.
> We first discuss the implementation considerations of the input to the Wasm-side Init() API call. The isolated linear memory space of Wasm instance is referred to as guest, while the program’s address space running the Wasm instance is referred to as host. The input to a Wasm instance consists of the contiguous bytes of an EncUnit copied from the host’s memory into the guest’s memory, plus any additional runtime options.
> Although research has shown the importance of minimizing the number of memory copies in analytical workloads, we consider the memory copy while passing input to Wasm decoders hard to avoid for several reasons. First, the sandboxed linear memory restricts the guest to accessing only its own memory. Prior work has modified Wasm runtimes to allow access to host memory for reduced copying, but such changes compromise Wasm’s security guarantees
It's "the future work". But possibly, allowlists might help:
> Security concerns. Despite the sandbox design of Wasm, it still has vulnerabilities, especially with the discovery of new attack techniques. [...] We believe there are opportunities for future work to improve Wasm security. One approach is for creators of Wasm decoding kernels to register their Wasm modules in a central repository to get the Wasm modules verified and tamper-resistant.
The only issue the article seems to raise is that their solution isn't that optimal because there's redundant copying of the input data into the sandbox, but this enables the sandbox to be secure as the Wasm code can't modify data outside of its sandbox. I'd assume the memory is protected at the CPU level, something akin to virtualisation. But then there's maybe some side-channel attack it could use to extract outside data somehow?
So you're suggesting a denial of service? I doubt the sandbox is running on cooperative scheduling. Ultimately the VM can be run for some cycles, enough to get work done, but enough to permit other programs to run, and if the decoder doesn't seem to be making any progress the user or automated process can terminate it. Something like this already happens when js code takes up too much time on a browser and a little pop-down shows to let you stop it.
You can if you bound the program in time and space. In particular if you put a timeout on a computation you know for a fact it will halt at some point before the timeout.
Do we have a vetted 'optimal' format for column (this) and row (sqlite?) yet? I often have to switch all day and I have something in memory made by us that works but nothing standard.
The embedded WASM makes this more extensible, but still, they have to choose, and freeze forever, interfaces that the WASM blocks can implement. Chances are that will, at some time, become a pain point.
They also have to specify the WASM version. In 6 years, we already have 3 versions there (https://webassembly.org/specs/). Are those 100% backwards compatible?
Inertia is a thing. Parquet and ORC and all the myriad of other formats all exist with ecosystems and things. For this to succeed they’d have to seed the community with connectors and libraries and hooks into things to make it easy to use.
Plugins for DudckDB for example to read and write to it. Heck if you got Iceberg to use it instead of parquet that could be a win.
Sometimes tech doesn’t win on the merits because of how entrenched existing stuff is and the high cost of switching.
Why not store the data directly as Arrow files, to allow for mmaping? I see F3 also supports such zero-copy mmap, and skimming through the paper, it actually seems that it uses Arrow buffers, so I wonder what is the difference to directly using Arrow files? (Arrow files is what is being used by HuggingFace datasets.)
The main reason is that arrow files are not compressed at all. So storing everything as Arrow would increase storage size by 10-100x (depending on data)
This feels like one of those ideas that seems like a good idea during a late-night LLM brainstorming sesh but not so good when you come back to it with a fresh brain in the morning.
The obvious concern would be data-dependent backdoors for malicious “decoding”, i.e. correctly decoding ordinary data, but manipulating the decoding of targeted data in some compromising way.
That relies on some rather far-fetched assumptions about what the attacker might reasonably be able to control undetected, and what goals they might reasonably be able to achieve through such low-level data corruption.
Maybe information leakage? Tweak some low-order float bits in the decoded results with high-order bits from data the decoder recognizes as “interesting”?
What's the attack vector in this case? The Wasm is loaded from the file itself. If they can compromise the file then its cheaper to just compromise the data directly.
What I’m imagining is essentially a supply chain attack: The victim (mistakenly) trusts the attacker to supply an encoder. The encoder appears to function normally, but in fact will subtly leak information smuggled in decoded values of the victim’s data.
Is there any reason to believe that a major new browser tech, WASM, will ever have support dropped for its early versions?
Even if the old versions are not as loved (i.e.: engine optimized for it and immediately ready to be executed) as the old versions, emulation methods work wonders and could easily be downloaded on-demand by browsers needing to run "old WASM".
I'm quite optimistic for the forwards-compatibility proposed here.
Sure. Just recently Google was pushing to remove XSLT in Chrome. Flash is a bit of a different beast but it died long ago.
However I don't think it matters too much. Here WASM is not targeting the browser, and there are many more runtimes for WASM, in many diverse languages, and they outnumber browsers significantly. It won't die easily.
Indeed, I guess in a while it is going to look pretty much like the older formats relying on obscure bytecode look now. Possibly the kind of data analytics it aims has some special requirements, but I would expect a simple format like CSV (or DSV, more generally) to be more future-proof generally.
Weird… for a format that says it’s a general future-proof file format, there’s not a single reference to cache-oblivious structures or other structures like VRF
the embedded decoder may as well be closing on a full-blown SQL execution engine (as the data size tramps the executable size these days). Push that execution onto a memory chips and into SSD/HDD controllers. I think something similar happens in filesystem development too where instead of straight access to raw data you may access some execution API over the data. A modern take on IBM mainframe's filesystem database.
The proliferation of opensource file formats (i.e., Parquet, ORC) allows seamless data sharing across disparate platforms. However, these formats were created over a decade ago for hardware and workload environments that are much different from today
Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable.
Is that really necessary, though? Data files are useless without a program that knows how to utilize the data. Said program should already know how to decode data on the platform it's running.
And if you're really working on an obscure platform, implementing a decoder for a file format is probably easier than implementing a full-blown wasm runtime for that platform.
> Data files are useless without a program that knows how to utilize the data.
As I see it, the point is that the exact details of how the bits are encoded is not really interesting from the perspective of the program reading the data.
Consider a program that reads CSV files and processes the data in them. First column contains a timestamp, second column contains a filename, third column contains a size.
As long as there's a well-defined interface that the program can use to extract rows from a file, where each row contains one or more columns of data values and those data values have the correct data type, then the program doesn't really care about this coming from a CSV file. It could just as easily be a 7zip-compressed JSON file, or something else entirely.
Now, granted, this file format isn't well-suited as a generic file format. After all, the decoding API they specify is returning data as Apache Arrow arrays. Probably not well-suited for all uses.
I think the counter argument here is that you’re now including a CSV decoder in every CSV data file now. At the data sizes we’re talking, this is negligible overhead, but it seems overly complicated to me. Almost like it’s trying too hard to be clever.
How many different storage format implementations will there realistically be?
It does open up the possibility for specialized compressors for the data in the file, which might be interesting for archiving where improved compression ratio is worth a lot.
That makes sense. I think fundamentally you’re trading off space between the compressed data and the lookup tables stored in your decompression code. I can see that amortizing well if the compressed payloads are large or if there are a lot of payloads with the same distribution of sequences though.
Giving it a quick look, seems like they've addressed a lot of the shortcomings of Parquet which is very exciting. In no particular order:
- Parquet metadata is Thrift, but with comments saying "if this field exists, this other field must exist", and no code actually verifying the fact, so I'm pretty sure you could feed it bogus Thrift metadata and crash the reader.
- Parquet metadata must be parsed out, meaning you have to: allocate a buffer, read the metadata bytes, and then dynamically keep allocating a whole bunch of stuff as you parse the metadata bytes, since you don't know the size of the materialized metadata! Too many heap allocations! This file format's Flatbuffers approach seems to solve this as you can interpret Flatbuffer bytes directly.
- The encodings are much more powerful. I think a lot of people in the database community have been saying that we need composable/recursive lightweight encodings for a long time. BtrBlocks was the first such format that was open in my memory, and then FastLanes followed up. Both of these were much better than Parquet by itself, so I'm glad ideas from those two formats are being taken up.
- Parquet did the Dremel record-shredding thing which just made my brain explode and I'm glad they got rid of it. It seemed to needlessly complicate the format with no real benefit.
- Parquet datapages might contain different numbers of rows, so you have to scan the whole ColumnChunk to find the row you want. Here it seems like you can just jump to the DataPage (IOUnit) you want.
- They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.
Overall great improvement, I'm looking forward to this taking over the data analytics space.
> - They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.
"Heavyweight compression" as in zstd and brotli? That's very useful for columns of non-repeated strings. I get compression ratios in the order of 1% on some of those columns, because they are mostly ASCII and have lots of common substrings.
I think the wasm compiler is going to bring in more dependencies than the ‘heavy’ compression would have.
I think that more expensive compression may have made more of a difference 15 years ago when cpu was more plentiful compared to network or disk bandwidth.
For compression, has the world settled with zstd now?
I think it’s a pretty common choice when you want compression in a new format or protocol. It works better for compressing chunks of your data rather than large files where you want to maintain some kind of index or random access. Similarly, if you have many chunks then you can parallelise decompression (I’m not sure any kind of parallelism support should have been built in to the zstd format, though it is useful for command line uses).
A big problem for some people is that Java support is hard as it isn’t portable so eg making a Java web server compress its responses with are isn’t so easy.
Java can just use native libraries, there are plenty of Java projects that do that.
It's not like it's 1999 and there is still some Sun dogma against doing this.
Depends on the use-case. For transparent filesystem compression I would still recommend lz4 over zstd because speed matters more than compression ratio in that use case.
Most definitely not settled, but it's a good default
https://stackoverflow.com/questions/31812780/append-a-new-co...
> I'm looking forward to this taking over the data analytics space
Parquet is surprisingly arcane. There are a lot of unpleasant and poorly documented details one has to be aware of in order to use it efficiently.
> Wes McKinney
Sold
For those who don't know. Wes McKinney is the creator of Pandas, the go-to tabular analysis library for Python. That gives his format widespread buy-in from the outset, as well as a couple of decades' of Caring About The Problem which makes his insights unusually valuable.
Andy Pavlo also deserves a shout out; he's an authority on databases and lives a "data oriented lifestyle". See his two "What goes around comes around ..." papers for great overviews of the past and future 50 years in databases respectively (as well as all the amazing CMU Seminar Series). I'm excited to see him involved in this.
My apologies to the Chinese coauthors who I'm not familiar with.
Bookmarked for thorough reading!
I've been looking for an article that I think I found 10+ years ago on the web about a very similar topic, something like "the evolution of database systems". It was a very well written article that I just skimmed through and planned to read properly but could never find it again. I remember it had hand-drawn-style diagrams of database architectures with blue backgrounds, sometimes squiggly lines and maybe even some bees on them (could be just my mind mixing it with something). I'd be eternally grateful if someone here could find it.
Not sure if this is what you had in mind but here are the links to the two papers that I was referencing.
[1] covers the first 40 years of databases. [2] fills in the gap of the last 20 years and gives their thoughts on the future.
My apologies, the first one was actually Stonebraker & Hellerstein and didn't involve Pavlo. They're both excellent papers though for anyone working with data.
Stonebraker, for those who don't know, is the creator of Postgres, a database you might have heard of.
1: Stonebraker & Hellerstein, "What Goes Around Comes Around", 2005, https://people.csail.mit.edu/tdanford/6830papers/stonebraker...
2: Stonebraker & Pavlo, "What Goes Around Comes Around... And Around...", 2024, https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec20...
I'm a big fan of Wes' work and Pandas was incredibly influential. However technically it wasn't his best work. In terms of selling points, I think that the Arrow data format is technically much better and more influential on the data ecosystem as a whole, see DataFusion, etc...
That said, now let me see what F3 is actually about (and yes, your comment is what actually made me want to click through to the link) ...
Also creator of Apache Arrow. A core component of modern data analytics.
His work on parquet probably stands out as a better call to authority
Mixing data and code is a classic security mistake. Having one somewhat known individual involved doesn’t magically make it less of a mistake.
I was also concerned about the wasted overhead. However I guess it's just there for compatibility (since space is cheap) and for common encodings you'll be able to skip reading it with range requests and use your trusted codec to decode the data. Smart move imho.
I’m not concerned about the overhead, there is always more and larger pieces of iron. Still not a good idea to mix executable code with data.
It really depends on the order of priorities. If the overall goal is to allow digital archeologist to make sense of some file they found, it would be prudent to give them some instructions on how it is decoded.
I just hope that people will not just execute that code in an unconfined environment.
Hope is not an adequate security best practice. ;)
Not in physicists' future! /jk. Exabytes of data produced in the next two decades at the Large Hadron Collider will be stored in a format homemade by CERN: https://cds.cern.ch/record/2923186/
CERN has been dealing with large data sets before most of the people reading this have been born.
https://cds.cern.ch/record/2296399/files/zebra.pdf
Forgive me for not understanding the difference between columnar storage and otherwise, but why is this so cool? Is the big win that you can send around custom little vector embedding databases with a built in sandbox?
I get the "step change/foundational building block" vibe from the paper - and the project name itself implies more than a little "je ne sais quoi" - but unfortunately I only understand a few sentences per page. The pictures are pretty though, and the colour choices are tasteful yet bold. Two thumbs up from the easily swayed.
> Is the big win that you can send around custom little vector embedding databases with a built in sandbox?
No, this is a compatibility layer for future encoding changes.
For example, ORCv2 has never shipped because we tried to bundle all the new features into a new format version, ship all the writers with the features disabled, then ship all the readers with support and then finally flip the writers to write the new format.
Specifically, there was a new flipped bit version of float encoding which sent the exponent, mantissa and sign as integers for maximum compression - this would've been so much easier to ship if I could ship a wasm shim with the new file and skip the year+ wait for all readers to support it.
We'd have made progress with the format, but we'd also be able to deprecate a reader impl in code without losing compatibility if the older files carried their own information.
Today, something like Spark's variant type would benefit from this - the sub-columnarization that does would be so much easier to ship as bytecode instead of as an interpreter that contains support for all possible recombinations from split up columns.
PS: having spent a lot of nights tweaking tpc-h with ORC and fixing OOMs in the writer, it warms my heart to see it sort of hold up those bits in the benchmark
I'm less confident. Your description highlights a real problem but this particular solution looks like an attempt to shoe horn a technical solution to a political people problem. It feels like one of these great ideas that years later results in 1000s of different decoders, breakages and a nightmare to maintain. Then someone starts an initiative to move decoding from being bundled and to instead just defining the data format.
Sometimes the best option is to do the hard political work and improve the standard and get everyone moving with it. People have pushed parquet and arrow. Which they are absolutely great technologies that I use regularly but 8 years after someone asked how to write parquet in java, the best answer is to use duckdb: https://stackoverflow.com/questions/47355038/how-to-generate...
Not having a good parquet writer for java shows a poor attempt at pushing forward a standard. Similarly arrow has problems in java land. If they can't be bothered to consider how to actually implement and roll out standards to a top 5 language, I'm not sure I want them throwing WASM into the mix will fix it.
Columnar storage is great because a complex nested schema can be decomposed into its leaf values and stored as primitives. You can directly access leaf values, avoiding a ton of IO and parsing. Note that all these formats are actually partitioned into groups of rows at the top level.
One big win here is that it's possible to get Apache Arrow buffers directly from the data pages - either by using the provided WASM or bringing a native decoder.
In Parquet this is currently very complicated. Parquet uses the Dremel encoding which stores primitive values alongside two streams of integers (repetition and definition levels) that drive a state machine constructed from the schema to reconstruct records. Even getting those integer streams is hard - Parquet has settled on “RLE” which is a mixture of bit-packing and run-length encoding and the reference implementation uses 74,000 lines of generated code just for the bit-packing part.
So to get Arrow buffers from Parquet is a significant amount of work. F3 should make this much easier and future-proof.
One of the suggested wins here is random access to metadata. When using GeoParquet I index the metadata in SQLite, otherwise it would take about 10 minutes as opposed to a few milliseconds to run a spatial query on e.g. Overture Maps - I'd need to parse the footer of ~500 files meaning ~150MB of Thrift would need to be parsed and queried.
However the choice of Google's Flatbuffers is an odd one. Memory safety with FlatBuffers is a known problem [1]. I seriously doubt that generated code that must mitigate these threats will show any real-world performance benefits. Actually - why not just embed a SQLite database?
[1] https://rustsec.org/advisories/RUSTSEC-2021-0122.html
Afaik the win of columnar storage comes from the fact that you can very quickly scan the entire column across all rows making very efficient use of os buffering etc. so queries like select a where b = 'x' are very quick.
> so queries like select a where b = 'x' are very quick
I wouldn’t say “very quick”. They only need to read and look at the data for columns a and b, whereas, with a row-oriented approach, with storage being block-based, you will read additional data, often the entire dataset.
That’s faster, but for large datasets you need an index to make things “very quick”. This format supports that, but whether to have that is orthogonal to being row/column oriented.
Sum(x) is a better example. Indexing x won’t help when you need all the values.
When b = 'x' is true for many rows and you select * or multiple columns, then it's the opposite, because reading all row data is slower in column based data structures than in row based ones.
IMO it's easier to explain in terms of workload:
- OLTP (T = transactional workloads), row based, for operating on rows - OLAP (A = analytical workloads), column based, for operating on columns (sum/min/max/...)
Another useful one is aggregations. Think sum(), concat(), max(), etc. You can operate on the column.
This is in contrast to row based. You have to scan the full row, to get a column. Think how you'd usually read a CSV (read line, parse line).
this is messed up
> The decoding performance slowdown of Wasm is minimal (10–30%) compared to a native implementation.
so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future. And you also give up any advanced decoding functions other than "decode whole block and store into memory".
I have no idea why would anyone do this. If you care about speed, then wasm is not going to cut it. If you don't care about speed, you don't need super-fancy encoding algorithms, just use any of the well-known ones.
> so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future.
The WASM is meant as a backup. If you have the native decoder installed (e.g., as a crate), then a system will prefer to use that. Otherwise, fallback to WASM. A 10-30% performance hit is worth it over not being able to read a file at all.
It even says so right in the abstract:
"Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable."
The idea that software I write today can decode a data file written in ten years using new encodings is quite appealing.
And the idea that new software written to make use of the new encodings doesn't have to carry the burden of implementing the whole history of encoders for backwards compatibility likewise.
Now you have code stored in your database which you don't know what will do when you execute it.
Sounds very much like the security pain from macros in Excel and Microsoft Word that could do anything.
This is why most PDF readers will ignore any javascript embedded inside PDF files.
Is this relevant in practice? Say I go to a website to download some data, but a malicious actor has injected an evil decoder (that does what exactly?). They could just have injected the wasm into the website I am visiting to get the data!
In fact, wasm was explicitly designed for me to run unverified wasm blobs from random sources safely on my computer.
It gets even better further down the paper!
"In case users prefer native decoding speed over Wasm, F3 plans to offer an option to associate a URL with each Wasm binary, pointing to source code or a precompiled library."
They are not suggesting that the code at the url would be automatically downloaded. It would be up to you to get the code and build it into your application like any other library.
Excel, Word and PDF readers weren’t properly sandboxed.
I kind of agree with you, but there's more to the picture.
The situation you describe is kind of already the case with various approaches to compression. For example, perhaps we decide to bitpack instead of use the generic compressor. Or change compressors entirely.
This sort of thing exists without WASM, and it means you have to "transcode" i.e. rewrite the file after updating your software with the new techniques.
With WASM, it's the same. You just rewrite the file.
I do agree that this pushes the costs of iteration up the stack in a vastly less efficient way. Overall this seems way more expensive, very unclear that future proofing is worth it. I've worked with exabyte-scale systems and re-encoding swaths of data regularly would not be good.
The pitch for this sounds very similar to the pitch for Vortex (i.e. obviating the need to create a new format every time a shift occurs in data processing and computing by providing a data organization structure and a general-purpose API to allow developers to add new encoding schemes easily).
But I'm not totally clear what the relationship between F3 and Vortex is. It says their prototype uses the encoding implementation in Vortex, but does not use the Vortex type system?
The backstory is complicated. The plan was to establish a consortium between CMU, Tsinghua, Meta, CWI, VoltronData, Nvidia, and SpiralDB to unify behind a single file format. But that fell through after CMU's lawyers freaked out over Meta's NDA stuff to get access to a preview of Velox Nimble. IANAL, but Meta's NDA seemed reasonable to me. So the plan fell through after about a year, and then everyone released their own format:
→ Meta's Nimble: https://github.com/facebookincubator/nimble
→ CWI's FastLanes: https://github.com/cwida/FastLanes
→ SpiralDB's Vortex: https://vortex.dev
→ CMU + Tsinghua F3: https://github.com/future-file-format/f3
On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.
I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:
→ Germans: https://github.com/AnyBlox
Andrew, it’s always great to read the background from the author on how (and even why!) this all played out. This comment is incredibly helpful for understanding the context of why all these multiple formats were born.
Thank you for the explanation! But what a mess.
I would love to bring these benefits to the multidimensional array world, via integration with the Zarr/Icechunk formats somehow (which I work on). But this fragmentation of formats makes it very hard to know where to start.
... Are you saying that there's 5 competing "universal" file format projects? Each with different non-compatible approaches? Is this a laughing/crying thing, or a "lots of interesting paths to explore" thing?
Also, back on topic - is your file format encryptable via that WASM embedding?
Kinda off topic but im really curious:
Why is there a chess move mentioned towards the end of the paper? (page24)
CMU-DB vs. MIT-DB: #1 pe4
It took me a while to find the source code, but it's available here: https://github.com/future-file-format/F3
Requiring any implementation who wants to read the data to be able to run webassembly means the format is ill-suited to any environment which aims to reduce dependencies or bloat.
Wasm is pretty simple; Don't confuse it with the "web" part of the acronym. And as others point out; wasm is a backup. You can use native binding without going through wasm, which provide better performance,
it needs a compiler to convert it to machine code, it also has a non-trivial runtime...
Don't know if you need to compile, you might want to but I think interpreting might seem reasonable if size/complexity is a concern.
Is the runtime really that large? I know with wasm 2.0 with garbage collection and exceptions is a bit of a beast but wasm 1.0? What's needed (I'm speaking from a place of ignorance here, I haven't implemented a WASM runtime)? Some contiguous memory, a stack machine, IEEE float math and some utf-8 operations. I think you can add some reasonable limitations like only a single module and a handful of available imports relevant to the domain.
I know that feature creep would almost inevitably follow, but if someone cares about minimizing complexity it seems possible.
If you view WASM as a worst case fallback, then you could implement a simple interpreter and transcode the file to something that your program has native decoders for. Yes, that might be slow, but at least the data remains usable. At least, that’s my understanding. I don’t have a dog in the hunt.
If you have a native decoder you don't need to run WASM. It's literally in the abstract.
But given a file in this format, you cannot know whether your native decoder can read it without running WASM; the file you read may use a feature that was invented after your library was written.
It seems one of the first file formats that embed WebAssembly modules. Is there any other prior work? I'm specifically interested from the compression perspective as a well-chosen WebAssembly preprocessor can greatly boost the compression ratio. (In fact, I'm currently working on such file format just in case.)
Alan Kay described [1] what he considers the first object oriented system, made in the 60s by an unknown programmer. It was a tape-based storage system, where the "format" of the tap was a set of routines to read, write, etc. at a known offsets on the tape.
So, prior art! :)
[1] https://www.cs.tufts.edu/comp/150FP/archive/alan-kay/smallta... (page 4)
How do they prevent people from embedding malicious payloads in the WebAssembly?
By sandboxing:
> We first discuss the implementation considerations of the input to the Wasm-side Init() API call. The isolated linear memory space of Wasm instance is referred to as guest, while the program’s address space running the Wasm instance is referred to as host. The input to a Wasm instance consists of the contiguous bytes of an EncUnit copied from the host’s memory into the guest’s memory, plus any additional runtime options.
> Although research has shown the importance of minimizing the number of memory copies in analytical workloads, we consider the memory copy while passing input to Wasm decoders hard to avoid for several reasons. First, the sandboxed linear memory restricts the guest to accessing only its own memory. Prior work has modified Wasm runtimes to allow access to host memory for reduced copying, but such changes compromise Wasm’s security guarantees
It's "the future work". But possibly, allowlists might help:
> Security concerns. Despite the sandbox design of Wasm, it still has vulnerabilities, especially with the discovery of new attack techniques. [...] We believe there are opportunities for future work to improve Wasm security. One approach is for creators of Wasm decoding kernels to register their Wasm modules in a central repository to get the Wasm modules verified and tamper-resistant.
The only issue the article seems to raise is that their solution isn't that optimal because there's redundant copying of the input data into the sandbox, but this enables the sandbox to be secure as the Wasm code can't modify data outside of its sandbox. I'd assume the memory is protected at the CPU level, something akin to virtualisation. But then there's maybe some side-channel attack it could use to extract outside data somehow?
Because sandboxing has worked so well in preventing security issues?
For one, sandboxing can't solve the halting problem.
So you're suggesting a denial of service? I doubt the sandbox is running on cooperative scheduling. Ultimately the VM can be run for some cycles, enough to get work done, but enough to permit other programs to run, and if the decoder doesn't seem to be making any progress the user or automated process can terminate it. Something like this already happens when js code takes up too much time on a browser and a little pop-down shows to let you stop it.
You can if you bound the program in time and space. In particular if you put a timeout on a computation you know for a fact it will halt at some point before the timeout.
Anything by my favorite Andy Pavlo is good in my book
Do we have a vetted 'optimal' format for column (this) and row (sqlite?) yet? I often have to switch all day and I have something in memory made by us that works but nothing standard.
The embedded WASM makes this more extensible, but still, they have to choose, and freeze forever, interfaces that the WASM blocks can implement. Chances are that will, at some time, become a pain point.
They also have to specify the WASM version. In 6 years, we already have 3 versions there (https://webassembly.org/specs/). Are those 100% backwards compatible?
A format requiring a program to decode is nuts. Might as well bundle 7zip with every zip file.
not a bad idea if your average file size is a terabyte and there’s never a sure way to get a clean binary.
Didn’t this used to be done? I’m sure I remember finding .exe zip files and being surprised it wasn’t just a virus.
Self extracting executables, (zip files with the decoder included) have been around for decades.
7zip has a feature called `7-Zip self-extracting (SFX) archive`.
Inertia is a thing. Parquet and ORC and all the myriad of other formats all exist with ecosystems and things. For this to succeed they’d have to seed the community with connectors and libraries and hooks into things to make it easy to use.
Plugins for DudckDB for example to read and write to it. Heck if you got Iceberg to use it instead of parquet that could be a win.
Sometimes tech doesn’t win on the merits because of how entrenched existing stuff is and the high cost of switching.
This is why parquet is becoming standard even though it is suboptimal and every single high performance db has its own internal format.
If you need optimized format you do it yourself, if you need standard you use w/e else is using unless it is too bad.
These new formats seem like they would only be useful as research to build on when creating a specific tool or database
> Sometimes tech doesn’t win on the merits because of how entrenched existing stuff is and the high cost of switching.
Yes! Path dependence: https://en.wikipedia.org/wiki/Path_dependence
Why not store the data directly as Arrow files, to allow for mmaping? I see F3 also supports such zero-copy mmap, and skimming through the paper, it actually seems that it uses Arrow buffers, so I wonder what is the difference to directly using Arrow files? (Arrow files is what is being used by HuggingFace datasets.)
The main reason is that arrow files are not compressed at all. So storing everything as Arrow would increase storage size by 10-100x (depending on data)
I think mmapping is less useful when you are expecting to get your data over a network. Not sure though.
For those complaining about the PDF here's the DOI:
https://dl.acm.org/doi/10.1145/3749163
"file format for the future" .... asterisk, footer note "until something better comes along"
If I had a dollar for all the file formats that got deprecated or never took-off over the years...
The neat thing about this the client could substitute with their down decoder when they detect a supported format.
This would allow them to use a decoder optimized for the current hardware. Eg. multi-threaded
This feels like one of those ideas that seems like a good idea during a late-night LLM brainstorming sesh but not so good when you come back to it with a fresh brain in the morning.
LLM = Lager, lager, mead
What kind of embedded WASM malware are we going to be defending against with this?
The obvious concern would be data-dependent backdoors for malicious “decoding”, i.e. correctly decoding ordinary data, but manipulating the decoding of targeted data in some compromising way.
That relies on some rather far-fetched assumptions about what the attacker might reasonably be able to control undetected, and what goals they might reasonably be able to achieve through such low-level data corruption.
Maybe information leakage? Tweak some low-order float bits in the decoded results with high-order bits from data the decoder recognizes as “interesting”?
What's the attack vector in this case? The Wasm is loaded from the file itself. If they can compromise the file then its cheaper to just compromise the data directly.
What I’m imagining is essentially a supply chain attack: The victim (mistakenly) trusts the attacker to supply an encoder. The encoder appears to function normally, but in fact will subtly leak information smuggled in decoded values of the victim’s data.
Far-fetched, indeed.
Providing an optional, optimized, native decoder which is much faster, but does something wicked when it sees the right data.
Reminds me of Anyblox, which is a columnar data-store file with attached wasm for encode/decode. Super fun read, showed up in the comments for Spiral. https://news.ycombinator.com/item?id=45212960#45214646
14 votes & no comments, 3 mo ago, for AnyBlox: A Framework for Self-Decoding Datasets [pdf] https://gienieczko.com/anyblox-paper https://news.ycombinator.com/item?id=44501743
Nothing screams "future proof" like WASM.
Old websites still run very well.
Is there any reason to believe that a major new browser tech, WASM, will ever have support dropped for its early versions?
Even if the old versions are not as loved (i.e.: engine optimized for it and immediately ready to be executed) as the old versions, emulation methods work wonders and could easily be downloaded on-demand by browsers needing to run "old WASM".
I'm quite optimistic for the forwards-compatibility proposed here.
Sure. Just recently Google was pushing to remove XSLT in Chrome. Flash is a bit of a different beast but it died long ago.
However I don't think it matters too much. Here WASM is not targeting the browser, and there are many more runtimes for WASM, in many diverse languages, and they outnumber browsers significantly. It won't die easily.
> Old websites still run very well.
<blink> is gone. In C, gets is gone. It may take only one similar minor change to make huge amounts of data unreadable by newer WASM versions.
Indeed, I guess in a while it is going to look pretty much like the older formats relying on obscure bytecode look now. Possibly the kind of data analytics it aims has some special requirements, but I would expect a simple format like CSV (or DSV, more generally) to be more future-proof generally.
> in a while it is going to look pretty much like the older formats relying on obscure bytecode look now
Depends how obscure. You can play Z-machine games on pretty much anything now.
I'm looking forward to the announcement of F4 next year.
The irony of being a PDF file
Weird… for a format that says it’s a general future-proof file format, there’s not a single reference to cache-oblivious structures or other structures like VRF
End Adobe.
the embedded decoder may as well be closing on a full-blown SQL execution engine (as the data size tramps the executable size these days). Push that execution onto a memory chips and into SSD/HDD controllers. I think something similar happens in filesystem development too where instead of straight access to raw data you may access some execution API over the data. A modern take on IBM mainframe's filesystem database.
Ssd controllers can barely keep up with advertised read/write workloads afaik so it is hard to understand how can this be useful
data file format for the future [pdf]
:)
tldr
The proliferation of opensource file formats (i.e., Parquet, ORC) allows seamless data sharing across disparate platforms. However, these formats were created over a decade ago for hardware and workload environments that are much different from today
Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable.
I know sandboxes for wasm are very advanced, but decades of file formats with built-in scripting are flashing the danger lights at me.
“This time it’ll be different. Trust us.”
Is that really necessary, though? Data files are useless without a program that knows how to utilize the data. Said program should already know how to decode data on the platform it's running.
And if you're really working on an obscure platform, implementing a decoder for a file format is probably easier than implementing a full-blown wasm runtime for that platform.
> Data files are useless without a program that knows how to utilize the data.
As I see it, the point is that the exact details of how the bits are encoded is not really interesting from the perspective of the program reading the data.
Consider a program that reads CSV files and processes the data in them. First column contains a timestamp, second column contains a filename, third column contains a size.
As long as there's a well-defined interface that the program can use to extract rows from a file, where each row contains one or more columns of data values and those data values have the correct data type, then the program doesn't really care about this coming from a CSV file. It could just as easily be a 7zip-compressed JSON file, or something else entirely.
Now, granted, this file format isn't well-suited as a generic file format. After all, the decoding API they specify is returning data as Apache Arrow arrays. Probably not well-suited for all uses.
I think the counter argument here is that you’re now including a CSV decoder in every CSV data file now. At the data sizes we’re talking, this is negligible overhead, but it seems overly complicated to me. Almost like it’s trying too hard to be clever.
How many different storage format implementations will there realistically be?
> How many different storage format implementations will there realistically be?
Apparently an infinite number, if we go with the approach in the paper /s
It does open up the possibility for specialized compressors for the data in the file, which might be interesting for archiving where improved compression ratio is worth a lot.
That makes sense. I think fundamentally you’re trading off space between the compressed data and the lookup tables stored in your decompression code. I can see that amortizing well if the compressed payloads are large or if there are a lot of payloads with the same distribution of sequences though.