F3: Open-source data file format for the future [pdf]

(db.cs.cmu.edu)

355 points | by eatonphil a day ago ago

113 comments

sakras 17 hours ago

Giving it a quick look, seems like they've addressed a lot of the shortcomings of Parquet which is very exciting. In no particular order:

- Parquet metadata is Thrift, but with comments saying "if this field exists, this other field must exist", and no code actually verifying the fact, so I'm pretty sure you could feed it bogus Thrift metadata and crash the reader.

- Parquet metadata must be parsed out, meaning you have to: allocate a buffer, read the metadata bytes, and then dynamically keep allocating a whole bunch of stuff as you parse the metadata bytes, since you don't know the size of the materialized metadata! Too many heap allocations! This file format's Flatbuffers approach seems to solve this as you can interpret Flatbuffer bytes directly.

- The encodings are much more powerful. I think a lot of people in the database community have been saying that we need composable/recursive lightweight encodings for a long time. BtrBlocks was the first such format that was open in my memory, and then FastLanes followed up. Both of these were much better than Parquet by itself, so I'm glad ideas from those two formats are being taken up.

- Parquet did the Dremel record-shredding thing which just made my brain explode and I'm glad they got rid of it. It seemed to needlessly complicate the format with no real benefit.

- Parquet datapages might contain different numbers of rows, so you have to scan the whole ColumnChunk to find the row you want. Here it seems like you can just jump to the DataPage (IOUnit) you want.

- They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.

Overall great improvement, I'm looking forward to this taking over the data analytics space.

[-]

progval 6 hours ago

> - They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.

"Heavyweight compression" as in zstd and brotli? That's very useful for columns of non-repeated strings. I get compression ratios in the order of 1% on some of those columns, because they are mostly ASCII and have lots of common substrings.

dan-robertson 12 hours ago

I think the wasm compiler is going to bring in more dependencies than the ‘heavy’ compression would have.

I think that more expensive compression may have made more of a difference 15 years ago when cpu was more plentiful compared to network or disk bandwidth.

miohtama 15 hours ago

For compression, has the world settled with zstd now?

[-]

dan-robertson 12 hours ago

I think it’s a pretty common choice when you want compression in a new format or protocol. It works better for compressing chunks of your data rather than large files where you want to maintain some kind of index or random access. Similarly, if you have many chunks then you can parallelise decompression (I’m not sure any kind of parallelism support should have been built in to the zstd format, though it is useful for command line uses).

A big problem for some people is that Java support is hard as it isn’t portable so eg making a Java web server compress its responses with are isn’t so easy.

[-]

oblio 11 hours ago

Java can just use native libraries, there are plenty of Java projects that do that.

It's not like it's 1999 and there is still some Sun dogma against doing this.

craftkiller 5 hours ago

Depends on the use-case. For transparent filesystem compression I would still recommend lz4 over zstd because speed matters more than compression ratio in that use case.

lionkor 13 hours ago

Most definitely not settled, but it's a good default

meehai 12 hours ago

https://stackoverflow.com/questions/31812780/append-a-new-co...

MrBuddyCasino 15 hours ago

> I'm looking forward to this taking over the data analytics space

Parquet is surprisingly arcane. There are a lot of unpleasant and poorly documented details one has to be aware of in order to use it efficiently.

anigbrowl 19 hours ago

> Wes McKinney

Sold

For those who don't know. Wes McKinney is the creator of Pandas, the go-to tabular analysis library for Python. That gives his format widespread buy-in from the outset, as well as a couple of decades' of Caring About The Problem which makes his insights unusually valuable.

[-]

snthpy 17 hours ago

Andy Pavlo also deserves a shout out; he's an authority on databases and lives a "data oriented lifestyle". See his two "What goes around comes around ..." papers for great overviews of the past and future 50 years in databases respectively (as well as all the amazing CMU Seminar Series). I'm excited to see him involved in this.

My apologies to the Chinese coauthors who I'm not familiar with.

Bookmarked for thorough reading!

[-]

whage 14 hours ago

I've been looking for an article that I think I found 10+ years ago on the web about a very similar topic, something like "the evolution of database systems". It was a very well written article that I just skimmed through and planned to read properly but could never find it again. I remember it had hand-drawn-style diagrams of database architectures with blue backgrounds, sometimes squiggly lines and maybe even some bees on them (could be just my mind mixing it with something). I'd be eternally grateful if someone here could find it.

[-]

snthpy 13 hours ago

Not sure if this is what you had in mind but here are the links to the two papers that I was referencing.

[1] covers the first 40 years of databases. [2] fills in the gap of the last 20 years and gives their thoughts on the future.

My apologies, the first one was actually Stonebraker & Hellerstein and didn't involve Pavlo. They're both excellent papers though for anyone working with data.

Stonebraker, for those who don't know, is the creator of Postgres, a database you might have heard of.

1: Stonebraker & Hellerstein, "What Goes Around Comes Around", 2005, https://people.csail.mit.edu/tdanford/6830papers/stonebraker...

2: Stonebraker & Pavlo, "What Goes Around Comes Around... And Around...", 2024, https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec20...

snthpy 17 hours ago

I'm a big fan of Wes' work and Pandas was incredibly influential. However technically it wasn't his best work. In terms of selling points, I think that the Arrow data format is technically much better and more influential on the data ecosystem as a whole, see DataFusion, etc...

That said, now let me see what F3 is actually about (and yes, your comment is what actually made me want to click through to the link) ...

geodel 17 hours ago

Also creator of Apache Arrow. A core component of modern data analytics.

wodenokoto 17 hours ago

His work on parquet probably stands out as a better call to authority

nialse 18 hours ago

Mixing data and code is a classic security mistake. Having one somewhat known individual involved doesn’t magically make it less of a mistake.

[-]

snthpy 17 hours ago

I was also concerned about the wasted overhead. However I guess it's just there for compatibility (since space is cheap) and for common encodings you'll be able to skip reading it with range requests and use your trusted codec to decode the data. Smart move imho.

[-]

nialse 15 hours ago

I’m not concerned about the overhead, there is always more and larger pieces of iron. Still not a good idea to mix executable code with data.

[-]

chme 12 hours ago

It really depends on the order of priorities. If the overall goal is to allow digital archeologist to make sense of some file they found, it would be prudent to give them some instructions on how it is decoded.

I just hope that people will not just execute that code in an unconfined environment.

[-]

nialse 7 hours ago

Hope is not an adequate security best practice. ;)

moelf 21 hours ago

Not in physicists' future! /jk. Exabytes of data produced in the next two decades at the Large Hadron Collider will be stored in a format homemade by CERN: https://cds.cern.ch/record/2923186/

[-]

slashdave 6 hours ago

CERN has been dealing with large data sets before most of the people reading this have been born.

https://cds.cern.ch/record/2296399/files/zebra.pdf

digdugdirk 20 hours ago

Forgive me for not understanding the difference between columnar storage and otherwise, but why is this so cool? Is the big win that you can send around custom little vector embedding databases with a built in sandbox?

I get the "step change/foundational building block" vibe from the paper - and the project name itself implies more than a little "je ne sais quoi" - but unfortunately I only understand a few sentences per page. The pictures are pretty though, and the colour choices are tasteful yet bold. Two thumbs up from the easily swayed.

[-]

gopalv 19 hours ago

> Is the big win that you can send around custom little vector embedding databases with a built in sandbox?

No, this is a compatibility layer for future encoding changes.

For example, ORCv2 has never shipped because we tried to bundle all the new features into a new format version, ship all the writers with the features disabled, then ship all the readers with support and then finally flip the writers to write the new format.

Specifically, there was a new flipped bit version of float encoding which sent the exponent, mantissa and sign as integers for maximum compression - this would've been so much easier to ship if I could ship a wasm shim with the new file and skip the year+ wait for all readers to support it.

We'd have made progress with the format, but we'd also be able to deprecate a reader impl in code without losing compatibility if the older files carried their own information.

Today, something like Spark's variant type would benefit from this - the sub-columnarization that does would be so much easier to ship as bytecode instead of as an interpreter that contains support for all possible recombinations from split up columns.

PS: having spent a lot of nights tweaking tpc-h with ORC and fixing OOMs in the writer, it warms my heart to see it sort of hold up those bits in the benchmark

[-]

RyanHamilton 11 hours ago

I'm less confident. Your description highlights a real problem but this particular solution looks like an attempt to shoe horn a technical solution to a political people problem. It feels like one of these great ideas that years later results in 1000s of different decoders, breakages and a nightmare to maintain. Then someone starts an initiative to move decoding from being bundled and to instead just defining the data format.

Sometimes the best option is to do the hard political work and improve the standard and get everyone moving with it. People have pushed parquet and arrow. Which they are absolutely great technologies that I use regularly but 8 years after someone asked how to write parquet in java, the best answer is to use duckdb: https://stackoverflow.com/questions/47355038/how-to-generate...

Not having a good parquet writer for java shows a poor attempt at pushing forward a standard. Similarly arrow has problems in java land. If they can't be bothered to consider how to actually implement and roll out standards to a top 5 language, I'm not sure I want them throwing WASM into the mix will fix it.

willtemperley 14 hours ago

Columnar storage is great because a complex nested schema can be decomposed into its leaf values and stored as primitives. You can directly access leaf values, avoiding a ton of IO and parsing. Note that all these formats are actually partitioned into groups of rows at the top level.

One big win here is that it's possible to get Apache Arrow buffers directly from the data pages - either by using the provided WASM or bringing a native decoder.

In Parquet this is currently very complicated. Parquet uses the Dremel encoding which stores primitive values alongside two streams of integers (repetition and definition levels) that drive a state machine constructed from the schema to reconstruct records. Even getting those integer streams is hard - Parquet has settled on “RLE” which is a mixture of bit-packing and run-length encoding and the reference implementation uses 74,000 lines of generated code just for the bit-packing part.

So to get Arrow buffers from Parquet is a significant amount of work. F3 should make this much easier and future-proof.

One of the suggested wins here is random access to metadata. When using GeoParquet I index the metadata in SQLite, otherwise it would take about 10 minutes as opposed to a few milliseconds to run a spatial query on e.g. Overture Maps - I'd need to parse the footer of ~500 files meaning ~150MB of Thrift would need to be parsed and queried.

However the choice of Google's Flatbuffers is an odd one. Memory safety with FlatBuffers is a known problem [1]. I seriously doubt that generated code that must mitigate these threats will show any real-world performance benefits. Actually - why not just embed a SQLite database?

[1] https://rustsec.org/advisories/RUSTSEC-2021-0122.html

bbminner 18 hours ago

Afaik the win of columnar storage comes from the fact that you can very quickly scan the entire column across all rows making very efficient use of os buffering etc. so queries like select a where b = 'x' are very quick.

[-]

Someone 15 hours ago

> so queries like select a where b = 'x' are very quick

I wouldn’t say “very quick”. They only need to read and look at the data for columns a and b, whereas, with a row-oriented approach, with storage being block-based, you will read additional data, often the entire dataset.

That’s faster, but for large datasets you need an index to make things “very quick”. This format supports that, but whether to have that is orthogonal to being row/column oriented.

[-]

mr_toad 9 hours ago

Sum(x) is a better example. Indexing x won’t help when you need all the values.

PhilippGille 11 hours ago

When b = 'x' is true for many rows and you select * or multiple columns, then it's the opposite, because reading all row data is slower in column based data structures than in row based ones.

IMO it's easier to explain in terms of workload:

- OLTP (T = transactional workloads), row based, for operating on rows - OLAP (A = analytical workloads), column based, for operating on columns (sum/min/max/...)

rovr138 17 hours ago

Another useful one is aggregations. Think sum(), concat(), max(), etc. You can operate on the column.

This is in contrast to row based. You have to scan the full row, to get a column. Think how you'd usually read a CSV (read line, parse line).

theamk 19 hours ago

this is messed up

> The decoding performance slowdown of Wasm is minimal (10–30%) compared to a native implementation.

so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future. And you also give up any advanced decoding functions other than "decode whole block and store into memory".

I have no idea why would anyone do this. If you care about speed, then wasm is not going to cut it. If you don't care about speed, you don't need super-fancy encoding algorithms, just use any of the well-known ones.

[-]

apavlo 19 hours ago

> so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future.

The WASM is meant as a backup. If you have the native decoder installed (e.g., as a crate), then a system will prefer to use that. Otherwise, fallback to WASM. A 10-30% performance hit is worth it over not being able to read a file at all.

[-]

Certhas 15 hours ago

It even says so right in the abstract:

"Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable."

The idea that software I write today can decode a data file written in ten years using new encodings is quite appealing.

And the idea that new software written to make use of the new encodings doesn't have to carry the burden of implementing the whole history of encoders for backwards compatibility likewise.

[-]

silvestrov 11 hours ago

Now you have code stored in your database which you don't know what will do when you execute it.

Sounds very much like the security pain from macros in Excel and Microsoft Word that could do anything.

This is why most PDF readers will ignore any javascript embedded inside PDF files.

[-]

Certhas 4 hours ago

Is this relevant in practice? Say I go to a website to download some data, but a malicious actor has injected an evil decoder (that does what exactly?). They could just have injected the wasm into the website I am visiting to get the data!

In fact, wasm was explicitly designed for me to run unverified wasm blobs from random sources safely on my computer.

tsukikage 10 hours ago

It gets even better further down the paper!

"In case users prefer native decoding speed over Wasm, F3 plans to offer an option to associate a URL with each Wasm binary, pointing to source code or a precompiled library."

[-]

discreteevent 9 hours ago

They are not suggesting that the code at the url would be automatically downloaded. It would be up to you to get the code and build it into your application like any other library.

mr_toad 9 hours ago

Excel, Word and PDF readers weren’t properly sandboxed.

xyzzy_plugh 19 hours ago

I kind of agree with you, but there's more to the picture.

The situation you describe is kind of already the case with various approaches to compression. For example, perhaps we decide to bitpack instead of use the generic compressor. Or change compressors entirely.

This sort of thing exists without WASM, and it means you have to "transcode" i.e. rewrite the file after updating your software with the new techniques.

With WASM, it's the same. You just rewrite the file.

I do agree that this pushes the costs of iteration up the stack in a vastly less efficient way. Overall this seems way more expensive, very unclear that future proofing is worth it. I've worked with exabyte-scale systems and re-encoding swaths of data regularly would not be good.

tomnicholas1 21 hours ago

The pitch for this sounds very similar to the pitch for Vortex (i.e. obviating the need to create a new format every time a shift occurs in data processing and computing by providing a data organization structure and a general-purpose API to allow developers to add new encoding schemes easily).

But I'm not totally clear what the relationship between F3 and Vortex is. It says their prototype uses the encoding implementation in Vortex, but does not use the Vortex type system?

[-]

apavlo 20 hours ago

The backstory is complicated. The plan was to establish a consortium between CMU, Tsinghua, Meta, CWI, VoltronData, Nvidia, and SpiralDB to unify behind a single file format. But that fell through after CMU's lawyers freaked out over Meta's NDA stuff to get access to a preview of Velox Nimble. IANAL, but Meta's NDA seemed reasonable to me. So the plan fell through after about a year, and then everyone released their own format:

→ Meta's Nimble: https://github.com/facebookincubator/nimble

→ CWI's FastLanes: https://github.com/cwida/FastLanes

→ SpiralDB's Vortex: https://vortex.dev

→ CMU + Tsinghua F3: https://github.com/future-file-format/f3

On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.

I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:

→ Germans: https://github.com/AnyBlox

[-]

rancar2 19 hours ago

Andrew, it’s always great to read the background from the author on how (and even why!) this all played out. This comment is incredibly helpful for understanding the context of why all these multiple formats were born.

tomnicholas1 8 hours ago

Thank you for the explanation! But what a mess.

I would love to bring these benefits to the multidimensional array world, via integration with the Zarr/Icechunk formats somehow (which I work on). But this fragmentation of formats makes it very hard to know where to start.

digdugdirk 19 hours ago

... Are you saying that there's 5 competing "universal" file format projects? Each with different non-compatible approaches? Is this a laughing/crying thing, or a "lots of interesting paths to explore" thing?

Also, back on topic - is your file format encryptable via that WASM embedding?

coderlens 3 hours ago

Kinda off topic but im really curious:

Why is there a chess move mentioned towards the end of the paper? (page24)

CMU-DB vs. MIT-DB: #1 pe4

syrusakbary 15 hours ago

It took me a while to find the source code, but it's available here: https://github.com/future-file-format/F3

mgaunard 14 hours ago

Requiring any implementation who wants to read the data to be able to run webassembly means the format is ill-suited to any environment which aims to reduce dependencies or bloat.

[-]

aDyslecticCrow 14 hours ago

Wasm is pretty simple; Don't confuse it with the "web" part of the acronym. And as others point out; wasm is a backup. You can use native binding without going through wasm, which provide better performance,

[-]

mgaunard 10 hours ago

it needs a compiler to convert it to machine code, it also has a non-trivial runtime...

[-]

Bigpet 8 hours ago

Don't know if you need to compile, you might want to but I think interpreting might seem reasonable if size/complexity is a concern.

Is the runtime really that large? I know with wasm 2.0 with garbage collection and exceptions is a bit of a beast but wasm 1.0? What's needed (I'm speaking from a place of ignorance here, I haven't implemented a WASM runtime)? Some contiguous memory, a stack machine, IEEE float math and some utf-8 operations. I think you can add some reasonable limitations like only a single module and a handful of available imports relevant to the domain.

I know that feature creep would almost inevitably follow, but if someone cares about minimizing complexity it seems possible.

drob518 7 hours ago

If you view WASM as a worst case fallback, then you could implement a simple interpreter and transcode the file to something that your program has native decoders for. Yes, that might be slow, but at least the data remains usable. At least, that’s my understanding. I don’t have a dog in the hunt.

Certhas 14 hours ago

If you have a native decoder you don't need to run WASM. It's literally in the abstract.

[-]

Someone 3 hours ago

But given a file in this format, you cannot know whether your native decoder can read it without running WASM; the file you read may use a feature that was invented after your library was written.

lifthrasiir 21 hours ago

It seems one of the first file formats that embed WebAssembly modules. Is there any other prior work? I'm specifically interested from the compression perspective as a well-chosen WebAssembly preprocessor can greatly boost the compression ratio. (In fact, I'm currently working on such file format just in case.)

[-]

ianbicking 7 hours ago

Alan Kay described [1] what he considers the first object oriented system, made in the 60s by an unknown programmer. It was a tape-based storage system, where the "format" of the tap was a set of routines to read, write, etc. at a known offsets on the tape.

So, prior art! :)

[1] https://www.cs.tufts.edu/comp/150FP/archive/alan-kay/smallta... (page 4)

esafak 21 hours ago

How do they prevent people from embedding malicious payloads in the WebAssembly?

[-]

aeonfox 21 hours ago

By sandboxing:

> We first discuss the implementation considerations of the input to the Wasm-side Init() API call. The isolated linear memory space of Wasm instance is referred to as guest, while the program’s address space running the Wasm instance is referred to as host. The input to a Wasm instance consists of the contiguous bytes of an EncUnit copied from the host’s memory into the guest’s memory, plus any additional runtime options.

> Although research has shown the importance of minimizing the number of memory copies in analytical workloads, we consider the memory copy while passing input to Wasm decoders hard to avoid for several reasons. First, the sandboxed linear memory restricts the guest to accessing only its own memory. Prior work has modified Wasm runtimes to allow access to host memory for reduced copying, but such changes compromise Wasm’s security guarantees

[-]

theamk 19 hours ago

It's "the future work". But possibly, allowlists might help:

> Security concerns. Despite the sandbox design of Wasm, it still has vulnerabilities, especially with the discovery of new attack techniques. [...] We believe there are opportunities for future work to improve Wasm security. One approach is for creators of Wasm decoding kernels to register their Wasm modules in a central repository to get the Wasm modules verified and tamper-resistant.

[-]

aeonfox 17 hours ago

The only issue the article seems to raise is that their solution isn't that optimal because there's redundant copying of the input data into the sandbox, but this enables the sandbox to be secure as the Wasm code can't modify data outside of its sandbox. I'd assume the memory is protected at the CPU level, something akin to virtualisation. But then there's maybe some side-channel attack it could use to extract outside data somehow?

mannyv 17 hours ago

Because sandboxing has worked so well in preventing security issues?

flockonus 18 hours ago

For one, sandboxing can't solve the halting problem.

[-]

aeonfox 17 hours ago

So you're suggesting a denial of service? I doubt the sandbox is running on cooperative scheduling. Ultimately the VM can be run for some cycles, enough to get work done, but enough to permit other programs to run, and if the decoder doesn't seem to be making any progress the user or automated process can terminate it. Something like this already happens when js code takes up too much time on a browser and a little pop-down shows to let you stop it.

catlifeonmars 17 hours ago

You can if you bound the program in time and space. In particular if you put a timeout on a computation you know for a fact it will halt at some point before the timeout.

ikeashark 3 hours ago

Anything by my favorite Andy Pavlo is good in my book

anonzzzies 12 hours ago

Do we have a vetted 'optimal' format for column (this) and row (sqlite?) yet? I often have to switch all day and I have something in memory made by us that works but nothing standard.

Someone 15 hours ago

The embedded WASM makes this more extensible, but still, they have to choose, and freeze forever, interfaces that the WASM blocks can implement. Chances are that will, at some time, become a pain point.

They also have to specify the WASM version. In 6 years, we already have 3 versions there (https://webassembly.org/specs/). Are those 100% backwards compatible?

fallat 20 hours ago

A format requiring a program to decode is nuts. Might as well bundle 7zip with every zip file.

[-]

coderatlarge 18 hours ago

not a bad idea if your average file size is a terabyte and there’s never a sure way to get a clean binary.

totetsu 8 hours ago

Didn’t this used to be done? I’m sure I remember finding .exe zip files and being surprised it wasn’t just a virus.

mr_toad 7 hours ago

Self extracting executables, (zip files with the decoder included) have been around for decades.

iceboundrock 7 hours ago

7zip has a feature called `7-Zip self-extracting (SFX) archive`.

gigatexal 16 hours ago

Inertia is a thing. Parquet and ORC and all the myriad of other formats all exist with ecosystems and things. For this to succeed they’d have to seed the community with connectors and libraries and hooks into things to make it easy to use.

Plugins for DudckDB for example to read and write to it. Heck if you got Iceberg to use it instead of parquet that could be a win.

Sometimes tech doesn’t win on the merits because of how entrenched existing stuff is and the high cost of switching.

[-]

ozgrakkurt 7 hours ago

This is why parquet is becoming standard even though it is suboptimal and every single high performance db has its own internal format.

If you need optimized format you do it yourself, if you need standard you use w/e else is using unless it is too bad.

These new formats seem like they would only be useful as research to build on when creating a specific tool or database

rzk 15 hours ago

> Sometimes tech doesn’t win on the merits because of how entrenched existing stuff is and the high cost of switching.

Yes! Path dependence: https://en.wikipedia.org/wiki/Path_dependence

albertzeyer 14 hours ago

Why not store the data directly as Arrow files, to allow for mmaping? I see F3 also supports such zero-copy mmap, and skimming through the paper, it actually seems that it uses Arrow buffers, so I wonder what is the difference to directly using Arrow files? (Arrow files is what is being used by HuggingFace datasets.)

[-]

thinkharderdev 9 hours ago

The main reason is that arrow files are not compressed at all. So storing everything as Arrow would increase storage size by 10-100x (depending on data)

dan-robertson 12 hours ago

I think mmapping is less useful when you are expecting to get your data over a network. Not sure though.

6LLvveMx2koXfwn 12 hours ago

For those complaining about the PDF here's the DOI:

https://dl.acm.org/doi/10.1145/3749163

traceroute66 14 hours ago

"file format for the future" .... asterisk, footer note "until something better comes along"

If I had a dollar for all the file formats that got deprecated or never took-off over the years...

pyrolistical 17 hours ago

The neat thing about this the client could substitute with their down decoder when they detect a supported format.

This would allow them to use a decoder optimized for the current hardware. Eg. multi-threaded

binary132 19 hours ago

This feels like one of those ideas that seems like a good idea during a late-night LLM brainstorming sesh but not so good when you come back to it with a fresh brain in the morning.

[-]

edoceo 18 hours ago

LLM = Lager, lager, mead

indigovole 11 hours ago

What kind of embedded WASM malware are we going to be defending against with this?

[-]

twoodfin 10 hours ago

The obvious concern would be data-dependent backdoors for malicious “decoding”, i.e. correctly decoding ordinary data, but manipulating the decoding of targeted data in some compromising way.

That relies on some rather far-fetched assumptions about what the attacker might reasonably be able to control undetected, and what goals they might reasonably be able to achieve through such low-level data corruption.

Maybe information leakage? Tweak some low-order float bits in the decoded results with high-order bits from data the decoder recognizes as “interesting”?

[-]

discreteevent 9 hours ago

What's the attack vector in this case? The Wasm is loaded from the file itself. If they can compromise the file then its cheaper to just compromise the data directly.

[-]

twoodfin 5 hours ago

What I’m imagining is essentially a supply chain attack: The victim (mistakenly) trusts the attacker to supply an encoder. The encoder appears to function normally, but in fact will subtly leak information smuggled in decoded values of the victim’s data.

Far-fetched, indeed.

actionfromafar 6 hours ago

Providing an optional, optimized, native decoder which is much faster, but does something wicked when it sees the right data.

jauntywundrkind 20 hours ago

Reminds me of Anyblox, which is a columnar data-store file with attached wasm for encode/decode. Super fun read, showed up in the comments for Spiral. https://news.ycombinator.com/item?id=45212960#45214646

14 votes & no comments, 3 mo ago, for AnyBlox: A Framework for Self-Decoding Datasets [pdf] https://gienieczko.com/anyblox-paper https://news.ycombinator.com/item?id=44501743

fijiaarone 20 hours ago

Nothing screams "future proof" like WASM.

[-]

DecoPerson 20 hours ago

Old websites still run very well.

Is there any reason to believe that a major new browser tech, WASM, will ever have support dropped for its early versions?

Even if the old versions are not as loved (i.e.: engine optimized for it and immediately ready to be executed) as the old versions, emulation methods work wonders and could easily be downloaded on-demand by browsers needing to run "old WASM".

I'm quite optimistic for the forwards-compatibility proposed here.

[-]

xyzzy_plugh 19 hours ago

Sure. Just recently Google was pushing to remove XSLT in Chrome. Flash is a bit of a different beast but it died long ago.

However I don't think it matters too much. Here WASM is not targeting the browser, and there are many more runtimes for WASM, in many diverse languages, and they outnumber browsers significantly. It won't die easily.

Someone 15 hours ago

> Old websites still run very well.

<blink> is gone. In C, gets is gone. It may take only one similar minor change to make huge amounts of data unreadable by newer WASM versions.

defanor 17 hours ago

Indeed, I guess in a while it is going to look pretty much like the older formats relying on obscure bytecode look now. Possibly the kind of data analytics it aims has some special requirements, but I would expect a simple format like CSV (or DSV, more generally) to be more future-proof generally.

[-]

yjftsjthsd-h 16 hours ago

> in a while it is going to look pretty much like the older formats relying on obscure bytecode look now

Depends how obscure. You can play Z-machine games on pretty much anything now.

slashdave 6 hours ago

I'm looking forward to the announcement of F4 next year.

bound008 19 hours ago

The irony of being a PDF file

alfiedotwtf 6 hours ago

Weird… for a format that says it’s a general future-proof file format, there’s not a single reference to cache-oblivious structures or other structures like VRF

ProofHouse 14 hours ago

End Adobe.

trhway 20 hours ago

the embedded decoder may as well be closing on a full-blown SQL execution engine (as the data size tramps the executable size these days). Push that execution onto a memory chips and into SSD/HDD controllers. I think something similar happens in filesystem development too where instead of straight access to raw data you may access some execution API over the data. A modern take on IBM mainframe's filesystem database.

[-]

ozgrakkurt 7 hours ago

Ssd controllers can barely keep up with advertised read/write workloads afaik so it is hard to understand how can this be useful

1oooqooq 21 hours ago

data file format for the future [pdf]

DenisM 21 hours ago

tldr

The proliferation of opensource file formats (i.e., Parquet, ORC) allows seamless data sharing across disparate platforms. However, these formats were created over a decade ago for hardware and workload environments that are much different from today

Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable.

[-]

NegativeK 21 hours ago

I know sandboxes for wasm are very advanced, but decades of file formats with built-in scripting are flashing the danger lights at me.

[-]

drob518 7 hours ago

“This time it’ll be different. Trust us.”

kijin 21 hours ago

Is that really necessary, though? Data files are useless without a program that knows how to utilize the data. Said program should already know how to decode data on the platform it's running.

And if you're really working on an obscure platform, implementing a decoder for a file format is probably easier than implementing a full-blown wasm runtime for that platform.

[-]

magicalhippo 20 hours ago

> Data files are useless without a program that knows how to utilize the data.

As I see it, the point is that the exact details of how the bits are encoded is not really interesting from the perspective of the program reading the data.

Consider a program that reads CSV files and processes the data in them. First column contains a timestamp, second column contains a filename, third column contains a size.

As long as there's a well-defined interface that the program can use to extract rows from a file, where each row contains one or more columns of data values and those data values have the correct data type, then the program doesn't really care about this coming from a CSV file. It could just as easily be a 7zip-compressed JSON file, or something else entirely.

Now, granted, this file format isn't well-suited as a generic file format. After all, the decoding API they specify is returning data as Apache Arrow arrays. Probably not well-suited for all uses.

[-]

mbreese 19 hours ago

I think the counter argument here is that you’re now including a CSV decoder in every CSV data file now. At the data sizes we’re talking, this is negligible overhead, but it seems overly complicated to me. Almost like it’s trying too hard to be clever.

How many different storage format implementations will there realistically be?

[-]

catlifeonmars 17 hours ago

> How many different storage format implementations will there realistically be?

Apparently an infinite number, if we go with the approach in the paper /s

[-]

magicalhippo 14 hours ago

It does open up the possibility for specialized compressors for the data in the file, which might be interesting for archiving where improved compression ratio is worth a lot.

[-]

catlifeonmars 5 hours ago

That makes sense. I think fundamentally you’re trading off space between the compressed data and the lookup tables stored in your decompression code. I can see that amortizing well if the compressed payloads are large or if there are a lot of payloads with the same distribution of sequences though.