The cuDF interop in the roadmap [1] will be huge for my workloads. XGBoost has the fastest inference time on GPUs, so a fast path straight from these Vortex files to GPU memory seems promising.
Can you explain how it’s faster? GPU memory is just a blob with an address. Is it because the loading algorithms for vortex align better with XGBoost or just plain uploading to the GPU?
The default writer will decompress the values, however, right now you can implement your own write strategy that will avoid doing it. We plan on adding that as an option since it’s quite common.
I've never understood why people say Feather file format isn't meant for "long-term" storage and prefer Parquet for that. Access is much faster from Feather, compression better with Parquet but Feather is really good.
Honestly I think Arrow makes Feather redundant. To answer your question, Parquet is optimized for storage on disk - can store with compression to take leas space, and might include clever tricks or some form of indices to query data from the file. Feather on the other hand is optimized for loading onto memory. It uses the same representation on disk as it does in memory. Very little in the way of compression (if any). No optimized for disk at all. BUT you can memory map a Feather file and randomly access any part of it in O(1) time (I believe, but do your own due diligence :)
Vortex is a file format, where as delta lake and iceberg are table formats. it should be compared to Parquet rather than delta lake and iceberg.
This guest lecture by a maintainer of Vortex provides a good overview of the file format, motivations for its creation and its key features.
Agreed, really need a tl;dr here, because Parquet is boring technology. Going to require quite the sales pitch to move. At minimum, I assume it will be years before I could expect native integration in pandas/polars/etc which would make it low effort enough to consider.
Parquet is ..fine, I guess. It is good enough. Why invoke churn? Sell me on the vision.
I think it would still make sense to compare with those table formats, or is the idea that you would only use this if you could not use a table format?
Vortex is, roughly, how you save data to files and Iceberg is the database-like manager of those files. You’ll soon be able to run Iceberg using Vortex because they are complementary, not competing, technologies.
One thing I found interesting is the logical type system doesn't seem to include sum types or unions, unlike Arrow etc.
I'd generally encourage new type systems to include sum types as a first-class concept.
I wonder if a columnar storage format should implement sum types with a struct of arrays where only one array has a nun-null value for each index.
The cuDF interop in the roadmap [1] will be huge for my workloads. XGBoost has the fastest inference time on GPUs, so a fast path straight from these Vortex files to GPU memory seems promising.
[1] https://github.com/vortex-data/vortex/issues/2116
Can you explain how it’s faster? GPU memory is just a blob with an address. Is it because the loading algorithms for vortex align better with XGBoost or just plain uploading to the GPU?
Can you append new columns to a file stored on disk without reading it all in mempey? Somehoe this is beyond parquet capabilities.
The default writer will decompress the values, however, right now you can implement your own write strategy that will avoid doing it. We plan on adding that as an option since it’s quite common.
how does this compare to Arrow IPC / Feather v2?
It is wildly more complex
I've never understood why people say Feather file format isn't meant for "long-term" storage and prefer Parquet for that. Access is much faster from Feather, compression better with Parquet but Feather is really good.
Honestly I think Arrow makes Feather redundant. To answer your question, Parquet is optimized for storage on disk - can store with compression to take leas space, and might include clever tricks or some form of indices to query data from the file. Feather on the other hand is optimized for loading onto memory. It uses the same representation on disk as it does in memory. Very little in the way of compression (if any). No optimized for disk at all. BUT you can memory map a Feather file and randomly access any part of it in O(1) time (I believe, but do your own due diligence :)
Vortex and Lance both seem really cool but will have to infiltrate either the Delta or Iceberg specs to become mainstream.
Can’t wait for https://github.com/apache/iceberg/issues/12225 to merge so there’s an api to integrate against
Can we stop with the cringe emojis at the start of every heading?
I guess not surprising from a project that combines Polars & Vortex
How does this compare with delta lake and iceberg?
Vortex is a file format, where as delta lake and iceberg are table formats. it should be compared to Parquet rather than delta lake and iceberg. This guest lecture by a maintainer of Vortex provides a good overview of the file format, motivations for its creation and its key features.
https://www.youtube.com/watch?v=zyn_T5uragA
The website could use a comparison / motivation in comparison to Parquet (beyond just stating it's 100x better).
Agreed, really need a tl;dr here, because Parquet is boring technology. Going to require quite the sales pitch to move. At minimum, I assume it will be years before I could expect native integration in pandas/polars/etc which would make it low effort enough to consider.
Parquet is ..fine, I guess. It is good enough. Why invoke churn? Sell me on the vision.
DuckDB just added support for vortex in their last release using the Vortex Python package so hopefully other tools wont be too far behind
> Going to require quite the sales pitch to move.
Mutability would be one such pitch I would like to see ...
I think it would still make sense to compare with those table formats, or is the idea that you would only use this if you could not use a table format?
That’s like comparing words with characters.
Vortex is, roughly, how you save data to files and Iceberg is the database-like manager of those files. You’ll soon be able to run Iceberg using Vortex because they are complementary, not competing, technologies.
As others said, Vortex is complementary to the table Formats you mentioned.
There are other formats though that it can be compared to.
The Lance columnar format is one: https://github.com/lancedb/lancedb
And Nimble from Meta is another: https://github.com/facebookincubator/nimble
Parquet is so core to data infra and widespread, that removing it from its throne is a really really hard task.
The people behind these projects that are willing to try and do this, have my total respect.