Optimize the Data Layout

(cedardb.com)

45 points | by napsterbr a year ago ago

7 comments

o11c a year ago

Where this gets interesting is when you have patterns of sparse data. Often a mix is best - AoS fails badly for partially-sparse rows, and SoA requires duplicating the key, but SoAoS only has to duplicate the key per struct, and SoAoSoA only has to duplicate it once per sequence of adjacent structs.

jakozaur a year ago

A similar explanation could be applied to columnar vs row databases.

[-]

makmanalp a year ago

If you like this, you might enjoy: "Column-Stores vs. Row-Stores: How Different Are They Really?"

> The elevator pitch behind this performance difference is > straightforward: column-stores are more I/O efficient for read-only > queries since they only have to read from disk (or from memory) > those attributes accessed by a query.

> This simplistic view leads to the assumption that one can ob- > tain the performance benefits of a column-store using a row-store: > either by vertically partitioning the schema, or by indexing every column so that columns can be accessed independently. In this pa- > per, we demonstrate that this assumption is false.

https://faculty.cc.gatech.edu/~jarulraj/courses/4420-s19/pap...

bunderbunder a year ago

They're broadly the same thing as AoS vs SoA. Just relax the "array" bit to allow other kinds of core data structures such as B-trees.

durner a year ago

The idea of comparing column and row storage actually inspired this whole post.

[-]

tomnipotent a year ago

Your read performance test is biased towards a struct of arrays, array of structs should outperform when needing random non-contigious look-ups. In the context of fixed-page databases, this is an important distinction since row-based and hybrid storage (PAX) will need to read fewer pages than a pure columnar store.

[-]

durner a year ago

Sure, a filtering scan or an index lookup is better in chunked SoA (or PAX as we database people say) than without chunks due to the metadata filter options. We briefly talk about that in the out-of-memory optimization section. Most column-based formats/databases are actually inspired by the ideas of PAX, they often just use a bit coarser granularity (e.g. Parquet's rowgroups).