Postgres data stored in Parquet on S3: LTAP architecture explained

(databricks.com)

29 points | by andrenotgiant 3 days ago ago

10 comments

Here's what I don't understand:

Part of the value of doing an ETL pipeline via streaming replication is you get the full history of data in a table. An SCD type 2 table where each row also has a valid_from and valid_to timestamp column.

How would someone do the same thing with this architecture?

[-]

hasyimibhar an hour ago

It wouldn't be possible to do this with LTAP architecture since (I'm assuming) the individual logical changes are not visible. But honestly I've always seen SCD type 2 table as a workaround due to lack of data modeling experience in the source database. If you design your tables correctly, you shouldn't need SCD type 2 downstream.

For example, if you know your user can change emails, and there might be events from another source that is keyed by user email (e.g. marketing-related events), then naturally you will need some sort of email_history table that has historical mapping of user id to email (you probably need it for audit purposes too). Then in this case there is no need to build SCD type 2 table of user from CDC, it's already there.

khurs an hour ago

Both Iceberg and Delta Lake support 'time travel' so you can query data as it was at a certain date.

SELECT count * FROM my_table AS OF "2025-01-01"

https://delta.io/blog/2023-02-01-delta-lake-time-travel/

https://iceberg.apache.org/docs/latest/spark-queries/#spark-...

[-]

_zoltan_ an hour ago

up to a limited number of snapshots. it's not arbitrary times, it's when there was an atomic snapshot created in the catalog.

eveningtree 2 hours ago

Rather than answering directly, I'm thinking about this problem from the other end altogether ever since I saw the dbricks rt demo. Apologies for the rambling response, as I haven't yet finished thinking about this problem...

We ended up with 'hot' data in oltp and 'cold/archival' data in olap because the storage size of oltp has always been limited.

(1) Limited by computation - there's only so much data that we can store on disks and nvme

(2) Limited by wallet - disks and nvme are EXPENSIVE

Also, the tight coupling of compute and data didn't help. It limited the size of databases on the individual expensive compute nodes.

So, another question will be -

What's currently stopping me from keeping the scd history tables right in my oltp db? what's forcing me to copy state into my etl/elt pipeline and the process it into scd into a dedicated olap db?

To some extent,the answer is still the same - the oltp cannot scale for the storage size required for keeping historical data. So, I've had to take out the 'cold' historical data and keep it in my olap freezer.

Now, if oltp itself is scaling, I'm not gonna bother with the copying step. I'll just prefer to store the history in oltp itself.

In my perspective (majorly from handling IoT systems), I need olap for 2 reasons - (1) storage scalability, and (2) analytical processing speed

I now consider (1) to be a solved problem

As for (2), I'm still not sure how this architecture ends up matching the query processing speeds of column-oriented storages. But again, I need to study more.

The SCD pipeline still remains in some form. Either in the form of (1) scd rows that we currently keep (etl pipeline) , or (2) as older lsn rows that simply don't get deleted (existing db engine).

I've done quite a lot of experimentation with (2), and it is a pretty solid concept to work with.

I've spent quite a lot of years hammering my brain at databases and datastores in general. And I've now got a feeling that this is it. Finally.

PunchyHamster 2 hours ago

I don't wanna see that S3 bandwidth bill after running some big query

[-]

khurs 2 hours ago

There are self hosted object stores which use the same protocol as S3. One example: https://github.com/minio/minio

Parquet files are smaller than row based storage in a database (but not those databases with focus on strong compression).

And for backup - the files are probably easier to just copy to multiple disks for redundancy, as opposed to database dumps and incremental backups which at the Petabyte scale will be a pain.

[-]

_zoltan_ an hour ago

your reply makes zero sense.

there is a reason why people develop for S3: a lot of enterprise data is there. people ingest there from various sources. and it's not just parquet usually, it's multivendor sources writing to an iceberg catalog.

nobody will run minio on AWS other than hobby projects and small demos.

I regularly work with iceberg datasets in the double digit TB range per dataset. keep that in mind when you think about sizes. databricks, snowflake, large enterprise vendors: they are targeting these sizes.

[-]

khurs an hour ago

> nobody will run minio on AWS other than hobby projects and small demos.

You realise not every company uses AWS for any/all its needs?

There are datacenters around the world owned by individual companies or co-located. And many companies still have servers on prem.

Compute and disks are getting more dense & liquid cooled, so less rack space is needed for same power.

And Minio and others can handle Petabytes+

https://www.cisco.com/c/en/us/products/collateral/servers-un...

Backblaze, Cloudflare R2 and other cheaper S3 compatible competitors also exist.

tux3 28 minutes ago

The large enterprise vendors are not prise-sensitive. They're on AWS because you never get fired for picking AWS, and there isn't really any other choice for these vendors regardless of AWS ripping you off.

At this point S3 is a standard interface. All sorts of cloud providers and open-source projects provide S3. If you're on AWS, price isn't the reason. You pick AWS because you don't see your company taking a risk with anything else.

S3 doesn't mean expensive. AWS does. But AWS users are fully locked-in, they'll pay whatever the price is.