Reducing the cost of a single Google Cloud Dataflow Pipeline by Over 60%

(blog.allegro.tech)

9 points | by jimminyx 5 hours ago ago

2 comments

The number of things I've seen explode in cost when using Beam and its managed variants is insane.

The general technique of performing ETL by streaming saved data to a compute resource, writing and running a program in the company's lingua franca, and then loading it back is nearly always inefficient. This article underlines just how impactful minor issues with node sizing can be - and its something to stick into your calendar to revisit every six months (and after data and program changes).

We don't get the context here, but its generally more cost efficient to keep the data in a real database (e.g. BigQuery), and operate on it using SQL as much as possible. You can perform in-database ETL by loading to different tables, and operating there. For some tasks, you will want to use UDF's, and in rarer instances you will need external ETL - but if those instances aren't first powered by a non-trivial internal query I would be very concerned!

One of the main reasons teams don't store data in a database is the structure of it is currently considered incompatible, or they see big challenges with partial and duplicate storage. Another reason is issues around data loading - ingest can be very expensive or very cheap depending on exactly how you do it!

A final note; the article was written on the 20th June, and its been a while. It would be great to know the real impact rather than the estimate!

> Presented figures are only estimates based on a single run (with only 3% of input data) and extrapolated to the whole year with the assumption that processing the whole dataset will result in the same relative savings as processing 3% of source data.

vander_elst 3 hours ago

Would efficiency improve if they were to use larger machines? Same CPU/memory ratio, but more CPUs and memory? Assuming the have more than ~20 VMS for this..

it would also be interesting to know if they could get away with a single very large machine instead, like h3-standard-88... From a cost perspective it does not seem too far off from their final solution, that's why the assumption that maybe a single VM could handle the load