I did non trivial work with apache spark dataframes and came to appreciate them before ever being exposed to Pandas. After spark, pandas just seemed frustrating and incomprehensible. Polars is much more like spark and I am very happy about that.
DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.
I had a similar experience with spark, especially in the Scala API it felt very expressive and concise once I got used to certain idioms. Also +1 on duckdb which is excellent.
There are some frustrations in spark however, I remember getting stuck on Winsorizing over groups. Hilariously there are identical functions called `percentile_approx` and `approx_percentile` and it wasn't clear from the docs they were the same or at least did the same thing.
Given all that, the ergonomics of Julia for general purpose data handling is really unmatched IMO. I've got a lot of clean and readable data pipeline and shaping code that I revisited a couple years later and could easily understand. And making updates with new more type-generic functions is a breeze. Very enjoyable.
I don't know how well the polars implementation works, but what I love about PySpark is that sometimes spark is able to push those groupings down to the database. Not always, but sometimes. However I imagine that many people love polars/pandas performance for transactional queries (from start to finish get me a result in less than a second (as long as the number of underlying rows is not greater than 20k-ish). Pyspark will never be super great for that.
The power of having an API that allows usage of the Free monad.
And in less-funny-FP-speak, the power of allowing the user write a program (expressions), that the sufficiently-smart backend later compiles/interprets.
Awesome! Didn't expect such a vast difference in usability at first.
Pandas sat alone in the Python ecosphere for a long time. Lack of competition is generally not a good thing. I'm thrilled to have Polars around to innovate on the API end (and push Pandas to be better).
And I say this as someone who makes much of their living from Pandas.
I think pandas is well aware of some of the unfortunate legacy API decisions without Polars. They are trapped by backwards compatibility. Wes’ “Things I Hate About Pandas” post covers the highlights. Most of which boils down to having not put a layer between numpy and pandas. Which is why they were stuck with the unfortunate integer null situation.
Which is all stuff they could fix, if they'd be willing to, with a major version bump. They'd need a killer feature to encourage that migration though.
The really brutal thing is all of the code using Pandas written by researchers and non-software engineers running quietly in lab environments. Difficult to reproduce environments, small or non-existent test suites, code written by grad students long gone. If the Pandas interface breaks for installs done via `pip install pandas` it will cause a lot of pain.
With that acknowledged, it'll make life a lot easier on everyone if the "fix the API" Pandas 3 had a different package name. Polars and others seem like exactly that solution, even if not literally Pandas.
I've wanted to convert a massive Pandas codebase to Polars for a long time. Probably 90% of the compute time is Pandas operations, especially creating new columns / resizing dataframes (which I understand to involve less of a speed difference compared to the grouping operations mentioned in the post, but still substantial). Anyone had success doing this and found it to be worth the effort?
The difference is a sanely and presciently designed expression API, which is a bit more verbose in some common cases, but is more predictable and much more expressive in more complex situations like this.
On a tangent, i wonder what this op would look like in SQL? Probably would need support for filtering in a window function, which I'm not sure is standardized?
But also props to Wes McKinney for giving us a dataframe library during a time when we had none. Java still doesn’t have a decent dataframe library so we mustn’t take these things for granted.
The Pandas API is no longer the way things should be done today nor should it be in new tutorials. Pandas was the jquery of its time —- great but no longer the state of the art. But I have much gratitude for it being around when it was needed.
-- "find the maximum value of 'views',
-- where 'sales' is greater than its mean, per 'id'".
select max(views), id -- "find the maximum value of 'views',
from example_table as et
where exists
(
SELECT *
FROM
(
SELECT id, avg(sales) as mean_sales
FROM example_table
GROUP by id
) as f --
where et.sales > f.mean_sales -- where 'sales' is greater than its mean
and et.id = f.id
)
group by id; -- per 'id'".
I’ve moved mostly to polars. I still have some frameworks that demand pandas and pandas is still a very solid dataframe, but when I need to interpolate months in millions of lines of quarterly data, polars just blows it away.
Even better is using tools like Narwhals and Ibis which can convert back and forth to any frames you want.
I did non trivial work with apache spark dataframes and came to appreciate them before ever being exposed to Pandas. After spark, pandas just seemed frustrating and incomprehensible. Polars is much more like spark and I am very happy about that.
DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.
I had a similar experience with spark, especially in the Scala API it felt very expressive and concise once I got used to certain idioms. Also +1 on duckdb which is excellent.
There are some frustrations in spark however, I remember getting stuck on Winsorizing over groups. Hilariously there are identical functions called `percentile_approx` and `approx_percentile` and it wasn't clear from the docs they were the same or at least did the same thing.
Given all that, the ergonomics of Julia for general purpose data handling is really unmatched IMO. I've got a lot of clean and readable data pipeline and shaping code that I revisited a couple years later and could easily understand. And making updates with new more type-generic functions is a breeze. Very enjoyable.
Spark docs are way too minimal for my taste, at least the API docs.
I don't know how well the polars implementation works, but what I love about PySpark is that sometimes spark is able to push those groupings down to the database. Not always, but sometimes. However I imagine that many people love polars/pandas performance for transactional queries (from start to finish get me a result in less than a second (as long as the number of underlying rows is not greater than 20k-ish). Pyspark will never be super great for that.
The power of having an API that allows usage of the Free monad. And in less-funny-FP-speak, the power of allowing the user write a program (expressions), that the sufficiently-smart backend later compiles/interprets.
Awesome! Didn't expect such a vast difference in usability at first.
Pandas sat alone in the Python ecosphere for a long time. Lack of competition is generally not a good thing. I'm thrilled to have Polars around to innovate on the API end (and push Pandas to be better).
And I say this as someone who makes much of their living from Pandas.
I think pandas is well aware of some of the unfortunate legacy API decisions without Polars. They are trapped by backwards compatibility. Wes’ “Things I Hate About Pandas” post covers the highlights. Most of which boils down to having not put a layer between numpy and pandas. Which is why they were stuck with the unfortunate integer null situation.
Which is all stuff they could fix, if they'd be willing to, with a major version bump. They'd need a killer feature to encourage that migration though.
The really brutal thing is all of the code using Pandas written by researchers and non-software engineers running quietly in lab environments. Difficult to reproduce environments, small or non-existent test suites, code written by grad students long gone. If the Pandas interface breaks for installs done via `pip install pandas` it will cause a lot of pain.
With that acknowledged, it'll make life a lot easier on everyone if the "fix the API" Pandas 3 had a different package name. Polars and others seem like exactly that solution, even if not literally Pandas.
I've wanted to convert a massive Pandas codebase to Polars for a long time. Probably 90% of the compute time is Pandas operations, especially creating new columns / resizing dataframes (which I understand to involve less of a speed difference compared to the grouping operations mentioned in the post, but still substantial). Anyone had success doing this and found it to be worth the effort?
I converted to DuckDB and Polars. It’s worth it for the speed improvement.
However there are subtle differences between Pandas and Polars behaviors so regression testing is your friend. It’s not 1:1 mapping.
There's been so many subtle changes in pandas to pandas upgrades (especially groupby is somehow always hit), so regression tests are always needed...
Which things did you decide to move to duckdb?
The difference is a sanely and presciently designed expression API, which is a bit more verbose in some common cases, but is more predictable and much more expressive in more complex situations like this.
On a tangent, i wonder what this op would look like in SQL? Probably would need support for filtering in a window function, which I'm not sure is standardized?
Without having checked, maybe something like:
In dplyr, there is an ‘old style’ method which works on an intermediate ‘grouped data frame’ and a new style which doesn’t. In the old style: In the new style, either: Or:Props to Ritchie Vink for designing polars.
But also props to Wes McKinney for giving us a dataframe library during a time when we had none. Java still doesn’t have a decent dataframe library so we mustn’t take these things for granted.
The Pandas API is no longer the way things should be done today nor should it be in new tutorials. Pandas was the jquery of its time —- great but no longer the state of the art. But I have much gratitude for it being around when it was needed.
Here's an example implementation in MSSQL - https://data.stackexchange.com/stackoverflow/query/edit/1873...
No need to filter within the window function if you use subquery or CTE, which is supported everywhere.
https://en.wikipedia.org/wiki/SQL?useskin=vector#Standardiza...
According to wikipedia, windowing was standardized back in 2003.
I’ve moved mostly to polars. I still have some frameworks that demand pandas and pandas is still a very solid dataframe, but when I need to interpolate months in millions of lines of quarterly data, polars just blows it away.
Even better is using tools like Narwhals and Ibis which can convert back and forth to any frames you want.