Non-elementary group-by aggregations in Polars vs pandas

(labs.quansight.org)

135 points | by rbanffy 8 months ago ago

14 comments

I did non trivial work with apache spark dataframes and came to appreciate them before ever being exposed to Pandas. After spark, pandas just seemed frustrating and incomprehensible. Polars is much more like spark and I am very happy about that.

DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.

__mharrison__ 8 months ago

Pandas sat alone in the Python ecosphere for a long time. Lack of competition is generally not a good thing. I'm thrilled to have Polars around to innovate on the API end (and push Pandas to be better).

And I say this as someone who makes much of their living from Pandas.

akdor1154 8 months ago

The difference is a sanely and presciently designed expression API, which is a bit more verbose in some common cases, but is more predictable and much more expressive in more complex situations like this.

On a tangent, i wonder what this op would look like in SQL? Probably would need support for filtering in a window function, which I'm not sure is standardized?

lend000 8 months ago

I've wanted to convert a massive Pandas codebase to Polars for a long time. Probably 90% of the compute time is Pandas operations, especially creating new columns / resizing dataframes (which I understand to involve less of a speed difference compared to the grouping operations mentioned in the post, but still substantial). Anyone had success doing this and found it to be worth the effort?

winwang 8 months ago

The power of having an API that allows usage of the Free monad. And in less-funny-FP-speak, the power of allowing the user write a program (expressions), that the sufficiently-smart backend later compiles/interprets.

Awesome! Didn't expect such a vast difference in usability at first.

combocosmo 8 months ago

I've always liked scatter solutions for these kind of problems:

  import numpy as np
  
  def scatter_mean(index, value):
      sums = np.zeros(max(index)+1)
      counts = np.zeros(max(index)+1)
      for i in range(len(index)):
          j = index[i]
          sums[j] += value[i]
          counts[j] += 1
      return sums / counts
  
  def scatter_max(index, value):
      maxs = -np.inf * np.ones(max(index)+1)
      for i in range(len(index)):
          j = index[i]
          maxs[j] = max(maxs[j], value[i])
      return maxs
  
  def scatter_count(index):
      counts = np.zeros(max(index)+1, dtype=np.int32)
      for i in range(len(index)):
          counts[index[i]] += 1
      return counts
  
  id = np.array([1, 1, 1, 2, 2, 2]) - 1
  sales = np.array([4, 1, 2, 7, 6, 7])
  views = np.array([3, 1, 2, 8, 6, 7])
  means = scatter_mean(id, sales).repeat(scatter_count(id))
  print(views[sales > means].max())

Obviously you'd need good implementations of the scatter operations, not these naive python for-loops. But once you have them the solution is a pretty readable two-liner.

Vaslo 8 months ago

I’ve moved mostly to polars. I still have some frameworks that demand pandas and pandas is still a very solid dataframe, but when I need to interpolate months in millions of lines of quarterly data, polars just blows it away.

Even better is using tools like Narwhals and Ibis which can convert back and forth to any frames you want.

Def_Os 8 months ago

Data point: I have a medium-complexity data transformation use case that I still prefer pandas for.

Reason: I can speed things up fairly easily with Cython functions, and do multithreading using the Python module. With polars I would have to learn Rust for that.

Larrikin 8 months ago

If I'm doing some data science just for fun and personal projects, is there any reason to not go with Polars?

I took some data science classes in grad school, but basically haven't had any reason to touch pandas since I graduated. But, did like the ecosystem of tools, learning materials, and other libraries surrounding it when I was working with it. I recently just started a new project and am quickly going through my old notes to refamiliarize myself with pandas, but maybe I should just go and learn Polars?

kolja005 8 months ago

Does anyone have a good heuristic for when a dataframe library is a good tool choice? I work on a team that has a lot of data scientists and a few engineers (including myself) and I often see the data scientists using dataframes when simple python classes would be much more appropriate so that you have a better sense of the object you're working with. I'm been having a hard time getting this idea across to people though.

wismwasm 8 months ago

I’m just using Ibis: https://ibis-project.org/ They provide a nice backend agnostic API. For most backends it will just compile to SQL and act as a query builder. SQL basically has solved the problem of providing a declarative data transformation syntax, why reinvent the wheel?

8 months ago

[deleted]

xgdgsc 8 months ago

I' m tired of remembering of all these library invented concepts and prefer doing brainless for loops to process data in Julia.

8 months ago

[deleted]