My NumPy year: Creating a DType for the next generation of scientific computing

(quansight.com)

96 points | by elashri 9 months ago ago

19 comments

The NEP (Numpy Enhancement Proposal) link for those more curious about the details than the story: https://numpy.org/neps/nep-0055-string_dtype.html

[-]

davidinosauro 9 months ago

In particular from the article I was confused by this:

> NumPy doesn’t offer a way to store data outside of the array buffer—there’s no concept of “sidecar storage” in NumPy.

But then it goes on and say to he strings are stored on the heap (which clearly is also possible with dtype=object) with an arena allocator. Reading the NEP now

[-]

ngoldbaum 9 months ago

Well, there was no concept of sidecar storage. Now we have the hack we came up with for StringDType to store data on the DType instance and also make it so StringDType arrays don't share StringDType instances, unless the array is a view.

EDIT: looking back at the NEP, I'm not sure it does a great job explaining exactly how the per-array descriptor works. Ultimately it's powered by a hook in the DType API: https://github.com/numpy/numpy/pull/24988. There is only one spot in NumPy where array buffers are allocated, so we hooked there and made sure any arrays with newly allocated buffers get a new DType instance.

jofer 9 months ago

First off, major kudos and this is very very cool work. It's also a great article and really, everyone should go read it.

Beyond "just" better string arrays, my favorite side effect of this is efficient NaN support in string arrays. The article talks about this a lot, but I had already started this comment before fully reading the article :p

I mean, sure, the old approach was object arrays, and you can do it there because each element is an independent object, but they're super inefficient. This both makes things efficient _and_ has a really cool side effect of supporting something that had become common partly as an accident of the old object array approach - NaNs in arrays of strings.

This is really really really useful work and it's _super_ cool!!

[-]

ngoldbaum 9 months ago

Thank you this really means a lot.

karteum 9 months ago

There are aspects I didn't fully understand, maybe someone can explain : on one hand I understand that Object arrays were very slow and raised issues, that PyArrow strings solved the issue (how ?) at the expense of a huge dependency, and that the approach here also solves the issue in a lighter way, but it seems to be somewhat the same way as Object arrays : "So, the actual array buffer doesn’t contain any string data—it contains pointers to the actual string data. This is similar to how object arrays work, except instead of pointers to Python objects, they’re pointers to strings" => OK so... if they are stored in a similar way, how is that more performant or better than regular Object arrays ?

[-]

thelastbender12 9 months ago

PyArrow string arrays store the entries (string values) contiguously in memory, so access is quicker, while object arrays have pointers to scattered memory locations in the heap.

I agree, I couldn't really figure how the new numpy string data type makes it work though.

ngoldbaum 9 months ago

They’re stored on the DType instance. This requires that there’s only one DType instance per “owned” array buffer, which I figured out how to do along with Sebastian Berg and others using the new DType system.

chrisjj 9 months ago

> The idea we came up with was to use an array of pointers.

> So, the actual array buffer doesn’t contain any string data—it contains pointers to the actual string data.

"The idea /we/ came up with"?? :)

kristianp 9 months ago

What's the compatibility story with the new type? Sounds like it is quite compatible with the object type for strings, i.e. not breaking existing code.

hnman10 9 months ago

Would people not use Arrow for this now? Arrow has all these types and has also replaced NumPy in Pandas.

[-]

ngoldbaum 9 months ago

This was a case of convergent evolution, both projects ended up working simultaneously on similar ideas.

One issue with using Arrow directly in NumPy is PyArrow exposes an immutable 1D array, while NumPy exposes a mutable ND array.

[-]

hopfenspergerj 9 months ago

Are the pandas people considering this as the default string type? Seems like it would be a slam dunk.

[-]

ngoldbaum 9 months ago

That is something I’d like to see but I don’t want to wade into the already very complicated discussion around arrow strings in pandas. If a Pandas developer wanted to take this on I think that would make things easier since there’s so much complexity around strings in Pandas.

That said there is a branch that gets most of the way there: https://github.com/pandas-dev/pandas/pull/58578. The remaining challenges are mostly around getting consensus around how to introduce this change.

If NumPy had StringDType in 2019 instead of 2024 I think Pandas might have had an easier time. Sadly the timing didn’t quite work out.

9 months ago

[deleted]

wkjrahg 9 months ago

[flagged]

[-]

gbbcey 9 months ago

Please don't hijack a discussion with irrelevant flamebait. This thread isn't an invitation for you to prosecute your political agenda.

This is not a discussion about Tim Peters, and your own link proves that this topic has already been aired in this community.

I want to hear more about this project. I don't want to see it hijacked to rehash an irrelevant contentious issue.

[-]

adasdh 9 months ago

[flagged]

sjKag 9 months ago

[flagged]