Why do people come up with such unbelievably complex solutions that don’t actually achieve what a simple solution could do?
Trusted Publishing approximately involves a service like GitHub proving to somebody that some release artifact came from a GitHub Actions workflow file with a particular name, possibly in a particular commit. Never mind that GitHub Actions is an unbelievable security nightmare and that it’s probably not particularly hard for a malicious holder of GitHub credentials to stealthily or even completely silently compromise their own Actions workflow to produce malicious output.
But even ignoring that, it’s wildly unclear what is “trusted”. PyPI encourages developers to also use “attestations”. Read this and try to tell me what is being attested to:
But I did learn that this is based on Sigstore. Sigstore is very impressive: it’s a system by which GitHub can attest via OIDC to various state, and a service called Fulcio (which we’re supposed to trust) uses its secret key to sign a message stating that GitHub did so at a certain time. (The OIDC transcript itself is not a durable attestation.) There’s even a transparency log (which is a separate system called Rekor maintained by the same organization). Except that, for some reason, Fulcio doesn’t do that at all. Instead it issues an X.509 certificate with an expiration in the near future where the certificate fields encode whatever GitHub attested to in its OIDC exchange, and the Sigstore client (which is hopefully a bit trustworthy) is supposed to use the private key (which it knows, in the clear, but is supposed to immediate forget) to sign a message that is associated with the release artifact or whatever else is being attested to. And then a separate transparency log records the signature and supposedly timestamps it so everyone one can verify the attestation later even though the certificate is expired! Why not just sign the message on the Fulcio server (which has an HSM, hopefully) directly?
All of this is trying to cryptographically tie a package on PyPI.org to a git tag. But: why not just do it directly? For most pure Python packages, which is a whole lot of packages, the distribution artifact is literally a zip file containing files from git, verbatim, plus some metadata. PyPI could check the GitHub immutable tag, read the commit hash, and verify the whole chain of hashes from the files to the tree to the commit. Or PyPI could even run the build process itself in a sandbox. (If people care about .pyc files, PyPI could regenerate them (again, in a sandbox), but omitting them might make sense too — after all, uv doesn’t even build them by default.) This would give much stronger security properties with a much more comprehensible system and no dependence on the rather awful security properties of GitHub Actions.
One of the big companies making billions on Python software should step up and fund the infrastructure needed to enable PyPI package search via the CLI, like you could with `pip search` in the past.
Serious question: how important is `pip search` to your workflows? I don’t think I ever used it, back when PyPI still had an XMLRPC search endpoint.
(I think the biggest blocker on CLI search isn’t infrastructure, but that there’s no clear agreement on the value of CLI search without a clear scope of what that search would do. Just listing matches over the package names would be less useful than structured metadata search for example, but the latter makes a lot of assumptions about the availability of structured metadata!)
I upvoted you because I broadly agree with you, but search is never coming back in the API. They previously outlined the cost involved and there's no way, given how minimal the value it gives more broadly, it's coming back ant time soon. It's basically an abusive vector because of the compute cost.
Pypi has fewer than one million projects. The searchable content for each package is what? 300 bytes? That's a 200mb index. You don't even need fancy full text search, you could literally split the query by word and do a grep over a text file. No need for elasticsearch or anything fancy.
And anyway, hit rates are going to be pretty good. You're not taking arbitrary queries, the domain is pretty narrow. Half the queries are going to be for requests, pytorch, numpy, httpx, and the other usual suspects.
The searchable context for a distribution on PyPI is unbounded in the general case, assuming the goal is to allow search over READMEs, distribution metadata, etc.
(Which isn’t to say I disagree with you about scale not being the main issue, just to offer some nuance. Another piece of nuance is the fact that distributions are the source of metadata but users think in terms of projects/releases.)
> Trusted Publishing
Why do people come up with such unbelievably complex solutions that don’t actually achieve what a simple solution could do?
Trusted Publishing approximately involves a service like GitHub proving to somebody that some release artifact came from a GitHub Actions workflow file with a particular name, possibly in a particular commit. Never mind that GitHub Actions is an unbelievable security nightmare and that it’s probably not particularly hard for a malicious holder of GitHub credentials to stealthily or even completely silently compromise their own Actions workflow to produce malicious output.
But even ignoring that, it’s wildly unclear what is “trusted”. PyPI encourages developers to also use “attestations”. Read this and try to tell me what is being attested to:
https://docs.pypi.org/attestations/producing-attestations/
But I did learn that this is based on Sigstore. Sigstore is very impressive: it’s a system by which GitHub can attest via OIDC to various state, and a service called Fulcio (which we’re supposed to trust) uses its secret key to sign a message stating that GitHub did so at a certain time. (The OIDC transcript itself is not a durable attestation.) There’s even a transparency log (which is a separate system called Rekor maintained by the same organization). Except that, for some reason, Fulcio doesn’t do that at all. Instead it issues an X.509 certificate with an expiration in the near future where the certificate fields encode whatever GitHub attested to in its OIDC exchange, and the Sigstore client (which is hopefully a bit trustworthy) is supposed to use the private key (which it knows, in the clear, but is supposed to immediate forget) to sign a message that is associated with the release artifact or whatever else is being attested to. And then a separate transparency log records the signature and supposedly timestamps it so everyone one can verify the attestation later even though the certificate is expired! Why not just sign the message on the Fulcio server (which has an HSM, hopefully) directly?
All of this is trying to cryptographically tie a package on PyPI.org to a git tag. But: why not just do it directly? For most pure Python packages, which is a whole lot of packages, the distribution artifact is literally a zip file containing files from git, verbatim, plus some metadata. PyPI could check the GitHub immutable tag, read the commit hash, and verify the whole chain of hashes from the files to the tree to the commit. Or PyPI could even run the build process itself in a sandbox. (If people care about .pyc files, PyPI could regenerate them (again, in a sandbox), but omitting them might make sense too — after all, uv doesn’t even build them by default.) This would give much stronger security properties with a much more comprehensible system and no dependence on the rather awful security properties of GitHub Actions.
One of the big companies making billions on Python software should step up and fund the infrastructure needed to enable PyPI package search via the CLI, like you could with `pip search` in the past.
Serious question: how important is `pip search` to your workflows? I don’t think I ever used it, back when PyPI still had an XMLRPC search endpoint.
(I think the biggest blocker on CLI search isn’t infrastructure, but that there’s no clear agreement on the value of CLI search without a clear scope of what that search would do. Just listing matches over the package names would be less useful than structured metadata search for example, but the latter makes a lot of assumptions about the availability of structured metadata!)
I upvoted you because I broadly agree with you, but search is never coming back in the API. They previously outlined the cost involved and there's no way, given how minimal the value it gives more broadly, it's coming back ant time soon. It's basically an abusive vector because of the compute cost.
Funding could help, but it still requires PyPI/Warehouse to ship and operate a new public search interface that is safe at internet scale.
They operate a public package hosting interface, how is a search one any harder?
PyPI responses are cached at 99% or higher, with less infrastructure to run.
Search is an unbounded context and does not lend itself to caching very well, as every search can contain anything
Pypi has fewer than one million projects. The searchable content for each package is what? 300 bytes? That's a 200mb index. You don't even need fancy full text search, you could literally split the query by word and do a grep over a text file. No need for elasticsearch or anything fancy.
And anyway, hit rates are going to be pretty good. You're not taking arbitrary queries, the domain is pretty narrow. Half the queries are going to be for requests, pytorch, numpy, httpx, and the other usual suspects.
The searchable context for a distribution on PyPI is unbounded in the general case, assuming the goal is to allow search over READMEs, distribution metadata, etc.
(Which isn’t to say I disagree with you about scale not being the main issue, just to offer some nuance. Another piece of nuance is the fact that distributions are the source of metadata but users think in terms of projects/releases.)
I wonder how a PyPi search index could be statically served and locally evaluated on `pip search`?
PyPI servers would have to be constantly rebuilding a central index and making it available for download. Seems inefficient
If you really need it, they publish a dump regularly and you can query that.
For simple use cases, you have the web search, and you can curl it.
Pypi has a search interface on their public website, though?
They probably don't need it. You can start a crowdfunding campaign if you do.
Great work Dustin and team!
> 1.92 exabytes of total data transferred
That's something like triple the amount from 2023, yes?
Great work!
Side issue: anyone else seeing that none of the links in the article work? They're all 404s.
Whoops, sorry about that. Should be fixed now. Happy New Year!
Is the compute and network required to service pypi all from donations or do they have any business arm that generates income?
This seems to suggest once the bubble pops, it will take Python down with it. The next AI winter will definitely replace Lisp with Python.
Replace lisp with python?
Edit: my bad it seems you meant the opposite. Absolutely fantasy but a man can certainly dream lol
Appropriate username!