Why invest so much time and money in a feature that prevents such a small percentage of data breaches that it's not even categorized on the 2024 Verizon Data Breach Investigations Report?
The vast majority of breaches are caused by credential theft, phishing, and exploiting vulnerabilities.
It doesn't matter that you can cryptographically verify that a package came from a given commit if that commit has accidentally-vulnerable code, or someone just gets phished.
Probably because they got a government contract under which they receive funding for 3-5 FTEs over 24-36 months in return for quarterly reports - and a tool like this makes the DARPA PM happy. They're one of those "Cyber Defense Contractors"..
Correct, software supply chain attacks may be rarer than a simple SQL injection attack, but they are much harder to detect and defend against, and can have very high impact.
Where I work (highly regulated financial utility), we would absolutely love to have signed attestations for the provenance of python packages.
I also have personal experience of NotPetya; I produced the forensic analysis of how it compromised the whole of Maersk. Took less than 30 minutes to take out their entire global IT infrastructure, with a cost of about $400 million.
Because publishing goes with GitHub actions for the most part for attestations the attack vector is getting access to GitHub which might be easier at this point.
According to this page, urllib3 does not use trusted publishing. According to https://docs.pypi.org/project_metadata/#verified-details , trusted publishing and self-links are the only ways to have "verified details". However https://pypi.org/project/urllib3/ shows Changelog/Code/Issue tracker as "Verified details" even though they are not self-links. How come?
I believe this is a system where a human/system builds a package and uploads and cryptographically signs it, verifying end to end that the code uploaded to github for widget-package 3.2.1 is the code you're downloading to your laptop for widget-package 3.2.1 and there's no chance it is modified/signed by a adversarial third party
That's my understanding also, but I still feel like 'who cares' about that attack scenario. Am I just insufficiently paranoid? Is this kind of attack really likely? (How is it done, other than evil people at pypi?)
Yes, it is likely. It is done by evil intermediaries on hosts that are used to create and upload the package. It is possible for example if the package is created and uploaded on the developer laptop which is compromised.
---
From the docs:
> PyPI's support for digital attestations defines a strong and verifiable association between a file on PyPI and the source repository, workflow, and even the commit hash that produced and uploaded the file.
It still doesn’t protect against rogue commits to packages by bad actors. Which, IMO, is the larger threat (and one that’s been actively exploited).
So while a step in the right direction, it certainly doesn’t completely solve the supply chain risk.
It’s honestly a bit nuts that in 2024 a system as foundational as PyPI just accepts totally arbitrary, locally built archives for its “packages”.
I appreciate that it’s a community effort and compute isn’t free, but Launchpad did this correctly from the very beginning — dput your signed dsc and it will build and sign binary debs for you.
Code checked out from the repository was not the same that was used to build the actual release. These are not high-likelyhood incidents but when they do occur, they tend to have high impact.
And more recently, semantically similar code-vs-build mismatch: CrowdStrike. The broken release was modified after the actual build step, as per their own build process, and the modified artifact is what was released.
It is likely in the same way that aliens are statistically likely. There is no evidence so far afaik, but you will likely never find out when it happens, and it can compromise the whole world if it were to happen to a widely used package. It is not worth the risk to not have the feature. I even think it should ideally eventually become mandatory.
A directed attack against the homegrown build system of a small company is unlikely. An attack against a high profile, centralized system or a commonly used package in such system is something to be prepared against.
It’s like HTTPS vs HTTP for your packages. It’s fine if you don’t care but having more secure standards helps us all and hopefully doesn’t add too much of a headache to provides while being mostly semi invisible for end users.
You are correct. Start distributing and requiring hashes with your Python dependencies instead.
This thing is a non-solution to a non-problem. (The adversaries won't be MiTM'ing Github or Pypi.)
The actual problem is developers installing "foopackage" by referencing the latest version instead of vetting their software supply chain. Fixing this implies a culture change where people stop using version numbers for software at all.
Nothing stopping version numbers from being 512 bits long. It’s nice if they can be obviously correlated with time of release and their predecessor, which hashes alone can’t do.
But that involved one of the developers of said package committing malicious code and it being accepted and then deployed. How would this prevent that from happening?
I thought this was about ensuring the code that developers pushed is what you end up downloading.
No, part of the malicious code is in test data file, and the modified m4 file is not in the git repo. The package signed and published by Jia Tan is not reproducible from the source and intentionally done that way.
You might want to revisit the script of xz backdoor.
An absolutely irrelevant detail here. While there was an additional flourish of obfuscation of questionable prudence, the attack was not at all dependent on that. It’s a library that justifies all kinds of seemingly innocuous test data. There were plenty of creative ways to smuggle in selective backdoors to the build without resorting to a compromised tar file. The main backdoor mechanism resided in test data in the git repo, the entire compromise could have.
>Using a Trusted Publisher is the easiest way to enable attestations, since they come baked in! See the PyPI user docs and official PyPA publishing action to get started.
For many smaller packages in this top 360 list I could imagine this representing quite a bit of a learning curve.
Or it could see Microsoft tightening its proprietary grip over free software by not only generously offering gratis hosting, but now also it's a Trusted Publisher and you're not - why read those tricky docs? Move all your hosting to Microsoft today, make yourself completely dependent on it, and you'll be rewarded with a green tick!
Microsoft for sure has an ulterior motive here, and the PyPI devs are serving it. It's not a bad thing, it's a win-win for both parties. That kind of carrot is how you get buy-in from huge companies and in return they do free labor for you that secures your software supply chain.
I think it's pretty hard to get a Python package into the top 360 list while not picking up any maintainers who could climb that learning curve pretty quickly. I wrote my own notes on how to use Trusted Publishers here: https://til.simonwillison.net/pypi/pypi-releases-from-github
The bigger problem is for projects that aren't hosting on GitHub and using GitHub Actions - I'm sure there are quite a few of those in the top 360.
I expect that implementing attestations without using the PyPA GitHub Actions script has a much steeper learning curve, at least for the moment.
I suspect that most of the packages in the top 360 list are already hosted on GitHub, so this shouldn’t be a leap for many of them. This is one of the reasons we saw Trusted Publishing adopted relatively quickly: it required less work and was trivial to adopt within existing CI workflows.
https://slsa.dev/ gives much clearer explanations about the why if this work. Github recently started offering a SaaS sigstore implementation including support for private reps. https://docs.github.com/en/actions/security-for-github-actio... Anyone working on OT should be quickly moving towards this.
I suggest reading this detailed article to understand why they built this: https://blog.trailofbits.com/2024/11/14/attestations-a-new-g...
The implementation is interesting - it's a static page built using GitHub Actions, and the key part of the implementation is this Python function here: https://github.com/trailofbits/are-we-pep740-yet/blob/a87a88...
If you read the code you can see that it's hitting pages like https://pypi.org/simple/pydantic/ - which return HTML - but sending this header instead:
Then scanning through the resulting JSON looking for files that have a provenance that isn't set to null.Here's an equivalent curl + jq incantation:
Why invest so much time and money in a feature that prevents such a small percentage of data breaches that it's not even categorized on the 2024 Verizon Data Breach Investigations Report?
The vast majority of breaches are caused by credential theft, phishing, and exploiting vulnerabilities.
It doesn't matter that you can cryptographically verify that a package came from a given commit if that commit has accidentally-vulnerable code, or someone just gets phished.
Probably because they got a government contract under which they receive funding for 3-5 FTEs over 24-36 months in return for quarterly reports - and a tool like this makes the DARPA PM happy. They're one of those "Cyber Defense Contractors"..
The 2024 DBIR, for whatever it's worth, repeatedly mentions software supply chain attacks.
Because Verizon's report, while a good read, isn't the end-all-be-all of threat intelligence.
https://www.wired.com/story/notpetya-cyberattack-ukraine-rus...
https://krebsonsecurity.com/2020/12/u-s-treasury-commerce-de...
Software supply chain attacks are rare, but when they happen, they're usually high-impact ordeals.
Correct, software supply chain attacks may be rarer than a simple SQL injection attack, but they are much harder to detect and defend against, and can have very high impact.
Where I work (highly regulated financial utility), we would absolutely love to have signed attestations for the provenance of python packages.
I also have personal experience of NotPetya; I produced the forensic analysis of how it compromised the whole of Maersk. Took less than 30 minutes to take out their entire global IT infrastructure, with a cost of about $400 million.
They already went through requirering 2FA for the most popular packages: https://blog.pypi.org/posts/2023-05-25-securing-pypi-with-2f...
This is just another step in increasing security. And of cause that is something you want to preferably do prior to breaches not only as a reaction.
Because publishing goes with GitHub actions for the most part for attestations the attack vector is getting access to GitHub which might be easier at this point.
According to this page, urllib3 does not use trusted publishing. According to https://docs.pypi.org/project_metadata/#verified-details , trusted publishing and self-links are the only ways to have "verified details". However https://pypi.org/project/urllib3/ shows Changelog/Code/Issue tracker as "Verified details" even though they are not self-links. How come?
urllib3 does not have a recent release that could explain https://trailofbits.github.io/are-we-pep740-yet/ lagging behind.
People don’t have to use GitHub, and certainly don’t have to use GitHub Actions even if they do
Could someone explain why this is important? My uninformed feeling towards PEP 740 is 'who cares?'.
I believe this is a system where a human/system builds a package and uploads and cryptographically signs it, verifying end to end that the code uploaded to github for widget-package 3.2.1 is the code you're downloading to your laptop for widget-package 3.2.1 and there's no chance it is modified/signed by a adversarial third party
That's my understanding also, but I still feel like 'who cares' about that attack scenario. Am I just insufficiently paranoid? Is this kind of attack really likely? (How is it done, other than evil people at pypi?)
Yes, it is likely. It is done by evil intermediaries on hosts that are used to create and upload the package. It is possible for example if the package is created and uploaded on the developer laptop which is compromised.
---
From the docs:
> PyPI's support for digital attestations defines a strong and verifiable association between a file on PyPI and the source repository, workflow, and even the commit hash that produced and uploaded the file.
It still doesn’t protect against rogue commits to packages by bad actors. Which, IMO, is the larger threat (and one that’s been actively exploited). So while a step in the right direction, it certainly doesn’t completely solve the supply chain risk.
It’s honestly a bit nuts that in 2024 a system as foundational as PyPI just accepts totally arbitrary, locally built archives for its “packages”.
I appreciate that it’s a community effort and compute isn’t free, but Launchpad did this correctly from the very beginning — dput your signed dsc and it will build and sign binary debs for you.
Build farms are expensive though. The PSF probably doesn’t have this kind of money and if they do, there’s a whole lot of other Python issues to fix.
https://xkcd.com/2347/ all over again.
Could you explain why you think it is a likely risk? Has this attack happened frequently?
Yes. Solarwinds.
Code checked out from the repository was not the same that was used to build the actual release. These are not high-likelyhood incidents but when they do occur, they tend to have high impact.
And more recently, semantically similar code-vs-build mismatch: CrowdStrike. The broken release was modified after the actual build step, as per their own build process, and the modified artifact is what was released.
It is likely in the same way that aliens are statistically likely. There is no evidence so far afaik, but you will likely never find out when it happens, and it can compromise the whole world if it were to happen to a widely used package. It is not worth the risk to not have the feature. I even think it should ideally eventually become mandatory.
I'd encourage you to read the Verizon DBIR before making statements about whether a given attack is likely or not. Hijacking build systems is not likely: https://www.verizon.com/business/resources/reports/dbir/
A directed attack against the homegrown build system of a small company is unlikely. An attack against a high profile, centralized system or a commonly used package in such system is something to be prepared against.
It’s like HTTPS vs HTTP for your packages. It’s fine if you don’t care but having more secure standards helps us all and hopefully doesn’t add too much of a headache to provides while being mostly semi invisible for end users.
You are correct. Start distributing and requiring hashes with your Python dependencies instead.
This thing is a non-solution to a non-problem. (The adversaries won't be MiTM'ing Github or Pypi.)
The actual problem is developers installing "foopackage" by referencing the latest version instead of vetting their software supply chain. Fixing this implies a culture change where people stop using version numbers for software at all.
Nothing stopping version numbers from being 512 bits long. It’s nice if they can be obviously correlated with time of release and their predecessor, which hashes alone can’t do.
1- Why not compile it? 2- does pip install x not guarantee that?
Yeah, because the problem with Python packaging is a lack of cryptographic signatures.
/"Rolls eyes" would be an understatement/
https://en.m.wikipedia.org/wiki/XZ_Utils_backdoor
But that involved one of the developers of said package committing malicious code and it being accepted and then deployed. How would this prevent that from happening?
I thought this was about ensuring the code that developers pushed is what you end up downloading.
No, part of the malicious code is in test data file, and the modified m4 file is not in the git repo. The package signed and published by Jia Tan is not reproducible from the source and intentionally done that way.
You might want to revisit the script of xz backdoor.
An absolutely irrelevant detail here. While there was an additional flourish of obfuscation of questionable prudence, the attack was not at all dependent on that. It’s a library that justifies all kinds of seemingly innocuous test data. There were plenty of creative ways to smuggle in selective backdoors to the build without resorting to a compromised tar file. The main backdoor mechanism resided in test data in the git repo, the entire compromise could have.
>Using a Trusted Publisher is the easiest way to enable attestations, since they come baked in! See the PyPI user docs and official PyPA publishing action to get started.
For many smaller packages in this top 360 list I could imagine this representing quite a bit of a learning curve.
Or it could see Microsoft tightening its proprietary grip over free software by not only generously offering gratis hosting, but now also it's a Trusted Publisher and you're not - why read those tricky docs? Move all your hosting to Microsoft today, make yourself completely dependent on it, and you'll be rewarded with a green tick!
I think it's a little rude to imply that the people who worked on this are serving an ulterior motive.
It's possible they're just naive.
Microsoft for sure has an ulterior motive here, and the PyPI devs are serving it. It's not a bad thing, it's a win-win for both parties. That kind of carrot is how you get buy-in from huge companies and in return they do free labor for you that secures your software supply chain.
Thankfully, the PyPI side of the hosting is done by a smaller, unrelated company (Fastly).
I think it's pretty hard to get a Python package into the top 360 list while not picking up any maintainers who could climb that learning curve pretty quickly. I wrote my own notes on how to use Trusted Publishers here: https://til.simonwillison.net/pypi/pypi-releases-from-github
The bigger problem is for projects that aren't hosting on GitHub and using GitHub Actions - I'm sure there are quite a few of those in the top 360.
I expect that implementing attestations without using the PyPA GitHub Actions script has a much steeper learning curve, at least for the moment.
I suspect that most of the packages in the top 360 list are already hosted on GitHub, so this shouldn’t be a leap for many of them. This is one of the reasons we saw Trusted Publishing adopted relatively quickly: it required less work and was trivial to adopt within existing CI workflows.