Are We PEP740 Yet?

(trailofbits.github.io)

116 points | by djoldman 8 months ago ago

82 comments

simonw 8 months ago

I suggest reading this detailed article to understand why they built this: https://blog.trailofbits.com/2024/11/14/attestations-a-new-g...

The implementation is interesting - it's a static page built using GitHub Actions, and the key part of the implementation is this Python function here: https://github.com/trailofbits/are-we-pep740-yet/blob/a87a88...

If you read the code you can see that it's hitting pages like https://pypi.org/simple/pydantic/ - which return HTML - but sending this header instead:

    Accept: application/vnd.pypi.simple.v1+json

Then scanning through the resulting JSON looking for files that have a provenance that isn't set to null.

Here's an equivalent curl + jq incantation:

    curl -s \
      -H 'Accept: application/vnd.pypi.simple.v1+json' \
      https://pypi.org/simple/pydantic/ \
    | jq '.files | map(select(.provenance != null)) | length'

[-]

Cthulhu_ 8 months ago

That's the first time I've seen JSON api standard headers in the wild. There was a project where an architect indicated our APIs should be built in that fashion, but people just... disregarded it completely out of pragmatism, also because our endpoints were just pure API / JSON endpoints, never anything else. But seeing how it's used in the wild is pretty clever, same endpoint for different use cases.

[-]

the_mitsuhiko 8 months ago

Python folks are a bit obsessed with weird, novel or otherwise barely adopted standards, particularly around packaging and PyPI. They also use Macaroons for tokens.

It's quite interesting to see, but they rarely become particularly popular outside of that community.

[-]

woodruffw 8 months ago

I can take partial credit (blame?) for going forwards with Macaroons. The reason those were selected originally is because they allow distributed permission attenuation, and the thinking was that individual users/orgs could manually attenuate their scopes as needed.

In practice, that never really panned out (very few users actually attenuated their credentials). If we were to reimplement PyPI's API tokens today, I'd likely go with a more traditional design (even though all of this is a black box from a normal user's perspective - a Macaroon looks like a normal opaque token, just chunkier).

cyrnel 8 months ago

Why invest so much time and money in a feature that prevents such a small percentage of data breaches that it's not even categorized on the 2024 Verizon Data Breach Investigations Report?

The vast majority of breaches are caused by credential theft, phishing, and exploiting vulnerabilities.

It doesn't matter that you can cryptographically verify that a package came from a given commit if that commit has accidentally-vulnerable code, or someone just gets phished.

[-]

darkamaul 8 months ago

The fact that a security measure doesn't solve all or even most breaches doesn't mean it's not worth implementing. Supply chain attacks may be a smaller percentage of breaches, but they can have massive impact when they do occur (see SolarWinds). Security is all about layers - each measure raises the bar incrementally.

some_furry 8 months ago

Because Verizon's report, while a good read, isn't the end-all-be-all of threat intelligence.

https://www.wired.com/story/notpetya-cyberattack-ukraine-rus...

https://krebsonsecurity.com/2020/12/u-s-treasury-commerce-de...

Software supply chain attacks are rare, but when they happen, they're usually high-impact ordeals.

[-]

MattPalmer1086 8 months ago

Correct, software supply chain attacks may be rarer than a simple SQL injection attack, but they are much harder to detect and defend against, and can have very high impact.

Where I work (highly regulated financial utility), we would absolutely love to have signed attestations for the provenance of python packages.

I also have personal experience of NotPetya; I produced the forensic analysis of how it compromised the whole of Maersk. Took less than 30 minutes to take out their entire global IT infrastructure, with a cost of about $400 million.

[-]

pastage 8 months ago

You need to know where these attestations were made, you need a TPM telling you that this was done on a patched untouched server, in an patched untouched VM, on a signed attested code base, with deploy scripts unchanged by anyone.

We are not there as long as Jenkins and most build infrastructure is basically Remote exploit as a service it is also going to take a long time. While Petya like in Ukraine is harder the more attestations you add there is always a hole, especially when all of your infrastructure needs to run things from some proprietary vendor.

[-]

michaelt 8 months ago

The proposal is an attestation from Github, owned by Microsoft.

They could instead use a cloud-based trusted execution environment on Azure, and get an attestation from some combination of Intel and Microsoft.

Or they could use a TPM and get an attestation from Infineon that the Microsoft-signed shim loaded a kernel blessed by a Microsoft-approved Linux vendor such as Canonical.

Seems to me that just trades one corporate attestation for another.

[-]

amiga386 8 months ago

And that attestation does not affirm that the device's /usr/bin/python3 was squeaky-clean and didn't Ken-Thompson the package build process. It just affirms which machine. The trust that the machine itself is not compromised is implicit; "oh, yes, we trust Microsoft not to mess up its GitHub Actions worker VMs"

What would work better is if _everybody_ could run the same CI action to produce a release tarball, and it produced identical builds, and everyone doing this could publish their findings - we can federate a heterogenous network of verifiers, rather than trust a single source's attestation

MattPalmer1086 8 months ago

Well, those would be nice too. But I'm happy to not let the perfect be the enemy of the good.

tptacek 8 months ago

The 2024 DBIR, for whatever it's worth, repeatedly mentions software supply chain attacks.

[-]

cyrnel 8 months ago

They consider accidental vulnerabilities to be "supply chain attacks". They make no mention of build/packaging system attacks except for SolarWinds.

twothreeone 8 months ago

Probably because they got a government contract under which they receive funding for 3-5 FTEs over 24-36 months in return for quarterly reports - and a tool like this makes the DARPA PM happy. They're one of those "Cyber Defense Contractors"..

[-]

woodruffw 8 months ago

This has nothing to do with DARPA. As the linked post in the comments says, this work was funded by Google.

Source: I wrote the contract for it.

[-]

twothreeone 7 months ago

Thanks, I appreciate the honesty.

Cthulhu_ 8 months ago

Why not? You're presenting a false dichotomy, the time spent on this security does not take away time spent on the other ones you mentioned, and ultimately all security measures should be taken.

[-]

cyrnel 8 months ago

Security is about risk reduction; like creating an ordered list of measures that reduce the most risk for the lowest cost.

Working on stuff outside of the top 99%+ of effective solutions deserves to be questioned.

As an alternative, they could have been investing time in a form of capabilities-based security for packages like Deno has where not every pip package can spawn info stealer processes at install time or runtime. That would address both the build compromise attacks and actually-common attacks like vulnerability exploitation.

itsgrimetime 8 months ago

> Why invest so much time and money in a feature that prevents such a small percentage of data breaches ...

Because it's a tractable problem that these devs can solve - and just because they're working on this doesn't meant they (or others) aren't also working on the other things.

> It doesn't matter that you can cryptographically verify that a package came from a given commit if ...

Sure, but just because it doesn't solve every single problem doesn't mean it's not worthwhile

gklitz 8 months ago

They already went through requirering 2FA for the most popular packages: https://blog.pypi.org/posts/2023-05-25-securing-pypi-with-2f...

This is just another step in increasing security. And of cause that is something you want to preferably do prior to breaches not only as a reaction.

[-]

the_mitsuhiko 8 months ago

Because publishing goes with GitHub actions for the most part for attestations the attack vector is getting access to GitHub which might be easier at this point.

tzlander 8 months ago

This method favors big corporations and provides further lock-in. Python only does what Microsoft/Instagram etc. demand.

So you get suit-compatible catch phrases like "SBOM" (notice how free software has been deliberately degraded to "materials" in that acronym!).

The corporations want to control open source, milk it, feed it to their LLMs, plagiarize it and so forth. And they pay enough "open" source developers who sell out the valuable parts that are usually written by other people.

As you say, it's partly security theater because of the other attack vectors that are especially relevant in an organization that has no stringent procedures, no open discussion culture or commitment to correctness like e.g. FreeBSD.

rfoo 8 months ago

> It doesn't matter that you can cryptographically verify that a package came from a given commit if that commit has accidentally-vulnerable code, or someone just gets phished.

If that commit has accidentally-vulnerable code, or someone just gets phished and attacker added some malicious code to the repository with his creds, it is visible.

However, if the supply chain was secretly compromised and the VCS repo was always clean, only the release contains malware, then good luck finding it out.

We've all witnessed this earlier this year, in the xz accident, while the (encrypted) malicious code was presented in the source code repo as part of test data, the code to load and decrypt it only ever existed in release tarballs.

Tknl 8 months ago

https://slsa.dev/ gives much clearer explanations about the why of this work. Github recently started offering a SaaS sigstore implementation including support for private reps. https://docs.github.com/en/actions/security-for-github-actio... Anyone working on OT should be quickly moving towards this.

marky1991 8 months ago

Could someone explain why this is important? My uninformed feeling towards PEP 740 is 'who cares?'.

[-]

darkamaul 8 months ago

Supply chain attacks can exploit gaps between source code and distributed packages. Today, if PyPI were to be compromised, attackers could serve malicious packages even if the source code is clean.

Attestations provide cryptographic proof linking published packages to specific code states. This proof can be verified independently of PyPI - reducing exclusive trust in the package index.

Worth noting, attestations aren't a complete defense against index compromises since an attacker could simply stop serving attestations (though this would raise alerts for users monitoring their dependencies' attestation status).

Is this a silver bullet? No. If an attacker compromises a project's source repository, attestations won't help. However, it meaningfully reduces certain attack vectors and moves us towards being able to cryptographically verify the entire chain from source code to deployed package.

(Disclaimer: I helped build this feature for PyPI)

[-]

amelius 8 months ago

> Is this a silver bullet? No. If an attacker compromises a project's source repository, attestations won't help.

But that's a huge attack surface.

[-]

MattPalmer1086 8 months ago

And one that needs a different solution.

Update: That a control doesn't solve all problems is irrelevant. By that measure, we would have no controls at all.

[-]

amelius 8 months ago

> And one that needs a different solution.

Does the community have any ideas about that?

[-]

MattPalmer1086 8 months ago

If you're asking how we can prevent people breaking in to a source code repository, the answer is mostly just the same as anything.

Patch, apply principle of least privilege, make sure everyone uses strong authentication. Monitor the system.

For SCM specific controls we could also require signed commits, apply branch protection and any other system specific controls, and enforce code reviews before commit.

[-]

amelius 8 months ago

That doesn't sound like a lot of fun, and I wonder if it is reasonable to ask from maintainers who do this work in their spare time.

[-]

amelius 8 months ago

Maybe in the near future we could have an AI system that:

- Connects to HDMI

- Monitors any text that's on the screen

- Checks that the user is watching the screen (or at least is at the computer)

- Checks that any changes committed into the Git repository have been at least on the user's screen for X seconds, while they were sitting near the computer.

hadlock 8 months ago

I believe this is a system where a human/system builds a package and uploads and cryptographically signs it, verifying end to end that the code uploaded to github for widget-package 3.2.1 is the code you're downloading to your laptop for widget-package 3.2.1 and there's no chance it is modified/signed by a adversarial third party

[-]

marky1991 8 months ago

That's my understanding also, but I still feel like 'who cares' about that attack scenario. Am I just insufficiently paranoid? Is this kind of attack really likely? (How is it done, other than evil people at pypi?)

[-]

OutOfHere 8 months ago

Yes, it is likely. It is done by evil intermediaries on hosts that are used to create and upload the package. It is possible for example if the package is created and uploaded on the developer laptop which is compromised.

---

From the docs:

> PyPI's support for digital attestations defines a strong and verifiable association between a file on PyPI and the source repository, workflow, and even the commit hash that produced and uploaded the file.

[-]

abotsis 8 months ago

It still doesn’t protect against rogue commits to packages by bad actors. Which, IMO, is the larger threat (and one that’s been actively exploited). So while a step in the right direction, it certainly doesn’t completely solve the supply chain risk.

[-]

MattPalmer1086 8 months ago

I'm not sure there is any way to completely solve supply chain risk. All you can do is raise the bar for a successful attack. Right now, we hardly have any controls at all.

mikepurvis 8 months ago

It’s honestly a bit nuts that in 2024 a system as foundational as PyPI just accepts totally arbitrary, locally built archives for its “packages”.

I appreciate that it’s a community effort and compute isn’t free, but Launchpad did this correctly from the very beginning — dput your signed dsc and it will build and sign binary debs for you.

[-]

amiga386 8 months ago

Why not? It also accepts the totally arbitrary, locally typed "code" from the author.

I'm also happy with the Launchpad approach, but even happier with debian's Reproducible Builds approach; you _can_ provide a binary package that you built on your own machine, but several other people will _also_ build it. With a reproducible build, they will all get _exactly_ the same result for a given architecture. If they don't... we've found a problem.

I'd prefer to trust:

* the creator of the package over the distributor of the package

* a multitude of independent distributors who can all vouch the package builds the same for them, over a single distributor who builds it themselves and says "trust me bro"

baq 8 months ago

Build farms are expensive though. The PSF probably doesn’t have this kind of money and if they do, there’s a whole lot of other Python issues to fix.

https://xkcd.com/2347/ all over again.

marky1991 8 months ago

Could you explain why you think it is a likely risk? Has this attack happened frequently?

[-]

OutOfHere 8 months ago

It is likely in the same way that aliens are statistically likely. There is no evidence so far afaik, but you will likely never find out when it happens, and it can compromise the whole world if it were to happen to a widely used package. It is not worth the risk to not have the feature. I even think it should ideally eventually become mandatory.

[-]

LudwigNagasena 8 months ago

If it happens, you won’t notice it even with PEP740 because all those “trusted publishers” already are or may easily be compromised by state actors. It’s all smoke and mirrors without reproducible builds.

bostik 8 months ago

Yes. Solarwinds.

Code checked out from the repository was not the same that was used to build the actual release. These are not high-likelyhood incidents but when they do occur, they tend to have high impact.

And more recently, semantically similar code-vs-build mismatch: CrowdStrike. The broken release was modified after the actual build step, as per their own build process, and the modified artifact is what was released.

[-]

irundebian 8 months ago

To which CrowdStrike incident are you refering to? The global impact CrowdStrike incident was caused due to a driver defect which wasn't caught by quality assurance processes. It had nothing to do with malicious actors which were interfering with the code repository or the software deploment process.

[-]

bostik 8 months ago

The same incident, because the root causes are even more messed up than just shoddy QA.

Yes, the thing wasn't caught because of missing QA. What I find even worse is that the build process for their "channel files" involved:

    * building a release in CI, for which tests were run
    * modifying the built artifact as a post-process step, and
    * uploading this modified end result into their CDN infrastructure

In effect, what they actually built from their sources in the CI pipeline was not what was delivered to end users. You are correct in that attestations wouldn't help against such flagrant lies. And it wasn't a malicious act (although maliciously incompetent might qualify).

That post-build modification step would fly in the face of the attestation concept. It wouldn't help against having an empty set of tests, but an attestation-friendly build process at least would discourage messing around with the artifacts prior to release.

cyrnel 8 months ago

I'd encourage you to read the Verizon DBIR before making statements about whether a given attack is likely or not. Hijacking build systems is not likely: https://www.verizon.com/business/resources/reports/dbir/

[-]

usr1106 8 months ago

A directed attack against the homegrown build system of a small company is unlikely. An attack against a high profile, centralized system or a commonly used package in such system is something to be prepared against.

otabdeveloper4 8 months ago

You are correct. Start distributing and requiring hashes with your Python dependencies instead.

This thing is a non-solution to a non-problem. (The adversaries won't be MiTM'ing Github or Pypi.)

The actual problem is developers installing "foopackage" by referencing the latest version instead of vetting their software supply chain. Fixing this implies a culture change where people stop using version numbers for software at all.

[-]

baq 8 months ago

Nothing stopping version numbers from being 512 bits long. It’s nice if they can be obviously correlated with time of release and their predecessor, which hashes alone can’t do.

8 months ago

[deleted]

gklitz 8 months ago

It’s like HTTPS vs HTTP for your packages. It’s fine if you don’t care but having more secure standards helps us all and hopefully doesn’t add too much of a headache to provides while being mostly semi invisible for end users.

hadlock 7 months ago

If I were a state actor trying to introduce vulnerabilities in an adversary's system, handing the guy who runs sqlalchemy (or whatever, pytorch, logger take your pick) $250,000 to modify their package with my vulnerability would probably be step 0 or step 1

TZubiri 8 months ago

1- Why not compile it? 2- does pip install x not guarantee that?

otabdeveloper4 8 months ago

Yeah, because the problem with Python packaging is a lack of cryptographic signatures.

/"Rolls eyes" would be an understatement/

8 months ago

[deleted]

rty32 8 months ago

https://en.m.wikipedia.org/wiki/XZ_Utils_backdoor

[-]

marky1991 8 months ago

But that involved one of the developers of said package committing malicious code and it being accepted and then deployed. How would this prevent that from happening?

I thought this was about ensuring the code that developers pushed is what you end up downloading.

[-]

rty32 8 months ago

No, part of the malicious code is in test data file, and the modified m4 file is not in the git repo. The package signed and published by Jia Tan is not reproducible from the source and intentionally done that way.

You might want to revisit the script of xz backdoor.

[-]

epcoa 8 months ago

An absolutely irrelevant detail here. While there was an additional flourish of obfuscation of questionable prudence, the attack was not at all dependent on that. It’s a library that justifies all kinds of seemingly innocuous test data. There were plenty of creative ways to smuggle in selective backdoors to the build without resorting to a compromised tar file. The main backdoor mechanism resided in test data in the git repo, the entire compromise could have.

progval 8 months ago

According to this page, urllib3 does not use trusted publishing. According to https://docs.pypi.org/project_metadata/#verified-details , trusted publishing and self-links are the only ways to have "verified details". However https://pypi.org/project/urllib3/ shows Changelog/Code/Issue tracker as "Verified details" even though they are not self-links. How come?

urllib3 does not have a recent release that could explain https://trailofbits.github.io/are-we-pep740-yet/ lagging behind.

[-]

darkamaul 8 months ago

This page only shows if a package has been uploaded with attestations .The verified details (Changelog/Code/Issue tracker) are showing because they do use Trusted Publishing.

However, they have not published a new version since the beginning of attestation support in PyPI. That's the meaning of the clock icon right to the package name.

Their workflow responsible for publishing new releases [1] has support for attestations. Thus, it will turn green on this page with the next project release.

[1] https://github.com/urllib3/urllib3/blob/main/.github/workflo...

physicsguy 8 months ago

People don’t have to use GitHub, and certainly don’t have to use GitHub Actions even if they do

[-]

globular-toast 8 months ago

"How can I trust you?"

"I am trusted."

It's basically the same model as HTTPS. Not sure if it has a name. "Too big to fail" security? Security by fiat?

Arch-TK 8 months ago

Something this doesn't answer:

Can I make my package green without having to compromise my integrity by utilising proprietary git hosting?

[-]

blenderob 8 months ago

I've got the same question. Anyone knows the answer please?

darthwalsh 7 months ago

I read that GitLab was going to be supported too.

[-]

Arch-TK 7 months ago

But not self hosted GitLab.

zahlman 8 months ago

>Using a Trusted Publisher is the easiest way to enable attestations, since they come baked in! See the PyPI user docs and official PyPA publishing action to get started.

For many smaller packages in this top 360 list I could imagine this representing quite a bit of a learning curve.

[-]

amiga386 8 months ago

Or it could see Microsoft tightening its proprietary grip over free software by not only generously offering gratis hosting, but now also it's a Trusted Publisher and you're not - why read those tricky docs? Move all your hosting to Microsoft today, make yourself completely dependent on it, and you'll be rewarded with a green tick!

[-]

zahlman 8 months ago

Thankfully, the PyPI side of the hosting is done by a smaller, unrelated company (Fastly).

simonw 8 months ago

I think it's a little rude to imply that the people who worked on this are serving an ulterior motive.

[-]

akira2501 8 months ago

It's possible they're just naive.

Spivak 8 months ago

Microsoft for sure has an ulterior motive here, and the PyPI devs are serving it. It's not a bad thing, it's a win-win for both parties. That kind of carrot is how you get buy-in from huge companies and in return they do free labor for you that secures your software supply chain.

[-]

woodruffw 8 months ago

Microsoft was not and is not involved in any way with this work. If you look at the announcements, you'll observe that a different, large, soulless megacorporation that is one of Microsoft's primary competitors funded it.

If you're going to cast aspersions about hidden motives, you might as well cast them at the right entity.

[-]

amiga386 8 months ago

Would you agree that most of the large, soulless megacorporations want moats? They want extra bureaucracy or complexity or other costs, which they as a behemoth can throw people and money at, but would bury a scrappy upstart challenger. It's the same with e.g. Google and HTML standards; Microsoft and document standards; Amazon and cloud APIs; IBM/Microsoft and patents.

If that's the case, it doesn't matter which megacorporation sponsors the work; it benefits all of them.

I'm not saying this specific security initiative is or isn't worthwhile, that remains to be seen. But whether it was intended or not, it's a windfall to Microsoft for an advocacy website to demand PEP740 when? and the most practical way to apply it right now is Microsoft's proprietary offering.

Also, if I'm reading this correctly, the docs warn you off setting up your own Trusted Publisher and steer you towards using an existing one... of which there is currently only one choice, Microsoft. That's just an endorsement with extra steps.

[-]

woodruffw 8 months ago

I think, strategically, that all companies want moats. I do not think this will ultimately be anything resembling a meaningful for moat for MSFT (or GOOG), given that it's designed to interoperate with anybody who can run an OIDC IdP.

As I've said about 50 times in the last day: the only reason GitHub is being used for examples and was the first one enabled is because that's where the overwhelming majority of PyPI upload traffic comes from. It's not a commitment or an endorsement; it's a strictly strategic move to help the largest uploading demographic first.

simonw 8 months ago

I think it's pretty hard to get a Python package into the top 360 list while not picking up any maintainers who could climb that learning curve pretty quickly. I wrote my own notes on how to use Trusted Publishers here: https://til.simonwillison.net/pypi/pypi-releases-from-github

The bigger problem is for projects that aren't hosting on GitHub and using GitHub Actions - I'm sure there are quite a few of those in the top 360.

I expect that implementing attestations without using the PyPA GitHub Actions script has a much steeper learning curve, at least for the moment.

[-]

St-Clock 8 months ago

> I think it's pretty hard to get a Python package into the top 360 list while not picking up any maintainers who could climb that learning curve pretty quickly.

I can speak for experience. Py4J is on that list and getting maintainers is very difficult (for various reasons). Packaging is also not something that naturally attracts contributions.

woodruffw 8 months ago

I suspect that most of the packages in the top 360 list are already hosted on GitHub, so this shouldn’t be a leap for many of them. This is one of the reasons we saw Trusted Publishing adopted relatively quickly: it required less work and was trivial to adopt within existing CI workflows.

8 months ago

[deleted]