Codeberg Reconsidering OSI License Approval in Terms of Use

(codeberg.org)

52 points | by pabs3 8 months ago ago

16 comments

EU AI Act (https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:...) as of today (CTRL + F "open-source"):

> (89) Third parties making accessible to the public tools, services, processes, or AI components other than general-purpose AI models, should not be mandated to comply with requirements targeting the responsibilities along the AI value chain, in particular towards the provider that has used or integrated them, when those tools, services, processes, or AI components are made accessible under a free and open-source licence. ...

> Article 2, 12. This Regulation does not apply to AI systems released under free and open-source licences, unless they are placed on the market or put into service as high-risk AI systems or as an AI system that falls under Article 5 or 50.

Let's see if the EU AI Act will be adjusted in the same spirit as discussed in the linked discussion.

[-]

weebull 8 months ago

> > ..., should not be mandated to comply with requirements targeting the responsibilities along the AI value chain, ...

What does that mean?

[-]

TheNewsIsHere 8 months ago

It reads to me that you can’t pass on upstream AI vendor/platform/service requirements downstream to users/customers/end parties.

Which would be separate from any legislated requirements or limitations.

pabs3 8 months ago

I like Debian's policy for libre AI:

https://salsa.debian.org/deeplearning-team/ml-policy/

[-]

rettichschnidi 8 months ago

«ToxicCandy Model» is a great term!

RobotToaster 8 months ago

Given that most AI is trained on data scraped from the internet (most of which isn't open source), isn't it basically impossible to release an entire training dataset under an open source licence?

[-]

regularfry 8 months ago

That would, I suspect, be the point. If your AI is trained on non-free content, the implication is that it would be impossible for it to be released with an open source licence. So don't do that, the argument goes: only use content that has been released with a sufficiently free licence that republishing it in your dataset is not a problem. And as a side effect, you have to show that there isn't any "misappropriated" content in your training set. That side effect is what gets some people excited here.

I don't agree with that position legally, but I do mechanically. The point of the GPL family (to pick one random type of licence) is that the end user should have the capability to modify the product to their own ends, and I don't think fine-tuning provides enough capability to qualify.

pabs3 8 months ago

It has been done before, for eg the original RNNoise was trained on proprietary data, later there was crowd-sourced effort to record new data and have it under libre licenses.

https://github.com/xiph/rnnoise/

guerrilla 8 months ago

They could release as much is as necessary to recreate it, the crawlers or list of links they used and configuration or scripts used to drive the training. Nobody is asking for the entire web in their git repo, only the ability to retrain from scratch, possibly with modifications.

[-]

tourmalinetaco 8 months ago

This is incredibly unrealistic. Imagine hundreds to thousands of individuals scraping thousands of websites, DDoSing them because their mid-tier “open source” LLM project gave them a list of links to “recreate” their dataset. It’s far more sustainable to create a dataset filled GPL, public domain, and other permissively licensed data rather than nuke half the Internet’s bandwidth. And that’s ignoring the fact that scraping it yourself does not actually grant you a permissive license, that’s like saying that by watching a movie in theaters you have a legal right to sell that movie.

echoangle 8 months ago

Not really, because there’s no guarantee that it will be available in the future. A script to download the data doesn’t mean I can reliably recreate the data in 5 years, I wouldn’t call that open source. To me, the data itself needs to be published.

[-]

guerrilla 8 months ago

Oh well, they did their best. You can't expect them to do better than what is possible. Enough Nirvana fallacy here.

[-]

echoangle 8 months ago

I’m not expecting them to do the impossible, but they shouldn’t call it open source then. Either you provide all the data and call it open source, or you don’t provide the data because it is proprietary and don’t call the model open source.

[-]

guerrilla 8 months ago

Well, I'm glad you exist to push the Overton window even further anyway. A lot of people are trying to claim that what's being pushed now (opaque data) is open source. I'll be satisfied if I can at least aplroximate the training witb whatever is online at the time I were to do it.

guerrilla 8 months ago

It's heartening to see people take this seriously. Let's hope many more stand up for the basic ontology and spirit of free software.

8 months ago

[deleted]