Apertus 70B: Truly Open - Swiss LLM by ETH, EPFL and CSCS

(huggingface.co)

117 points | by denysvitali 3 days ago ago

17 comments

  • denysvitali 3 days ago

    Report: https://github.com/swiss-ai/apertus-tech-report/raw/refs/hea...

    Key features

    Fully open model: open weights + open data + full training details including all data and training recipes

    Massively Multilingual: 1811 natively supported languages

    Compliant: Apertus is trained while respecting opt-out consent of data owners (even retrospectivey), and avoiding memorization of training data

    • lyu07282 2 days ago

      Their struggle with Nvidia driver bugs they had to work around was very relatable. You'd think if someone buys 10,752 of their high-end GPUs you'd get some support with it.

      • _zoltan_ an hour ago

        did I miss a blog on this?

    • Bromeo 3 days ago

      Looks like the performance is pretty decent, somewhere around Llama3.1 for general knowledge (Tables 17) but still a bit behind in Code and Reasoning (Table 18). Llama3.1 was released about one year ago.

    • esafak an hour ago

      There's an interesting "Swiss AI Charter" on pg. 107.

  • nickpsecurity 3 days ago

    Upvoting to encourage discussion of these differentiators:

    "Apertus is a 70B and 8B parameter language model designed to push the boundaries of fully-open multilingual and transparent models. The model supports over 1000 languages and long context, it uses only fully compliant and open training data, and achieves comparable performance to models trained behind closed doors."

    "pretrained on 15T tokens with a staged curriculum of web, code and math data"

    "open weights + open data + full training details including all data and training recipes"

    "Apertus is trained while respecting opt-out consent of data owners (even retrospectivey), and avoiding memorization of training data"

    • Mars008 3 days ago

      At least not "open source"

      > "open weights + open data + full training details including all data and training recipes"

      Is it reproducible?

      > respecting opt-out consent of data owners (even retrospectivey)

      Were they notified and given an option to opt out? Owners and authors are not the same. Data owners aren't copyright owners either.

      > avoiding memorization of training data

      Not convincing.

      • ujjkel9938 2 days ago

        I saw some of the pretraining code in github, but not the post-training.

  • lastdong 3 days ago

    In my opinion, we need more models trained on fully traceable and clean data instead of closed models that we later find out were trained on Reddit and Facebook discussion threads.

  • WanderPanda an hour ago

    Imagine regulators doing their job for once and creating a clean regulation that removes the uncertainty about the liability for such releases. Such that they can just slap Apache or MIT on it and call it a day and don't require to collect personal data to comply with the "acceptable use policy".

  • SilverElfin 3 days ago

    Apparently a project of https://www.swiss-ai.org/

  • habi an hour ago

    https://apertus.org/ exists since 15 years, interesting choice of name.

  • titaniumrain 3 days ago

    seems a DOA

  • cmdrk 3 days ago

    Does their training corpus respect copyrights or do you have to follow their opt out procedure to keep them from consuming your data? Assuming it’s the latter, it’s open-er but still not quite there.

    • SparkyMcUnicorn 2 hours ago

      Your question is addressed in opening abstract: https://github.com/swiss-ai/apertus-tech-report/raw/refs/hea...

      > Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for copyrighted, non-permissive, toxic, and personally identifiable content.

    • traspler 2 hours ago

      Afaik they respect robots.txt on crawl and later when using the data they re-check the robots.txt and will exclude the data if the new robots.txt was updated to deny access. They have further data filtering bit for that you better check the technical report.