Dstack: An alternative to k8s for AI/ML tasks

(github.com)

32 points | by shcheklein a day ago ago

13 comments

Oh, excited to see dstack featured here. Founder and core contributor to dstack here. Yes, we aim to simplify container orchestration for AI and build an alternative to both K8S and Slurm - for both multi-cloud and on-prem.

Would love to hear feedback!

[-]

doctorpangloss a day ago

What are the differences in opinions between dstack and SkyPilot? Why should I try dstack first over SkyPilot? Same question could be posed to Modal.

Aren’t you worried about the amount of product development that goes into Kubernetes and its ecosystem? For example, is Keda poorly managed, or does flexible autoscaling require full time attention from a developer? What about democratic-csi? NVIDIA GPU Operator, AMD’s k8s device plugins, Intel’s? Calico versus Cilium?

[-]

cheptsov 21 hours ago

Thank you for the question!

> What are the differences in opinions between dstack and SkyPilot?

SkyPilot is great. I think there are many tiny details though. At dstack, we try to provide out-of-the-box and more high-level experience.

Examples:

1. Authorization built-into services

2. Dev environments with IDE integration

3. HTTPS out of the box with an ability to set up own domains

4. Projects for team management and resource isolation

5. Hardware metrics tracking

Also, we try to distance from Kubernetes and improve our own orchestrator that natively integrates with cloud providers

> Same question could be posed to Modal.

Modal is great too. Modal's strengths is Python decorators and their focus on cloldstarts/serverless kind of experience.

dstack here is more about flexibility/multi-cloud/on-prem/etc. For example, I personally prefer being able to run any code with dstack without changing my code. Otehr people may prefer Python decorators.

> Aren’t you worried about the amount of product development that goes into Kubernetes and its ecosystem?

A very good question. I think its both a strength and a weakness of Kubernetes. So far we see that for us it's a lot easier to bring AI-native experience, simplify it, and make it more out of the box.

We of course respect K8S though. But we think the community to deserve options!

> What about democratic-csi?

Haven't seen it yet. Will look into it. At dstack, we support volumes for both cloud and on-prem.

> NVIDIA GPU Operator, AMD’s k8s device plugins

We aim to support any accelerators out of the box.

> Calico versus Cilium?

Haven't heard of it yet.

P.S.: Whould love to hear your opinion too!

empath75 a day ago

Trying to understand what this gives me that kubeflow doesn't and I think it's just completely untrue that this does everything kubeflow does and more, as your documentation claims.

[-]

cheptsov a day ago

Well,

1. No changes in the code required; Works out of the box with any Docker image; Incl. support for distributed training

2. Out-of-the-box support for AMD/NVIDIA/TPU

3. Multi-cloud, incl. Neoclouds such as Lambda, RunPod, TensorDock, and more to come

4. 5 min to set up your own on-prem cluster

5. Easye to combine multi-cloud + on-prem

6. On top of task, you get dev environments and model inference

whinvik a day ago

Struggling to compare it to k8s but maybe that's just my lack of knowledge on k8s.

However, to the general HN crowd my question is what would be the correct abstraction for something that wants to replace k8s to make one think that this new abstraction is simple enough.

In my naive understanding, I would think that if I don't have to think beyond a docker-compose.yml that would be the right level of simple. To clarify, locally I work with the docker compose file to bring up my services, and I should be able to just deploy it to AWS for example.

[-]

cheptsov a day ago

I guess it depends on the use case. For example with dstack, we focus on AI.

Our abstractions include:

1. Dev environments - you need them often and need an easy way to get one with tight GOU resources - either using already provisioned resources or provision on-demand

2. Tasks. For example, in AI you may want to run distributed tasks over a cluster using your favorite framework like pytroch

3. Services - very close to Docker Compose. And you can use it with dstack. But for you may want to also manage GPU requirements; and of course auto-scaling

4. Managing clusters. As an AI user you may want to provision them on-demand. This is what we call fleets with dstack.

5. Ingress for public endpoints. Dstack also handles authorization and OpenAI endpoint mapping - as it’s important for AI.

6. Finally you need to manage tenancies - isolate resources across projects or teams. With dstack, we call it projects.

htrp a day ago

>dstack is a streamlined alternative to Kubernetes and Slurm, specifically designed for AI. It simplifies container orchestration for AI workloads both in the cloud and on-prem, speeding up the development, training, and deployment of AI models.

Any explanation for how it simplifies orchestration.

>dstack supports NVIDIA GPU, AMD GPU, and Google Cloud TPU out of the box.

Any plans for trainium ?

[-]

cheptsov a day ago

> Any explanation for how it simplifies orchestration.

There are many examples in the docs. Here are few:

1. ALl accelerators are supported natively (no operators are required).

2. Distributed task work out of the box [1]

3. For inference, OpenAI compatible gateway is provided automatically for any models deployed; along with authentication. [2]

4. Cluster management is a lot more convenient for AI compared to K8S [3]

More than anything else, dstack is more lightweight; no one need to know K8S.