Apple Releases Open Weights Video Model

(starflow-v.github.io)

130 points | by vessenes 5 hours ago ago

26 comments

Apple has a video understanding model too. I can't wait to find out what accessibility stuff they'll do with the models. As a blind person, AI has changed my life.

[-]

densh an hour ago

> As a blind person, AI has changed my life.

Something one doesn't see in news headlines. Happy to see this comment.

[-]

tippa123 an hour ago

+1 and I would be curious to read and learn more about it.

[-]

joedevon an hour ago

If you want to see more on this topic, check out (google) the podcast I co-host called Accessibility and Gen. AI.

badmonster an hour ago

What other accessibility features do you wish existed in video AI models? Real-time vs post-processing?

fguerraz 34 minutes ago

> Something one doesn't see in news headlines.

I hope this wasn't a terrible pun

phyzix5761 an hour ago

Can you share some ways AI has changed your life?

[-]

darkwater an hour ago

I guess that auto-generated audio descriptions for (almost?) any video you want is a very, very nice feature for a blind person.

[-]

tippa123 an hour ago

My two cents, this seems like a case where it’s better to wait for the person’s response instead of guessing.

[-]

darkwater 22 minutes ago

Fair enough. Anyway I wasn't trying to say what actually changed GP's life, I was just expressing my opinion on what video models could potentially bring as an improvement to a blind person.

baq an hour ago

guessing that being able to hear a description of what the camera is seeing (basically a special case of a video) in any circumstances is indeed life changing if you're blind...? take a picture through the window and ask what's the commotion? door closed outside that's normally open - take a picture, tell me if there's a sign on it? etc.

gostsamo 21 minutes ago

Not the gp, but currently reading a web novel with a card game where the author didn't include alt text in the card images. I contacted them about it and they started, but in the meantime ai was a big help. all kinds of other images on the internet as well when they are significant to understanding the surrounding text. better search experience when Google, DDG, and the like make finding answers difficult. I might use smart glasses for better outdoor orientation, though a good solution might take some time. phone camera plus ai is also situationally useful.

RobotToaster an hour ago

The license[0] seems quite restrictive, limiting it's use to non commercial research. It doesn't meet the open source definition so it's more appropriate to call it weights available.

[0]https://github.com/apple/ml-starflow/blob/main/LICENSE_MODEL

yegle an hour ago

Looking at text to video examples (https://starflow-v.github.io/#text-to-video) I'm not impressed. Those gave me the feeling of the early Will Smith noodles videos.

Did I miss anything?

[-]

M4v3R an hour ago

These are ~2 years behind state of the art from the looks of it. Still cool that they're releasing anything that's open for researchers to play with, but it's nothing groundbreaking.

[-]

Mashimo an hour ago

But 7b is rather small no? Are other open weight video models also this small? Can this run on a single consumer card?

[-]

Maxious 11 minutes ago

Wan 2.2: "This generation was run on an RTX 3060 (12 GB VRAM) and took 900 seconds to complete at 840 × 420 resolution, producing 81 frames." https://www.nextdiffusion.ai/tutorials/how-to-run-wan22-imag...

tomthe an hour ago

No, it is not as good as Veo, but better than Grok, I would say. Definitely better than what was available 2 years ago. And it is only a 7B research model!

mdrzn 43 minutes ago

"VAE: WAN2.2-VAE" so it's just a Wan2.2 edit, compressed to 7B.

[-]

kouteiheika 6 minutes ago

This doesn't necessarily mean that it's Wan2.2. People often don't train their own VAEs and just reuse an existing one, because a VAE isn't really what's doing the image generation part.

A little bit more background for those who don't know what a VAE is (I'm simplifying here, so bear with me): it's essentially a model which turns raw RGB images into a something called a "latent space". You can think of it as a fancy "color" space, but on steroids.

There are two main reasons for this: one is to make the model which does the actual useful work more computationally efficient. VAEs usually downscale the spatial dimensions of the images they ingest, so your model now instead of having to process a 1024x1024 image needs to work on only a 256x256 image. (However they often do increase the number of channels to compensate, but I digress.)

The other reason is that, unlike raw RGB space, the latent space is actually a higher level representation of the image.

Training a VAE isn't the most interesting part of image models, and while it is tricky, it's done entirely in an unsupervised manner. You give the VAE an RGB image, have it convert it to latent space, then have it convert it back to RGB, you take a diff between the input RGB image and the output RGB image, and that's the signal you use when training them (in reality it's a little more complex, but, again, I'm simplifying here to make the explanation more clear). So it makes sense to reuse them, and concentrate on the actually interesting parts of an image generation model.

BoredPositron 32 minutes ago

They used the VAE of WAN like many other models do. For image models you see a lot of them using the flux VAE. Which is perfectly fine, they are released as apache2 and save you time to focus on your transformers architecture...

satvikpendem 2 hours ago

Looks good. I wonder what use case Apple has in mind though, or I suppose this is just what the researchers themselves were interested in, perhaps due to the current zeitgeist. I'm not really sure how it works at big tech companies with regards to research, are there top down mandates?

coolspot 2 hours ago

> STARFlow-V is trained on 96 H100 GPUs using approximately 20 million videos.

They don’t say for how long.

nothrowaways 2 hours ago

Where do they get the video training data?

[-]

postalcoder 2 hours ago

From the paper:

> Datasets. We construct a diverse and high-quality collection of video datasets to train STARFlow-V. Specifically, we leverage the high-quality subset of Panda (Chen et al., 2024b) mixed with an in-house stock video dataset, with a total number of 70M text-video pairs.

camillomiller an hour ago

Hopefully this will make into some useful feature in the ecosystem and not contribute to having just more terrible slop. Apple has saved itself from the destruction of quality and taste that these model enabled, I hope it stays that way.