> Unlike popular diffusion models, OmniGen features a very concise structure, comprising only two main components: a VAE and a transformer model, without any additional encoders.
> OmniGen supports arbitrarily interleaved text and image inputs as conditions to guide image generation, rather than text-only or image-only conditions.
> Additionally, we incorporate several classic computer vision tasks such as human pose estimation, edge detection, and image deblurring, thereby extending the model’s capability boundaries and enhancing its proficiency in complex image generation tasks.
This enables prompts for edits like:
"|image_1| Put a smile face on the note."
or
"The canny edge of the generated picture should look like: |image_1|"
> To train a robust unified model, we construct the first large-scale unified image generation dataset X2I, which unifies various tasks into one format.
I think this type of capability will make a lot of image generation stuff obsolete eventually. In a year or two, 75%+ of what people do with ComfyUI workflows might be built into models.
Hrmm, so this is how it's gonna be moving forward then? Use a smidgen of truth, to tell the whole falsehood, and nuttin' but the falsehoods. Sheesh- but, at least the subject is real? And that's that- nuttin' else doh.
With consistent representation of characters, are we now on the precipice of a Cambrian explosion of manga/graphic novels/comics?
not yet, still can't generate transparent images
Elegant architecture, trained from scratch, excels at image editing. This looks very interesting!
From https://arxiv.org/html/2409.11340v1
> Unlike popular diffusion models, OmniGen features a very concise structure, comprising only two main components: a VAE and a transformer model, without any additional encoders.
> OmniGen supports arbitrarily interleaved text and image inputs as conditions to guide image generation, rather than text-only or image-only conditions.
> Additionally, we incorporate several classic computer vision tasks such as human pose estimation, edge detection, and image deblurring, thereby extending the model’s capability boundaries and enhancing its proficiency in complex image generation tasks.
This enables prompts for edits like: "|image_1| Put a smile face on the note." or "The canny edge of the generated picture should look like: |image_1|"
> To train a robust unified model, we construct the first large-scale unified image generation dataset X2I, which unifies various tasks into one format.
I left all the defaults as is, uploaded a small image, typed in "cafe," and 15 minutes later I am still waiting on this finishing.
I think this type of capability will make a lot of image generation stuff obsolete eventually. In a year or two, 75%+ of what people do with ComfyUI workflows might be built into models.
Love this idea -- you have a typo in tools "Satble Diffusion"
Curious what's the actual cost for each edit? Will this infra always be reliable?
I mean, I struggle even getting Dall-E to iterate on one image without changing everything, so this is pretty cool
it seems like there's a lot of potential for abuse if you can get it to generate ai images of real people reliably.
Hrmm, so this is how it's gonna be moving forward then? Use a smidgen of truth, to tell the whole falsehood, and nuttin' but the falsehoods. Sheesh- but, at least the subject is real? And that's that- nuttin' else doh.
We've been manipulating photos as long as we've been taking them.