Meissonic, High-Resolution Text-to-Image Synthesis on consumer graphics cards

(arxiv.org)

63 points | by jinqueeny 2 days ago ago

4 comments

>Meissonic, with just 1B parameters, offers comparable or superior 1024×1024 high-resolution, aesthetically pleasing images while being able to run on consumer-grade GPUs with only 8GB VRAM without the need for any additional model optimizations. Moreover, Meissonic effortlessly generates images with solid-color backgrounds, a feature that usually demands model fine-tuning or noise offset adjustments in diffusion models.

This looks really cool. Also nice to see another architecture being used for image generation besides diffusion. It seems like every NLP problem can be solved with transformers now: text generation/understanding, image generation/understanding, translation, OCR. Perhaps llama 4/5 will have image generation as well. eidt: llama 3.2 already has image editing, they probably just don't want to release an image generator for other reasons.

mysteria 2 days ago

Interesting how pretty much all the example images look like renders/paintings as opposed to photographs. Maybe that's what it's trained on?

jensenbox 2 days ago

The images in the PDF are amazing.

littlestymaar a day ago

> It’s crucial to highlight the resource efficiency of our training process. Our training is considerably more resource-efficient compared to Stable Diffusion (Podell et al., 2023). Meissonic is trained in approximately 48 H100 GPU days

From scratch training of an image synthesis model for the price of a graphic card isn't something I expected anytime soon!