I'm missing something. Shouldn't any llm that's 'natively multimodal' somehow include embeddings which are multi-modal? for ex here's googles blogpost on Gemini
Until now, the standard approach to creating multimodal models involved
training separate components for different modalities and then stitching them
together to roughly mimic some of this functionality. These models can
sometimes be good at performing certain tasks, like describing images, but
struggle with more conceptual and complex reasoning.
We designed Gemini to be natively multimodal, pre-trained from the start on
different modalities. Then we fine-tuned it with additional multimodal data to
further refine its effectiveness. This helps Gemini seamlessly understand and
reason about all kinds of inputs from the ground up, far better than existing
multimodal models — and its capabilities are state of the art in nearly every
domain.
LLM embedding contain super positions of many concepts so while they might predict the next token they don’t actually out perform contrastively pretrained embedding models.
No worries at all. That's great feedback and an area of improvement for us when it comes to future posts -- we'll be more explicit about multilingualism in blogs and in our docs.
In the traditional Python API, the Voyage engine will tokenize blocks of text and output a string of characters. This model seems to be doing that by vectorizing images in space.
Words like 'you' and 'apple' will be a unitary token. More complex terms like 'pikachu' may be divided into pik-a-chu.
I'm missing something. Shouldn't any llm that's 'natively multimodal' somehow include embeddings which are multi-modal? for ex here's googles blogpost on Gemini
LLM embedding contain super positions of many concepts so while they might predict the next token they don’t actually out perform contrastively pretrained embedding models.
This does read very impressive. Any critical perspectives on the presented evaluation? What about noon-English text?
I understand the model is, like for other commercial ones, available exclusively through their API, right?
Yes, voyage models are API only.
There was a part here about multilingualism but that was wrong! Sorry!
FWIW: Voyage also has separate `law`, `code`, and `finance` models. See [1]
Really cool results, anyway.
[1]: https://docs.voyageai.com/docs/embeddings
Glad you liked the results! We do have multilingual models (and rerankers) -- voyage-3, in particular, is multilingual: https://blog.voyageai.com/2024/09/18/voyage-3/
voyage-multimodal-3 is multilingual as well, supporting the same set of languages as voyage-3.
Sorry for spreading false information. I edited the post above.
It is interesting that you’re not as up front about multilingualism compared to cohere. They seem to mention it a lot, which led to my confusion.
No worries at all. That's great feedback and an area of improvement for us when it comes to future posts -- we'll be more explicit about multilingualism in blogs and in our docs.
In the traditional Python API, the Voyage engine will tokenize blocks of text and output a string of characters. This model seems to be doing that by vectorizing images in space.
Words like 'you' and 'apple' will be a unitary token. More complex terms like 'pikachu' may be divided into pik-a-chu.
[1]: https://docs.voyageai.com/docs/tokenization