It can't make outbound network calls though, so this fails:
llm -m gemini-2.0-flash-exp -o code_execution 1 \
'write python code to retrieve https://simonwillison.net/ and use a regex to extract the title, run that code'
Code execution is okay, but soon runs into the problem of missing packages that it can't install.
Practically, sandboxing hasn't been super important for me. Running claude with mcp based shell access has been working fine for me, as long as you instruct it to use venv, temporary directory, etc.
Brown Pelican (Pelecanus occidentalis) heads are white in the breeding season. Birds start breeding aged three to five. So technically the statement is correct but I wonder if Gemini didn't get its pelicans and cormorants in a muddle. The mainland European Great Cormorant (Phalacrocorax carbo sinensis) has a head that gets progressively whiter as birds age.
Big companies can be slow to pivot, and Google has been famously bad at getting people aligned and driving in one direction.
But, once they do get moving in the right direction the can achieve things that smaller companies can't. Google has an insane amount of talent in this space, and seems to be getting the right results from that now.
Remains to be seen how well they will be able to productize and market, but hard to deny that their LLM models aren't really, really good though.
> Remains to be seen how well they will be able to productize and market
The challenge is trust.
Google is one of the leaders in AI and are home to incredibly talented developers. But they also have an incredibly bad track record of supporting their products.
It's hard to justify committing developers and money to a product when there's a good chance you'll just have to pivot again once they get bored. Say what you will about Microsoft, but at least I can rely on their obsession with supporting outdated products.
> they also have an incredibly bad track record of supporting their products
Incredibly bad track record of supporting products that don't grow. I'm not saying this to defend Google, I'm still (perhaps unreasonably) angry because of Reader, it's just that there is a pattern and AI isn't likely to fit that for a long while.
I’m sad for reader but it was a somewhat niche product. Inbox I can’t forgive. It was insanely good and was killed because it was a threat to Gmail.
My main issue with Google is that internal politic affects users all the time. See the debacle of anything built on top of Android and being treated as a second citizen.
You can’t trust a company which can’t shield users from its internal politics. It means nothing is aligned correctly for users to be taken seriously.
Yeah, either AI is significant, in which case Google isn't going to kill it. Or AI is a bubble, in any of the alternatives one might pick can easily crash and die long before Google ends of life anything.
This isn't some minor consumer play, like a random tablet or Stadia. Anyone who has paying attention would have noticed that AI has been an important, consistent, long term strategic interest of Google's for a very long time. They've been killing off the fail/minor products to invest in this.
Yes. Imagine Google banning your entire Google account / Gmail because you violated their gray area AI terms ([1] or [2]). Or, one of your users did via an app you made using an API key and their models.
With that being said, I am extremely bullish on Google AI for a long time. I imagine they land at being the best and cheapest for the foreseeable future.
For me that is a reason for not touching anything from Google for building stuff. I can afford lossing my Amazon account, but Google's one would be too much. At least they should be clear in their terms that getting banned at cloud doesn't mean getting banned from Gmail/Docs/Photos...
That won't help. Their TOS and policies are vague enough that they can terminate all accounts you own (under "Use of multiple accounts for abuse" for instance).
> Google is one of the leaders in AI and are home to incredibly talented developers. But they also have an incredibly bad track record of supporting their products.
This is why we've stayed with Anthropic. Every single person I work with on my current project is sore at Google for discontinuing one product or another - and not a single one of them mentioned Reader.
We do run some non-customer facing assets in Google Cloud. But the website and API are on AWS.
>Say what you will about Microsoft, but at least I can rely on their obsession with supporting outdated products.
Eh... I don't know about that. Their tech graveyard isn't as populous as Google's, but it's hardly empty. A few that come to mind: ATL, MFC, Silverlight, UWP.
Besides Silverlight (which was supported all the way until the end of 2021!), you can still not only run but _write new applications_ using all of the listed technologies.
That doesn't constitute support when it comes to development platforms. They've not received any updates in years or decades. What they've done is simply not remove the capability build capability from the toolchains. That is, not even the work that would be required to no longer support them in any way. Compare that to C#, which has evolved rapidly over the same time period.
Only because they operate different businesses. Google is primarily a service provider. They have few software products that are not designed to integrate with their servers. Many of Microsoft's businesses work fundamentally differently. There's nothing Microsoft could do to Windows to disable all MFC applications and only MFC applications, and if there was it would involve more work than simply not doing anything else with MFC.
I can write something with Microsoft tech and expect it with reasonable likelihood to work in 10 years (even their service-based stuff), but can't say the same about anything from Google.
That alone stops me/my org buying stuff from Google.
I'm not contending that Microsoft and Google are equivalent in this regard, I'm saying that Microsoft does have a history of releasing technologies and then letting them stagnate.
They have to not get blind sided by Sora, while at the same time fighting the cloud war against MS/Amazon.
Weirdly Google is THE AI play. If AI is not set to change everything and truly is a hype cycle, then Google stock withstands and grows. If AI is the real deal, then Google still withstands due to how much bigger the pie will get.
It’s a huge factor. Nothing makes all your efforts look dull than video generation. It’s video, it’s literally video. So they have to fight tooth and nail not to fall behind on that.
They already looked slow because they whiffed on releasing an LLM. You will look even slower if the top efforts beat you in video because everyone can see it. They won’t need benchmarks.
> and Google has been famously bad at getting people aligned and driving in one direction.
To be fair, it's not that they're bad at it -- it's that they generally have an explicit philosophy against it. It's a choice.
Google management doesn't want to "pick winners". It prefers to let multiple products (like messaging apps, famously) compete and let the market decide. According to this way of thinking, you come out ahead in the long run because you increase your chances of having the winning product.
Gemini is a great example of when they do choose to focus on a single strategy, however. Cloud was another great example.
I definitely agree that multiple competing products is a deliberate choice, but it was foolish to pursue it for so long in a space like messaging apps that has network effects.
>> hard to deny that their LLM models aren't really, really good though.
The context window of Gemini 1.5 pro is incredibly large and it retains the memory of things in the middle of the window well. It is quite a game changer for RAG applications.
Bear in mind that a "1 million token" context window isn't actually that. You're being sold a sparse attention model, which is guaranteed to drop critical context. Google TPUs aren't running inference on a TERABYTE of fp8 query-key inputs, let alone TWO of fp16.
BERT and Gemma 2B were both some of the highest-performing edge models of their time. Google does really well - in terms of pushing efficiency in the community they're second to none. They also don't need to rely on inordinate amounts of compute because Google's differentiating factor is the products they own and how they integrate it. OpenAI is API-minded, Google is laser-focused on the big-picture experience.
For example; those little AI-generated YouTube summaries that have been rolling out are wonderful. They don't require heavyweight LLMs to generate, and can create pretty effective summaries using nothing but a transcript. It's not only more useful than the other AI "features" I interact with regularly, it doesn't demand AGI or chain-of-thought.
I disagree - another way you could phrase this is that Google is presbyopic. They're very capable of thinking long-term (eg. Google Deepmind and AI as a whole, cloud, video, Drive/GSuite, etc.), but as a result they struggle to respond to quick market changes. AdSense is the perfect example of Google "going long" on a product and reaping the rewards to monopolistic ends. They can corner a market when the set their sights on it.
I don't think Google (or really any of FAANG) makes "good" products anymore. But I do think there are things to appreciate in each org, and compared to the way Apple and Microsoft are flailing helplessly I think Google has proven themselves in software here.
Yet, google continues to show it'll deprecate it's APIs, Services, and Functionality at the detriment of your own business. I'm not sure enterprises will trust Google's LLM over the alternatives. Too many have been burned throughout the years, including GCP customers.
I read parent comment "grew 35% last quarter" as (income on 2024-09-30) is 1.35 * (income on 2024-07-01)
The balance sheet shows (income on days from 2024-07-01 through 09-30) is 1.35 * (income on days from 2023-07-01 through 09-30)
These are different because with heavily handwavey math the first is growing 35% in a single quarter and the second is growing 35% annually (by comparing like-for-like quarters)
> hard to deny that their LLM models aren't really, really good though
I'm so scarred by how much their first Gemini releases sucked that the thought of trying it again doesn't even cross my mind.
Are you telling us you're buying this press release wholesale, or you've tried the tech they're talking about and love it, or you have some additional knowledge not immediately evident here? Because it's not clear from your comment where you are getting that their LLM models are really good.
Buried in the announcement is the real gem — they’re releasing a new SDK that actually looks like it follows modern best practices. Could be a game-changer for usability.
They’ve had OpenAI-compatible endpoints for a while, but it’s never been clear how serious they were about supporting them long-term. Nice to see another option showing up. For reference, their main repo (not kidding) recommends setting up a Kubernetes cluster and a GCP bucket to submit batch requests.
its interesting that just as the LLM hype appears to be simmering down, DeepMind is making big strides. I'm more excited by this than any of OpenAI's announcements.
Beats Gemini 1.5 Pro at all but two of the listed benchmarks. Google DeepMind is starting to get their bearings in the LLM era. These are the minds behind AlphaGo/Zero/Fold. They control their own hardware destiny with TPUs. Bullish.
Regarding TPU’s, sure for the stuff that’s running on the cloud.
However their on device TPUs lag behind the competition and Google still seem to struggle to move significant parts of Gemini to run on device as a result.
Of course, Gemini is provided as a subscription service as well so perhaps they’re not incentivized to move things locally.
I am curious if they’ll introduce something like Apple’s private cloud compute.
i don’t think they need to win the on device market.
we need to separate inference and training - the real winners are those who have the training compute. you can always have other companies help with inference
> i don’t think they need to win the on device market.
The second Apple comes out with strong on-device AI - and it very much looks like they will - Google will have to respond on Android. They can't just sit and pray that e.g. Samsung makes a competitive chip for this purpose.
I think Apple is uniquely disadvantaged in the AI race to a point people dont realize.
They have less training data to use, having famously been focused on privacy for its users and thus having no particular advantage in this space due to not having customer data to train on.
They have little to no cloud business, and while they operate a couple of services for their users, they do not have the infrastructure scale to compete with hyperscaler cloud vendors such as Google and Microsoft.
Most of what they would need to spend on training new models would require that they hand over lots of money to the very companies that already have their own models, supercharging their competition.
While there is a chance that Apple might come out with a very sophisticate on-device model. The problem here is that they would only be able to compete with other on-device models. The magnitude of compute needed to keep pace with SOA models is not achievable on a single device. It will take many generations of Apple silicon in order to compete with the compute of existing datacenters.
Google also already has competitive silicon in this space with the Tensor series processors, which are being fabbed at Samsung plants today. There is no sitting and praying necessary on their part as they already compete.
Apple is a very distant competitor in the space of AI, and I see no reason to assume this will change, they are uniquely disadvantaged by several of the choices they made on their way to mobile supremacy.
The only thing they currently have going for them is the development of their own ARM silicon which may give them the ability to compete with Google's TPU chips, but there is far more needed to be competitive here than the ability to avoid the Nvidia tax.
"having famously been focused on privacy for its users and thus having no particular advantage in this space due to not having customer data to train on"
That may not be as big a disadvantage as you think.
Anthropic claim that they did not use any data from their users when they trained Claude 3.5 Sonnet.
sure but they certainly acquired data from mass scraping (including of data produced by their users) and/or data brokering aka paying someone to do the same.
yeah i’ve never understood the outsized optimism for apple’s ai strategy, especially on hn.
they’re a little bit less of a nobody than they used to be, but they’re basically a nobody when it comes to frontier research/scaling. and the best model matters way more than on-device which can always just be distilled later and find some random startup/chipco to do inference
Theory: Apple's lifestyle branding is quite important to the identity of many in the community here. I mean, look at the buy-in at launch for Apple Vision Pro by so many people on HN--it made actual Apple communities and publications look like jaded skeptics.
At what point does the on device stuff eat into their market share though? As on device gets better, who will pay for cloud compute? Other than enterprise use.
I’m not saying on device will ever truly compete at quality, but I believe it’ll be good enough that most people don’t care to pay for cloud services.
That makes no sense. Inference cost dwarf training cost if you have a succesfull product pretty quickly. Afaik there is no commodity hardware that can run state of the art models like chatgpt-o1.
> Afaik there is no commodity hardware that can run state of the art models like chatgpt-o1.
Stack enough GPUs and any of them can run o1. Building a chip to infer LLMs is much easier than building a training chip.
Just because one cost dwarfs another does not mean that this is where the most marginal value from developing a better chip will be, especially if other people are just doing it for you. Google gets a good model, inference providers will be begging to be able to run it on their platform, or to just sell google their chips - and as I said, inference chips are much easier.
Each GPU costs ~50k. You need at least 8 of them to run mid-sized models. Then you need a server to plug those GPUs into. That's not commodity hardware.
more like ~$16k for 16 3090s. AMD chips can also run these models. The parts are expensive but there is a competitive market in processors that can do LLM inference. Less so in training.
I don’t think the AI market will ever really be a healthy one until inference vastly outnumbers training. What does it say about AI if training is done more than inference?
I agree that the in-device inference market is not important yet.
And yet these new models still haven’t reached feature parity with Google Assistant, which can turn my flashlight on, but with all the power of burning down a rainforest, Gemini still cannot interact with my actual phone.
Poor reception is rapidly becoming a non-issue for most of the developed world. I can’t think of the last time I had poor reception (in America) and wasn’t on an airplane.
As the global human population increasingly urbanizes, it’ll become increasingly easy to blanket it with cell towers. Poor(er) regions of the world will increase reception more slowly, but they’re also more likely to have devices that don’t support on-device models.
Also, Gemini Flash is basically positioned as a free model, (nearly) free API, free in GUI, free in Search Results, Free in a variety of Google products, etc. No one will be paying for it.
Many major cities have significant dead spots for coverage. It’s not just for developing areas.
Flash is free for api use at a low rate limit. Gemini as a whole is not free to Android users (free right now with subscription costs beyond a time period for advanced features) and isn’t free to Google without some monetary incentive. Hence why I also originally ask about private cloud compute alternatives with Google.
Your first sentence has the fallacy that you’re attributing the cost of the device to a single feature against the cost of that single feature.
Most people are unlikely to buy the device for the AI features alone. It’s a value add to the device they’d buy anyway.
So you need the paid for option to be significantly better than the free one that comes with the device.
Your second sentence assumes the local one is dumb. What happens when local ones get better? Again how much better is the cloud one to compete on cost?
To your last sentence, it assumes data fetching from the cloud. Which is valid but a lot of data is local too. Are people really going to pay for what Google search is giving them for free?
I think it's a more likely assumption that on device performance will trail off device models by a significant margin for at least the next few years - of course if magically you can make it work locally with the same level of performance it would be better.
Plus a lot of the "agentic" stuff is interaction with the outside world, connectivity is a must regardless.
My point is that you do NOT need the same level of performance. You need an adequate level of performance that the cost to get more performance isn’t worth it to most people.
And my point is that it's way too early to try to optimize for running locally, if performance really stabilizes and comes to a halt (which may likely happen) then it makes more sense to optimize.
Plus once you start with on device features you start limiting your development speed and flexibility.
Yes, but I can't imagine situations where I "have" to run a model when I don't have internet at that time. My life would be more affected with the rest of the internet than having to run a small stupid model locally. At the very least until the hallucination is completely solved, as I need internet to verify the models.
You’re assuming the model is purely for generation though. Several of the Gemini features are lookup of things across data available to it. A lot of that data can be local to device.
That is currently Apple’s path with Apple Intelligence for example.
No, and they haven't been for at least half a year. Utterly optimized for by the providers. Nowadays if a model would be SotA for general use but not #1 on any of these benchmarks, I doubt they'd even release it.
I've started keeping an eye out for original brainteasers, just for that reason. GCHQ's Christmas puzzle just came out [1], and o1-pro got 6 out of 7 of them right. It took about 20 minutes in total.
I wasn't going to bother trying those because I was pretty sure it wouldn't get any of them, but decided to give it an easy one (#4) and was impressed at the CoT.
Meanwhile, Google's newest 2.0 Flash model went 0 for 7.
Yeah they've been slow to release end-user facing stuff but it's obvious that they're just grinding away internally.
They've ceded the fast mover advantage, but with a massive installed base of Android devices, a team of experts who basically created the entire field, a huge hardware presence (that THEY own), massive legal expertise, existing content deals, and a suite of vertically integrated services, I feel like the game is theirs to lose at this point.
The only caution is regulation / anti-trust action, but with a Trump administration that seems far less likely.
Fascinating, thanks for calling that out: I found 1.0 promising in practice, but with hallucination problems. Then I saw it had gotten 57% of questions wrong on open book true/false and I wrote it off completely - no reason to switch to it for speed and cost if it's just a random generator. That's a great outcome.
Anyway, I'm glad that this Google release is actually available right away! I pay for Gemini Advanced and I see "Gemini Flash 2.0" as an option in the model selector.
I've been going through Advent of Code this year, and testing each problem with each model (GPT-4o, o1, o1 Pro, Claude Sonnet, Opus, Gemini Pro 1.5). Gemini has done decent, but is probably the weakest of the bunch. It failed (unexpectedly to me) on Day 10, but when I tried Flash 2.0 it got it! So at least in that one benchmark, the new Flash 2.0 edged out Pro 1.5.
I look forward to seeing how it handles upcoming problems!
I should say: Gemini Flash didn't quite get it out of the box. It actually had a syntax error in the for loop, which caused it to fail to compile, which is an unusual failure mode for these models. Maybe it was a different version of Java or something (I'm also trying to learn Java with AoC this year...). But when I gave Flash 2.0 the compilation error, it did fix it.
For the more Java proficient, can someone explain why it may have provided this code:
for (int[] current = queue.remove(0)) {
which was a compilation error for me? The corrected code it gave me afterwards was just
for (int[] current : queue) {
and with that one change the class ran and gave the right solution.
I use a Claude and Gemini a lot for coding and I realized there is no good or best model. Every model has it's upside and downside. I was trying to get authentication working according to the newer guidelines of Manifest V3 for browser extensions and every model is terrible. It is one use case where there is not much information or right documentation so every model makesup stuff. But this is my experience and I don't speak for everyone.
Relatedly, I start to think more and more the AI is great for mediocre stuff. If you just need to do the 1000th website, it can do that. Do you want to build a new framework? Then there will probably be less many useful suggestions. (Still not useless though. I do like it a lot for refactoring while building xrcf.)
EDIT: One reason that lead me to think it's better for mediocre stuff was seeing the Sora model generate videos. Yes it can create semi-novel stuff through combinations of existing stuff, but it can't stick to a coherent "vision" throughout the video. It's not like a movie by a great director like Tarantino where every detail is right and all details point to the same vision. Instead, Sora is just flailing around. I see the same in software. Sometimes the suggestions go towards one style and the next moment into another. I guess AI currently is just way lower in their context length. Tarantino has been refining his style for 30 years now. And always he has been tuning his model towards his vision. AI in comparison seems to always just take everything and turn it into one mediocre blob. It's not useless but currently good to keep in mind I think. That you can only use it to generate mediocre stuff.
That's when having a huge context is valuable. Dump all of the new documentation into the model along with your query and the chances of success hugely increase.
This is true for all newish code bases. You need to provide the context it needs to get the problem right. It has been my experience that one or two examples with new functions or new requirements will suffice for a correction.
I can't comment on why the model gave you that code, but I can tell you why it was not correct.
`queue.remove(0)` gives you an `int[]`, which is also what you were assigning to `current`. So logically it's a single element, not an iterable. If you had wanted to iterate over each item in the array, it would need to be:
```
for (int[] current : queue) {
for (int c : current) {
// ...do stuff...
}
}
```
Alternatively, if you wanted to iterate over each element in the queue and treat the int array as a single element, the revised solution is the correct one.
The Gemini 2 models support native audio and image generation but the latter won't be generally available till January. Really excited for that as well as 4o's image generation (whenever that comes out). Steerability has lagged behind aesthetics in image generation for a while now and it's be great to see a big advance in that.
Also a whole lot of computer vision tasks (via LLMs) could be unlocked with this. Think Inpainting, Style Transfer, Text Editing in the wild, Segmentation, Edge detection etc
Maybe some of these tasks are arguably not aligned with the traditional applications of CV, but Segmentation and Edge detection are definitely computer vision in every definition I've come across - before and after NNs took over.
OT: I’m not entirely sure why, but "agentic" sets my teeth on edge. I don't mind the concept, but the word itself has that hollow, buzzwordy flavor I associate with overblown LinkedIn jargon, particularly as it is not actually in the dictionary...unlike perfectly serviceable entries such as "versatile", "multifaceted" or "autonomous"
To play devil's advocate, the correct use of the word would be when multiple AIs are coordinating and handing off tasks to each other with limited context, such that the handoffs are dynamically decided at runtime by the AI, not by any routine code. I have yet to see a single example where this is required. Most problems can be solved with static workflows and simple rule based code. As such, I do believe that >95% of the usage of the word is marketing nonsense.
I think this sort of usage is already happening, but perhaps in the internal details or uninteresting parts, such as content moderation. Most good LLM products are in fact using many LLM calls under the hood, and I would expect that results from one are influencing which others get used.
I actually have built such a tool (two AIs, each with different capabilities), but still cringe at calling at agentic. Might just be an instinctive reflex.
I'm personally very glad that the word has adhered itself to a bunch of AI stuff, because people had started talking about "living more agentically" which I found much more aggravating. Now if anyone states that out loud you immediately picture them walking into doors and misunderstanding simple questions, so it will hopefully die out.
That's not necessarily a good thing because they are overloaded while novel jargon is specific.
We use new words so often that we take it for granted. You've passively picked up dozens of new words over the last 5 or 10 years without questioning them.
No, we need a scientific understanding of autonomous intelligent decision-making. The problem with “agentic AI” is the same old “Artificial Intelligence, Natural Stupidity” problem: we have no clue what “reasoning” or “intelligence” or “autonomous” actually means in animals, and trying to apply these terms to AI without understanding them (or inventing a new term without nailing down the underlying concept) is doomed to fail.
We need something to describe a behavioral element in business processes. Something goes into it, something comes out of it - though in this case nondeterminism is involved and it may not be concrete outputs so much as further actioning.
This is what other replies are missing - I've been following AI closely since GPT 2 and it's not immediately clear what agentic means, so to other people, the term must be even less clear. Using the word autonomous can't be worse than agentic imo.
I like that https://artificialanalysis.ai/leaderboards/models describes both quality and speed (tokens/s and first chunk s). Not sure how accurate it is; anyone know? Speed and variance of it in particular seems difficult to pin down because providers obviously vary it with load to control their costs.
Leaderboards are not that useful for measuring real-life effectiveness of the models atleast in my day-today usage.
I am currently struggling to diagnose an ipv6 mis-configuration in my enormous aws cloudformation yaml code. I gave the same input to Claude Opus, Gemini and ChatGPT ( o1 and 4o).
4o was the worst. verbose and waste of my time.
Claude completely went off-tangent and began recommending fixes for ipv4 while I specifically asked for ipv6 issues
o1 made a suggestion which I tried out and it fixed it. It literally found a needle in the haystack. The solution is working well now.
Gemini made a suggestion which almost got it right but it was not a full solution.
I must clarify diagnosing network issues on AWS VPC is not my expertise and I use the LLMs to supplement my knowledge.
With the accelerating progress, the "be ready to look again" is becoming a full time job that we need to be able to delegate in some way, and I haven't found anything better than benchmarks, leaderboards and reviews.
FWIW I've found the 'coding' 'category' of the leaderboard to be reasonably accurate. Claude was the best, o1-mini then was typically stronger, now the Gemini Exp 1206 is at the top.
I find myself just paying a la carte via the API rather than paying the $20/mo so I can switch between the models.
poe.com has a decent model where you buy credits and spend them talking to any LLM which makes it nice to swap between them even during the same conversation instead of paying for multiple subscriptions.
Though gpt-4o could say "David Mayer" on poe.com but not on chat.openai.com which makes me wonder if they sometimes cheat and sneak in different models.
It's easier for consultants and sales people to sell to enterprise if the terminology is familiar but mysterious.
Bad
1. installed Antivirus software
2. added screen-size CSS rules
3. copied 'Assets' harddrive to DropBox
4. edited homepage to include Bitcoin wallet address link
5. upgraded to ChatGPT Pro
The beauty of LLMs isn’t just these coding objects speak human vernacular but they can be concatenated with human vernacular prompts and that itself can be used as an input, command or output sensibly without necessarily causing error even if a series of inputs combinations weren't preprogrammed.
I have an A.I. textbook that has agent terminology that was written preLLm days. agents are just autonomous ish code that loops on itself with some extra functionality. LLMs in their elegance can more easily out the box selfloop just on the basis concatenating language prompts, sensibly. They are almost agent ready out the box by this very elegant quality(the textbook agentic diagram is just a conceptual self perpetuation loop), except…
Except they fail at a lot or get stuck at hiccups. But, here is a novel thought. What if an LLM becomes more agentic (ie more able to sustain autonomous chain prompts that do actions without a terminal failure) and less copilotee not by more complex controlling wrapper self perpetuation code, but by means of training the core llm itself to more fluidly function in agentic scenarios.
a better agentically performing llm that isnt mislabeled with a bad buzzword might not reveal itself in its wrapper control code but through it just performing better in an typical agentic loop or environment conditions with whatever initiating prompt, control wrapper code, or pipeline that initiates its self perpetuation cycle.
I've been using gemini-exp-1206 and I notice a lot of similarities to the new gemini-2.0-flash-exp: they're not that much actually smarter but they go out of their way to convince you they are with overly verbose "reasoning" and explanations. The reasoning and explanations aren't necessarily wrong per se, but put them aside and focus on the actual logical reasoning steps and conclusions to your prompts and it's still very much a dumb model.
The models do just fine on "work" but are terrible for "thinking". The verbosity of the explanations (and the sheer amount of praise the models like to give the prompter - I've never had my rear end kissed so much!) should lead one to beware any subjective reviews of their performance rather than objective reviews focusing solely on correct/incorrect.
Windows Phone was actually great though, and would've eventually been a major player in the space if Microsoft were stubborn enough to stick with it long enough, like they did with the Xbox.
By his own admission, Gates was extremely distracted at the time by the antitrust cases in Europe, and he let the initiative die.
The Web is dead. It’s pretty clear future web pages, if we call them that, will be assembled on-the-fly by AI based your user profile and declared goals, interests and requests.
> We're also launching a new feature called Deep Research, which uses advanced reasoning and long context capabilities to act as a research assistant, exploring complex topics and compiling reports on your behalf. It's available in Gemini Advanced today.
Anyone seeing this? I don't have an option in my dropdown.
Anecdotally, using the Gemini App with "Gemini Advanced 2.0 Flash Experimental", the response quality is ignorantly improved and faster at some basic Python and C# generation.
Based on initial interactions, it's extremely verbose. It seems to be focused on explaining its reasoning, but even after just a few interactions I have seen some surprising hallucinations. For example, to assess current understanding of AI, I mentioned "Why hasn't Anthropic released Claude 3.5 Opus yet?" Gemini responded with text that included "Why haven't they released Claude 3.5 Sonnet First? That's an interesting point." There's clearly some reflection/attempted reasoning happening, but it doesn't feel competitive with o1 or the new Claude 3.5 Sonnet that was trained on 3.5 Opus output.
Was this written by an LLM? It's pretty bad copy. Maybe they laid off their copywriting team...?
> "Now millions of developers are building with Gemini. And it’s helping us reimagine all of our products — including all 7 of them with 2 billion users — and to create new ones"
and
> "We’re getting 2.0 into the hands of developers and trusted testers today. And we’re working quickly to get it into our products, leading with Gemini and Search. Starting today our Gemini 2.0 Flash experimental model will be available to all Gemini users."
That's not really any better, since "all of our products" already includes the subset that has at least 2B users. "I brought all my shoes, including all my red shoes."
Again, that's covered by "all our products". Why do we need to be reminded that Google has a lot of users? Someone oblivious to that isn't going to care about this press release.
Scale and cost are defining considerations of LLMs. By saying they're rolling out to billions of users, they're pointing out they're doing something pretty unique and have confidence in a major competitive advantage. Point billions of devices at other high-performing competitors' offerings, and all of them would fall over.
That phrasing still sucks, I am neither a native speaker nor a wordsmith but I've worked with professional English writers who could make that look and sound infinitely better.
We’re definitely going to need better benchmarks for agentic tasks, and not just code reasoning. Things that are needlessly painful that humans go through all the time
Does anyone have any insights into how Google selects source material for AI overviews? I run an educational site with lots of excellent information, but it seems to have been passed over entirely for AI overviews. With these becoming an increasingly large part of search--and from the sound of it, now more so with Gemini 2.0--this has me a little worried.
Anyone else run into similar issues or have any tips?
I've been using Gemini Flash for free through the API using Cline for VS Code. I switch between Claude and Gemini Flash, using Claude for more complicated tasks. Hope that the 2.0 model comes closer to Claude for coding.
Agreed - tried some sample prompts on our data and the rough vibe check is that flash is now as good as the old pro. If they keep pricing the same, this would be really promising.
Their Mariner tool for controlling the browser sounds scary and exciting. At the moment, it's an extension, which means JavaScript. Some web sites block automation that happens this way, and developers resort to tools such as Selenium. These use the Chrome DevTools API to automate the browser. It's better, but can still be distinguished from normal use with very technical details. I wonder if Google, who still own Chrome, will give extensions better APIs for automation that can not be distinguished from normal use.
Did anyone get to play with the native image generation part? In my experience, Imagen 3 was much better than the competition so I'm curious to hear people's take on this one.
At least when it comes to Go code, I'm pretty impressed by the results so far. It's also pretty good at following directions, which is a problem I have with open source models, and seems to use or handle Claude's XML output very well.
Overall, especially seeing as I haven't paid a dime to use the API yet, I'm pretty impressed.
Does anyone know how to sign up to the speech output wait list or tester program? I have a decent spent with GCP over the years if that helps at all. Really want DemoTime videos to use those voices. (I like how truly incredible best in the world tts is like a footnote in this larger announcement.)
Anyone else annoyed how the ML/AI community just adopted the word "reasoning" when it seems like it is being used very out of context when looking at what the model actually does?
Static computation is not reasoning (these models are not building up an argument from premises, they are merely finding statistically likely completions). Computational thinking/reasoning would be breaking down a problem into an algorithmic steps. The model is doing neither. I wouldn't confuse the fact that it can break it into steps if you ask it, because again that is just regurgitation. It's not going through that process without your prompt. That is not part of its process to arrive at an answer.
I kinda agree with you but I can also see why it isn't that far from "reasoning" in the sense humans do it.
To wit, if I am doing a high school geometry proof, I come up with a sequence of steps. If the proof is correct, each step follows logically from the one before it.
However, when I go from step 2 to step 3, there are multiple options for step-3 I could have chose. Is it so different from a "most-likely-prediction" an LLM makes? I suppose the difference is humans can filter out logically-incorrect steps, or prune chains-of-steps that won't lead to the actual theorem quicker. But an LLM predictor coupled with a verifier doesn't feel that different from it.
The point is emergent capabilities in LLMs go beyond statistical extrapolation, as they demonstrate reasoning by combining learned patterns.
When asked, “If Alice has 3 apples and gives 2 to Bob, how many does she have left?”, the model doesn’t just retrieve a memorized answer—it infers the logical steps (subtracting 2 from 3) to generate the correct result, showcasing reasoning built on the interplay of its scale and architecture rather than explicit data recall.
Does it help to explore the annoyance using gap analysis? I think of it as heuristics. As with humans, it's the pragmatic "whatever seems to work" where "seems" is determined via training. It's neither reasoning from first principles (system 1) nor just selecting the most likely/prevalent answer (system 2). And chaining heuristics doesn't make it reasoning, either. But where there's evidence that it's working from a model, then it becomes interesting, and begins to comply with classical epistemology wrt "reasoning". Unfortunately, information theory seems to treat any compression as a model leading to some pretty subtle delusions.
Also no pricing is live yet. OpenAI's audio inputs/outputs are too expensive to really put in production, so hopefully Gemini will be cheaper. (Not to mention, OAI's doesn't follow instructions very well.)
The Multimodal Live API is free while the model/API is in preview. My guess is that they will be pretty aggressive with pricing when it's in GA, given the 1.5 Flash multimodal pricing.
If you're interested in this stuff, here's a full chat app for the new Gemini 2 API's with text, audio, image, camera video and screen video. This shows how to use both the WebSocket API and to route through WebRTC infrastructure.
Small models don't "know" as much so they hallucinate more. They are better suited for generations that are based in a ground truth, like in a RAG setup.
A better comparison might be Flash 2.0 vs 4o-mini. Even then, the models aren't meant to have vast world knowledge, so benchmarking them on that isn't a great indicator of how they would be used in real-world cases.
Yes, it's not an apples to apples comparison. My point is the position it's at on the lmarena leaderboard is misplaced due to the hallucination issues.
I'm quite impressed with the flash demo's reasoning capabilities. Did the 20 questions game with it, it found the computer mouse i had in my head. At first it was confused about our roles and said something weird, it thought that it had to guess its own word. Afterwards I had a meta conversation about that weirdness and it gave impressive insights:
"
Why My "I Guess What I Have in Mind" Statement Doesn't Make Sense
The Problem of Agency: You're spot-on when you say it doesn't make sense for me to state "I guess what I (the computer) have in mind." AI does not have a "mind" in the same way that humans do. My internal state is a complex pattern of numerical data, not conscious thoughts or intentions in the way you have things in mind. My phrasing gave the illusion of having a "thing in mind," but in reality, I don't have this capacity.
Misinterpretation of the Task: My statement "I guess what I have in mind" misinterprets the task. You were clearly signaling that I needed to ask questions to uncover what you were thinking of. My failure to grasp this created a weird scenario where it seemed like I was trying to determine my own data set!"
Unfortunately the 10rpm quota for this experimental model isn't enough to run an actual Agentic experience on.
That's my main issue with google there's several models we want to try with our agent but quota is limited and we have to jump through hoops to see if we can get it raised.
I think they are really overloading that word "Agent". I know there isn't a standard definition - but I think Google are stretching the meaning of that way thinner than most C Suite level execs talk about agents at.
I think DeepMind could make progress if they focused on the agent definition of multi-step reasoning + action through a web browser, and deliver a ton of value, outside of lumping in the seldom used "Look at the world through a camera" or "Multi modal Robots" thing.
If Google cracked robots, past plays show that the market for those aren't big enough to interest Google. Like VR, you just can't get a billion people to be interested in robots - so even if they make progress, it won't survive under Google.
The "Look at the world through a camera" thing is a footnote in an Android release.
Agentic computer use _is_ a product a billion people would use, and it's adjacent to the business interests of Google Search.
"Over the last year, we have been investing in developing more agentic models, meaning they can understand more about the world around you, think multiple steps ahead, and take action on your behalf, with your supervision."
"With your supervision". Thus avoiding Google being held responsible. That's like Teslas Fake Self Driving, where the user must have their hands on the wheel at all times.
Claude MCP does the same thing. It’s the setup that is hard. It will do push pull create branch automatically from a single prompt. 500$ a month for Devin could be worth it if you want it taken care off plus use the models for a team, but a single person can set it up
Yes, LMArena shows Gemini-2.0-Flash-Exp ranking 3rd right now, after Gemini-Exp-1206 and ChatGPT-4o-latest_(2024-11-20), and ahead of o1-preview and o1-mini:
The best thing about gemini models is the huge context windows, you can just throw big documents and find stuff real fast, rather than struggling with cut off in perplexity or Claude.
I work with LLMs and MLLMs all day (as part of my work on JoyCaption, an open source VLM). Specifically, I spend a lot of time interacting with multiple models at the same time, so I get the chance to very frequently compare models head-to-head on real tasks.
I'll give Flash 2 a try soon, but I gotta say that Google has been doing a great job catching up with Gemini. Both Gemini 1.5 Pro 002 and Flash 1.5 can trade blows with 4o, and are easily ahead of the vast majority of other major models (Mistral Large, Qwen, Llama, etc). Claude is usually better, but has a major flaw (to be discussed later).
So, here's my current rankings. I base my rankings on my work, not on benchmarks. I think benchmarks are important and they'll get better in time, but most benchmarks for LLMs and MLLMs are quite bad.
1) 4o and its ilk are far and away the best in terms of accuracy, both for textual tasks as well as vision related tasks. Absolutely nothing comes even close to 4o for vision related tasks. The biggest failing of 4o is that it has the worst instruction following of commercial LLMs, and that instruction following gets _even_ worse when an image is involved. A prime example is when I ask 4o to help edit some text, to change certain words, verbage, etc. No matter how I prompt it, it will often completely re-write the input text to its own style of speaking. It's a really weird failing. It's like their RLHF tuning is hyper focused on keeping it aligned with the "character" of 4o to the point that it injects that character into all its outputs no matter what the user or system instructions state. o1 is a MASSIVE improvement in this regard, and is also really good at inferring things so I don't have to explicitly instruct it on every little detail. I haven't found o1-pro overly useful yet. o1 is basically my daily driver outside of work, even for mundane questions, because it's just better across the board and the speed penalty is negligible. One particularly example of o1 being better I encountered yesterday. I had it re-wording an image description, and thought it had introduced a detail that wasn't in the original description. Well, I was wrong and had accidentally skimmed over that detail in the original. It _told_ me I was wrong, and didn't update the description! Freaky, but really incredible. 4o never corrects me when I give it an explicit instruction.
4o is fairly easy to jailbreak. They've been turning the screws for awhile so it isn't as easy as day 1, but even o1-pro can be jailbroken.
2) Gemini 1.5 Pro 002 (specifically 002) is second best in my books. I'd guesstimate it at being about 80% as good as 4o on most tasks, including vision. But it's _significantly_ better at instruction following. Its RLHF is a lot lighter than ChatGPT models, so it's easier to get these models to fall back to pretraining, which is really helpful for my work specifically. But in general the Gemini models have come a long way. The ability to turn off model censorship is quite nice, though it does still refuse at times. The Flash variation is interesting; often times on-par with Pro with Pro edging out maybe 30% of the time. I don't frequently use Flash, but it's an impressive model for its size. (Side note: The Gemma models are ... not good. Google's other public models, like so400m and OWLv2 are great, so it's a shame their open LLMs forays are falling behind). Google also has the best AI playground.
Jailbreaking Gemini is a piece of cake.
3) Claude is third on my list. It has the _best_ instruction following of all the models, even slightly better than o1. Though it often requires multi-turn to get it to fully follow instructions, which is annoying. Its overall prowess as an LLM is somewhere between 4o and Gemini. Vision is about the same as Gemini, except for knowledge based queries which Gemini tends to be quite bad at (who is this person? Where is this? What brand of guitar? etc). But Claude's biggest flaw is the insane "safety" training it underwent, which makes it practically useless. I get false triggers _all_ the time from Claude. And that's to say nothing of how unethical their "ethics" system is to begin with. And what's funny is that Claude is an order of magnitude _smarter_ when its reasoning about its safety training. It's the only real semblance of reason I've seen from LLMs ... all just to deny my requests.
I've put Claude three out of respect for the _technical_ achievements of the product, but I think the developers need to take a long look in the mirror and ask why they think it's okay to for _them_ to decide what people with disabilities are and are not aloud to have access to.
4) Llama 3. What a solid model. It's the best open LLM, hands down. Nowhere near the commercial models above, but for a model that's completely free to use locally? That's invaluable. Their vision variation is ... not worth using. But I think it'll get better with time. The 8B variation far outperforms its weight class. 70B is a respectable model, with better instruction following than 4o. The ability to finetune these models to a task with so little data is a huge plus. I've made task specific models with 200-400 examples.
5) Mistral Large (I forget the specific version for their latest release). I love Mistral as the "under-dog". Their models aren't bad, and behave _very_ differently from all other models out there, which I appreciate. But Mistral never puts any effort into polishing their models; they always come out of the oven half-baked. Which means they frequently glitch out, have very inconsistent behavior, etc. Accuracy and quality is hard to assess because of this inconsistency. On its best days it's up near Gemini, which is quite incredible considering the models are also released publicly. So theoretically you could finetune them to your task and get a commercial grade model to run locally. But rarely see anyone do that with Mistral, I think partly because of their weird license. Overall, I like seeing them in the race and hope they get better, but I wouldn't use it for anything serious.
Mistral is lightly censored, but fairly easy to jailbreak.
6) Qwen 2 (or 2.5 or whatever the current version is these days). It's an okay model. I've heard a lot of praises for it, but in all my uses thus far its always been really inconsistent, glitchy, and weak. I've used it both locally and through APIs. I guess in _theory_ it's a good model, based on benchmarks. And it's open, which I appreciate. But I've not found any practical use for it. I even tried finetuning with Qwen 2VL 72B, and my tiny 8B JoyCaption model beat it handily.
That's about the sum of it. AFAIK that's all the major commercial and open models (my focus is mainly on MLLMs). OpenAI are still leading the pack in my experience. I'm glad to see good competition coming from Google finally. I hope Mistral can polish their models and be a real contender.
There are a couple smaller contenders out there like Pixmo/etc from allenai. Allen AI has hands down the _best_ public VQA dataset I've seen, so huge props to them there. Pixmo is ... okayish. I tried Amazon's models a little but didn't see anything useful.
NOTE: I refuse to use Grok models for the obvious reasons, so fucks to be them.
It is interesting to see that they keep focusing on the cheapest model instead of the frontier model. Probably because of their primary (internal?) customer's need?
It's cheaper and faster to train a small model, which is better for a research team to iterate on, right? If Google decides that a particular small model is really good, why wouldn't they go ahead and release it while they work on scaling up that work to train the larger versions of the model?
I have no knowledge of Google specific cases, but in many teams smaller models are trained upon bigger frontier models through distillation. So the frontier models come first then smaller models later.
Training a "frontier model" without testing the architecture is very risky.
Meta trained the smaller Llama 3 models first, and then trained the 405B model on the same architecture once it had been validated on the smaller ones. Later, they went back and used that 405B model to improve the smaller models for the Llama 3.1 release. Mistral started with a number of small models before scaling up to larger models.
I feel like this is a fairly common pattern.
If Google had a bigger version of Gemini 2.0 ready to go, I feel confident they would have mentioned it, and it would be difficult to distill it down to a small model if it wasn't ready to go.
the problem is that last generation of the largest models failed to overcome smaller models on the benchmarks, see lack of new claude opus or gpt-5. The problem is probably in the benchmarks, but anyway.
Considering so many of us would like more vRAM than NVIDIA is giving us for home compute, is there any future where these Trillium TPUs become commodity hardware?
Power concerns aside, individual chips in a TPU pod don't actually have a ton of vRAM; they rely on fast interconnects between a lot of chips to aggregate vRAM and then rely on pipeline / tensor parallelism. It doesn't make sense to try to sell the hardware -- it's operationally expensive. By keeping it in house Google only has to support the OS/hardware in their datacenter and they can and do commercialize through hosted services.
Why do you want the hardware vs just using it in the cloud? If you're training huge models you probably don't also keep all your data on prem, but on GCS or S3 right? It'd be more efficient to use training resources close to your data. I guess inference on huge models? Still isn't just using a hosted API simpler / what everyone is doing now?
Reminder that implied models are not actual models. Models have failed to materialize repeatedly and vanished without further mention. I assume no one is trying to be misleading but, at this point, maybe overly optimistic.
Speed looks good vis-a-vis 4o-mini, and quality looks good so far against my eval set. If it's cheaper than 4o-mini too (which, it probably will be?) then OpenAI have a real problem, because switching between them is a value in a config file.
Is this what it feels like to become one of the gray bearded engineers? This sounds like a bunch of intentionally confusing marketing drivel.
When capitalism has pilfered everything from the pockets of working people so people are constantly stressed over healthcare and groceries, and there's little left to further the pockets of plutocrats, the only marketing that makes sense is to appeal to other companies in order to raid their coffers by tricking their Directors to buy a nonsensical product.
Is that what they mean by "agentic era"? Cause that's what it sounds like to me. Also smells alot like press release driven development where the point is to put a feather in the cap of whatever poor Google engineer is chasing their next promotion.
> Is that what they mean by "agentic era"? Cause that's what it sounds like to me.
What are you basing your opinion on? I have no idea how well these LLM agents will perform but its definitely a thing. OpenAI is working on them , Claude and certainly Google.
Yeah it’s a lot of marketing fluff but these tools are genuinely useful and there’s no wonder why Google is working hard to prevent them from destroying their search-dependent bottom line.
Marketing aside, agents are just LLMs that can reach out of their regular chat bubbles and use tools. Seems like just the next logical evolution
Side note on Gemini: I pay for Google Workspace simply to enable e-mail capability for a custom domain.
I never used the web interface to access email until recently. To my surprise, all of the AI shit is enabled by default. So it’s very likely Gemini has been training on private data without my explicit consent.
Of course G words it as “personalizing” the experience for me but it’s such a load of shit. I’m tired of these companies stealing our data and never getting rightly compensated.
Gmail is hosting your email. Being able to change the domain doesn't change that they're hosting it on their terms. I think there are other email providers that have more privacy-focused policies.
Gemini in search is answering so many of my search questions wrong.
If I ask natural language yes/no questions, Gemini sometimes tells me outright lies with confidence.
It also presents information as authoritative - locations, science facts, corporate ownership, geography - even when it's pure hallucination.
Right at the top of Google search.
edit:
I can't find the most obnoxious offending queries, but here was one I performed today: "how many islands does georgia have?".
Compare that with "how many islands does georgia have? Skidaway Island".
This is an extremely mild case, but I've seen some wildly wrong results, where Google has claimed companies were founded in the wrong states, that towns were located in the wrong states, etc.
At first, this was true but now it has gotten pretty good. The times it gets things wrong are often not the models fault and just google searches fault.
"GVP stands for Good Pharmacovigilance Practice, which is a set of guidelines for monitoring the safety of drugs. SVP stands for Senior Vice President, which is a role in a company that focuses on a specific area of operations."
Seems lot of pharma regulation in my telecom company.
I am sure Google has the resources to compete in this space. What I'm less sure about is whether Google can monetize AI in a way that doesn't cannibalize their advertising income.
Who the hell wants an AI that has the personality of a car salesman?
No mention of Perplexity yet in the comments but it's obvious to me that they're targeting Perplexity Pro directly with their new Deep Research feature (https://blog.google/products/gemini/google-gemini-deep-resea...). I still wonder why Perplexity is worth $7 billion when the 800-pound gorilla is pounding on their door (albeit slowly).
Just tried the deep search. It's a much much slower experience than perplexity at the moment - taking sweet many minutes to return result. Maybe it's more extensive but I use perplexity for quick information summary a lot and this is a very different UX.
Haven't used it enough to evaluate the quality, however.
Before dropping it for a different project that got some traction, "Slow Perplexity" was something I was pretty set on building.
Perplexity is a much less versatile product than it has to be in the chase of speed: you can only chew through so many tokens, do so much CoT, etc. in a given amount of time.
They optimized for virality (it's just as fast as Google but gives me more info!) but I suspect it kills the stickiness for a huge number of users since you end up with some "embarrassing misses": stuff that should have been a slam dunk, goes off the rails due to not enough search, or the wrong context being surfaced from the page, etc. and the user just doesn't see value in it anymore.
I know this isn't really a useful comment, but, I'm still sour about the name they chose. They MUST have known about the Gemini protocol. I'm tempted to think it was intentional, even.
It's like Microsoft creating an AI tool and calling it Peertube. "Hurr durr they couldn't possibly be confused; one is a decentralised video platform and the other is an AI tool hurr durr. And ours is already more popular if you 'bing' it hurr durr."
I released a new llm-gemini plugin with support for the Gemini 2.0 Flash model, here's how to use that in the terminal:
LLM installation: https://llm.datasette.io/en/stable/setup.htmlWorth noting that the Gemini models have the ability to write and then execute Python code. I tried that like this:
Here's the result: https://gist.github.com/simonw/0d8225d62e8d87ce843fde471d143...It can't make outbound network calls though, so this fails:
Amusingly Gemini itself doesn't know that it can't make network calls, so it tries several different approaches before giving up: https://gist.github.com/simonw/2ccfdc68290b5ced24e5e0909563c...The new model seems very good at vision:
I got back a solid description, see here: https://gist.github.com/simonw/32172b6f8bcf8e55e489f10979f8f...Published some more detailed notes on my explorations of Gemini 2.0 here https://simonwillison.net/2024/Dec/11/gemini-2/
Question: Have you tried using this for video?
Alternately, if I wanted to pipe a bunch of screencaps into it and get one grand response, how would I do that?
e.g. "Does the user perform a thumbs up gesture in any of these stills?"
[edit: also, do you know the vision pricing? I couldn't find it easily]
Previous Gemini models worked really well for video, and this one can even handle steaming video: https://simonwillison.net/2024/Dec/11/gemini-2/#the-streamin...
Wow this is amazing. It just gave me critique on my bodyweight squat form.
But I also found it hard to prompt to tutor in French or Portuguese; the accents were gruesomely bad.
Code execution is okay, but soon runs into the problem of missing packages that it can't install.
Practically, sandboxing hasn't been super important for me. Running claude with mcp based shell access has been working fine for me, as long as you instruct it to use venv, temporary directory, etc.
Can it run ipython? Then you could use ipython magic to pip install things:
https://ipython.readthedocs.io/en/stable/interactive/magics....
Is there a guide on how to do that?
For building mcp server? The official docs do a great job
https://modelcontextprotocol.io/introduction
My own mcp server could be an inspiration on Mac. It's based on pexpect to enable repl session and has some tricks to prevent bad commands.
https://github.com/rusiaaman/wcgw
However, I recommend creating one with your own customised prompts and tools for maximum benefit.
I wrote a program that can do more or less the same thing, if you only care about the LLM running commands to help you do something:
https://github.com/skorokithakis/sysaidmin
> Some pelicans have white on their heads, suggesting that some of them are older birds.
Interesting theory!
Brown Pelican (Pelecanus occidentalis) heads are white in the breeding season. Birds start breeding aged three to five. So technically the statement is correct but I wonder if Gemini didn't get its pelicans and cormorants in a muddle. The mainland European Great Cormorant (Phalacrocorax carbo sinensis) has a head that gets progressively whiter as birds age.
Big companies can be slow to pivot, and Google has been famously bad at getting people aligned and driving in one direction.
But, once they do get moving in the right direction the can achieve things that smaller companies can't. Google has an insane amount of talent in this space, and seems to be getting the right results from that now.
Remains to be seen how well they will be able to productize and market, but hard to deny that their LLM models aren't really, really good though.
> Remains to be seen how well they will be able to productize and market
The challenge is trust.
Google is one of the leaders in AI and are home to incredibly talented developers. But they also have an incredibly bad track record of supporting their products.
It's hard to justify committing developers and money to a product when there's a good chance you'll just have to pivot again once they get bored. Say what you will about Microsoft, but at least I can rely on their obsession with supporting outdated products.
> they also have an incredibly bad track record of supporting their products
Incredibly bad track record of supporting products that don't grow. I'm not saying this to defend Google, I'm still (perhaps unreasonably) angry because of Reader, it's just that there is a pattern and AI isn't likely to fit that for a long while.
I’m sad for reader but it was a somewhat niche product. Inbox I can’t forgive. It was insanely good and was killed because it was a threat to Gmail.
My main issue with Google is that internal politic affects users all the time. See the debacle of anything built on top of Android and being treated as a second citizen.
You can’t trust a company which can’t shield users from its internal politics. It means nothing is aligned correctly for users to be taken seriously.
Why would they grow if they don't vocally support them? Launch and hope for the best does not work; it's not the wild west on the Internet any more.
Yeah, either AI is significant, in which case Google isn't going to kill it. Or AI is a bubble, in any of the alternatives one might pick can easily crash and die long before Google ends of life anything.
This isn't some minor consumer play, like a random tablet or Stadia. Anyone who has paying attention would have noticed that AI has been an important, consistent, long term strategic interest of Google's for a very long time. They've been killing off the fail/minor products to invest in this.
not going to miss the opportunity to upvote on the grief of having lost Reader
Yes. Imagine Google banning your entire Google account / Gmail because you violated their gray area AI terms ([1] or [2]). Or, one of your users did via an app you made using an API key and their models.
With that being said, I am extremely bullish on Google AI for a long time. I imagine they land at being the best and cheapest for the foreseeable future.
[1] https://policies.google.com/terms/generative-ai
[2] https://policies.google.com/terms/generative-ai/use-policy
For me that is a reason for not touching anything from Google for building stuff. I can afford lossing my Amazon account, but Google's one would be too much. At least they should be clear in their terms that getting banned at cloud doesn't mean getting banned from Gmail/Docs/Photos...
why not just make a business / project account?
That won't help. Their TOS and policies are vague enough that they can terminate all accounts you own (under "Use of multiple accounts for abuse" for instance).
To be fair, I believe this is reserved for things like fighting fraud.
It has been used a few times by people who had a Google Play app banned, that sometimes the personal account would get banned as well.
https://www.xda-developers.com/google-developer-account-ban-...
We do run some non-customer facing assets in Google Cloud. But the website and API are on AWS.
Putting your trust in Google is a fools errand. I don't know anyone that doesn't have a story.
>Say what you will about Microsoft, but at least I can rely on their obsession with supporting outdated products.
Eh... I don't know about that. Their tech graveyard isn't as populous as Google's, but it's hardly empty. A few that come to mind: ATL, MFC, Silverlight, UWP.
Besides Silverlight (which was supported all the way until the end of 2021!), you can still not only run but _write new applications_ using all of the listed technologies.
That doesn't constitute support when it comes to development platforms. They've not received any updates in years or decades. What they've done is simply not remove the capability build capability from the toolchains. That is, not even the work that would be required to no longer support them in any way. Compare that to C#, which has evolved rapidly over the same time period.
That's different from "killing" the product / technology, which is what Google does.
Only because they operate different businesses. Google is primarily a service provider. They have few software products that are not designed to integrate with their servers. Many of Microsoft's businesses work fundamentally differently. There's nothing Microsoft could do to Windows to disable all MFC applications and only MFC applications, and if there was it would involve more work than simply not doing anything else with MFC.
The business model doesn't matter.
I can write something with Microsoft tech and expect it with reasonable likelihood to work in 10 years (even their service-based stuff), but can't say the same about anything from Google.
That alone stops me/my org buying stuff from Google.
I'm not contending that Microsoft and Google are equivalent in this regard, I'm saying that Microsoft does have a history of releasing technologies and then letting them stagnate.
With many research areas converging to comparable levels, the most critical piece is arguably vertical integration and forgoing the Nvidia tax.
They haven't wielded this advantage as powerfully as possible, but changes here could signal how committed they are to slaying the search cash cow.
Nadella deservedly earned acclaim for transitioning Microsoft from the Windows era to cloud and mobile.
It will be far more impressive if Google can defy the odds and conquer the innovator's dilemma with search.
Regardless, congratulations to Google on an amazing release and pushing the frontiers of innovation.
They have to not get blind sided by Sora, while at the same time fighting the cloud war against MS/Amazon.
Weirdly Google is THE AI play. If AI is not set to change everything and truly is a hype cycle, then Google stock withstands and grows. If AI is the real deal, then Google still withstands due to how much bigger the pie will get.
sora is not a big factor in this
It’s a huge factor. Nothing makes all your efforts look dull than video generation. It’s video, it’s literally video. So they have to fight tooth and nail not to fall behind on that.
They already looked slow because they whiffed on releasing an LLM. You will look even slower if the top efforts beat you in video because everyone can see it. They won’t need benchmarks.
They need an iPod to iPhone like transition. If they can pull it off it will be incredible for the business.
> and Google has been famously bad at getting people aligned and driving in one direction.
To be fair, it's not that they're bad at it -- it's that they generally have an explicit philosophy against it. It's a choice.
Google management doesn't want to "pick winners". It prefers to let multiple products (like messaging apps, famously) compete and let the market decide. According to this way of thinking, you come out ahead in the long run because you increase your chances of having the winning product.
Gemini is a great example of when they do choose to focus on a single strategy, however. Cloud was another great example.
I definitely agree that multiple competing products is a deliberate choice, but it was foolish to pursue it for so long in a space like messaging apps that has network effects.
As a user I always still wish that there were fewer apps with the best features of both. Google's 2(!) apps for AI podcasts being a recent example : https://notebooklm.google.com/ and https://illuminate.google.com/home
Google is not winning on cloud, AWS is winning and MS gaining ground.
Parent didn't claim Google is winning. Only that there is a cohesive push and investment in a single product/platform.
That was 2023; more recently Microsoft is losing ground to Google (in 2024).
Well, compared to github copilot (paid), I think Gemini Free is actually better at writing non-archaic code.
Using Claude 3.5 sonnet ?
Gemini is coming to copilot soon anyway.
>> hard to deny that their LLM models aren't really, really good though.
The context window of Gemini 1.5 pro is incredibly large and it retains the memory of things in the middle of the window well. It is quite a game changer for RAG applications.
It looks like long context degraded from 1.5 to 2.0 according to the 2.0 launch benchmarks.
Bear in mind that a "1 million token" context window isn't actually that. You're being sold a sparse attention model, which is guaranteed to drop critical context. Google TPUs aren't running inference on a TERABYTE of fp8 query-key inputs, let alone TWO of fp16.
Google's marketing wins again, I guess.
So far, for my tests, it has performed terribly compared to ChatGPT and Claude. I hope this version is better.
BERT and Gemma 2B were both some of the highest-performing edge models of their time. Google does really well - in terms of pushing efficiency in the community they're second to none. They also don't need to rely on inordinate amounts of compute because Google's differentiating factor is the products they own and how they integrate it. OpenAI is API-minded, Google is laser-focused on the big-picture experience.
For example; those little AI-generated YouTube summaries that have been rolling out are wonderful. They don't require heavyweight LLMs to generate, and can create pretty effective summaries using nothing but a transcript. It's not only more useful than the other AI "features" I interact with regularly, it doesn't demand AGI or chain-of-thought.
> Google is laser-focused on the big-picture experience.
This doesn't match my experience of any Google product.
I disagree - another way you could phrase this is that Google is presbyopic. They're very capable of thinking long-term (eg. Google Deepmind and AI as a whole, cloud, video, Drive/GSuite, etc.), but as a result they struggle to respond to quick market changes. AdSense is the perfect example of Google "going long" on a product and reaping the rewards to monopolistic ends. They can corner a market when the set their sights on it.
I don't think Google (or really any of FAANG) makes "good" products anymore. But I do think there are things to appreciate in each org, and compared to the way Apple and Microsoft are flailing helplessly I think Google has proven themselves in software here.
Google does software/features relatively well, but they are completely lost when it comes to marketing, shipping, and continuing to support products.
Or how would you describe their handling of Stadia, or their weird obsession about shipping and cancelling about a dozen instant messengers?
Yet, google continues to show it'll deprecate it's APIs, Services, and Functionality at the detriment of your own business. I'm not sure enterprises will trust Google's LLM over the alternatives. Too many have been burned throughout the years, including GCP customers.
The fact GCP needs to have this page, and these lists are not 100% comprehensive is telling enough. https://cloud.google.com/compute/docs/deprecations https://cloud.google.com/chronicle/docs/deprecations https://developers.google.com/maps/deprecations
Steve Yegge rightfully called this out, and yet no change has been made. https://medium.com/@steve.yegge/dear-google-cloud-your-depre...
GCP grew 35% last quarter , just saying ...
"just saying" things that are false.
Google Cloud grew 35% year over year, when comparing the 3 months ending September 30th 2024 with 2023.
https://abc.xyz/assets/94/93/52071fba4229a93331939f9bc31c/go... page 12
Isn't that the typical interpretation of what the parent comment said? How is it false?
I read parent comment "grew 35% last quarter" as (income on 2024-09-30) is 1.35 * (income on 2024-07-01)
The balance sheet shows (income on days from 2024-07-01 through 09-30) is 1.35 * (income on days from 2023-07-01 through 09-30)
These are different because with heavily handwavey math the first is growing 35% in a single quarter and the second is growing 35% annually (by comparing like-for-like quarters)
> seems to be getting the right results
> hard to deny that their LLM models aren't really, really good though
I'm so scarred by how much their first Gemini releases sucked that the thought of trying it again doesn't even cross my mind.
Are you telling us you're buying this press release wholesale, or you've tried the tech they're talking about and love it, or you have some additional knowledge not immediately evident here? Because it's not clear from your comment where you are getting that their LLM models are really good.
I’ve been using Gemini 1.5 Pro for coding and it’s been great.
Buried in the announcement is the real gem — they’re releasing a new SDK that actually looks like it follows modern best practices. Could be a game-changer for usability.
They’ve had OpenAI-compatible endpoints for a while, but it’s never been clear how serious they were about supporting them long-term. Nice to see another option showing up. For reference, their main repo (not kidding) recommends setting up a Kubernetes cluster and a GCP bucket to submit batch requests.
[1]https://github.com/googleapis/python-genai
its interesting that just as the LLM hype appears to be simmering down, DeepMind is making big strides. I'm more excited by this than any of OpenAI's announcements.
Beats Gemini 1.5 Pro at all but two of the listed benchmarks. Google DeepMind is starting to get their bearings in the LLM era. These are the minds behind AlphaGo/Zero/Fold. They control their own hardware destiny with TPUs. Bullish.
Regarding TPU’s, sure for the stuff that’s running on the cloud.
However their on device TPUs lag behind the competition and Google still seem to struggle to move significant parts of Gemini to run on device as a result.
Of course, Gemini is provided as a subscription service as well so perhaps they’re not incentivized to move things locally.
I am curious if they’ll introduce something like Apple’s private cloud compute.
i don’t think they need to win the on device market.
we need to separate inference and training - the real winners are those who have the training compute. you can always have other companies help with inference
> i don’t think they need to win the on device market.
The second Apple comes out with strong on-device AI - and it very much looks like they will - Google will have to respond on Android. They can't just sit and pray that e.g. Samsung makes a competitive chip for this purpose.
I think Apple is uniquely disadvantaged in the AI race to a point people dont realize. They have less training data to use, having famously been focused on privacy for its users and thus having no particular advantage in this space due to not having customer data to train on. They have little to no cloud business, and while they operate a couple of services for their users, they do not have the infrastructure scale to compete with hyperscaler cloud vendors such as Google and Microsoft. Most of what they would need to spend on training new models would require that they hand over lots of money to the very companies that already have their own models, supercharging their competition.
While there is a chance that Apple might come out with a very sophisticate on-device model. The problem here is that they would only be able to compete with other on-device models. The magnitude of compute needed to keep pace with SOA models is not achievable on a single device. It will take many generations of Apple silicon in order to compete with the compute of existing datacenters.
Google also already has competitive silicon in this space with the Tensor series processors, which are being fabbed at Samsung plants today. There is no sitting and praying necessary on their part as they already compete.
Apple is a very distant competitor in the space of AI, and I see no reason to assume this will change, they are uniquely disadvantaged by several of the choices they made on their way to mobile supremacy. The only thing they currently have going for them is the development of their own ARM silicon which may give them the ability to compete with Google's TPU chips, but there is far more needed to be competitive here than the ability to avoid the Nvidia tax.
"having famously been focused on privacy for its users and thus having no particular advantage in this space due to not having customer data to train on"
That may not be as big a disadvantage as you think.
Anthropic claim that they did not use any data from their users when they trained Claude 3.5 Sonnet.
sure but they certainly acquired data from mass scraping (including of data produced by their users) and/or data brokering aka paying someone to do the same.
yeah i’ve never understood the outsized optimism for apple’s ai strategy, especially on hn.
they’re a little bit less of a nobody than they used to be, but they’re basically a nobody when it comes to frontier research/scaling. and the best model matters way more than on-device which can always just be distilled later and find some random startup/chipco to do inference
Theory: Apple's lifestyle branding is quite important to the identity of many in the community here. I mean, look at the buy-in at launch for Apple Vision Pro by so many people on HN--it made actual Apple communities and publications look like jaded skeptics.
The Android on chip AI is and has been leagues better than what is available on iOS.
If anything, I think the upcoming iOS AI update will bring them to a similar level as android/google.
But given inference time compute, to give a strong reply reasonably fast, you'll need a lot of compute, very rarely used.
Economically this fits the cloud much better.
At what point does the on device stuff eat into their market share though? As on device gets better, who will pay for cloud compute? Other than enterprise use.
I’m not saying on device will ever truly compete at quality, but I believe it’ll be good enough that most people don’t care to pay for cloud services.
You're still focused about inference :)
inference basically does not matter, it is a commodity
You’re still focused about training :)
training doesn’t matter if inference costs are high and people don’t pay for them
but inference costs arent high already and there are tons of hardware companies that can do relatively cheap LLM inference
Inference costs per invocation aren’t high. Scale it out to billions of users and it’s a different story.
Training is amortized over each inference, so the cost of inference also needs to include the cost of training to break even unless made up elsewhere
That makes no sense. Inference cost dwarf training cost if you have a succesfull product pretty quickly. Afaik there is no commodity hardware that can run state of the art models like chatgpt-o1.
> Afaik there is no commodity hardware that can run state of the art models like chatgpt-o1.
Stack enough GPUs and any of them can run o1. Building a chip to infer LLMs is much easier than building a training chip.
Just because one cost dwarfs another does not mean that this is where the most marginal value from developing a better chip will be, especially if other people are just doing it for you. Google gets a good model, inference providers will be begging to be able to run it on their platform, or to just sell google their chips - and as I said, inference chips are much easier.
Each GPU costs ~50k. You need at least 8 of them to run mid-sized models. Then you need a server to plug those GPUs into. That's not commodity hardware.
more like ~$16k for 16 3090s. AMD chips can also run these models. The parts are expensive but there is a competitive market in processors that can do LLM inference. Less so in training.
I don’t think the AI market will ever really be a healthy one until inference vastly outnumbers training. What does it say about AI if training is done more than inference?
I agree that the in-device inference market is not important yet.
done more != where the value is at
inference hardware is a commodity in a way that training is not
If the model weights is not open, you can't run it on device anyways.
The Pixel 9 runs many small proprietary Gemini models on the internal TPU.
And yet these new models still haven’t reached feature parity with Google Assistant, which can turn my flashlight on, but with all the power of burning down a rainforest, Gemini still cannot interact with my actual phone.
I just tried asking my phone to turn on the flashlight using Gemini. It worked. https://9to5google.com/2024/11/07/gemini-utilities-extension...
Ok I tried literally last week on Pixel 7a and it didn’t work. What model do you have? Maybe it requires a phone that can do on-device models?
Works on a Pixel 4A 5G..
Pretty sure that's not doing any fancy on-device models!
That said, there was a popup today saying that assistant is now using Gemini, so I just enabled it to try. Could well have changed in the last week.
I just tried it on my Galaxy Ultra s23 and it worked. I then disconnected internet and it did not work.
Gemini nano weights are leaked and google doesn't care about it being leaked. Google would definitely care if Pro weights are leaked.
Is there any phone in the world that can realistically run pro weights?
Majority of people want better performance, running locally is just a nice to have feature.
Latency is a huge factor in performance, and local models often have a huge edge. Especially on mobile devices that could be offline entirely.
They’ll care though when they have to pay for it, or when they’re in an area with poor reception.
Poor reception is rapidly becoming a non-issue for most of the developed world. I can’t think of the last time I had poor reception (in America) and wasn’t on an airplane.
As the global human population increasingly urbanizes, it’ll become increasingly easy to blanket it with cell towers. Poor(er) regions of the world will increase reception more slowly, but they’re also more likely to have devices that don’t support on-device models.
Also, Gemini Flash is basically positioned as a free model, (nearly) free API, free in GUI, free in Search Results, Free in a variety of Google products, etc. No one will be paying for it.
Many major cities have significant dead spots for coverage. It’s not just for developing areas.
Flash is free for api use at a low rate limit. Gemini as a whole is not free to Android users (free right now with subscription costs beyond a time period for advanced features) and isn’t free to Google without some monetary incentive. Hence why I also originally ask about private cloud compute alternatives with Google.
I ride a ferry from a city of 50k to a city of 700k in the US and work in a building with apartments upstairs basically a concrete cave.
I see poor reception in both areas and only one has WiFi.
They pay to run it locally as well (more expensive hardware)
And sure, poor reception will be an issue, but most people would still absolutely take a helpful remote assistant over a dumb local assistant.
And you don't exactly see people complaining that they can't run Google/YouTube/etc locally.
Your first sentence has the fallacy that you’re attributing the cost of the device to a single feature against the cost of that single feature.
Most people are unlikely to buy the device for the AI features alone. It’s a value add to the device they’d buy anyway.
So you need the paid for option to be significantly better than the free one that comes with the device.
Your second sentence assumes the local one is dumb. What happens when local ones get better? Again how much better is the cloud one to compete on cost?
To your last sentence, it assumes data fetching from the cloud. Which is valid but a lot of data is local too. Are people really going to pay for what Google search is giving them for free?
I think it's a more likely assumption that on device performance will trail off device models by a significant margin for at least the next few years - of course if magically you can make it work locally with the same level of performance it would be better.
Plus a lot of the "agentic" stuff is interaction with the outside world, connectivity is a must regardless.
My point is that you do NOT need the same level of performance. You need an adequate level of performance that the cost to get more performance isn’t worth it to most people.
And my point is that it's way too early to try to optimize for running locally, if performance really stabilizes and comes to a halt (which may likely happen) then it makes more sense to optimize.
Plus once you start with on device features you start limiting your development speed and flexibility.
It isn't really hypothetical. Lots of good models run well on a modern Macbook Pro.
You can run model >100x faster in cloud compared to on device with DDR RAM. This would make up for the reception.
And you can’t run the cloud model at all if you can’t talk to the cloud.
Yes, but I can't imagine situations where I "have" to run a model when I don't have internet at that time. My life would be more affected with the rest of the internet than having to run a small stupid model locally. At the very least until the hallucination is completely solved, as I need internet to verify the models.
Hallucination can't be solved because bogus output is categorically the same sort of thing as useful output.
It has no world model. It doesn't know truth any more than it knows bullshit just a statistical relationship between words.
You’re assuming the model is purely for generation though. Several of the Gemini features are lookup of things across data available to it. A lot of that data can be local to device.
That is currently Apple’s path with Apple Intelligence for example.
Are these benchmarks still meaningful?
No, and they haven't been for at least half a year. Utterly optimized for by the providers. Nowadays if a model would be SotA for general use but not #1 on any of these benchmarks, I doubt they'd even release it.
I've started keeping an eye out for original brainteasers, just for that reason. GCHQ's Christmas puzzle just came out [1], and o1-pro got 6 out of 7 of them right. It took about 20 minutes in total.
I wasn't going to bother trying those because I was pretty sure it wouldn't get any of them, but decided to give it an easy one (#4) and was impressed at the CoT.
Meanwhile, Google's newest 2.0 Flash model went 0 for 7.
1: https://metro.co.uk/2024/12/11/gchq-christmas-puzzle-2024-re...
Why are you comparing flash vs o1-pro, wouldn't a more fair comparison be flash vs mini?
I just ask o1-mini the first two questions and it got it wrong.
Did it get the 8 right? The linked article provides the wrong answer btw.
Wow! That’s all I need to know about Google’s model.
What is impressive about this new model is that it is the lightweight version (flash).
There will probably be a 2.0 pro (which will be 4o/sonnet class) and maybe an ultra (o1(?)/Opus).
That's a comparison of multiple GPT-4 models working together... against a single GPT-4 mini style model.
If you look at where talent is going, it's Anthropic that is the real competitor to Google, not OpenAI.
Yeah they've been slow to release end-user facing stuff but it's obvious that they're just grinding away internally.
They've ceded the fast mover advantage, but with a massive installed base of Android devices, a team of experts who basically created the entire field, a huge hardware presence (that THEY own), massive legal expertise, existing content deals, and a suite of vertically integrated services, I feel like the game is theirs to lose at this point.
The only caution is regulation / anti-trust action, but with a Trump administration that seems far less likely.
Gemini-2.0-Flash does extremely well on the Hallucination Evaluation Leaderboard, at 1.3% hallucination rate https://github.com/vectara/hallucination-leaderboard
Fascinating, thanks for calling that out: I found 1.0 promising in practice, but with hallucination problems. Then I saw it had gotten 57% of questions wrong on open book true/false and I wrote it off completely - no reason to switch to it for speed and cost if it's just a random generator. That's a great outcome.
This naming is confusing...
Anyway, I'm glad that this Google release is actually available right away! I pay for Gemini Advanced and I see "Gemini Flash 2.0" as an option in the model selector.
I've been going through Advent of Code this year, and testing each problem with each model (GPT-4o, o1, o1 Pro, Claude Sonnet, Opus, Gemini Pro 1.5). Gemini has done decent, but is probably the weakest of the bunch. It failed (unexpectedly to me) on Day 10, but when I tried Flash 2.0 it got it! So at least in that one benchmark, the new Flash 2.0 edged out Pro 1.5.
I look forward to seeing how it handles upcoming problems!
I should say: Gemini Flash didn't quite get it out of the box. It actually had a syntax error in the for loop, which caused it to fail to compile, which is an unusual failure mode for these models. Maybe it was a different version of Java or something (I'm also trying to learn Java with AoC this year...). But when I gave Flash 2.0 the compilation error, it did fix it.
For the more Java proficient, can someone explain why it may have provided this code:
which was a compilation error for me? The corrected code it gave me afterwards was just and with that one change the class ran and gave the right solution.I use a Claude and Gemini a lot for coding and I realized there is no good or best model. Every model has it's upside and downside. I was trying to get authentication working according to the newer guidelines of Manifest V3 for browser extensions and every model is terrible. It is one use case where there is not much information or right documentation so every model makesup stuff. But this is my experience and I don't speak for everyone.
Relatedly, I start to think more and more the AI is great for mediocre stuff. If you just need to do the 1000th website, it can do that. Do you want to build a new framework? Then there will probably be less many useful suggestions. (Still not useless though. I do like it a lot for refactoring while building xrcf.)
EDIT: One reason that lead me to think it's better for mediocre stuff was seeing the Sora model generate videos. Yes it can create semi-novel stuff through combinations of existing stuff, but it can't stick to a coherent "vision" throughout the video. It's not like a movie by a great director like Tarantino where every detail is right and all details point to the same vision. Instead, Sora is just flailing around. I see the same in software. Sometimes the suggestions go towards one style and the next moment into another. I guess AI currently is just way lower in their context length. Tarantino has been refining his style for 30 years now. And always he has been tuning his model towards his vision. AI in comparison seems to always just take everything and turn it into one mediocre blob. It's not useless but currently good to keep in mind I think. That you can only use it to generate mediocre stuff.
We got to the point that AI isn't great because it is not like a Tarantino movie. What a time to be alive.
That's when having a huge context is valuable. Dump all of the new documentation into the model along with your query and the chances of success hugely increase.
This is true for all newish code bases. You need to provide the context it needs to get the problem right. It has been my experience that one or two examples with new functions or new requirements will suffice for a correction.
> I use a Claude and Gemini a lot for coding and I realized there is no good or best model.
True to a point, but is anyone using GPT2 for anything still? Sometimes the better model completely supplants others.
> For the more Java proficient, can someone explain why it may have provided this code:
To me that reads like it was trying to accomplish something like
I can't comment on why the model gave you that code, but I can tell you why it was not correct.
`queue.remove(0)` gives you an `int[]`, which is also what you were assigning to `current`. So logically it's a single element, not an iterable. If you had wanted to iterate over each item in the array, it would need to be:
``` for (int[] current : queue) { for (int c : current) { // ...do stuff... } } ```
Alternatively, if you wanted to iterate over each element in the queue and treat the int array as a single element, the revised solution is the correct one.
A tangent, but is there a clear best choice amongst those models for AOC type questions?
The Gemini 2 models support native audio and image generation but the latter won't be generally available till January. Really excited for that as well as 4o's image generation (whenever that comes out). Steerability has lagged behind aesthetics in image generation for a while now and it's be great to see a big advance in that.
Also a whole lot of computer vision tasks (via LLMs) could be unlocked with this. Think Inpainting, Style Transfer, Text Editing in the wild, Segmentation, Edge detection etc
They have a demo: https://www.youtube.com/watch?v=7RqFLp0TqV0
These are not computer vision tasks…
Maybe some of these tasks are arguably not aligned with the traditional applications of CV, but Segmentation and Edge detection are definitely computer vision in every definition I've come across - before and after NNs took over.
What are they, then…?
The first two are tasks which involve making images. They could be called image generation or image editing.
OT: I’m not entirely sure why, but "agentic" sets my teeth on edge. I don't mind the concept, but the word itself has that hollow, buzzwordy flavor I associate with overblown LinkedIn jargon, particularly as it is not actually in the dictionary...unlike perfectly serviceable entries such as "versatile", "multifaceted" or "autonomous"
To play devil's advocate, the correct use of the word would be when multiple AIs are coordinating and handing off tasks to each other with limited context, such that the handoffs are dynamically decided at runtime by the AI, not by any routine code. I have yet to see a single example where this is required. Most problems can be solved with static workflows and simple rule based code. As such, I do believe that >95% of the usage of the word is marketing nonsense.
I think this sort of usage is already happening, but perhaps in the internal details or uninteresting parts, such as content moderation. Most good LLM products are in fact using many LLM calls under the hood, and I would expect that results from one are influencing which others get used.
I actually have built such a tool (two AIs, each with different capabilities), but still cringe at calling at agentic. Might just be an instinctive reflex.
Versatile is far worse. It’s so broad to the point of meaninglessness. My garden rake is fairly versatile.
Agentic to me means that it acts somewhat under its own authority rather than a single call to an LLM. It has a small degree of agency.
I'm personally very glad that the word has adhered itself to a bunch of AI stuff, because people had started talking about "living more agentically" which I found much more aggravating. Now if anyone states that out loud you immediately picture them walking into doors and misunderstanding simple questions, so it will hopefully die out.
Huh, all three words you mentioned as replacement are equally buzzwordy and I see them a lot in CVs while screen candidates for job interview.
They agree—they're saying that at least those buzzwords are in the dictionary, not that they'd be a good replacement for "agentic".
Versatile implies it can to more kinds of tasks (than it's predecessor or competitor). Agentic implies it requires less human intervention.
I don't think these are necessary buzzwords if the product really does what they imply.
At least all three of them are actually in the dictionary
That's not necessarily a good thing because they are overloaded while novel jargon is specific.
We use new words so often that we take it for granted. You've passively picked up dozens of new words over the last 5 or 10 years without questioning them.
Need a general term for autonomous intelligent decision making.
No, we need a scientific understanding of autonomous intelligent decision-making. The problem with “agentic AI” is the same old “Artificial Intelligence, Natural Stupidity” problem: we have no clue what “reasoning” or “intelligence” or “autonomous” actually means in animals, and trying to apply these terms to AI without understanding them (or inventing a new term without nailing down the underlying concept) is doomed to fail.
Isn't that just "intelligent"?
We need something to describe a behavioral element in business processes. Something goes into it, something comes out of it - though in this case nondeterminism is involved and it may not be concrete outputs so much as further actioning.
Intelligence is a characteristic.
Volitional, independent, spontaneous, free-willed, sovereign...
Yeah I hate it when AI companies throw around words like AGI and agentic capabilities. It’s non sense to most people and ambiguous at best
This is what other replies are missing - I've been following AI closely since GPT 2 and it's not immediately clear what agentic means, so to other people, the term must be even less clear. Using the word autonomous can't be worse than agentic imo.
What's everyone's favorite LLM leaderboard? Gemini 2 seems to be edging out 4o on chatbot arena(https://lmarena.ai/?leaderboard)
Notably, GPT-4o is a "full size" model, whereas Gemini 2 Flash is the small and efficient variant in that family as far as I understand it.
https://aider.chat/docs/leaderboards/
I like that https://artificialanalysis.ai/leaderboards/models describes both quality and speed (tokens/s and first chunk s). Not sure how accurate it is; anyone know? Speed and variance of it in particular seems difficult to pin down because providers obviously vary it with load to control their costs.
Leaderboards are not that useful for measuring real-life effectiveness of the models atleast in my day-today usage.
I am currently struggling to diagnose an ipv6 mis-configuration in my enormous aws cloudformation yaml code. I gave the same input to Claude Opus, Gemini and ChatGPT ( o1 and 4o).
4o was the worst. verbose and waste of my time.
Claude completely went off-tangent and began recommending fixes for ipv4 while I specifically asked for ipv6 issues
o1 made a suggestion which I tried out and it fixed it. It literally found a needle in the haystack. The solution is working well now.
Gemini made a suggestion which almost got it right but it was not a full solution.
I must clarify diagnosing network issues on AWS VPC is not my expertise and I use the LLMs to supplement my knowledge.
Sonnet 3.5 as of today is superior to Opus, curious if sonnet could have solved your problem
https://livebench.ai/#/
AI benchmarks and leaderboards are complete nonsense though.
Find something you like, use it, be ready to look again in a month or two.
With the accelerating progress, the "be ready to look again" is becoming a full time job that we need to be able to delegate in some way, and I haven't found anything better than benchmarks, leaderboards and reviews.
EDIT: Typo
FWIW I've found the 'coding' 'category' of the leaderboard to be reasonably accurate. Claude was the best, o1-mini then was typically stronger, now the Gemini Exp 1206 is at the top.
I find myself just paying a la carte via the API rather than paying the $20/mo so I can switch between the models.
poe.com has a decent model where you buy credits and spend them talking to any LLM which makes it nice to swap between them even during the same conversation instead of paying for multiple subscriptions.
Though gpt-4o could say "David Mayer" on poe.com but not on chat.openai.com which makes me wonder if they sometimes cheat and sneak in different models.
Am I alone in thinking the word “agentic” is dumb as shit?
Most of these things seem to just be a system prompt and a tool that get invoked as part of a pipeline. They’re hardly “agents”.
They’re modules.
It's easier for consultants and sales people to sell to enterprise if the terminology is familiar but mysterious.
Bad
"Good"Controlling a browser in Project Mariner seems very agentic: https://youtu.be/Fs0t6SdODd8?t=86
>“agentic” is dumb as shit?
It'll create endless consulting opportunities for projects that never go anywhere and add nothing of value unless you value rich consultants.
The beauty of LLMs isn’t just these coding objects speak human vernacular but they can be concatenated with human vernacular prompts and that itself can be used as an input, command or output sensibly without necessarily causing error even if a series of inputs combinations weren't preprogrammed.
I have an A.I. textbook that has agent terminology that was written preLLm days. agents are just autonomous ish code that loops on itself with some extra functionality. LLMs in their elegance can more easily out the box selfloop just on the basis concatenating language prompts, sensibly. They are almost agent ready out the box by this very elegant quality(the textbook agentic diagram is just a conceptual self perpetuation loop), except…
Except they fail at a lot or get stuck at hiccups. But, here is a novel thought. What if an LLM becomes more agentic (ie more able to sustain autonomous chain prompts that do actions without a terminal failure) and less copilotee not by more complex controlling wrapper self perpetuation code, but by means of training the core llm itself to more fluidly function in agentic scenarios.
a better agentically performing llm that isnt mislabeled with a bad buzzword might not reveal itself in its wrapper control code but through it just performing better in an typical agentic loop or environment conditions with whatever initiating prompt, control wrapper code, or pipeline that initiates its self perpetuation cycle.
Definitely not alone. With all the this money at stake, coining dumb terms like this might make you a pretty penny.
It's like a meme that can be milked for monetization.
Gemini, too, for the sole reason that non-native speakers have no clue how to pronounce it.
pronounced: juh-meany .... right?
Also, people at NASA pronounce it two ways, even native speakers of English.
I've been using gemini-exp-1206 and I notice a lot of similarities to the new gemini-2.0-flash-exp: they're not that much actually smarter but they go out of their way to convince you they are with overly verbose "reasoning" and explanations. The reasoning and explanations aren't necessarily wrong per se, but put them aside and focus on the actual logical reasoning steps and conclusions to your prompts and it's still very much a dumb model.
The models do just fine on "work" but are terrible for "thinking". The verbosity of the explanations (and the sheer amount of praise the models like to give the prompter - I've never had my rear end kissed so much!) should lead one to beware any subjective reviews of their performance rather than objective reviews focusing solely on correct/incorrect.
Think of Google as of a tanker ship. It takes a while to change course, but it has great momentum. Sundar just needs to make sure the course is right.
That's almost word for word what people said about Windows Phone when I was at Microsoft.
Windows Phone was actually great though, and would've eventually been a major player in the space if Microsoft were stubborn enough to stick with it long enough, like they did with the Xbox.
By his own admission, Gates was extremely distracted at the time by the antitrust cases in Europe, and he let the initiative die.
But Windows Phone was actually good, like Xune, it was just late, and it was incredibly popular to hate Microsoft at the time.
Additionally, Microsoft didn't really have any advantage in the smart phone space.
Google is already a product the majority of people on the planet use regularly to answer questions.
That seems like a competitive advantage to me.
Yeah, I liked my windows phone, not sure why they killed it
Was the Windows Phone ever at the frontier tho?
Windows Phone was superior to everything else on the market at the time. But phones are an ecosystem, and MS was a latecomer.
It is a lot easier to switch LLMs than it is to switch smartphone platforms.
And where is the ship headed if they are no longer supporting the open web?
Publishers are being squeezed and going under, or replacing humans with hallucinated genai slop.
It’s like we’re taking the private equity model of extracting value and killing something off to the entire web.
I’m not sure where this is headed, but I don’t think Sundar has any strategy here other than playing catch up.
Demis’ goal is pretty transparently positioning himself to take over.
The Web is dead. It’s pretty clear future web pages, if we call them that, will be assembled on-the-fly by AI based your user profile and declared goals, interests and requests.
> We're also launching a new feature called Deep Research, which uses advanced reasoning and long context capabilities to act as a research assistant, exploring complex topics and compiling reports on your behalf. It's available in Gemini Advanced today.
Anyone seeing this? I don't have an option in my dropdown.
Rolling out the next few days accorsing to Jeff
Not seeing it yet on web or mobile (in Canada)
Anecdotally, using the Gemini App with "Gemini Advanced 2.0 Flash Experimental", the response quality is ignorantly improved and faster at some basic Python and C# generation.
> ignorantly improved
autocorrect of "significantly improved"?
Gemini 2.0 Flash is available here: https://aistudio.google.com/prompts/new_chat
Based on initial interactions, it's extremely verbose. It seems to be focused on explaining its reasoning, but even after just a few interactions I have seen some surprising hallucinations. For example, to assess current understanding of AI, I mentioned "Why hasn't Anthropic released Claude 3.5 Opus yet?" Gemini responded with text that included "Why haven't they released Claude 3.5 Sonnet First? That's an interesting point." There's clearly some reflection/attempted reasoning happening, but it doesn't feel competitive with o1 or the new Claude 3.5 Sonnet that was trained on 3.5 Opus output.
I'm not gonna lie I like Google's models.
Flash combines speed and cost and is extremely good to build apps on.
People really take that whole benchmarking thing more seriously than necessary.
Was this written by an LLM? It's pretty bad copy. Maybe they laid off their copywriting team...?
> "Now millions of developers are building with Gemini. And it’s helping us reimagine all of our products — including all 7 of them with 2 billion users — and to create new ones"
and
> "We’re getting 2.0 into the hands of developers and trusted testers today. And we’re working quickly to get it into our products, leading with Gemini and Search. Starting today our Gemini 2.0 Flash experimental model will be available to all Gemini users."
Sorry, what's wrong with these phrases?
> all of our products — including all 7 of them
All the products including all the products?
Why did you specifically ignore the remainder of the sentence?
"...all of our products — including all 7 of them with 2 billion users..."
It tells people that 7 of their products have 2b users.
That's not really any better, since "all of our products" already includes the subset that has at least 2B users. "I brought all my shoes, including all my red shoes."
They're pointing out that seven of their products have more than 2 billion users.
"I brought all my shoes, including the pairs that cost over $10,000" is saying something about what shoes you brought, more than "all of them".
Why are they bragging about something completely unrelated in the middle of a sentence about the impact of a piece of technology?
-Hey, are you done packing?
-Yes, I decided I'll bring all my shoes, including the ones that cost over $10,000.
What, they just couldn't help themselves?
The fact that they're using Gemini with even their most important products shows that they trust it.
Again, that's covered by "all our products". Why do we need to be reminded that Google has a lot of users? Someone oblivious to that isn't going to care about this press release.
Scale and cost are defining considerations of LLMs. By saying they're rolling out to billions of users, they're pointing out they're doing something pretty unique and have confidence in a major competitive advantage. Point billions of devices at other high-performing competitors' offerings, and all of them would fall over.
That phrasing still sucks, I am neither a native speaker nor a wordsmith but I've worked with professional English writers who could make that look and sound infinitely better.
all of our products, 7 of which have over 2 billion users..
The meme of LLM generated content is that it's verbose and formal, not that it's poorly written.
It's why the quoted text is obviously written by a human.
There's no law that says LLM generated text has to bad in a singular way
executive spotted
It reads like a transcribed speech. You can picture this being read from a teleprompter at a conference keynote.
Short sentence fact. And aspirational tagline - pause for some metrics - and more. And. Today. And. And. Today.
We’re definitely going to need better benchmarks for agentic tasks, and not just code reasoning. Things that are needlessly painful that humans go through all the time
it's insane on lmarena for a size, livebench should have it soon too I guess
The size isn't stated, not necessarily a given that it's as small as 1.5-Flash.
Does anyone have any insights into how Google selects source material for AI overviews? I run an educational site with lots of excellent information, but it seems to have been passed over entirely for AI overviews. With these becoming an increasingly large part of search--and from the sound of it, now more so with Gemini 2.0--this has me a little worried.
Anyone else run into similar issues or have any tips?
Any word on price? I can't find it at https://ai.google.dev/pricing
I've been using Gemini Flash for free through the API using Cline for VS Code. I switch between Claude and Gemini Flash, using Claude for more complicated tasks. Hope that the 2.0 model comes closer to Claude for coding.
Or… just continue using Claude?
Claude is ridiculously expensive and often subject to rate limiting.
I think they try to conserve costs by only using Claude when needed.
Agreed - tried some sample prompts on our data and the rough vibe check is that flash is now as good as the old pro. If they keep pricing the same, this would be really promising.
£18/month
https://gemini.google/advanced/?Btc=web&Atc=owned&ztc=gemini...
then sign in with Google account and you'll see it
Oh but I only care about api pricing
I think it is free for 1500 requests/day. See the model dropdown on https://aistudio.google.com/prompts/new_chat
Their Mariner tool for controlling the browser sounds scary and exciting. At the moment, it's an extension, which means JavaScript. Some web sites block automation that happens this way, and developers resort to tools such as Selenium. These use the Chrome DevTools API to automate the browser. It's better, but can still be distinguished from normal use with very technical details. I wonder if Google, who still own Chrome, will give extensions better APIs for automation that can not be distinguished from normal use.
Did anyone get to play with the native image generation part? In my experience, Imagen 3 was much better than the competition so I'm curious to hear people's take on this one.
Hrm, when I tried to get it to generate an image it said it was using Imagen 3. Not sure what “native” image generation means then.
At least when it comes to Go code, I'm pretty impressed by the results so far. It's also pretty good at following directions, which is a problem I have with open source models, and seems to use or handle Claude's XML output very well.
Overall, especially seeing as I haven't paid a dime to use the API yet, I'm pretty impressed.
Does anyone know how to sign up to the speech output wait list or tester program? I have a decent spent with GCP over the years if that helps at all. Really want DemoTime videos to use those voices. (I like how truly incredible best in the world tts is like a footnote in this larger announcement.)
Google beat OpenAI at their own game.
I'd be interested to see Gemini 2.0's performance on SWE-Bench.
Anyone else annoyed how the ML/AI community just adopted the word "reasoning" when it seems like it is being used very out of context when looking at what the model actually does?
These models take an instruction, along with any contextual information, and are trained to produce valid output.
That production of output is a form of reasoning via _some_ type of logical processing. No?
Maybe better to say computational reasoning. That’s a mouthful.
Static computation is not reasoning (these models are not building up an argument from premises, they are merely finding statistically likely completions). Computational thinking/reasoning would be breaking down a problem into an algorithmic steps. The model is doing neither. I wouldn't confuse the fact that it can break it into steps if you ask it, because again that is just regurgitation. It's not going through that process without your prompt. That is not part of its process to arrive at an answer.
I kinda agree with you but I can also see why it isn't that far from "reasoning" in the sense humans do it.
To wit, if I am doing a high school geometry proof, I come up with a sequence of steps. If the proof is correct, each step follows logically from the one before it.
However, when I go from step 2 to step 3, there are multiple options for step-3 I could have chose. Is it so different from a "most-likely-prediction" an LLM makes? I suppose the difference is humans can filter out logically-incorrect steps, or prune chains-of-steps that won't lead to the actual theorem quicker. But an LLM predictor coupled with a verifier doesn't feel that different from it.
The point is emergent capabilities in LLMs go beyond statistical extrapolation, as they demonstrate reasoning by combining learned patterns.
When asked, “If Alice has 3 apples and gives 2 to Bob, how many does she have left?”, the model doesn’t just retrieve a memorized answer—it infers the logical steps (subtracting 2 from 3) to generate the correct result, showcasing reasoning built on the interplay of its scale and architecture rather than explicit data recall.
Does it help to explore the annoyance using gap analysis? I think of it as heuristics. As with humans, it's the pragmatic "whatever seems to work" where "seems" is determined via training. It's neither reasoning from first principles (system 1) nor just selecting the most likely/prevalent answer (system 2). And chaining heuristics doesn't make it reasoning, either. But where there's evidence that it's working from a model, then it becomes interesting, and begins to comply with classical epistemology wrt "reasoning". Unfortunately, information theory seems to treat any compression as a model leading to some pretty subtle delusions.
These kind of simplifications continue to make me an expert in LLM applications.
So... its a trade secret to know how it actually works...
Gemini multimodal live docs here: https://cloud.google.com/vertex-ai/generative-ai/docs/model-...
A little thin...
Also no pricing is live yet. OpenAI's audio inputs/outputs are too expensive to really put in production, so hopefully Gemini will be cheaper. (Not to mention, OAI's doesn't follow instructions very well.)
The Multimodal Live API is free while the model/API is in preview. My guess is that they will be pretty aggressive with pricing when it's in GA, given the 1.5 Flash multimodal pricing.
If you're interested in this stuff, here's a full chat app for the new Gemini 2 API's with text, audio, image, camera video and screen video. This shows how to use both the WebSocket API and to route through WebRTC infrastructure.
https://github.com/pipecat-ai/gemini-multimodal-live-demo
Thanks, this is great!
Tested out Gemini-2 Flash, I had such high hopes that a better base model would help. It still hallucinates like crazy compared to GPT-4o.
Small models don't "know" as much so they hallucinate more. They are better suited for generations that are based in a ground truth, like in a RAG setup.
A better comparison might be Flash 2.0 vs 4o-mini. Even then, the models aren't meant to have vast world knowledge, so benchmarking them on that isn't a great indicator of how they would be used in real-world cases.
Yes, it's not an apples to apples comparison. My point is the position it's at on the lmarena leaderboard is misplaced due to the hallucination issues.
Published the day after one of the authors, Demis Hassabis, received his Nobel prize in Stockholm.
I guess this means we'll have an openai release soon
I'm quite impressed with the flash demo's reasoning capabilities. Did the 20 questions game with it, it found the computer mouse i had in my head. At first it was confused about our roles and said something weird, it thought that it had to guess its own word. Afterwards I had a meta conversation about that weirdness and it gave impressive insights:
" Why My "I Guess What I Have in Mind" Statement Doesn't Make Sense
Unfortunately the 10rpm quota for this experimental model isn't enough to run an actual Agentic experience on.
That's my main issue with google there's several models we want to try with our agent but quota is limited and we have to jump through hoops to see if we can get it raised.
I think they are really overloading that word "Agent". I know there isn't a standard definition - but I think Google are stretching the meaning of that way thinner than most C Suite level execs talk about agents at.
I think DeepMind could make progress if they focused on the agent definition of multi-step reasoning + action through a web browser, and deliver a ton of value, outside of lumping in the seldom used "Look at the world through a camera" or "Multi modal Robots" thing.
If Google cracked robots, past plays show that the market for those aren't big enough to interest Google. Like VR, you just can't get a billion people to be interested in robots - so even if they make progress, it won't survive under Google.
The "Look at the world through a camera" thing is a footnote in an Android release.
Agentic computer use _is_ a product a billion people would use, and it's adjacent to the business interests of Google Search.
"Over the last year, we have been investing in developing more agentic models, meaning they can understand more about the world around you, think multiple steps ahead, and take action on your behalf, with your supervision."
"With your supervision". Thus avoiding Google being held responsible. That's like Teslas Fake Self Driving, where the user must have their hands on the wheel at all times.
Honestly this post makes Google sound like the new IBM. Very corporate.
"Hear from our CEO first, and then our other CEO in charge of this domain and CTO will tell you the actual news."
I haven't seen other tech companies write like that.
Jules looks like it's going after Devin
Claude MCP does the same thing. It’s the setup that is hard. It will do push pull create branch automatically from a single prompt. 500$ a month for Devin could be worth it if you want it taken care off plus use the models for a team, but a single person can set it up
"gemini for video games" - here we go again with the AI does the interesting stuff for you rather than the boring stuff
Is it on AI studio already?
Yes it is. Including the live features. It is pretty impressive. Basically voice mode with a live video feed as well.
Just played with it and it's great! A good 2.0 release I think.
Is this the gemini-exp model on LMArena?
Both are available on aistudio so I don't think so.
In my own testing "exp 1206" is significantly better than Gemini 2.
Feels like haiku 3.5 vs sonnet 3.5 kind of thing.
It looks like gemini-exp-1121 slightly upgraded. 1206 is something else.
Yes, LMArena shows Gemini-2.0-Flash-Exp ranking 3rd right now, after Gemini-Exp-1206 and ChatGPT-4o-latest_(2024-11-20), and ahead of o1-preview and o1-mini:
https://lmarena.ai/?leaderboard
There's also the "gremlin" model (not reachable directly) and it seems to be pretty smart.. maybe that's the deep research mode?
EDIT: probably not deep research.. is it Google testing their equivalent of o1? who knows..
The best thing about gemini models is the huge context windows, you can just throw big documents and find stuff real fast, rather than struggling with cut off in perplexity or Claude.
I work with LLMs and MLLMs all day (as part of my work on JoyCaption, an open source VLM). Specifically, I spend a lot of time interacting with multiple models at the same time, so I get the chance to very frequently compare models head-to-head on real tasks.
I'll give Flash 2 a try soon, but I gotta say that Google has been doing a great job catching up with Gemini. Both Gemini 1.5 Pro 002 and Flash 1.5 can trade blows with 4o, and are easily ahead of the vast majority of other major models (Mistral Large, Qwen, Llama, etc). Claude is usually better, but has a major flaw (to be discussed later).
So, here's my current rankings. I base my rankings on my work, not on benchmarks. I think benchmarks are important and they'll get better in time, but most benchmarks for LLMs and MLLMs are quite bad.
1) 4o and its ilk are far and away the best in terms of accuracy, both for textual tasks as well as vision related tasks. Absolutely nothing comes even close to 4o for vision related tasks. The biggest failing of 4o is that it has the worst instruction following of commercial LLMs, and that instruction following gets _even_ worse when an image is involved. A prime example is when I ask 4o to help edit some text, to change certain words, verbage, etc. No matter how I prompt it, it will often completely re-write the input text to its own style of speaking. It's a really weird failing. It's like their RLHF tuning is hyper focused on keeping it aligned with the "character" of 4o to the point that it injects that character into all its outputs no matter what the user or system instructions state. o1 is a MASSIVE improvement in this regard, and is also really good at inferring things so I don't have to explicitly instruct it on every little detail. I haven't found o1-pro overly useful yet. o1 is basically my daily driver outside of work, even for mundane questions, because it's just better across the board and the speed penalty is negligible. One particularly example of o1 being better I encountered yesterday. I had it re-wording an image description, and thought it had introduced a detail that wasn't in the original description. Well, I was wrong and had accidentally skimmed over that detail in the original. It _told_ me I was wrong, and didn't update the description! Freaky, but really incredible. 4o never corrects me when I give it an explicit instruction.
4o is fairly easy to jailbreak. They've been turning the screws for awhile so it isn't as easy as day 1, but even o1-pro can be jailbroken.
2) Gemini 1.5 Pro 002 (specifically 002) is second best in my books. I'd guesstimate it at being about 80% as good as 4o on most tasks, including vision. But it's _significantly_ better at instruction following. Its RLHF is a lot lighter than ChatGPT models, so it's easier to get these models to fall back to pretraining, which is really helpful for my work specifically. But in general the Gemini models have come a long way. The ability to turn off model censorship is quite nice, though it does still refuse at times. The Flash variation is interesting; often times on-par with Pro with Pro edging out maybe 30% of the time. I don't frequently use Flash, but it's an impressive model for its size. (Side note: The Gemma models are ... not good. Google's other public models, like so400m and OWLv2 are great, so it's a shame their open LLMs forays are falling behind). Google also has the best AI playground.
Jailbreaking Gemini is a piece of cake.
3) Claude is third on my list. It has the _best_ instruction following of all the models, even slightly better than o1. Though it often requires multi-turn to get it to fully follow instructions, which is annoying. Its overall prowess as an LLM is somewhere between 4o and Gemini. Vision is about the same as Gemini, except for knowledge based queries which Gemini tends to be quite bad at (who is this person? Where is this? What brand of guitar? etc). But Claude's biggest flaw is the insane "safety" training it underwent, which makes it practically useless. I get false triggers _all_ the time from Claude. And that's to say nothing of how unethical their "ethics" system is to begin with. And what's funny is that Claude is an order of magnitude _smarter_ when its reasoning about its safety training. It's the only real semblance of reason I've seen from LLMs ... all just to deny my requests.
I've put Claude three out of respect for the _technical_ achievements of the product, but I think the developers need to take a long look in the mirror and ask why they think it's okay to for _them_ to decide what people with disabilities are and are not aloud to have access to.
4) Llama 3. What a solid model. It's the best open LLM, hands down. Nowhere near the commercial models above, but for a model that's completely free to use locally? That's invaluable. Their vision variation is ... not worth using. But I think it'll get better with time. The 8B variation far outperforms its weight class. 70B is a respectable model, with better instruction following than 4o. The ability to finetune these models to a task with so little data is a huge plus. I've made task specific models with 200-400 examples.
5) Mistral Large (I forget the specific version for their latest release). I love Mistral as the "under-dog". Their models aren't bad, and behave _very_ differently from all other models out there, which I appreciate. But Mistral never puts any effort into polishing their models; they always come out of the oven half-baked. Which means they frequently glitch out, have very inconsistent behavior, etc. Accuracy and quality is hard to assess because of this inconsistency. On its best days it's up near Gemini, which is quite incredible considering the models are also released publicly. So theoretically you could finetune them to your task and get a commercial grade model to run locally. But rarely see anyone do that with Mistral, I think partly because of their weird license. Overall, I like seeing them in the race and hope they get better, but I wouldn't use it for anything serious.
Mistral is lightly censored, but fairly easy to jailbreak.
6) Qwen 2 (or 2.5 or whatever the current version is these days). It's an okay model. I've heard a lot of praises for it, but in all my uses thus far its always been really inconsistent, glitchy, and weak. I've used it both locally and through APIs. I guess in _theory_ it's a good model, based on benchmarks. And it's open, which I appreciate. But I've not found any practical use for it. I even tried finetuning with Qwen 2VL 72B, and my tiny 8B JoyCaption model beat it handily.
That's about the sum of it. AFAIK that's all the major commercial and open models (my focus is mainly on MLLMs). OpenAI are still leading the pack in my experience. I'm glad to see good competition coming from Google finally. I hope Mistral can polish their models and be a real contender.
There are a couple smaller contenders out there like Pixmo/etc from allenai. Allen AI has hands down the _best_ public VQA dataset I've seen, so huge props to them there. Pixmo is ... okayish. I tried Amazon's models a little but didn't see anything useful.
NOTE: I refuse to use Grok models for the obvious reasons, so fucks to be them.
It is interesting to see that they keep focusing on the cheapest model instead of the frontier model. Probably because of their primary (internal?) customer's need?
It's cheaper and faster to train a small model, which is better for a research team to iterate on, right? If Google decides that a particular small model is really good, why wouldn't they go ahead and release it while they work on scaling up that work to train the larger versions of the model?
I have no knowledge of Google specific cases, but in many teams smaller models are trained upon bigger frontier models through distillation. So the frontier models come first then smaller models later.
Training a "frontier model" without testing the architecture is very risky.
Meta trained the smaller Llama 3 models first, and then trained the 405B model on the same architecture once it had been validated on the smaller ones. Later, they went back and used that 405B model to improve the smaller models for the Llama 3.1 release. Mistral started with a number of small models before scaling up to larger models.
I feel like this is a fairly common pattern.
If Google had a bigger version of Gemini 2.0 ready to go, I feel confident they would have mentioned it, and it would be difficult to distill it down to a small model if it wasn't ready to go.
the problem is that last generation of the largest models failed to overcome smaller models on the benchmarks, see lack of new claude opus or gpt-5. The problem is probably in the benchmarks, but anyway.
Is it better than GPT4o? Does it have an API?
API is accessible via Vertex AI on Google Cloud in preview. I think it's also available in the consumer Gemini Chat.
https://ai.google.dev/gemini-api/docs/models/gemini-v2
> What can you tell me about this sculpture?
> It's located in London.
Mind blowing.
Instead of throwing up tables of benchmarks just let me try to do stuff and see if it's useful.
Can cloudflare turnstile (and others) detect these agents as bots?
Considering so many of us would like more vRAM than NVIDIA is giving us for home compute, is there any future where these Trillium TPUs become commodity hardware?
Power concerns aside, individual chips in a TPU pod don't actually have a ton of vRAM; they rely on fast interconnects between a lot of chips to aggregate vRAM and then rely on pipeline / tensor parallelism. It doesn't make sense to try to sell the hardware -- it's operationally expensive. By keeping it in house Google only has to support the OS/hardware in their datacenter and they can and do commercialize through hosted services.
Why do you want the hardware vs just using it in the cloud? If you're training huge models you probably don't also keep all your data on prem, but on GCS or S3 right? It'd be more efficient to use training resources close to your data. I guess inference on huge models? Still isn't just using a hosted API simpler / what everyone is doing now?
So many of us are probably in thousands they need to be 3 order magnitude higher before Google can even think of it.
We are moving through eras faster than years these days.
Reminder that implied models are not actual models. Models have failed to materialize repeatedly and vanished without further mention. I assume no one is trying to be misleading but, at this point, maybe overly optimistic.
Speed looks good vis-a-vis 4o-mini, and quality looks good so far against my eval set. If it's cheaper than 4o-mini too (which, it probably will be?) then OpenAI have a real problem, because switching between them is a value in a config file.
Chat now: https://app.chathub.gg/chat/cloud-gemini-2.0-flash
Is this what it feels like to become one of the gray bearded engineers? This sounds like a bunch of intentionally confusing marketing drivel.
When capitalism has pilfered everything from the pockets of working people so people are constantly stressed over healthcare and groceries, and there's little left to further the pockets of plutocrats, the only marketing that makes sense is to appeal to other companies in order to raid their coffers by tricking their Directors to buy a nonsensical product.
Is that what they mean by "agentic era"? Cause that's what it sounds like to me. Also smells alot like press release driven development where the point is to put a feather in the cap of whatever poor Google engineer is chasing their next promotion.
> Is that what they mean by "agentic era"? Cause that's what it sounds like to me.
What are you basing your opinion on? I have no idea how well these LLM agents will perform but its definitely a thing. OpenAI is working on them , Claude and certainly Google.
Yeah it’s a lot of marketing fluff but these tools are genuinely useful and there’s no wonder why Google is working hard to prevent them from destroying their search-dependent bottom line.
Marketing aside, agents are just LLMs that can reach out of their regular chat bubbles and use tools. Seems like just the next logical evolution
Side note on Gemini: I pay for Google Workspace simply to enable e-mail capability for a custom domain.
I never used the web interface to access email until recently. To my surprise, all of the AI shit is enabled by default. So it’s very likely Gemini has been training on private data without my explicit consent.
Of course G words it as “personalizing” the experience for me but it’s such a load of shit. I’m tired of these companies stealing our data and never getting rightly compensated.
Gmail is hosting your email. Being able to change the domain doesn't change that they're hosting it on their terms. I think there are other email providers that have more privacy-focused policies.
the demos are amazing
I need to rewire my brain for the power of these tools
this plus the quantum stuff...Google is on a win streak
>2,000 words of bs
>General availability will follow in January, along with more model sizes.
>Benchmarks against their own models which always underperformed
>No pricing visible anywhere
Completely inept leadership at play.
Gemini in search is answering so many of my search questions wrong.
If I ask natural language yes/no questions, Gemini sometimes tells me outright lies with confidence.
It also presents information as authoritative - locations, science facts, corporate ownership, geography - even when it's pure hallucination.
Right at the top of Google search.
edit:
I can't find the most obnoxious offending queries, but here was one I performed today: "how many islands does georgia have?".
Compare that with "how many islands does georgia have? Skidaway Island".
This is an extremely mild case, but I've seen some wildly wrong results, where Google has claimed companies were founded in the wrong states, that towns were located in the wrong states, etc.
Doesn't match my experience. It also feels like it's getting better over time.
At first, this was true but now it has gotten pretty good. The times it gets things wrong are often not the models fault and just google searches fault.
Gemini 1.5 indeed is a lot of hit-and-miss. Also, the politically correct and medical info filtering is limiting its usefulness a lot, IMHO.
I also miss that it’s not yet really as context aware as ChatGPTo4. Even just asking a follow-up question, confuses Gemini 1.5.
Hope Gemini 2.0 will improve that!
This has happened to me zero times. :shrug:
can you provide some example queries that Gemini in search gets wrong?
I've found these results quite useful
Can these guy lead for once? They are always responding to what OpenAI is doing.
“Hey google turn on kitchen lights”
“Sure, playing don’t fear the reaper on bathroom speaker”
Ok
Just searched for GVP vs SVP and got:
"GVP stands for Good Pharmacovigilance Practice, which is a set of guidelines for monitoring the safety of drugs. SVP stands for Senior Vice President, which is a role in a company that focuses on a specific area of operations."
Seems lot of pharma regulation in my telecom company.
Their offering is just so... bad. Even the new model. All the data in the world, yet they trail behind.
They have all of these extensions that they use to prop up the results in the web UI.
I was asking for a list of related YouTube videos - the UI returns them.
Ask the API the same prompt, it returns a bunch of made up YouTube titles and descriptions.
How could I ever rely on this product?
I am sure Google has the resources to compete in this space. What I'm less sure about is whether Google can monetize AI in a way that doesn't cannibalize their advertising income.
Who the hell wants an AI that has the personality of a car salesman?
No mention of Perplexity yet in the comments but it's obvious to me that they're targeting Perplexity Pro directly with their new Deep Research feature (https://blog.google/products/gemini/google-gemini-deep-resea...). I still wonder why Perplexity is worth $7 billion when the 800-pound gorilla is pounding on their door (albeit slowly).
Just tried the deep search. It's a much much slower experience than perplexity at the moment - taking sweet many minutes to return result. Maybe it's more extensive but I use perplexity for quick information summary a lot and this is a very different UX.
Haven't used it enough to evaluate the quality, however.
Before dropping it for a different project that got some traction, "Slow Perplexity" was something I was pretty set on building.
Perplexity is a much less versatile product than it has to be in the chase of speed: you can only chew through so many tokens, do so much CoT, etc. in a given amount of time.
They optimized for virality (it's just as fast as Google but gives me more info!) but I suspect it kills the stickiness for a huge number of users since you end up with some "embarrassing misses": stuff that should have been a slam dunk, goes off the rails due to not enough search, or the wrong context being surfaced from the page, etc. and the user just doesn't see value in it anymore.
I know this isn't really a useful comment, but, I'm still sour about the name they chose. They MUST have known about the Gemini protocol. I'm tempted to think it was intentional, even.
It's like Microsoft creating an AI tool and calling it Peertube. "Hurr durr they couldn't possibly be confused; one is a decentralised video platform and the other is an AI tool hurr durr. And ours is already more popular if you 'bing' it hurr durr."
> It's like Microsoft creating an AI tool and calling it Peertube.
How is it like that? Gemini is a much more common word than Peertube. https://en.wikipedia.org/wiki/Gemini