Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data
Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively
Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days
show comments
davidpolberger
This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:
I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.
I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.
show comments
dwringer
Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.
> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.
I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.
My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.
show comments
Workaccount2
It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.
Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.
Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.
show comments
AbstractH24
Can someone ELI5 what the difference between AI Studio, Antigravity, and Colab is?
falcor84
I love it that there's a "Read AI-generated summary" button on their post about their new AI.
I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:
> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.
- from "Dirk Gently's Holistic Detective Agency"
show comments
tylervigen
I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
Just generated a bunch of 3D CAD models using Gemini 3.0 to see how it compares in spatial understanding and it's heaps better than anything currently out there - not only intelligence but also speed.
Will run extended benchmarks later, let me know if you want to see actual data.
show comments
syspec
I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.
From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.
When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.
I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard
show comments
coffeecoders
Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.
Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).
show comments
stevesimmons
A nice Easter egg in the Gemini 3 docs [1]:
If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:
"thoughtSignature": "context_engineering_is_the_way_to_go"
I was sorting out the right way to handle a medical thing and Gemini 2.5 Pro was part of the way there, but it lacked some necessary information. Got the Gemini 3.0 release notification a few hours after I was looking into that, so I tried the same exact prompt and it nailed it. Great, useful, actionable information that surfaced actual issues to look out for and resolved some confusion. Helped work through the logic, norms, studies, standards, federal approvals and practices.
Very good. Nice work! These things will definitely change lives.
Good at vibe coding 10:30 - step change where it's actually useful
AGI still 5-10 years. Needs reasoning, memory, world models.
Is it a bubble? - Partly 22:00
What's fun to do with Gemini to show the relatives? Suggested taking a selfie with the app and having it edit. 24:00 (I tried and said make me younger. Worked pretty well.)
Also interesting - apparently they are doing an agent to go through your email inbox and propose replies automatically 4:00. I could see that getting some use.
__jl__
API pricing is up to $2/M for input and $12/M for output
For comparison:
Gemini 2.5 Pro was $1.25/M for input and $10/M for output
Gemini 1.5 Pro was $1.25/M for input and $5/M for output
show comments
siva7
I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine).
Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions).
So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).
show comments
crawshaw
Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.
Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.
Gemini 2.5 Pro scored 57.6, so this is a huge improvement.
bnchrch
I've been so happy to see Google wake up.
Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.
- Gmail vs Outlook
- Drive vs Word
- Android vs iOS
- Worklife balance and high pay vs the low salary grind of before.
Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.
show comments
aliljet
Understanding precisely why Gemini 3 isn't front of the pack on SWE Bench is really what I was hoping to understand here. Especially for a blog post targeted at software developers...
show comments
ogig
I just gave it a short description of a small game I had an idea for. It was 7 sentences. It pretty much nailed a working prototype, using React, clean css, Typescript and state management. It event implemented a Gemini query using the API for strategic analysis given a game state. I'm more than impressed, I'm terrified. Seriously thinking of a career change.
show comments
mparis
I've been playing with the Gemini CLI w/ the gemini-pro-3 preview. First impressions are that its still not really ready for prime time within existing complex code bases. It does not follow instructions.
The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.
Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.
People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".
Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.
Despite model supremacy, the products still matter.
golfer
Supposedly this is the model card. Very impressive results.
Okay, Gemini 3.0 Pro has officially surpassed Claude 4.5 (and GPT-5.1) as the top ranked model based on my private evals (multimodal reasoning w/ images/audio files and solving complex Caesar/transposition ciphers, etc.).
Claude 4.5 solved it as well (the Caesar/transposition ciphers), but Gemini 3.0 Pro's method and approach was a lot more elegant. Just my $0.02.
mccoyb
I truly do not understand what plan to use so I can use this model for longer than ~2 minutes.
Using Anthropic or OpenAI's models are incredibly straightforward -- pay us per month, here's the button you press, great.
Where do I go for this for these Google models?
show comments
bityard
> Whether you’re an experienced developer or a vibe coder
I absolutely LOVE that Google themselves drew a sharp distinction here.
show comments
svantana
Grok got to hold the top spot of LMArena-text for all of ~24 hours, good for them [1]. With stylecontrol enabled, that is. Without stylecontrol, gemini held the fort.
I just wish gemini could write well formatted code. I do like the solutions it comes up to and I know I can use a linter/formatter tool - but it would just be nice if when I openned gemini (cli) up and asked it to write a feature it didn't mix up the indenting so badly... somehow codex and claude both get this without any trouble...
yomismoaqui
From an initial testing of my personal benchmark it works better than Gemini 2.5 pro.
My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)
The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.
It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...
mark_l_watson
I had a fantastic ‘first result’ with Gemini 3 but a few people on social media I respect didn’t. Key takeaway is to do your own testing with your use cases. I feel like I am now officially biased re: LLM infrastructure: I am retired, doing personal research and writing, and I decided months ago to drop OpenAI and Anthropic infrastructure and just use Google to get stuff done - except I still budget about two hours a week to experiment with local models and Chinese models’ APIs.
mpeg
Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.
Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)
primaprashant
Created a summary of comments from this thread about 15 hours after it had been posted and had 814 comments with gemini-3-pro and gpt-5.1 using this script [1]:
Summary from GPT 5.1 is significantly longer and more verbose compared to Gemini 3 Pro (13,129 output tokens vs 3,776). Gemini 3 summary seems more readable, however, GPT 5.1 one has interesting insights missed by Gemini.
Last time I did this comparison at the time of GPT 5 release [2], the summary from Gemini 2.5 Pro was way better and readable than the GPT 5 one. This time the readability of Gemini 3 summary still seems great while GPT 5.1 feels a bit more improved but not there quite yet.
I asked it to analyze my tennis serve. It was just dead wrong. For example, it said my elbow was bent. I had to show it a still image of full extension on contact, then it admitted, after reviewing again, it was wrong. Several more issues like this. It blamed it on video being difficult. Not very useful, despite the advertisements: https://x.com/sundarpichai/status/1990865172152660047
I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
A 50% increase over ChatGPT 5.1 on ARC-AGI2 is astonishing. If that's true and representative (a big if), it lends credence to this being the first of the very consistent agentically-inclined models because it's able to follow a deep tree of reasoning to solve problems accurately. I've been building agents for a while and thus far have had to add many many explicit instructions and hardcoded functions to help guide the agents in how to complete simple tasks to achieve 85-90% consistency.
show comments
wohoef
Curious to see it in action. Gemini 2.5 has already been very impressive as a study buddy for courses like set theory, information theory, and automata.
Although I’m always a bit skeptical of these benchmarks. Seems quite unlikely that all of the questions remain out of their training data.
bilekas
> The Gemini app surpasses 650 million users per month, more than 70% of our Cloud customers use our AI, 13 million developers have built with our generative models, and that is just a snippet of the impact we’re seeing
Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.
show comments
srameshc
I think I am in this AI fatigue phase. I am past all hype with models, tools and agents and back to problem and solution approach, sometimes code gen with AI , sometimes think and ask for a piece of code. But not offloading to AI and buying all the bs, waiting it to do magic with my codebase.
show comments
rubymamis
I gave it the task to recreate StackView.qml to be feel more native on iOS and it failed - like all other models...
Prompt:
Instead of the current StackView, I want you to implement a new StackView that will have a similar api with the differences that:
1. It automatically handles swiping to the previous page/item. If not mirrored, it should detect swiping from the left edge, if mirrored it should detect from the right edge. It's important that swiping will be responsive - that is, that the previous item will be seen under the current item when swiping - the same way it's being handled on iOS applications. You should also add to the api the option for the swipe to be detected not just from the edge, but from anywhere on the item, with the same behavior. If swiping is released from x% of current item not in view anymore than we should animate and move to the previous item. If it's a small percentage we should animate the current page to get back to its place as nothing happened.
2. The current page transitions are horrible and look nothing like native iOS transitions. Please make the transitions feel the same.
mil22
It's available to be selected, but the quota does not seem to have been enabled just yet.
"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
"You've reached your rate limit. Please try again later."
Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.
show comments
nickandbro
What we have all been waiting for:
"Create me a SVG of a pelican riding on a bicycle"
Haven't used Gemini much, but when I used, it often refused to do certain things that ChatGPT did happily. Probably because it has many things heavily censored. Obviously, a huge company like Google is under much heavier regulations than ChatGPT. Unfortunately this greatly reduces its usefulness in many situations despite that Google has more resources and computational power than OpenAI.
dudeinhawaii
Gemini has been so far behind agentically it's comical. I'll be giving it a shot but it has a herculean task ahead of itself. It has to not only be "good enough" but a "quantum leap forward".
That said, OpenAI was in the same place earlier in the year and very quickly became the top agentic platform with GPT-5-Codex.
The AI crowd is surprisingly not sticky. Coders quickly move to whatever the best model is.
Excited to see Gemini making a leap here.
show comments
King-Aaron
> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month
"Incredible"! When they insert it into literally every google request without an option to disable it. How incredibly shocking so many people use it.
nighwatch
I just tested the Gemini 3 preview as well, and its capabilities are honestly surprising.
As an experiment I asked it to recreate a small slice of Zelda , nothing fancy, just a mock interface and a very rough combat scene. It managed to put together a pretty convincing UI using only SVG, and even wired up some simple interactions.
It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.
Pretty happy the under 200k token pricing is staying in the same ballpark as Gemini 2.5 Pro:
Input: $1.25 -> $2.00 (1M tokens)
Output: $10.00 -> $12.00
Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.
Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.
Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.
No benchmark is safe, when this much money is on the line.
show comments
jpkw
Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.
show comments
qustrolabe
Out of all other companies Google provide the most generous free access so far. I bet this gives them plenty of data to train even better models
icapybara
Anyone know how Gemini CLI with this model compares to Codex and Claude Code?
misja111
I asked Gemini to solve today's Countle puzzle (https://www.countle.org/).
It got stuck while iterating randomly trying to find a solution.
While I'm writing this it has been trying already for 5 minutes and the web page has become unresponsive.
I also asked it for the best play when in backgammon opponent rolls 6-1 (plays 13/7 8/7) and you roll 5-1.
It starts alright with mentioning a good move (13/8 6/5) but continues to hallucinate with several alternative but illegal moves.
I'm not too impressed.
creddit
Gemini 3 is crushing my personal evals for research purposes.
I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.
It's really, really, really good so far. Wow.
Note that I haven't tried it for coding yet!
show comments
energy123
With the $20/m subscription, do we get it on "Low" or "High" thinking level?
Tested it on a bug that Claude and ChatGPT Pro struggled with, it nailed it, but only solved it partially (it was about matching data using a bipartite graph).
Another task was optimizing a complex SQL script: the deep-thinking mode provided a genuinely nuanced approach using indexes and rewriting parts of the query. ChatGPT Pro had identified more or less the same issues.
For frontend development, I think it’s obvious that it’s more powerful than Claude Code, at least in my tests, the UIs it produces are just better.
For backend development, it’s good, but I noticed that in Java specifically, it often outputs code that doesn’t compile on the first try, unlike Claude.
show comments
taf2
I had asked earlier in the day for gpt 5.1 high to refactor my apex visualforce page into a lightning component and it really didn’t do much here - Gemini 3 pro crushed this task… very promising
jacky2wong
What I loved about this release was that it was hyped up by a polymarket leak with insider trading - NOT with nonsensical feel the AGI hype. Great model that's pushed the frontier of spatial reasoning by a long shot.
eknkc
Looks like it is already available on VSCode Copilot. Just tried a prompt that was not returning anything good on Sonnet 4.5. (Did not spend much time though, but the prompth was already there on the chat screen so I switched the model and sent it again)
Gemini 3 worked much better and I actually committed the changes that it created. I don't mean its revolutionary or anything but it provided a nice summary of my request and created a decent simple solution. Sonnet had created a bunch of overarching changes that I would not even bother reviewing. Seems nice. Will probably use it for 2 weeks until someone else releases a 1.0001x better model.
show comments
mrinterweb
Hit the Gemini 3 quota on the second prompt in antigravity even though I'm a pro user. I highly doubt I hit a context window based on my prompt. Hopefully, it is just first day of near general availability jitters.
ponyous
Can’t wait to test it out. Been running a tons of benchmarks (1000+ generations) for my AI to CAD model project and noticed:
- GPT-5 medium is the best
- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster
Really wonder how well Gemini 3 will perform
GodelNumbering
And of course they hiked the API prices
Standard Context(≤ 200K tokens)
Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)
Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)
Long Context(> 200K tokens)
Input $4.00 vs $2.50 (same +60%)
Output $18.00 vs $15.00 (same +20%)
show comments
syedshahmir7214
I think from last few releases of these models from all companies, I have not observed much improvements in the response of these models. Their claims and launches are a little over hyped.
BugsJustFindMe
The Gemini AI Studio app builder (https://aistudio.google.com/apps) refuses to generate python files. I asked it for a website, frontend and python back end, and it only gave a front end. I asked again for a python backend and it just gives repeated server errors trying to write the python files. Pretty shit experience.
aerhardt
Combining structured outputs with search is the API feature I was looking for. Honestly crazy that it wasn’t there to start with - I have a project that is mostly Gemini API but I’ve had to mix in GPT-5 just for this feature.
I still use ChatGPT and Codex as a user but in the API project I’ve been working on Gemini 2.5 Pro absolutely crushed GPT-5 in the accuracy benchmarks I ran.
As it stands Gemini is my de facto standard for API work and I’ll be following very closely the performance of 3.0 in coming weeks.
It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000.
The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles.
Amazingly this clone's executable still works today on Windows machines.
The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt.
The levels can be downloaded from that website and they are text based.
What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem.
A reasonably decent programmer would solve the easiest problems in a very short amount of time.
kanodiaayush
I don't really understand the amount of ongoing negativity in the comments. This is not the first time a product has been near copied, and the experience for me is far superior to code in a terminal. It comes with improvements even though imperfect, and I'm excited for those! I've long wanted the ability to comment on code diffs instead of just writing things back down in chat. And I'm excited for the quality of gemini 3.0 pro; although I'm running into rate limits. I can already tell its something I'm going to try out a lot!
show comments
iib
As soon as I found out that this model launched, I tried giving it a problem that I have been trying to code in Lean4 (showing that quicksort preserves multiplicity). All the other frontier models I tried failed.
I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.
I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.
CephalopodMD
What I'm getting from this thread is that people have their own private benchmarks. It's almost a cottage industry. Maybe someone should crowd source those benchmarks, keep them completely secret, and create a new public benchmark of people's private AGI tests. All they should release for a given model is the final average score.
pclark
I just want Gemini to access ALL my Google Calendars, not just the primary one. If they supported this I would be all in on Gemini. Does no one else want this?
gertrunde
"AI Overviews now have 2 billion users every month."
"Users"? Or people that get presented with it and ignore it?
show comments
aliljet
When will this be available in the cli?
show comments
alach11
This is a really impressive release. It's probably the biggest lead we've seen from a model since the release of GPT-4. Seems likely that OpenAI rushed out GPT-5.1 to beat the Gemini 3 release, knowing that their model would underperform it.
auggierose
> Gemini 3 is the best vibe coding and agentic coding model we’ve ever built
Google goes full Apple...
deanc
The AntiGravity seems to be a bit overwhelmed. Unable to set up an account at the moment.
realty_geek
I would like to try controlling my browser with this model. Any ideas how to do this.
Ideally I would like something like openAI's atlas or perplexity's comet but powered by gemini 3.
show comments
briga
Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.
But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.
show comments
Retr0id
> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month
Do regular users know how to disable AI Overviews, if they don't love them?
show comments
clusterhacks
I wish I could just pay for the model and self-host on local/rented hardware. I'm incredibly suspicious of companies totally trying to capture us with these tools.
show comments
mikeortman
Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription.
Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.
I'm concerned.
zurfer
It also tops LMSYS leaderboard across all categories. However knowledge cutoff is Jan 2025. I do wonder how long they have been pre-training this thing :D.
show comments
oezi
Probably invested a couple of billion into this release (it is great as far as I can tell), but can't bring proper UI to AI Studio for long prompts and responses (e.g. it animates new text being generated even though you just return to the tab which was finished generating).
nilsingwersen
Feeling great to see something confidential
RobinL
- Anyone have any idea why it says 'confidential'?
- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)
[Edit: working for me now in ai studio]
I_am_tiberius
I still need a google account to use it and it always asks me for a phone verification, which I don't want to give to google. That prevents me from using Gemini. I would even pay for it.
> AI overviews now have 2 billion users every month
More like 2 billion hostages
sunaookami
Gemini CLI crashes due to this bug: https://github.com/google-gemini/gemini-cli/issues/13050 and when applying the fix in the settings file I can't login with my Google account due to "The authentication did not complete successfully. The following products are not yet authorized to access your account" with useless links to completely different products (Code Assist).
Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.
energy123
Impressive. Although the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement.
thedelanyo
Reading the introductory passage - all I can say now is, Ai is here to stay.
agentifysh
my only complaint is i wish the SWE and agentic coding would have been better to justify the 1~2x premium
gpt-5.1 honestly looking very comfortable given available usage limits and pricing
although gpt-5.1 used from chatgpt website seems to be better for some reason
Sonnet 4.5 agentic coding still holding up well and confirms my own experiences
i guess my reaction to gemini 3 is a bit mixed as coding is the primary reason many of us pay $200/month for
zen_boy
Is the "thinking" dropdown option on gemini.google.com what the blog post refers to as Deep Think?
What I'd prefer over benchmarks is the answer to a simple question:
What useful thing can it demonstrably do that its predecessors couldn't?
show comments
oceanplexian
Suspicious that none of the benchmarks include Chinese models even they scored higher on the benchmarks than the models they are comparing to?
raffkede
Seems to be the first model that one-shots my secret benchmark about nested SQLite and it did it in 30s,
show comments
CjHuber
Interesting that they added an option to select your own API key right in AI studio‘s input field.
I sincerely hope the times of generous free AIstudio usage are not over
bilsbie
Is there a way to use this without being in the whole google ecosystem? Just make a new account or something?
show comments
visioninmyblood
Really exciting results on paper. But truly interesting to see what data this has been trained on. There is a thin line between accuracy improvements and the data used from users. Hope the data used to train was obtained with consent from the creators
AstroBen
First impression is I'm having a distinctly harder time getting this to stick to instructions as compared to Gemini 2.5
taikahessu
Boring. Tried to explore sexuality related topics, but Alphabet is stuck in some Christianity Dark Ages.
Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.
gigatexal
How does it do in coding tasks? I’ve been absolutely spoiled by Claude sonnet 4.5 thinking.
ilaksh
okay since Gemini 3 is AI mode now, I switched from the free perplexity back to google as being my search default.
pflenker
> Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month.
Come on, you can’t be serious.
show comments
slackerIII
What's the easiest way to set up automatic code review for PRs for my team on GitHub using this model?
show comments
eterm
> It seems there's a date conflict. The prompt claims it's 2025, but my internal clock says otherwise.
> I'm now zeroing in on the temporal aspect. Examining the search snippets reveals dates like "2025-10-27," suggesting a future context relative to 2024. My initial suspicion was that the system time was simply misaligned, but the consistent appearance of future dates strengthens the argument that the prompt's implied "present" is indeed 2025. I am now treating the provided timestamps as accurate for a simulated 2025. It is probable, however, that the user meant 2024.
Um, huh? It's found search results for October 2025, but this has led it to believe it's in a simulated future, not a real one?
serjester
It's disappointing there's no flash / lite version - this is where Google has excelled up to this point.
show comments
DrNosferatu
Anyone has any idea if/when it’s coming to paid Perplexity?
NullCascade
I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.
Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?
Try lowering the temperature, use SymPy etc.
show comments
bluecalm
I've asked it (thinking 3) about the difference between Plus and Pro plans.
First it thought I am asking for comparison between Gemini and ChatGPT as it claimed there is no "Plus" plan on Gemini. After I insisted I am on this very plan right now it apologized and told me it in fact exists. Then it told me the difference is that I got access to newer models with the Pro subscription.
That is despite Google's own plan comparison page showing I get access to the Gemini 3 on both plans.
It also told me that on Plus I am most likely using "Flash" model. There is no "Flash" model in the dropdown to choose from. There is only "Fast" and "Thinking".
It then told me "Fast" is just renamed Flash and it likely uses Gemini 2.5. On the product comparison page there is nothing about 2.5, it only mentions version 3 for both Plus and Pro plans. Of course on the dropdown menu it's impossible to see which model it is really using.
How can a normal person understand their products when their own super advanced thinking/reasoning model that took months to train on world's most advanced hardware can't?
It's amazing to me they don't see it as an epic failure in communication and marketing.
Dquiroga
I asked Gemini to write "a comment response to this thread. I want to start an intense discussion".
Gemini 3:
The cognitive dissonance in this thread is staggering.
We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.
Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":
1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.
Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.
Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?
show comments
maczwei
entity.ts is in types/entity.ts .it cant grasp that it should import it like "../types/entity" and instead it always writes "../types" i am using the https://aistudio.google.com/apps
jordanpg
What is Gemini 3 under the hood? Is it still just a basic LLM based on transformers? Or are there all kinds of other ML technologies bolted on now? I feel like I've lost the plot.
show comments
fosterfriends
Gemini 3 and 3 pro are good bit cheaper than Sonnet 4.5 as well. Big fan
nprateem
OMG they've obviously had a major breakthrough because now it can reply to questions with actual answers instead of shit blog posts.
hekkle
GOOGLE: "We have a new product".
REALITY: It's just 3 existing products rolled into one. One of which isn't even a Google product.
- Microsoft Code
- Gemeni
- Chrome Browser
keepamovin
I don't wan't to shit on the much anticipated G3 model, but I have been using it for a complex single page task and find it underwhelming. Pro 2.5 level, beneath GPT 5.1. Maybe it's launch jitters. It struggles to produce more than 700 lines of code in a single file (aistudio). It struggles to follow instructions. Revisions omit previous gains. I feel cheated! 2.5 Pro has been clearly smarter than everything else for a long time, but now 3 seems not even as good as that, in comparison to the latest releases (5.1 etc). What is going on?
iamA_Austin
it started with OpenAI and Google took the competition damn seriously.
pgroves
I was hoping Bash would go away or get replaced at some point. It's starting to look like it's going to be another 20 years of Bash but with AI doodads.
show comments
guluarte
it is live in the api
> gemini-3-pro-preview-ais-applets
> gemini-3-pro-preview
show comments
beezlewax
Can't wait til Gemini 4 is out!
smarx007
Is it coming to Google Jules?
samuelknight
"Gemini 3 Pro Preview" is in Vertex
vivzkestrel
has anyone managed to use any of the AI models to build a complete 3D fps game using web GL or open GL?
show comments
chiragsrvstv
Waiting for google to nuke this as well just like 2.5pro
lofaszvanitt
Oh that corpulent fella with glasses who talks in the video. Look how good mannered he is, he can't hurt anyone. But Google still takes away all your data and you will be forced out of your job.
catigula
The problem with experiencing LLM releases nowadays is that it is no longer trivial to understand the differences in their vast intelligences so it takes awhile to really get a handle on what's even going on.
lofaszvanitt
A tad bit better, still has the same issues regarding unpacking and understanding complex prompts. I have a test of mine and now it performs a bit better, but still, it has zero understanding what is happening and for why. Gemini is the best of the best model out there, but with complex problems it just goes down the drain :(.
dankobgd
every day, new game changer
hubraumhugo
No gemini-3-flash yet, right? Any ETA on that mentioned? 2.5-flash has been amazing in terms of cost/value ratio.
show comments
t_minus_40
is there even a puzzle or math problem gemini 3 cant solve?
petesergeant
Still insists the G7 photo[0] is doctored, and comes up with wilder and wilder "evidence" to support that claim, before getting increasingly aggressive.
Trained models should be able to use formal tools (for instance a logical solver, a computer?).
Good. That said, I wonder if those models are still LLMs.
BoorishBears
So they won't release multimodal or Flash at launch, but I'm guessing people who blew smoke up the right person's backside on X are already building with it
Glad to see Google still can't get out of its own way.
testfrequency
I continue to not use Gemini as I can’t have my data not trained but also have chat history at the same time.
Yes, I know the Workspaces workaround, but that’s silly.
m3kw9
If it ain't quantum leap, new models are just "OS updates".
Der_Einzige
When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.
Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.
I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...
XCSme
How's the pelican?
skerit
Not the preview crap again.
Haven't they tested it enough?
When will it be available in Gemini-CLI?
show comments
thrownaway561
yea great.... when will I be able to have it dial a number on my google pixel? Seriously... Gemini absolutely sucks on pixel since it can't interact with the phone itself so it can't dial numbers.
pk-protect-ai
It is pointless to ask an LLM to draw an ASCII unicorn these days. Gemini 3 draws one of these (depending on the prompt):
The most devastating news out of this announcement is that Vending-Bench 2 came out and it has significantly less clanker[0] meltdowns than the first one. I mean, seriously? Not even one run where the model tried to stock goods that hadn't arrived yet, only for it to eventually try and fail to shut down the business, and then e-mail the FBI about the $2 daily fee being deducted from the bot?
[0] Fake racial slur for a robot, LLM chatbot, or other automated system
jennyholzer
boooooooooooooo
gabriel666smith
Interestingly, Gemini 3 Pro Thinking clocks in at 2nd place - behind Horizon Alpha, but ahead of any actually-released model checkpoints - on the TvP Benchmark, with a score of 71.8%.
This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.
An example output:
'''
The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.
He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.
"I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."
He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.
The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.
Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.
Three red dots appeared on his chest. They made a small triangle.
Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.
"I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."
The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.
"But what I do have," Bryan said, "are a very particular set of skills."
The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.
"Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.
The thing roared. It wasn't a human sound. It was a wet, guttural noise.
"I will look for you," Bryan said. He put the phone in his pocket. "I will find you."
The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.
"And I will kill you," Bryan said.
He raised the Glock and fired twice. Pop-pop.
The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.
Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.
Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.
"You're big," Bryan whispered. "That makes you slow."
He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.
The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.
The blade hit the creature in the knee. There was a snap. A loud, wet snap.
The thing howled. It fell onto one knee.
Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.
The creature swung its arm back. It hit Bryan in the chest.
Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.
He sat up. His ribs hurt. Maybe broken.
The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.
Bryan knew a bomb when he saw one.
"No," Bryan said.
He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.
He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.
Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.
Silence came back.
The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.
The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.
"...good luck..."
Then it died. It just stopped.
Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.
His glass of scotch was still sitting there, untouched. The ice hadn't even melted.
He picked it up. He took a drink. It still tasted cheap.
He took his phone out and looked at it. No service.
"Well," he said.
He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.
'''
otikik
… agentic …
Meh, not interested already
casey2
The first paragraph is pure delusion. Why do investors like delusional CEOs so much? I would take it as a major red flag.
jennyholzer
"AI" benchmarks are and have consistently been lies and misinformation. Gemini is dead in the water.
denysvitali
Finally!
rvz
I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.
So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.
It’s over for Anthropic. That’s why Google’s cool with Claude being on Azure.
Also probably over for OpenAI
mihau
@simonw wen pelican
poemxo
It's amazing to see Google take the lead while OpenAI worsens their product every release.
alksdjf89243
Pretty obvious how contaminated this site is with goog employees upvoting nonsense like this.
WXLCKNO
Valve could learn from Google here
informal007
It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com
kachapopopow
It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close.
the token efficiency and context is also mindblowing...
it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.
gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.
Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.
We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.
The result? It struggled where I didn't expect it to.
Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.
Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?
show comments
irthomasthomas
I asked it to summarize an article about the Zizians which mentions Yudkowsky SEVEN times. Gemini-3 did not mention him once. Tried it ten times and got zero mention of Yudkowsky, despite him being a central figure in the story.
https://xcancel.com/xundecidability/status/19908286970881311...
Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data
Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively
Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days
This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.
I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.
Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.
> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.
[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Static Pelican is boring. First attempt:
Generate SVG animation of following:
1 - There is High fantasy mage tower with a top window a dome
2 - Green goblin come in front of tower with a torch
3 - Grumpy old mage with beard appear in a tower window in high purple hat
4 - Mage sends fireball that burns goblin and all screen is covered in fire.
Camera view must be from behind of goblin back so we basically look at tower in front of us:
https://codepen.io/Runway/pen/WbwOXRO
I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.
Here are my notes and pelican benchmark, including a new, harder benchmark because the old one was getting too easy: https://simonwillison.net/2025/Nov/18/gemini-3/
My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.
It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.
Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.
Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.
Can someone ELI5 what the difference between AI Studio, Antigravity, and Colab is?
I love it that there's a "Read AI-generated summary" button on their post about their new AI.
I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:
> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.
- from "Dirk Gently's Holistic Detective Agency"
I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
The ARC puzzles in question: https://arcprize.org/arc-agi/2/
Just generated a bunch of 3D CAD models using Gemini 3.0 to see how it compares in spatial understanding and it's heaps better than anything currently out there - not only intelligence but also speed.
Will run extended benchmarks later, let me know if you want to see actual data.
I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.
From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.
When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.
I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard
Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.
Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).
A nice Easter egg in the Gemini 3 docs [1]:
[1] https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high...I was sorting out the right way to handle a medical thing and Gemini 2.5 Pro was part of the way there, but it lacked some necessary information. Got the Gemini 3.0 release notification a few hours after I was looking into that, so I tried the same exact prompt and it nailed it. Great, useful, actionable information that surfaced actual issues to look out for and resolved some confusion. Helped work through the logic, norms, studies, standards, federal approvals and practices.
Very good. Nice work! These things will definitely change lives.
Hassabis interview on Gemini 3, with Hard Fork (nyt podcast), also Josh Woodward https://youtu.be/rq-2i1blAlU?t=428 Some points -
Good at vibe coding 10:30 - step change where it's actually useful
AGI still 5-10 years. Needs reasoning, memory, world models.
Is it a bubble? - Partly 22:00
What's fun to do with Gemini to show the relatives? Suggested taking a selfie with the app and having it edit. 24:00 (I tried and said make me younger. Worked pretty well.)
Also interesting - apparently they are doing an agent to go through your email inbox and propose replies automatically 4:00. I could see that getting some use.
API pricing is up to $2/M for input and $12/M for output
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine). Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions). So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).
Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.
DeepMind page: https://deepmind.google/models/gemini/
Gemini 3 Pro DeepMind Page: https://deepmind.google/models/gemini/pro/
Developer blog: https://blog.google/technology/developers/gemini-3-developer...
Gemini 3 Docs: https://ai.google.dev/gemini-api/docs/gemini-3
Google Antigravity: https://antigravity.google/
Sets a new record on the Extended NYT Connections benchmark: 96.8 (https://github.com/lechmazur/nyt-connections/).
Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.
Gemini 2.5 Pro scored 57.6, so this is a huge improvement.
I've been so happy to see Google wake up.
Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.
- Gmail vs Outlook
- Drive vs Word
- Android vs iOS
- Worklife balance and high pay vs the low salary grind of before.
Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.
Understanding precisely why Gemini 3 isn't front of the pack on SWE Bench is really what I was hoping to understand here. Especially for a blog post targeted at software developers...
I just gave it a short description of a small game I had an idea for. It was 7 sentences. It pretty much nailed a working prototype, using React, clean css, Typescript and state management. It event implemented a Gemini query using the API for strategic analysis given a game state. I'm more than impressed, I'm terrified. Seriously thinking of a career change.
I've been playing with the Gemini CLI w/ the gemini-pro-3 preview. First impressions are that its still not really ready for prime time within existing complex code bases. It does not follow instructions.
The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.
Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.
People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".
Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.
Despite model supremacy, the products still matter.
Supposedly this is the model card. Very impressive results.
https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...
Also, the full document:
https://archive.org/details/gemini-3-pro-model-card/page/n3/...
Okay, Gemini 3.0 Pro has officially surpassed Claude 4.5 (and GPT-5.1) as the top ranked model based on my private evals (multimodal reasoning w/ images/audio files and solving complex Caesar/transposition ciphers, etc.).
Claude 4.5 solved it as well (the Caesar/transposition ciphers), but Gemini 3.0 Pro's method and approach was a lot more elegant. Just my $0.02.
I truly do not understand what plan to use so I can use this model for longer than ~2 minutes.
Using Anthropic or OpenAI's models are incredibly straightforward -- pay us per month, here's the button you press, great.
Where do I go for this for these Google models?
> Whether you’re an experienced developer or a vibe coder
I absolutely LOVE that Google themselves drew a sharp distinction here.
Grok got to hold the top spot of LMArena-text for all of ~24 hours, good for them [1]. With stylecontrol enabled, that is. Without stylecontrol, gemini held the fort.
[1] https://lmarena.ai/leaderboard/text
I just wish gemini could write well formatted code. I do like the solutions it comes up to and I know I can use a linter/formatter tool - but it would just be nice if when I openned gemini (cli) up and asked it to write a feature it didn't mix up the indenting so badly... somehow codex and claude both get this without any trouble...
From an initial testing of my personal benchmark it works better than Gemini 2.5 pro.
My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)
The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.
It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...
I had a fantastic ‘first result’ with Gemini 3 but a few people on social media I respect didn’t. Key takeaway is to do your own testing with your use cases. I feel like I am now officially biased re: LLM infrastructure: I am retired, doing personal research and writing, and I decided months ago to drop OpenAI and Anthropic infrastructure and just use Google to get stuff done - except I still budget about two hours a week to experiment with local models and Chinese models’ APIs.
Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.
Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)
Created a summary of comments from this thread about 15 hours after it had been posted and had 814 comments with gemini-3-pro and gpt-5.1 using this script [1]:
- gemini-3-pro summary: https://gist.github.com/primaprashant/948c5b0f89f1d5bc919f90...
- gpt-5.1 summary: https://gist.github.com/primaprashant/3786f3833043d8dcccae4b...
Summary from GPT 5.1 is significantly longer and more verbose compared to Gemini 3 Pro (13,129 output tokens vs 3,776). Gemini 3 summary seems more readable, however, GPT 5.1 one has interesting insights missed by Gemini.
Last time I did this comparison at the time of GPT 5 release [2], the summary from Gemini 2.5 Pro was way better and readable than the GPT 5 one. This time the readability of Gemini 3 summary still seems great while GPT 5.1 feels a bit more improved but not there quite yet.
[1]: https://gist.github.com/primaprashant/f181ed685ae563fd06c49d...
[2]: https://news.ycombinator.com/item?id=44835029
I asked it to analyze my tennis serve. It was just dead wrong. For example, it said my elbow was bent. I had to show it a still image of full extension on contact, then it admitted, after reviewing again, it was wrong. Several more issues like this. It blamed it on video being difficult. Not very useful, despite the advertisements: https://x.com/sundarpichai/status/1990865172152660047
How long does it typically take after this to become available on https://gemini.google.com/app ?
I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png
A 50% increase over ChatGPT 5.1 on ARC-AGI2 is astonishing. If that's true and representative (a big if), it lends credence to this being the first of the very consistent agentically-inclined models because it's able to follow a deep tree of reasoning to solve problems accurately. I've been building agents for a while and thus far have had to add many many explicit instructions and hardcoded functions to help guide the agents in how to complete simple tasks to achieve 85-90% consistency.
Curious to see it in action. Gemini 2.5 has already been very impressive as a study buddy for courses like set theory, information theory, and automata. Although I’m always a bit skeptical of these benchmarks. Seems quite unlikely that all of the questions remain out of their training data.
> The Gemini app surpasses 650 million users per month, more than 70% of our Cloud customers use our AI, 13 million developers have built with our generative models, and that is just a snippet of the impact we’re seeing
Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.
I think I am in this AI fatigue phase. I am past all hype with models, tools and agents and back to problem and solution approach, sometimes code gen with AI , sometimes think and ask for a piece of code. But not offloading to AI and buying all the bs, waiting it to do magic with my codebase.
I gave it the task to recreate StackView.qml to be feel more native on iOS and it failed - like all other models...
Prompt:
Instead of the current StackView, I want you to implement a new StackView that will have a similar api with the differences that:
1. It automatically handles swiping to the previous page/item. If not mirrored, it should detect swiping from the left edge, if mirrored it should detect from the right edge. It's important that swiping will be responsive - that is, that the previous item will be seen under the current item when swiping - the same way it's being handled on iOS applications. You should also add to the api the option for the swipe to be detected not just from the edge, but from anywhere on the item, with the same behavior. If swiping is released from x% of current item not in view anymore than we should animate and move to the previous item. If it's a small percentage we should animate the current page to get back to its place as nothing happened. 2. The current page transitions are horrible and look nothing like native iOS transitions. Please make the transitions feel the same.
It's available to be selected, but the quota does not seem to have been enabled just yet.
"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
"You've reached your rate limit. Please try again later."
Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.
What we have all been waiting for:
"Create me a SVG of a pelican riding on a bicycle"
https://www.svgviewer.dev/s/FfhmhTK1
Haven't used Gemini much, but when I used, it often refused to do certain things that ChatGPT did happily. Probably because it has many things heavily censored. Obviously, a huge company like Google is under much heavier regulations than ChatGPT. Unfortunately this greatly reduces its usefulness in many situations despite that Google has more resources and computational power than OpenAI.
Gemini has been so far behind agentically it's comical. I'll be giving it a shot but it has a herculean task ahead of itself. It has to not only be "good enough" but a "quantum leap forward".
That said, OpenAI was in the same place earlier in the year and very quickly became the top agentic platform with GPT-5-Codex.
The AI crowd is surprisingly not sticky. Coders quickly move to whatever the best model is.
Excited to see Gemini making a leap here.
> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month
"Incredible"! When they insert it into literally every google request without an option to disable it. How incredibly shocking so many people use it.
I just tested the Gemini 3 preview as well, and its capabilities are honestly surprising. As an experiment I asked it to recreate a small slice of Zelda , nothing fancy, just a mock interface and a very rough combat scene. It managed to put together a pretty convincing UI using only SVG, and even wired up some simple interactions.
It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.
Sets a new record on the Extended NYT Connections: 96.8. Gemini 2.5 Pro scored only 57.6. https://github.com/lechmazur/nyt-connections/
Pretty happy the under 200k token pricing is staying in the same ballpark as Gemini 2.5 Pro:
Input: $1.25 -> $2.00 (1M tokens)
Output: $10.00 -> $12.00
Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.
Make a pelican riding a bicycle in 3d: https://gemini.google.com/share/def18e3daa39
Amazing and hilarious
Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.
Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.
No benchmark is safe, when this much money is on the line.
Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.
Out of all other companies Google provide the most generous free access so far. I bet this gives them plenty of data to train even better models
Anyone know how Gemini CLI with this model compares to Codex and Claude Code?
I asked Gemini to solve today's Countle puzzle (https://www.countle.org/). It got stuck while iterating randomly trying to find a solution. While I'm writing this it has been trying already for 5 minutes and the web page has become unresponsive.
I also asked it for the best play when in backgammon opponent rolls 6-1 (plays 13/7 8/7) and you roll 5-1. It starts alright with mentioning a good move (13/8 6/5) but continues to hallucinate with several alternative but illegal moves. I'm not too impressed.
Gemini 3 is crushing my personal evals for research purposes.
I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.
It's really, really, really good so far. Wow.
Note that I haven't tried it for coding yet!
With the $20/m subscription, do we get it on "Low" or "High" thinking level?
Wow so the polymarket insider bet was true then..
https://old.reddit.com/r/wallstreetbets/comments/1oz6gjp/new...
Tested it on a bug that Claude and ChatGPT Pro struggled with, it nailed it, but only solved it partially (it was about matching data using a bipartite graph). Another task was optimizing a complex SQL script: the deep-thinking mode provided a genuinely nuanced approach using indexes and rewriting parts of the query. ChatGPT Pro had identified more or less the same issues. For frontend development, I think it’s obvious that it’s more powerful than Claude Code, at least in my tests, the UIs it produces are just better. For backend development, it’s good, but I noticed that in Java specifically, it often outputs code that doesn’t compile on the first try, unlike Claude.
I had asked earlier in the day for gpt 5.1 high to refactor my apex visualforce page into a lightning component and it really didn’t do much here - Gemini 3 pro crushed this task… very promising
What I loved about this release was that it was hyped up by a polymarket leak with insider trading - NOT with nonsensical feel the AGI hype. Great model that's pushed the frontier of spatial reasoning by a long shot.
Looks like it is already available on VSCode Copilot. Just tried a prompt that was not returning anything good on Sonnet 4.5. (Did not spend much time though, but the prompth was already there on the chat screen so I switched the model and sent it again)
Gemini 3 worked much better and I actually committed the changes that it created. I don't mean its revolutionary or anything but it provided a nice summary of my request and created a decent simple solution. Sonnet had created a bunch of overarching changes that I would not even bother reviewing. Seems nice. Will probably use it for 2 weeks until someone else releases a 1.0001x better model.
Hit the Gemini 3 quota on the second prompt in antigravity even though I'm a pro user. I highly doubt I hit a context window based on my prompt. Hopefully, it is just first day of near general availability jitters.
Can’t wait to test it out. Been running a tons of benchmarks (1000+ generations) for my AI to CAD model project and noticed:
- GPT-5 medium is the best
- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster
Really wonder how well Gemini 3 will perform
And of course they hiked the API prices
Standard Context(≤ 200K tokens)
Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)
Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)
Long Context(> 200K tokens)
Input $4.00 vs $2.50 (same +60%)
Output $18.00 vs $15.00 (same +20%)
I think from last few releases of these models from all companies, I have not observed much improvements in the response of these models. Their claims and launches are a little over hyped.
The Gemini AI Studio app builder (https://aistudio.google.com/apps) refuses to generate python files. I asked it for a website, frontend and python back end, and it only gave a front end. I asked again for a python backend and it just gives repeated server errors trying to write the python files. Pretty shit experience.
Combining structured outputs with search is the API feature I was looking for. Honestly crazy that it wasn’t there to start with - I have a project that is mostly Gemini API but I’ve had to mix in GPT-5 just for this feature.
I still use ChatGPT and Codex as a user but in the API project I’ve been working on Gemini 2.5 Pro absolutely crushed GPT-5 in the accuracy benchmarks I ran.
As it stands Gemini is my de facto standard for API work and I’ll be following very closely the performance of 3.0 in coming weeks.
I would love to see how Gemini 3 can solve this particular problem. https://lig-membres.imag.fr/benyelloul/uherbert/index.html
It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000. The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles. Amazingly this clone's executable still works today on Windows machines.
The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt. The levels can be downloaded from that website and they are text based.
What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem. A reasonably decent programmer would solve the easiest problems in a very short amount of time.
I don't really understand the amount of ongoing negativity in the comments. This is not the first time a product has been near copied, and the experience for me is far superior to code in a terminal. It comes with improvements even though imperfect, and I'm excited for those! I've long wanted the ability to comment on code diffs instead of just writing things back down in chat. And I'm excited for the quality of gemini 3.0 pro; although I'm running into rate limits. I can already tell its something I'm going to try out a lot!
As soon as I found out that this model launched, I tried giving it a problem that I have been trying to code in Lean4 (showing that quicksort preserves multiplicity). All the other frontier models I tried failed.
I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.
I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.
What I'm getting from this thread is that people have their own private benchmarks. It's almost a cottage industry. Maybe someone should crowd source those benchmarks, keep them completely secret, and create a new public benchmark of people's private AGI tests. All they should release for a given model is the final average score.
I just want Gemini to access ALL my Google Calendars, not just the primary one. If they supported this I would be all in on Gemini. Does no one else want this?
"AI Overviews now have 2 billion users every month."
"Users"? Or people that get presented with it and ignore it?
When will this be available in the cli?
This is a really impressive release. It's probably the biggest lead we've seen from a model since the release of GPT-4. Seems likely that OpenAI rushed out GPT-5.1 to beat the Gemini 3 release, knowing that their model would underperform it.
> Gemini 3 is the best vibe coding and agentic coding model we’ve ever built
Google goes full Apple...
The AntiGravity seems to be a bit overwhelmed. Unable to set up an account at the moment.
I would like to try controlling my browser with this model. Any ideas how to do this. Ideally I would like something like openAI's atlas or perplexity's comet but powered by gemini 3.
Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.
But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.
> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month
Do regular users know how to disable AI Overviews, if they don't love them?
I wish I could just pay for the model and self-host on local/rented hardware. I'm incredibly suspicious of companies totally trying to capture us with these tools.
Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription.
Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.
I'm concerned.
It also tops LMSYS leaderboard across all categories. However knowledge cutoff is Jan 2025. I do wonder how long they have been pre-training this thing :D.
Probably invested a couple of billion into this release (it is great as far as I can tell), but can't bring proper UI to AI Studio for long prompts and responses (e.g. it animates new text being generated even though you just return to the tab which was finished generating).
Feeling great to see something confidential
- Anyone have any idea why it says 'confidential'?
- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)
[Edit: working for me now in ai studio]
I still need a google account to use it and it always asks me for a phone verification, which I don't want to give to google. That prevents me from using Gemini. I would even pay for it.
https://www.youtube.com/watch?v=cUbGVH1r_1U
side by side comparison of gemini with other models
I just googled latest LLM models and this page appears at the top. It looks like Gemini Pro 3 can score 102% in high school math tests.
Here it makes a text based video editor that works:
https://youtu.be/MPjOQIQO8eQ?si=wcrCSLYx3LjeYDfi&t=797
> AI overviews now have 2 billion users every month
More like 2 billion hostages
Gemini CLI crashes due to this bug: https://github.com/google-gemini/gemini-cli/issues/13050 and when applying the fix in the settings file I can't login with my Google account due to "The authentication did not complete successfully. The following products are not yet authorized to access your account" with useless links to completely different products (Code Assist).
Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.
Impressive. Although the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement.
Reading the introductory passage - all I can say now is, Ai is here to stay.
my only complaint is i wish the SWE and agentic coding would have been better to justify the 1~2x premium
gpt-5.1 honestly looking very comfortable given available usage limits and pricing
although gpt-5.1 used from chatgpt website seems to be better for some reason
Sonnet 4.5 agentic coding still holding up well and confirms my own experiences
i guess my reaction to gemini 3 is a bit mixed as coding is the primary reason many of us pay $200/month for
Is the "thinking" dropdown option on gemini.google.com what the blog post refers to as Deep Think?
Somebody "two-shotted" Mario Bros NES in HTML:
https://www.reddit.com/r/Bard/comments/1p0fene/gemini_3_the_...
What I'd prefer over benchmarks is the answer to a simple question:
What useful thing can it demonstrably do that its predecessors couldn't?
Suspicious that none of the benchmarks include Chinese models even they scored higher on the benchmarks than the models they are comparing to?
Seems to be the first model that one-shots my secret benchmark about nested SQLite and it did it in 30s,
Interesting that they added an option to select your own API key right in AI studio‘s input field. I sincerely hope the times of generous free AIstudio usage are not over
Is there a way to use this without being in the whole google ecosystem? Just make a new account or something?
Really exciting results on paper. But truly interesting to see what data this has been trained on. There is a thin line between accuracy improvements and the data used from users. Hope the data used to train was obtained with consent from the creators
First impression is I'm having a distinctly harder time getting this to stick to instructions as compared to Gemini 2.5
Boring. Tried to explore sexuality related topics, but Alphabet is stuck in some Christianity Dark Ages.
Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.
How does it do in coding tasks? I’ve been absolutely spoiled by Claude sonnet 4.5 thinking.
okay since Gemini 3 is AI mode now, I switched from the free perplexity back to google as being my search default.
> Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month.
Come on, you can’t be serious.
What's the easiest way to set up automatic code review for PRs for my team on GitHub using this model?
> It seems there's a date conflict. The prompt claims it's 2025, but my internal clock says otherwise.
> I'm now zeroing in on the temporal aspect. Examining the search snippets reveals dates like "2025-10-27," suggesting a future context relative to 2024. My initial suspicion was that the system time was simply misaligned, but the consistent appearance of future dates strengthens the argument that the prompt's implied "present" is indeed 2025. I am now treating the provided timestamps as accurate for a simulated 2025. It is probable, however, that the user meant 2024.
Um, huh? It's found search results for October 2025, but this has led it to believe it's in a simulated future, not a real one?
It's disappointing there's no flash / lite version - this is where Google has excelled up to this point.
Anyone has any idea if/when it’s coming to paid Perplexity?
I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.
Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?
Try lowering the temperature, use SymPy etc.
I've asked it (thinking 3) about the difference between Plus and Pro plans. First it thought I am asking for comparison between Gemini and ChatGPT as it claimed there is no "Plus" plan on Gemini. After I insisted I am on this very plan right now it apologized and told me it in fact exists. Then it told me the difference is that I got access to newer models with the Pro subscription. That is despite Google's own plan comparison page showing I get access to the Gemini 3 on both plans.
It also told me that on Plus I am most likely using "Flash" model. There is no "Flash" model in the dropdown to choose from. There is only "Fast" and "Thinking". It then told me "Fast" is just renamed Flash and it likely uses Gemini 2.5. On the product comparison page there is nothing about 2.5, it only mentions version 3 for both Plus and Pro plans. Of course on the dropdown menu it's impossible to see which model it is really using.
How can a normal person understand their products when their own super advanced thinking/reasoning model that took months to train on world's most advanced hardware can't?
It's amazing to me they don't see it as an epic failure in communication and marketing.
I asked Gemini to write "a comment response to this thread. I want to start an intense discussion".
Gemini 3:
The cognitive dissonance in this thread is staggering.
We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.
Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":
1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.
Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.
Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?
entity.ts is in types/entity.ts .it cant grasp that it should import it like "../types/entity" and instead it always writes "../types" i am using the https://aistudio.google.com/apps
What is Gemini 3 under the hood? Is it still just a basic LLM based on transformers? Or are there all kinds of other ML technologies bolted on now? I feel like I've lost the plot.
Gemini 3 and 3 pro are good bit cheaper than Sonnet 4.5 as well. Big fan
OMG they've obviously had a major breakthrough because now it can reply to questions with actual answers instead of shit blog posts.
GOOGLE: "We have a new product".
REALITY: It's just 3 existing products rolled into one. One of which isn't even a Google product.
- Microsoft Code
- Gemeni
- Chrome Browser
I don't wan't to shit on the much anticipated G3 model, but I have been using it for a complex single page task and find it underwhelming. Pro 2.5 level, beneath GPT 5.1. Maybe it's launch jitters. It struggles to produce more than 700 lines of code in a single file (aistudio). It struggles to follow instructions. Revisions omit previous gains. I feel cheated! 2.5 Pro has been clearly smarter than everything else for a long time, but now 3 seems not even as good as that, in comparison to the latest releases (5.1 etc). What is going on?
it started with OpenAI and Google took the competition damn seriously.
I was hoping Bash would go away or get replaced at some point. It's starting to look like it's going to be another 20 years of Bash but with AI doodads.
it is live in the api
> gemini-3-pro-preview-ais-applets
> gemini-3-pro-preview
Can't wait til Gemini 4 is out!
Is it coming to Google Jules?
"Gemini 3 Pro Preview" is in Vertex
has anyone managed to use any of the AI models to build a complete 3D fps game using web GL or open GL?
Waiting for google to nuke this as well just like 2.5pro
Oh that corpulent fella with glasses who talks in the video. Look how good mannered he is, he can't hurt anyone. But Google still takes away all your data and you will be forced out of your job.
The problem with experiencing LLM releases nowadays is that it is no longer trivial to understand the differences in their vast intelligences so it takes awhile to really get a handle on what's even going on.
A tad bit better, still has the same issues regarding unpacking and understanding complex prompts. I have a test of mine and now it performs a bit better, but still, it has zero understanding what is happening and for why. Gemini is the best of the best model out there, but with complex problems it just goes down the drain :(.
every day, new game changer
No gemini-3-flash yet, right? Any ETA on that mentioned? 2.5-flash has been amazing in terms of cost/value ratio.
is there even a puzzle or math problem gemini 3 cant solve?
Still insists the G7 photo[0] is doctored, and comes up with wilder and wilder "evidence" to support that claim, before getting increasingly aggressive.
0: https://en.wikipedia.org/wiki/51st_G7_summit#/media/File:Pri...
It generated a quite cool pelican on a bike: https://imgur.com/a/yzXpEEh
Trained models should be able to use formal tools (for instance a logical solver, a computer?).
Good. That said, I wonder if those models are still LLMs.
So they won't release multimodal or Flash at launch, but I'm guessing people who blew smoke up the right person's backside on X are already building with it
Glad to see Google still can't get out of its own way.
I continue to not use Gemini as I can’t have my data not trained but also have chat history at the same time.
Yes, I know the Workspaces workaround, but that’s silly.
If it ain't quantum leap, new models are just "OS updates".
When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.
Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.
I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...
How's the pelican?
Not the preview crap again. Haven't they tested it enough? When will it be available in Gemini-CLI?
yea great.... when will I be able to have it dial a number on my google pixel? Seriously... Gemini absolutely sucks on pixel since it can't interact with the phone itself so it can't dial numbers.
It is pointless to ask an LLM to draw an ASCII unicorn these days. Gemini 3 draws one of these (depending on the prompt):
https://www.ascii-art.de/ascii/uvw/unicorn.txt
However, it is amazing how far spatial comprehension has improved in multimodal models.
I'm not sure the below would be properly displayed on HN; you'll probably need to cut and paste it into a text editor.
Prompt: Draw me an ASCII world map with tags or markings for the areas and special places.
Temperature: 1.85
Top-P 0.98
Answer: Edit (replaced with URL) https://justpaste.it/kpow3
Wild
The most devastating news out of this announcement is that Vending-Bench 2 came out and it has significantly less clanker[0] meltdowns than the first one. I mean, seriously? Not even one run where the model tried to stock goods that hadn't arrived yet, only for it to eventually try and fail to shut down the business, and then e-mail the FBI about the $2 daily fee being deducted from the bot?
[0] Fake racial slur for a robot, LLM chatbot, or other automated system
boooooooooooooo
Interestingly, Gemini 3 Pro Thinking clocks in at 2nd place - behind Horizon Alpha, but ahead of any actually-released model checkpoints - on the TvP Benchmark, with a score of 71.8%.
This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.
An example output:
'''
The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.
He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.
"I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."
He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.
The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.
Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.
Three red dots appeared on his chest. They made a small triangle.
Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.
"I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."
The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.
"But what I do have," Bryan said, "are a very particular set of skills."
The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.
"Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.
The thing roared. It wasn't a human sound. It was a wet, guttural noise.
"I will look for you," Bryan said. He put the phone in his pocket. "I will find you."
The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.
"And I will kill you," Bryan said.
He raised the Glock and fired twice. Pop-pop.
The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.
Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.
Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.
"You're big," Bryan whispered. "That makes you slow."
He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.
The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.
The blade hit the creature in the knee. There was a snap. A loud, wet snap.
The thing howled. It fell onto one knee.
Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.
The creature swung its arm back. It hit Bryan in the chest.
Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.
He sat up. His ribs hurt. Maybe broken.
The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.
Bryan knew a bomb when he saw one.
"No," Bryan said.
He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.
He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.
Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.
Silence came back.
The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.
The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.
"...good luck..."
Then it died. It just stopped.
Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.
His glass of scotch was still sitting there, untouched. The ice hadn't even melted.
He picked it up. He took a drink. It still tasted cheap.
He took his phone out and looked at it. No service.
"Well," he said.
He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.
'''
… agentic …
Meh, not interested already
The first paragraph is pure delusion. Why do investors like delusional CEOs so much? I would take it as a major red flag.
"AI" benchmarks are and have consistently been lies and misinformation. Gemini is dead in the water.
Finally!
I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.
So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.
Where is the outrage?
[0] https://web.archive.org/web/20251118111103/https://storage.g...
[1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...
It’s over for Anthropic. That’s why Google’s cool with Claude being on Azure.
Also probably over for OpenAI
@simonw wen pelican
It's amazing to see Google take the lead while OpenAI worsens their product every release.
Pretty obvious how contaminated this site is with goog employees upvoting nonsense like this.
Valve could learn from Google here
It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com
It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close.
the token efficiency and context is also mindblowing...
it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.
gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.
https://www.youtube.com/watch?v=cUbGVH1r_1U
Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.
We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.
The result? It struggled where I didn't expect it to.
Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.
Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?
I asked it to summarize an article about the Zizians which mentions Yudkowsky SEVEN times. Gemini-3 did not mention him once. Tried it ten times and got zero mention of Yudkowsky, despite him being a central figure in the story. https://xcancel.com/xundecidability/status/19908286970881311...
Also, can you guess which pelican SVG was gemini 3 vs 2.5? https://xcancel.com/xundecidability/status/19908113191723213...