EDIT: formatting, hopefully a bit more mobile friendly
show comments
mynti
It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding
show comments
Taek
One benchmark I would really like to see: instruction adherence.
For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.
The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.
If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.
I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.
show comments
transcriptase
There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score.
show comments
embedding-shape
Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/emailAddress=info@allot.com` which obviously fails...
Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?
show comments
lxdlam
What does the "Google Antigravity" mean? The link is http://antigravity.google/docs, seemingly a new product but now routing to the Google main page.
show comments
meetpateltech
it was accidentally pushed a little early, and now it has been taken down.
It says it's been trained from scratch. I wonder if it will have the same undescribable magic that makes me spend an hour every day with 2.5. I really love the results I can get with 2.5 pro. Google eventually limiting aistudio will be a sad day.
Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.
The bucket name "deepmind-media" has been used in the past on the deepmind official site, so it seems legit.
show comments
laborcontract
It's hilarious that the release of Gemini 3 is getting eclipsed by this cloudflare outage.
show comments
denysvitali
Title of the document is "[Gemini 3 Pro] External Model Card - November 18, 2025 - v2", in case you needed further confirmation that the model will be released today.
Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well.
> TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs.
That seems like a low bar. Who's training frontier LLMs on CPUs? Surely they meant to compare TPUs to GPUs. If "this is faster than a CPU for massively parallel AI training" is the best you can say about it, that's not very impressive.
show comments
fraboniface
> Developments to the model architecture contribute to the significantly improved performance from previous model families.
I wonder how significant this is. DeepMind was always more research-oriented that OpenAI, which mostly scaled things up. They may have come up with a significantly better architecture (Transformer MoE still leaves a lot of room).
mohsen1
This model is not a modification or a fine-tune of a prior model
Is that common to mention that? Feels like they built something from scratch
show comments
Topfi
Additional context from AI Studio including pricing:
Our most intelligent model with SOTA reasoning and multimodal understanding, and powerful agentic and vibe coding capabilities
What's wild here is that among every single score they've absolutely killed, somehow, Anthropic and Claude Sonnet 4.5 have won a single victory in the fight: SWE Bench Verified and only by a singular point.
I already enjoy Gemini 2.5 pro for planning and if Gemini 3 is priced similarly, I'll be incredibly happy to ditch the painfully pricey Claude max subscription. To be fair, I've already got an extremely sour taste in my mouth from the last Anthropic bait and switch on pricing and usage, so happy to see Google take the crown here.
show comments
ks2048
Why is this linking to a random site? Here is a link hosted by Google:
They scored a 31.1% on ARC AGI 2 which puts them in first place.
Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.
> Gemini 3 Pro was trained using Google’s Tensor Processing Units (TPUs)
NVDA is down 3.26%
show comments
msp26
Is flash/flash lite releasing alongside pro? Those two tiers have been incredible for the price since 2.0, absolute workhorses. Can't wait for 3.0.
oalessandr
Trying to open this link from Italy leads to a CSAM warning
show comments
robert-zaremba
The strategic move to use TPU rather than Nvidia is paying well for Google. They are able to better utilize their existing large infrastructure, but also specialize the processes and pipelines for their own framework that they use to create and train models.
I think a specialized hardware for training models is the next big wave in China.
__jl__
API pricing is up to $2/M for input and $12/M for output
For comparison:
Gemini 2.5 Pro was $1.25/M for input and $10/M for output
Gemini 1.5 Pro was $1.25/M for input and $5/M for output
charcircuit
>TPUs are
specifically designed to handle the massive computations involved in training LLMs and can speed up
training considerably compared to CPUs
These model cards tell me nothing. I want to know the exact data a model was trained on. Otherwise, how can I safely use it for generating texts that I show to children? Etc.etc.
show comments
butlike
It's over. I just don't care anymore. I don't care what a pro model card is. I don't care what a humanity's last exam is. I don't care if the response makes me feel good about the prompt I made. I don't care if it's sentient. I don't care if it's secretly sentient. I don't care if it's just a machine. I don't care if the gov't has appropriated a secret model. I don't care if this is the precursor to AGI, ASI, AGGI, AGGSISGIGIG....I just. Don't. care.
And I really don't think I'm alone in this.
bretpiatt
Page 5, "The knowledge cutoff date for Gemini 3 Pro was January 2025."
Still taking nearly a year to train and run post training safety and stability tuning.
With 10x the infrastructure they could iterate much faster, I don't see AI infrastructure as a bubble, it is still a bottleneck on pace of innovation at today's active deployment level.
show comments
nilayj
Curious to see the API pricing. SOTA performance across tasks at a price cheaper than GPT 5 / Claude would make mostly everyone switch to Gemini.
good benchmark stats except for coding where it looks similar to other SOTA models
aurareturn
Benchmark suggests it is a resounding win for Gemini 3 Pro as the top model.
wiz21c
SWE-Bench is disappointing not because it is lower than Claude, but because improving on all other domains of knowledge didn't help. So does this mean that this is actually a MoE model in the sense that one expert doesn't talk to the other ?
fcanesin
Great stuff, now if could please do gemini-2.5-pro-code that would be great
rvz
> The training dataset also includes: publicly available datasets that are readily downloadable; data
obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data
collected from users of Google products and services to train AI models, along with user interactions
with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific
policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or
generates in the course of its business operations, or directly from its workforce; and AI-generated
synthetic data.
Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.
show comments
Barry-Perkins
Excited to see the Gemini 3 Pro Model Card! Looking forward to exploring its features and capabilities.
DeathArrow
I hope cheaper Chinese open weights models as good as Gemini will come soon. Gemini, Claude, GPT are kind of expensive if you use AI a lot.
827a
What is Google Antigravity?
Traubenfuchs
So does google actually have a claude console alternative currently?
show comments
danielcampos93
mums the word on Flash?
catigula
I know this is a little controversial but the lack of performance on SWE-bench is hugely disappointing I think economically. These models don’t have any viable path to profitability if they can’t take engineering jobs.
show comments
omidsa1
TL;DR: expected results, not underwhelming.So far scaling laws hold.
margorczynski
If these numbers are true then OpenAI is probably done, Anthropic too.
Still, it's hard to see an effective monetization method for this tech and it clearly is eating Google's main pie which is search.
show comments
jll29
Hopefully this model does not generate fake news...
Benchmarks from page 4 of the model card:
n/s = not supportedEDIT: formatting, hopefully a bit more mobile friendly
It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding
One benchmark I would really like to see: instruction adherence.
For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.
The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.
If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.
I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.
There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score.
Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/emailAddress=info@allot.com` which obviously fails...
Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?
What does the "Google Antigravity" mean? The link is http://antigravity.google/docs, seemingly a new product but now routing to the Google main page.
it was accidentally pushed a little early, and now it has been taken down.
here’s the archived pdf: https://web.archive.org/web/20251118111103/https://storage.g...
It says it's been trained from scratch. I wonder if it will have the same undescribable magic that makes me spend an hour every day with 2.5. I really love the results I can get with 2.5 pro. Google eventually limiting aistudio will be a sad day.
Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.
I saw this on Reddit earlier today. Over there the source of this file was given as: https://web.archive.org/web/20251118111103/https://storage.g...
The bucket name "deepmind-media" has been used in the past on the deepmind official site, so it seems legit.
It's hilarious that the release of Gemini 3 is getting eclipsed by this cloudflare outage.
Title of the document is "[Gemini 3 Pro] External Model Card - November 18, 2025 - v2", in case you needed further confirmation that the model will be released today.
Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well.
Org was created on 2025-11-04T19:28:13Z (https://api.github.com/orgs/Google-Antigravity)
Interesting to see on page 2 the reference to ML pathways [1]. Looks like a multi layer mixture of experts. Is this common ?
[1] https://blog.google/technology/ai/introducing-pathways-next-...
> TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs.
That seems like a low bar. Who's training frontier LLMs on CPUs? Surely they meant to compare TPUs to GPUs. If "this is faster than a CPU for massively parallel AI training" is the best you can say about it, that's not very impressive.
> Developments to the model architecture contribute to the significantly improved performance from previous model families.
I wonder how significant this is. DeepMind was always more research-oriented that OpenAI, which mostly scaled things up. They may have come up with a significantly better architecture (Transformer MoE still leaves a lot of room).
Additional context from AI Studio including pricing:
Our most intelligent model with SOTA reasoning and multimodal understanding, and powerful agentic and vibe coding capabilities
<=200K tokens • Input: $2,00 / Output: $12,00
> 200K tokens • Input: $4,00 / Output: $18,00
Knowledge cut off: Jan. 2025
Gemini 3 Deep Think gets 45.1% on ARG-AGI-2
Gemini 3 Pro gets 31.1% on ARG-AGI-2
https://arcprize.org/leaderboard
What's wild here is that among every single score they've absolutely killed, somehow, Anthropic and Claude Sonnet 4.5 have won a single victory in the fight: SWE Bench Verified and only by a singular point.
I already enjoy Gemini 2.5 pro for planning and if Gemini 3 is priced similarly, I'll be incredibly happy to ditch the painfully pricey Claude max subscription. To be fair, I've already got an extremely sour taste in my mouth from the last Anthropic bait and switch on pricing and usage, so happy to see Google take the crown here.
Why is this linking to a random site? Here is a link hosted by Google:
https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...
They scored a 31.1% on ARC AGI 2 which puts them in first place.
Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.
Archive link: https://web.archive.org/web/20251118111103/https://storage.g...
For the veracity of the link itself: https://storage.googleapis.com/deepmind-media/* has been used by DeepMind itself (e.g. "View tech report" in https://deepmind.google/models/gemini/) so it is a genuine leak.
> Gemini 3 Pro was trained using Google’s Tensor Processing Units (TPUs)
NVDA is down 3.26%
Is flash/flash lite releasing alongside pro? Those two tiers have been incredible for the price since 2.0, absolute workhorses. Can't wait for 3.0.
Trying to open this link from Italy leads to a CSAM warning
The strategic move to use TPU rather than Nvidia is paying well for Google. They are able to better utilize their existing large infrastructure, but also specialize the processes and pipelines for their own framework that they use to create and train models.
I think a specialized hardware for training models is the next big wave in China.
API pricing is up to $2/M for input and $12/M for output
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
>TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs
Who is training LLMs with CPUs?
Update: it is available at https://aistudio.google.com now!
These model cards tell me nothing. I want to know the exact data a model was trained on. Otherwise, how can I safely use it for generating texts that I show to children? Etc.etc.
It's over. I just don't care anymore. I don't care what a pro model card is. I don't care what a humanity's last exam is. I don't care if the response makes me feel good about the prompt I made. I don't care if it's sentient. I don't care if it's secretly sentient. I don't care if it's just a machine. I don't care if the gov't has appropriated a secret model. I don't care if this is the precursor to AGI, ASI, AGGI, AGGSISGIGIG....I just. Don't. care.
And I really don't think I'm alone in this.
Page 5, "The knowledge cutoff date for Gemini 3 Pro was January 2025."
Still taking nearly a year to train and run post training safety and stability tuning.
With 10x the infrastructure they could iterate much faster, I don't see AI infrastructure as a bubble, it is still a bottleneck on pace of innovation at today's active deployment level.
Curious to see the API pricing. SOTA performance across tasks at a price cheaper than GPT 5 / Claude would make mostly everyone switch to Gemini.
gone now;
wayback machine still has it: https://web.archive.org/web/20251118111103/https://storage.g...
good benchmark stats except for coding where it looks similar to other SOTA models
Benchmark suggests it is a resounding win for Gemini 3 Pro as the top model.
SWE-Bench is disappointing not because it is lower than Claude, but because improving on all other domains of knowledge didn't help. So does this mean that this is actually a MoE model in the sense that one expert doesn't talk to the other ?
Great stuff, now if could please do gemini-2.5-pro-code that would be great
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.
Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.
Excited to see the Gemini 3 Pro Model Card! Looking forward to exploring its features and capabilities.
I hope cheaper Chinese open weights models as good as Gemini will come soon. Gemini, Claude, GPT are kind of expensive if you use AI a lot.
What is Google Antigravity?
So does google actually have a claude console alternative currently?
mums the word on Flash?
I know this is a little controversial but the lack of performance on SWE-bench is hugely disappointing I think economically. These models don’t have any viable path to profitability if they can’t take engineering jobs.
TL;DR: expected results, not underwhelming.So far scaling laws hold.
If these numbers are true then OpenAI is probably done, Anthropic too. Still, it's hard to see an effective monetization method for this tech and it clearly is eating Google's main pie which is search.
Hopefully this model does not generate fake news...
https://www.google.com/search?q=gemini+u.s.+senator+rape+all...