Gemini 3 Pro Model Card [pdf]

scrlk

Benchmarks from page 4 of the model card:

    | Benchmark             | 3 Pro     | 2.5 Pro | Sonnet 4.5 | GPT-5.1   |
    |-----------------------|-----------|---------|------------|-----------|
    | Humanity's Last Exam  | 37.5%     | 21.6%   | 13.7%      | 26.5%     |
    | ARC-AGI-2             | 31.1%     | 4.9%    | 13.6%      | 17.6%     |
    | GPQA Diamond          | 91.9%     | 86.4%   | 83.4%      | 88.1%     |
    | AIME 2025             |           |         |            |           |
    |   (no tools)          | 95.0%     | 88.0%   | 87.0%      | 94.0%     |
    |   (code execution)    | 100%      | -       | 100%       | -         |
    | MathArena Apex        | 23.4%     | 0.5%    | 1.6%       | 1.0%      |
    | MMMU-Pro              | 81.0%     | 68.0%   | 68.0%      | 80.8%     |
    | ScreenSpot-Pro        | 72.7%     | 11.4%   | 36.2%      | 3.5%      |
    | CharXiv Reasoning     | 81.4%     | 69.6%   | 68.5%      | 69.5%     |
    | OmniDocBench 1.5      | 0.115     | 0.145   | 0.145      | 0.147     |
    | Video-MMMU            | 87.6%     | 83.6%   | 77.8%      | 80.4%     |
    | LiveCodeBench Pro     | 2,439     | 1,775   | 1,418      | 2,243     |
    | Terminal-Bench 2.0    | 54.2%     | 32.6%   | 42.8%      | 47.6%     |
    | SWE-Bench Verified    | 76.2%     | 59.6%   | 77.2%      | 76.3%     |
    | t2-bench              | 85.4%     | 54.9%   | 84.7%      | 80.2%     |
    | Vending-Bench 2       | $5,478.16 | $573.64 | $3,838.74  | $1,473.43 |
    | FACTS Benchmark Suite | 70.5%     | 63.4%   | 50.4%      | 50.8%     |
    | SimpleQA Verified     | 72.1%     | 54.5%   | 29.3%      | 34.9%     |
    | MMLU                  | 91.8%     | 89.5%   | 89.1%      | 91.0%     |
    | Global PIQA           | 93.4%     | 91.5%   | 90.1%      | 90.9%     |
    | MRCR v2 (8-needle)    |           |         |            |           |
    |   (128k avg)          | 77.0%     | 58.0%   | 47.1%      | 61.6%     |
    |   (1M pointwise)      | 26.3%     | 16.4%   | n/s        | n/s       |

n/s = not supported

EDIT: formatting, hopefully a bit more mobile friendly

show comments

mynti

It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding

show comments

Taek

One benchmark I would really like to see: instruction adherence.

For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.

The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.

If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.

I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.

show comments

transcriptase

There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score.

show comments

embedding-shape

Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/emailAddress=info@allot.com` which obviously fails...

Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?

show comments

lxdlam

What does the "Google Antigravity" mean? The link is http://antigravity.google/docs, seemingly a new product but now routing to the Google main page.

show comments

meetpateltech

it was accidentally pushed a little early, and now it has been taken down.

here’s the archived pdf: https://web.archive.org/web/20251118111103/https://storage.g...

patates

It says it's been trained from scratch. I wonder if it will have the same undescribable magic that makes me spend an hour every day with 2.5. I really love the results I can get with 2.5 pro. Google eventually limiting aistudio will be a sad day.

Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.

show comments

bemmu

I saw this on Reddit earlier today. Over there the source of this file was given as: https://web.archive.org/web/20251118111103/https://storage.g...

The bucket name "deepmind-media" has been used in the past on the deepmind official site, so it seems legit.

show comments

laborcontract

It's hilarious that the release of Gemini 3 is getting eclipsed by this cloudflare outage.

show comments

denysvitali

Title of the document is "[Gemini 3 Pro] External Model Card - November 18, 2025 - v2", in case you needed further confirmation that the model will be released today.

Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well.

Org was created on 2025-11-04T19:28:13Z (https://api.github.com/orgs/Google-Antigravity)

show comments

Bobaso

Interesting to see on page 2 the reference to ML pathways [1]. Looks like a multi layer mixture of experts. Is this common ?

[1] https://blog.google/technology/ai/introducing-pathways-next-...

show comments

ethmarks

> TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs.

That seems like a low bar. Who's training frontier LLMs on CPUs? Surely they meant to compare TPUs to GPUs. If "this is faster than a CPU for massively parallel AI training" is the best you can say about it, that's not very impressive.

show comments

fraboniface

> Developments to the model architecture contribute to the significantly improved performance from previous model families.

I wonder how significant this is. DeepMind was always more research-oriented that OpenAI, which mostly scaled things up. They may have come up with a significantly better architecture (Transformer MoE still leaves a lot of room).

mohsen1

     This model is not a modification or a fine-tune of a prior model

Is that common to mention that? Feels like they built something from scratch

show comments

Topfi

Additional context from AI Studio including pricing:

Our most intelligent model with SOTA reasoning and multimodal understanding, and powerful agentic and vibe coding capabilities

<=200K tokens • Input: $2,00 / Output: $12,00

> 200K tokens • Input: $4,00 / Output: $18,00

Knowledge cut off: Jan. 2025

show comments

gardnr

Gemini 3 Deep Think gets 45.1% on ARG-AGI-2

Gemini 3 Pro gets 31.1% on ARG-AGI-2

https://arcprize.org/leaderboard

aliljet

What's wild here is that among every single score they've absolutely killed, somehow, Anthropic and Claude Sonnet 4.5 have won a single victory in the fight: SWE Bench Verified and only by a singular point.

I already enjoy Gemini 2.5 pro for planning and if Gemini 3 is priced similarly, I'll be incredibly happy to ditch the painfully pricey Claude max subscription. To be fair, I've already got an extremely sour taste in my mouth from the last Anthropic bait and switch on pricing and usage, so happy to see Google take the crown here.

show comments

ks2048

Why is this linking to a random site? Here is a link hosted by Google:

https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...

TheAceOfHearts

They scored a 31.1% on ARC AGI 2 which puts them in first place.

Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.

show comments

Palmik

Archive link: https://web.archive.org/web/20251118111103/https://storage.g...

lifthrasiir

For the veracity of the link itself: https://storage.googleapis.com/deepmind-media/* has been used by DeepMind itself (e.g. "View tech report" in https://deepmind.google/models/gemini/) so it is a genuine leak.

koakuma-chan

> Gemini 3 Pro was trained using Google’s Tensor Processing Units (TPUs)

NVDA is down 3.26%

show comments

msp26

Is flash/flash lite releasing alongside pro? Those two tiers have been incredible for the price since 2.0, absolute workhorses. Can't wait for 3.0.

oalessandr

Trying to open this link from Italy leads to a CSAM warning

show comments

robert-zaremba

The strategic move to use TPU rather than Nvidia is paying well for Google. They are able to better utilize their existing large infrastructure, but also specialize the processes and pipelines for their own framework that they use to create and train models.

I think a specialized hardware for training models is the next big wave in China.

__jl__

API pricing is up to $2/M for input and $12/M for output

For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output

charcircuit

>TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs

Who is training LLMs with CPUs?

eric15342335

Update: it is available at https://aistudio.google.com now!

amelius

These model cards tell me nothing. I want to know the exact data a model was trained on. Otherwise, how can I safely use it for generating texts that I show to children? Etc.etc.

show comments

butlike

It's over. I just don't care anymore. I don't care what a pro model card is. I don't care what a humanity's last exam is. I don't care if the response makes me feel good about the prompt I made. I don't care if it's sentient. I don't care if it's secretly sentient. I don't care if it's just a machine. I don't care if the gov't has appropriated a secret model. I don't care if this is the precursor to AGI, ASI, AGGI, AGGSISGIGIG....I just. Don't. care.

And I really don't think I'm alone in this.

bretpiatt

Page 5, "The knowledge cutoff date for Gemini 3 Pro was January 2025."

Still taking nearly a year to train and run post training safety and stability tuning.

With 10x the infrastructure they could iterate much faster, I don't see AI infrastructure as a bubble, it is still a bottleneck on pace of innovation at today's active deployment level.

show comments

nilayj

Curious to see the API pricing. SOTA performance across tasks at a price cheaper than GPT 5 / Claude would make mostly everyone switch to Gemini.

show comments

surrTurr

gone now;

wayback machine still has it: https://web.archive.org/web/20251118111103/https://storage.g...

surrTurr

good benchmark stats except for coding where it looks similar to other SOTA models

aurareturn

Benchmark suggests it is a resounding win for Gemini 3 Pro as the top model.

wiz21c

SWE-Bench is disappointing not because it is lower than Claude, but because improving on all other domains of knowledge didn't help. So does this mean that this is actually a MoE model in the sense that one expert doesn't talk to the other ?

fcanesin

Great stuff, now if could please do gemini-2.5-pro-code that would be great

rvz

> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.

Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.

show comments

Barry-Perkins

Excited to see the Gemini 3 Pro Model Card! Looking forward to exploring its features and capabilities.

DeathArrow

I hope cheaper Chinese open weights models as good as Gemini will come soon. Gemini, Claude, GPT are kind of expensive if you use AI a lot.

827a

What is Google Antigravity?

Traubenfuchs

So does google actually have a claude console alternative currently?

show comments

danielcampos93

mums the word on Flash?

catigula

I know this is a little controversial but the lack of performance on SWE-bench is hugely disappointing I think economically. These models don’t have any viable path to profitability if they can’t take engineering jobs.

show comments

omidsa1

TL;DR: expected results, not underwhelming.So far scaling laws hold.

margorczynski

If these numbers are true then OpenAI is probably done, Anthropic too. Still, it's hard to see an effective monetization method for this tech and it clearly is eating Google's main pie which is search.

show comments

jll29

Hopefully this model does not generate fake news...

https://www.google.com/search?q=gemini+u.s.+senator+rape+all...