Report this

What is the reason for this report?

State of the Union's Open Source AI: How American Open-Weights Models Compare Globally

Published on July 3, 2026
Andrew Dugan

By Andrew Dugan

Senior AI Technical Content Creator II

State of the Union's Open Source AI: How American Open-Weights Models Compare Globally

Introduction

The United States produces some of the world’s most widely used open-weights models, spanning hyperscaler releases like Meta’s Llama and Google’s Gemma, hardware-tuned models from NVIDIA, small-but-capable models from Microsoft’s Phi team, and the fully transparent OLMo family from the nonprofit Allen Institute for AI. Together they range from the most openly documented models on earth to some of the most commercially restricted, and from 14-billion-parameter models that run on a laptop to 550-billion-parameter reasoning systems. The result is a uniquely diverse ecosystem with no single design philosophy holding it together.

In honor of the 250th anniversary of the United States, this article reviews the state of the United States’ open-weights large language models. We take a look at the top performing options, analyze how open-weight LLMs from the US differ in architecture, and speculate on what kinds of improvements we might see in future open-source models. The goal is not to crown a single winner, but to map who builds open models in the US, how they build them, and where the ecosystem might be headed next.

Key Takeaways

  • American open-weights AI spans the full range from one of the most open model families in the world (Ai2’s OLMo) to the highest-scoring open model in the US (NVIDIA’s Nemotron 3 Ultra 550B), with strong self-hostable options like Gemma 4 31B in between.

  • American models are unique for their architectural diversity and are mostly distributed through incumbent platforms. Most of them lack techniques common across architectures abroad (Multi-head Latent Attention (MLA), auxiliary-loss-free Mixture-of-Experts (MoE), reasoning pretraining). Chinese labs converge on shared architectures and non-US/non-China labs like Mistral compete on sovereignty and permissive licensing.

  • Its open releases have historically trailed the internal proprietary frontier, but newer entrants are closing that gap. NVIDIA’s hybrid Mamba-MoE Nemotron 3 line now tops US open benchmarks on both quality and throughput.

The Best US Open-Weights Models by the Numbers

The tables below list verified benchmark results for the leading American open-weights models as of mid-2026, drawn from official model cards, technical reports, and the independent Artificial Analysis leaderboard.

The reasoning models and general-purpose models are separated. A reasoning model with test-time chain-of-thought will beat a non-reasoning model on math and science benchmarks. Comparing them directly is not a fair comparison, so they are in separate tables. Also, benchmark versions and conditions vary. LiveCodeBench versions differ (v3 vs v6), American Invitational Mathematics Examination (AIME) editions differ by year (2024/2025/2026), and most scores are vendor-reported from self-run evals. GPQA Diamond and LiveCodeBench are high-variance. Treat differences of a point or two as noise.

Reasoning-Capable Open Models

Scores are reasoning-mode (“thinking on”) where the model supports a toggle. All figures are from official model cards or technical reports.

Model (Lab) Params (total/active) MMLU-Pro GPQA Diamond LiveCodeBench AIME 2025 Context License
Nemotron 3 Ultra 550B (NVIDIA) 550B / 55B MoE 86.8 87.0 89.0 (v6) — † 1M OpenMDW-1.1 (weights + data)
Gemma 4 31B (Google) 31B dense 85.2 84.3 80.0 (v6) 89.2 ('26) 256K Apache 2.0
Nemotron 3 Super 120B (NVIDIA) 120B / 12B MoE 83.7 79.2 81.2 90.2 1M NVIDIA Open Model
OLMo 3 32B Think (Ai2) 32B dense — ‡ 58.1 83.5 (v3) 72.5 65K Apache 2.0 (full stack)
Llama-Nemotron Ultra 253B (NVIDIA) 253B dense 76.0 66.3 72.5 128K NVIDIA Open Model
Hermes 4 405B (Nous Research) 406B dense 80.5 70.5 61.3 (v6) 78.1 ~40K Llama 3.1
Nemotron 3 Nano 30B (NVIDIA) 30B / ~3B MoE 78.1 72.5 67.6 87.7 1M NVIDIA Open Model
Phi-4-reasoning-plus (Microsoft) 14B dense 76.0 68.9 53.1 78.0 32K MIT

† The Nemotron 3 Ultra model card reports GPQA, MMLU-Pro, LiveCodeBench, SWE-Bench Verified (70.7), and RULER-1M (94.7) but not a standalone AIME 2025 score; on tool-augmented olympiad math (IMOAnswerBench) it scores 92.3. ‡ OLMo 3 32B Think reports standard MMLU 85.4 and MATH 96.1 rather than MMLU-Pro. It is the strongest fully open (weights + code + data) reasoning model.

General-Purpose (Non-Reasoning) Open Models

Model (Lab) Params (total/active) MMLU-Pro GPQA Diamond LiveCodeBench Throughput Context License
Llama 4 Maverick (Meta) 400B/17B 80.5 69.8 43.4 ~108 tok/s 1M Llama 4 Community
Llama 4 Scout (Meta) 109B/17B 74.3 57.2 32.8 ~95 tok/s 10M Llama 4 Community
Phi-4 (Microsoft) 14B dense 70.4 56.1 16K MIT
Gemma 3 27B (Google) 27B dense 67.5 42.4 29.7 128K Gemma
DBRX Instruct (Databricks) 132B/36B — ‡ ~150 tok/s 32K Databricks Open

‡ DBRX (March 2024) predates MMLU-Pro/GPQA becoming standard; it reports MMLU 73.7, HumanEval 70.1, and GSM8K 66.9. It is included as a size and speed reference point, not a current-quality contender.

Summarizing the Numbers

  • Best overall (mid-2026): NVIDIA’s Nemotron 3 Ultra 550B. It tops MMLU-Pro (86.8), GPQA Diamond (87.0), and LiveCodeBench (89.0) among American open-weights models. Notably, it is also one of the most open flagships, released under OpenMDW-1.1 with training data and post-training recipes. 550B total parameters (55B active) means it needs an 8×H100-class node to serve.
  • Best you can self-host: Google’s Gemma 4 31B. It leads on AIME, runs on a single high-end GPU, and ships under Apache 2.0. For most builders, it is the strongest practical American open model.
  • Best efficiency: NVIDIA’s Nemotron 3 Nano 30B. With only ~3B active parameters (MoE), it posts MMLU-Pro 78.1 and GPQA 72.5. It outscores dense models many times its active size.
  • Best small model: Microsoft’s Phi-4-reasoning-plus (14B). GPQA Diamond and AIME numbers competitive with models 5–40x its size.
  • Best fully open: Ai2’s OLMo 3 32B Think. Class-leading MATH (96.1) and strong coding scores. Not the top scorer, but a fully open release with weights, training code, full data, and hundreds of intermediate checkpoints.
  • Fastest / longest context: Meta’s Llama 4. No reasoning variant and mid-tier scores, but Scout’s 10M-token context is the largest of any open model, and both variants are tuned for high-throughput serving at scale.

What Defines American Open-Weights AI

American open-weights AI is defined by a collection of divergent bets made by large technology incumbents, a chip vendor, and a few research nonprofits, with little shared design philosophy between them. The defining trait is architectural diversity without consensus. Grouped Query Attention (GQA) is the most common building block, but beyond it American labs pursue radically divergent bets in parallel. Mamba-2 state space models, Meta’s interleaved Rotary Position Embedding (iRoPE) attention, NVIDIA’s LatentMoE, and various layer-wise scaling schemes are diverging experiments with no shared design philosophy connecting them. NVIDIA is the clearest example of a lab pushing its own direction. It co-evolves model and silicon together, pretraining Nemotron in the NVIDIA FP4 (NVFP4) 4-bit format and building hardware-aware hybrid Mamba-2 architectures around its own GPUs.

How these models acquire their capabilities also differs from the Chinese approach. American labs have historically treated reasoning as a post-training problem, grafting it on through supervised fine-tuning and reinforcement learning rather than embedding it in pretraining. And because the largest players are platform companies, distribution is a structural advantage no independent lab has matched yet. Meta’s multi-billion-user footprint gives Llama real-world reach far beyond its benchmark standing.

Openness is where the ecosystem has the most contradictions. In past years, the pattern seemed to be “open after proprietary,” with open weights trailing a lab’s internal frontier product by a generation. NVIDIA’s Nemotron 3 line has recently upended that, shipping open weights (and, for the Ultra model, training data) that top the US benchmark tables outright. The Allen Institute’s OLMo family goes further still, releasing weights, training code, full training data, and intermediate checkpoints together, making it one of the most completely open releases anywhere. Yet the same country produces the most restricted licenses too, from Llama’s monthly-active-user cap to DBRX’s no-compete clause. The one throughline is efficiency at small scale. Microsoft’s 14-billion-parameter Phi-4-reasoning matches models many times its size, and Apple’s OpenELM has advanced layer-wise efficiency research.

What Defines Chinese Open-Weights AI

Where American labs diverge, leading Chinese labs have converged. Multi-head Latent Attention (MLA), first introduced by DeepSeek, has since been adopted by Moonshot’s Kimi K2 and, as of GLM-5, Zhipu’s GLM line (earlier GLM versions used GQA). Several of these models also share a fine-grained Mixture-of-Experts design along with DeepSeek’s auxiliary-loss-free load balancing.

Their training philosophy is also distinctive. Chinese labs increasingly treat reasoning as a pretraining target, dedicating whole “stage 2” pretraining phases to elevated math, code, and STEM data instead of relying on post-training alone. Some also build self-sustaining synthetic data loops. Alibaba, for instance, used specialized Qwen2.5-Math and Qwen2.5-Coder models to generate synthetic training data for Qwen3, reducing dependence on proprietary API teachers. And Chinese models appear to do it cheaply. DeepSeek V3 claims to have been trained on 14.8 trillion tokens for roughly $5.6 million, which would make it the most cost-efficient frontier training run ever.

Qwen leads global HuggingFace downloads and DeepSeek leads open-weight reasoning leaderboards, trailing only the strongest closed models. The limit is transparency. These labs publish strong results and detailed arXiv papers, but they release weights only. None of the major frontier Chinese labs publish training code, training data, or intermediate checkpoints.

What Defines European and Other Global Models

Outside the US and China, open-weights work seems to be driven as much by sovereignty and language coverage as by capability. France’s Mistral is the most frontier-competitive player, and its largest models now ship under Apache 2.0, a rarity at that scale. For most European efforts, though, the motivation is reducing dependence on American and Chinese platforms rather than beating them outright. That goal is supported by public money, most notably EuroHPC’s AI Factories, which give small and medium-sized enterprises (SMEs) and nonprofits free GPU access. In one striking case, a Latvian translation company trained a 30-billion-parameter model using entirely subsidized compute, something with no US equivalent.

The rest of the field tends to fill gaps the giants ignore. Multilingual coverage is a recurring theme. OpenEuroLLM spans the EU’s 24 official languages, Singapore’s SEA-LION covers Southeast Asian languages, and India’s Sarvam handles 22 Indian languages. Licensing approaches vary widely: Canada’s Cohere releases its Command A models for research under a non-commercial (CC-BY-NC) license, requiring a separate agreement for commercial use. Several one-time contenders have simply retreated. Germany’s Aleph Alpha left the frontier race for enterprise sovereignty software, and the UAE’s Falcon pulled back to far smaller models.

What Comes Next for American Open-Source AI

Memory-efficient attention, whether DeepSeek-style MLA or the hybrid Mamba-Transformer designs NVIDIA is now shipping, is on track to become standard because it cuts inference cost without sacrificing quality. Reasoning is likely to move earlier in the pipeline, treated as a pretraining objective rather than a post-training patch, and a single checkpoint will increasingly serve both a “thinking” and a “fast” mode instead of a lab shipping two separate models. More of the full stack, including training data, code, and even training-cost disclosures, could ship alongside weights.

The larger opportunity is organizational rather than technical. Almost every major American open model is a side output of a company whose real product is something else, which means openness is always secondary to a proprietary roadmap. The Allen Institute’s OLMo shows how much a genuinely open-source-first organization can accomplish, but it’s practically alone. There is room in the US for more organizations whose primary mission is the open release itself, especially in the underserved 30-to-70-billion-parameter range where no fully open, architecturally modern, data-released American model yet exists.

Finally, not every opportunity is a bigger model. Some of the most valuable open-source work will be small, hyper-specific, compact models tuned for a single domain, plus the routers, model-selection architectures, and specialized verifiers for speculative decoding. These systems reward openness, because they need to be inspected, fine-tuned, and freely composed, and they play directly to America’s demonstrated strength in small-model efficiency. A future US open-source ecosystem may compete less on owning the single biggest model and more on offering a rich toolkit of small, interoperable, purpose-built ones.

Common Questions

Is American open-source AI behind China?

Not on openness, and not uniformly on capability. The US produces both one of the most open model families in the world (Ai2’s OLMo) and some of the most widely used open models (Llama, Gemma); however, Qwen (China) now leads global downloads and derivatives. Where American open models lag is top-line benchmark performance. The highest-scoring open-weights models are currently Chinese.

What is the most open American model?

Ai2’s OLMo family (OLMo 2 and the newer OLMo 3). It releases weights, training code, the full training dataset, and hundreds of intermediate checkpoints together under a permissive Apache 2.0 license. This is among the most complete open releases anywhere. The majority of “open” models release weights only.

Mistral has offices in the U.S. Is Mistral an American open source project?

No. Mistral is a French company headquartered in Paris. Its US presence is a sales and operational office. Model development is controlled by the French parent. It is the strongest near-frontier open-weights lab outside the US and China.

Why don’t American labs use MLA?

A mix of timing, infrastructure lock-in, hardware privilege, and mission. Meta’s architecture predates MLA’s maturity. NVIDIA chose a competing approach. Google and Microsoft face serving-stack constraints. Underlying all of it, only Ai2 treats open weights as its primary mission, so most US labs adopt architecture on product timelines.

Has NVIDIA’s Mamba-2 bet paid off?

Increasingly, yes. The throughput advantage is well established, and with the Nemotron 3 line the hybrid Mamba-2 design now reaches near-frontier quality. Nemotron 3 Ultra tops the US open-weights benchmarks. What it has not shown is a quality advantage over the best pure-attention models. The top Chinese models still lead the composite leaderboards, and NVIDIA still keeps a small fraction of full attention layers rather than going pure state-space.

Conclusion

Two hundred and fifty years in, the state of American open-source AI is one of genuine leadership paired with self-imposed lag. Many companies and much of the talent in the U.S. are focused on leading the world in closed models, leaving open-weights work with less focus and attention. The US hosts some of the most open models in the world and some of the most widely used ones, but its strongest labs increasingly reserve their frontier work for closed products. Meta’s 2026 pivot to the proprietary Muse Spark is the clearest example.

The most interesting opportunities are structural rather than incremental. An open-source-first organization, building MLA-first at 30B+ scale, with reasoning in pretraining and the full training stack released, would close most of the gaps identified here at once. The techniques are already public. What is missing is an American lab whose primary mission is to use them.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Andrew Dugan
Andrew Dugan
Author
Senior AI Technical Content Creator II
See author profile

Andrew is an NLP Scientist with 8 years of experience designing and deploying enterprise AI applications and language processing systems.

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.