How 2025 ends with an open-source surge, sovereign models, and a hard look at AI’s physical footprint.
In the first days of December 2025, the generative-AI race feels less like a sprint toward a single “god model” and more like a branching river delta.
At Amazon’s re:Invent conference in Las Vegas, the company unveiled Nova 2, a second generation of frontier models, alongside Nova Forge, a tool that lets customers inject their own data during multiple stages of training—including the base-model pretraining phase usually reserved for elite labs.¹ It is a quiet but radical proposal: instead of renting a sealed model by the token, enterprises can begin to shape frontier-scale systems in their own image.
Across the Atlantic, Paris-based Mistral AI opened December by releasing Mistral 3, a family of dense small models and a new mixture-of-experts flagship, Mistral Large 3, with 41 billion “active” parameters, 675 billion total parameters, and a 256,000-token context window—released under Apache 2.0 and accompanied by an NVIDIA partnership promising deployment from cloud to edge.² A day later, Reuters reported HSBC will self-host Mistral’s models in a multi-year deal to automate financial analysis, translation, and risk workflows—a European open-weights lab powering one of the world’s biggest banks.³
In San Francisco, Anthropic marked December with a different kind of move: acquiring the developer-tool startup Bun to strengthen Claude Code, its AI coding assistant that has already reached a $1 billion annualised revenue run rate, while digesting a recent commitment of up to $15 billion from Microsoft and Nvidia and a valuation north of $180 billion.⁴ The acquisition folds an opinionated runtime, package manager, and bundler into the Claude ecosystem—another signal that the frontier race is no longer just about model weights but the plumbing around them.
Meanwhile, the models themselves continue to advance. In late November OpenAI released GPT-5.1, an upgrade framed as making GPT-5 “smarter, more conversational,” and easier to customise, extending a line of thinking-enabled GPT-5 models that became the default earlier this year.⁵ Anthropic’s Claude Opus 4.5 arrived days earlier, billed as the world’s strongest model for coding, agents, and “computer use,” with improved research and spreadsheet skills.⁶ Google followed with Gemini 3, a new flagship deployed across its apps, with deeper multimodal reasoning and a “Deep Think” mode for step-by-step problem-solving.⁷ X’s xAI has rolled out Grok 4.1 and Grok 4.1 Fast, pairing a two-million-token context window with an “Agent Tools API” wired directly into the X platform, web search, and remote code execution.⁸ And from China, DeepSeek’s new V3.2 and V3.2-Speciale models are explicitly branded as “reasoning-first” systems built for agents and tool-use.⁹
Yet some of the most consequential news this month may be coming from outside the big-tech duopoly altogether. On December 1, Ukraine’s Ministry of Digital Transformation announced plans for a national large language model trained on Google’s Gemma framework, initially using Google infrastructure but ultimately to be hosted on Ukrainian systems, tuned to local languages and war-time institutional data.¹⁰ It is an ambitious experiment in digital sovereignty: a country at war, building its own LLM so it is not wholly dependent on foreign proprietary systems or adversarial platforms.
The story of December 2025, then, is not a single breakthrough model but a pattern: frontier-scale capability diffusing outward—into open-weight MoE (Mixture of Experts)systems, sovereign projects, and deeply embedded enterprise tools—just as regulators, researchers, and civil-society groups begin to grapple in earnest with what these systems are doing to infrastructure, labour, and law.
I. The Frontier Models: Beyond the One-Model Myth
By early December, the top of the LLM leaderboard is crowded rather than singular. Shakudo’s latest “Top 9 LLMs” survey, updated for November, already framed the landscape as a portfolio of leading systems—OpenAI’s GPT-5, DeepSeek and Qwen from China, Meta’s Llama 4, Anthropic’s Claude, Mistral, Google’s Gemini, and enterprise-focused models like Cohere’s Command A—each with distinct strengths, licensing choices, and deployment patterns.¹¹ December’s news mostly reinforces that pluralism.
OpenAI’s GPT-5.1 extends the GPT-5 line with improved conversational abilities and easier on-the-fly customisation of ChatGPT, including more flexible “personas” and lighter-weight domain tuning.⁵ This is an incremental release rather than a generational leap, but it consolidates GPT-5’s role as OpenAI’s unified workhorse across chat, coding, and multimodal tasks. The company has also been pushing open-weight siblings—GPT-oss-120B and GPT-oss-20B—under Apache 2.0, giving developers a sanctioned path to local deployment without abandoning the GPT ecosystem.¹¹
Anthropic’s Claude Opus 4.5, released November 24, sharpened the company’s focus on agents and “computer use”—spanning code editing, working across office documents, and orchestrating multi-step workflows.⁶ The Bun acquisition in December can be read as a bet that the next competitive edge will lie in tight coupling between models and the runtimes they manipulate, especially in code-heavy environments.⁴
Google’s Gemini 3 is pitched as its “most intelligent” model yet, with better reasoning, multimodal understanding, and integration across its products—Gemini app, AI Studio, and Vertex AI—with a forthcoming “Deep Think” mode for ultra-subscribers.⁷ This continues a move toward “thinking modes” similar to those in GPT-5 and DeepSeek’s hybrid architectures, where the model chooses between fast, cheap responses and slower, chain-of-thought-style reasoning depending on the task.
Mistral’s Mistral 3 stands out for three reasons. First, it doubles down on small dense models (3B, 8B, 14B) optimised for on-premise and edge deployment. Second, its sparse Mistral Large 3 extends context and reasoning while staying licence-permissive. Third, by releasing the suite under Apache 2.0 and announcing a distribution partnership with NVIDIA, the company positions itself as the flagship of open-weight frontier models: high performance, but not locked behind an API.²
On the edges of this frontier, we see specialised plays. xAI’s Grok 4.1 family pushes hard into agentic use cases: tool-calling, real-time ingest of the firehose that is X, and a 2M-token context tuned for long-horizon retrieval and planning.⁸ DeepSeek’s V3.2-Speciale explicitly markets itself as “pushing the boundaries of reasoning,” with an API-only release aimed at builders who want an inexpensive, reasoning-first alternative to closed US models.⁹
Seen together, December 2025’s models tell us this: the race at the top has become as much about orchestration, context length, and tool use as raw static benchmarks. The centre of gravity is shifting from “which model scores highest on MMLU?” (Massive Multitask Language Understanding) to “which ecosystem can run complex, long-running tasks reliably, cheaply, and in ways organisations can control?”
II. Compute, Money, and the Infrastructure Squeeze
If 2020’s GPT-3 era was about shock at training costs, 2025 is about the grind of inference economics. A widely read technical essay this month notes that a GPT-3-scale training run that cost roughly $4.6 million in 2020 can now be done for around $450,000, thanks to hardware improvements and optimisation.¹² Training, in other words, has become relatively cheap. What is expensive is running these models, continuously, for hundreds of millions of users and billions of API calls.
That pressure explains much of December’s activity. Mistral’s partnership with NVIDIA is as much an infrastructure play as a model story: the companies promise a standard way to deploy Mistral 3 models on NVIDIA hardware “from cloud to edge,” with the MoE architecture and large context designed to maximise throughput per watt.² Amazon’s Nova 2 announcement explicitly leans on its in-house Trainium chips and Nova Forge’s ability to let customers build highly specialised models that don’t waste capacity on irrelevant capabilities.¹
Anthropic’s $1 billion revenue run rate for Claude Code hints at another financial reality: narrow use-case alignment matters.⁴ A coding assistant that can replace or accelerate high-value developer hours supports far higher per-token or per-seat pricing than a general chat assistant; it can justify the infrastructure bill. That logic is mirrored in HSBC’s move to self-host Mistral models: by bringing inference on-premise, the bank hopes to cut latency, improve control, and avoid mark-ups from hyperscale AI APIs.³
These economics are driving structural choices. December’s technical commentary for builders frames three ecosystems: small language models (SLMs) for cost-sensitive, narrow tasks; large language models (LLMs) for broad reasoning; and multimodal language models (MLMs) for perception-heavy workloads.¹² The message is clear: “big LLM or nothing” is now a bad architectural decision.
That shift also matters for energy and water. A joint Australian–US article this month on small vs large models reminds readers that LLMs “usually run in the cloud… with high operational costs,” while SLMs can run on cheap hardware, deliver millisecond latency, and fit in constrained environments—from farm-advice platforms to satellites.¹³ In other words, the climate and power implications of AI are no longer an abstract externality; they are part of the deployment trade-off.
III. Governance and Safety: From Grand Principles to Concrete Guidance
On the policy front, December brings fewer dramatic announcements than the passage of the EU AI Act earlier in the year, but there are meaningful moves toward operational guidance.
In Australia, the federal Labor government released its long-awaited AI strategy with a clear decision not to pursue standalone AI legislation, arguing that existing laws—on discrimination, privacy, online safety—can be adapted for AI. Instead, the roadmap emphasises economic benefits and unlocking public and private data for innovation. Yet the same document warns of “vast water and power resources being sucked up by datacentres,” AI-facilitated gendered abuse, and unresolved copyright questions for artists and writers whose work is “hoovered up” by large language models.¹⁵ The message is ambivalent: push ahead, but worry about the plumbing and the people.
In the United States, the Cybersecurity and Infrastructure Security Agency (CISA) and partners in Australia and beyond released joint guidance on securely integrating AI in operational technology (OT) environments—industrial control systems, energy grids, transport, and manufacturing.¹⁶ While not LLM-specific, the guidance acknowledges that generative models are being embedded into monitoring, incident response, and decision-support systems in critical infrastructure, and urges rigorous threat modelling, supply-chain scrutiny, and fallback modes if AI components misbehave.
At the micro level, even small organisations are starting to respond. Seattle street-paper Real Change this week announced a “robust generative AI policy,” outlining when AI tools may be used in reporting, editing, and illustration, how outputs must be disclosed, and a baseline commitment that core investigative work remain human-led.¹⁷ These may seem like modest moves, but they point to the new normal: organisations of all sizes are having to decide not just whether to use LLMs, but how to do so without undermining trust or labour conditions.
The legal profession is wrestling with broader questions. A December 1 analysis by law firm Greenberg Traurig asks bluntly, “Have large language models hit a wall?”, arguing that further capability gains may run into diminishing returns without new architectures or training regimes, and flagging regulatory, liability, and IP uncertainties that could constrain deployment in high-risk sectors.¹⁸ It is less a definitive verdict than a sign that the legal establishment now sees LLMs not as toys but as systemic risk factors.
IV. Open vs Closed, West vs China, Centre vs Periphery
The December landscape also sharpens geopolitical and philosophical divides.
Open vs closed: Mistral 3’s Apache-2.0 release, Meta’s earlier Llama 4 family, and OpenAI’s GPT-oss models demonstrate a maturing open-weight ecosystem at every scale—from compact 3B SLMs to mixture-of-experts giants.² ¹¹ ¹⁹ The licensing choices here matter: Apache 2.0 and MIT licences permit commercial use and local deployment with few restrictions, allowing banks, health systems, and national security agencies to run models inside their own walls. The trade-off is that these models may lag slightly behind the very latest proprietary releases on some benchmarks, and shift more responsibility for safety and alignment onto the deployer.
West vs China: DeepSeek’s trajectory remains a bellwether for China’s LLM ambitions. Its V3 series combines a “thinking mode” for complex reasoning with a faster “non-thinking” mode for bulk tasks, and uses a mixture-of-experts architecture to handle long contexts efficiently.¹¹ ⁹ DeepSeek V3.2-Speciale’s branding as a “reasoning-first” agent model—released API-only but supported by a technical paper on Hugging Face—signals a desire to compete directly with US labs on the agentic frontier, not merely as an inexpensive alternative.⁹
At the same time, Ukraine’s Gemma-based national LLM illustrates how smaller states are trying to avoid becoming mere customers of either bloc. Kyiv’s model will support Ukrainian, Russian, Crimean Tatar, and other minority languages, trained on data from more than 90 public institutions, with advisory committees overseeing cultural and technical standards.¹⁰ The explicit goal is to ensure that the country’s future AI infrastructure cannot be switched off or skewed by decisions in San Francisco, Beijing, or even Brussels.
Centre vs periphery: A December explainer on small vs large language models contrasts cloud-hosted LLMs with SLMs that can run on-device or on modest servers, enabling AI-powered agricultural advice platforms in India and other low-resource settings.¹³ This is a quiet rebalancing of power. If generative AI can only be consumed via US-based cloud APIs, the global periphery remains dependent on foreign platforms. If SLMs and open-weight models can be deployed locally, they become tools for digital self-determination.
V. Work, Culture, and Everyday Life
In workplaces, LLMs are no longer pilots or experiments; they are infrastructure. HSBC’s partnership with Mistral aims to “supercharge” generative AI across the bank, from document analysis to client communication, with the models self-hosted and integrated into internal systems.³ This move echoes similar roll-outs by other financial institutions and big-law firms over the past year, but is notable for picking an open-weight European lab over US incumbents.
Anthropic’s Claude Code—now at a billion-dollar run rate—offers a glimpse of what a mature, niche LLM product looks like: integrated into developer toolchains, trusted for routine tasks, and increasingly part of the cognitive scaffolding of software teams.⁴ When a runtime company like Bun is acquired not by a cloud provider but by an AI lab, it signals that the next productivity leap may come from deeply fusing models with the environments in which they act.
Culturally, the month brings a flurry of interpretive work. MIT Press and Penguin have released Stephan Raaijmakers’ book Large Language Models, a 300-page survey that traces the history, capabilities, and societal impact of LLMs, aimed at both computer-science and linguistics audiences.¹³ A special issue of the Journal of Manufacturing Systems focuses on LLMs in smart manufacturing, from natural-language interfaces for factory control to code generation for robotic systems.¹² These are signs that the field is moving from breathless headlines into more durable academic and pedagogical treatment.
At the same time, public-facing pieces are trying to demystify the toolset. Mirage News, picking up a Conversation article, explains the difference between small language models—“specialised tools in a toolbox,” with millions of parameters—and large models—“an entire workshop,” with billions or trillions.¹³ This kind of language is less about technical precision than about helping citizens, teachers, and small businesses choose the right level of AI for their needs.
Within newsrooms, unions, and community organisations, policies like Real Change’s generative-AI guidelines suggest a nascent consensus: AI can assist with transcription, summarisation, and visual drafts, but investigative judgment, political framing, and final editorial control must remain human.¹⁷ That line will be tested repeatedly in the years ahead.
VI. Research, Reflection, and the Question of Limits
Several December publications speak to the frontier of how LLMs think—or appear to.
In npj Artificial Intelligence, Baoxue Li and Chunhui Zhao report that structured self-reflection—having models critique and refine their own draft answers—can markedly improve performance on academic tasks.¹⁴ Self-reflection has been used informally in prompt-engineering and RLHF pipelines for some time; this paper offers a more rigorous demonstration that, at least in constrained settings, forcing models to “think again” yields more accurate and stable answers.
The Journal of Manufacturing Systems special issue, updated December 1, surveys applications of LLMs in production lines and supply chains, including retrieval-augmented assistants for technicians and code-generating agents for programmable logic controllers.¹² While many of these deployments are still experimental, they highlight a trend: LLMs moving from office chat windows into the guts of physical systems.
Stepping back, Greenberg Traurig’s “Have large language models hit a wall?” article articulates a growing unease.¹⁸ It notes that benchmark gains are flattening, hallucinations remain a problem, and scaling alone may not deliver the kind of robust, verifiable reasoning needed for medicine, law, or critical infrastructure. The piece also points to upcoming regulatory requirements for transparency, watermarking, and incident reporting that could impose friction on frontier labs.
Meanwhile, technical essays aimed at builders argue that we are not at a wall so much as a fork. One path pushes toward ever-larger, more capable general models; another explores smaller, more specialised systems; a third blends the two with orchestrated swarms and tool-calling.¹² December’s model releases, with their focus on agents, context windows, and modular deployment, suggest that the industry is already walking all three paths at once.
VII. Environmental and Material Realities
The environmental footprint of LLMs is no longer buried in footnotes. The Australian government’s AI roadmap explicitly calls out the “vast water and power resources” being consumed by datacentres, and the need to ensure that AI-driven growth does not overwhelm already stressed grids and watersheds.¹⁵ This is not an isolated concern: regulators in the EU and several US states are pushing for transparency on datacentre energy and water use, and for siting decisions that account for local climate risk.
Analyses like JIN’s “Technical Builders Guide to 2025” warn practitioners that LLMs’ operating costs—and by implication their energy use—scale with inference load, not just model size.¹² A single mis-architected application that calls an overpowered model millions of times per day can easily waste orders of magnitude more energy than a well-designed SLM-first system. Mirage News’ discussion of SLMs as “LED bulbs” and LLMs as “stadium lights” offers a useful metaphor: both have their place, but you don’t flood-light a living room.¹³
As Mistral, Meta, OpenAI, and others tout MoE architectures and quantised deployments, there is an emerging recognition that efficiency is now a competitive differentiator—not just a green talking point. Apache-licensed, hardware-friendly models like Mistral 3 and GPT-oss will be judged not only on accuracy, but on how cheaply and cleanly they can be run at scale.² ¹¹
VIII. What December 2025 Signals
With most of December still ahead, it is too early to declare the month’s final verdict. But the first days offer several clear signals.
1. The frontier is plural and increasingly open.
There is no longer a single dominant model: GPT-5.1, Claude Opus 4.5, Gemini 3, Grok 4.1, DeepSeek V3.2, Mistral 3, and Llama 4 each occupy different niches.² ⁵ ⁶ ⁷ ⁸ ⁹ ¹¹ ¹⁹ The open-weight side of that ledger is growing stronger, with frontier-scale Apache and MIT-licensed models now routine rather than exceptional.
2. Infrastructure and economics are shaping architecture.
Inference costs, energy use, and latency are forcing developers to adopt layered stacks of SLMs, LLMs, and multimodal models rather than one-size-fits-all solutions.¹² ¹³ Enterprise deals like HSBC–Mistral and Amazon’s Nova Forge push AI closer to where the data and compute already live.¹ ³
3. Sovereign and sectoral models are on the rise.
Ukraine’s Gemma-based LLM is likely a harbinger of dozens of national and sectoral models to come, trained on local languages and institutional data.¹⁰ Manufacturing, finance, and media are all moving from generic assistants to domain-tuned systems governed by bespoke policies.¹² ¹⁷ ¹⁸
4. Governance is shifting from principles to plumbing.
The questions now are less “Should we regulate AI?” and more “Which existing laws apply?” and “How do we integrate these systems safely into critical infrastructure?” Australian policymakers, CISA, and even small newsrooms are all wrestling with that wiring.¹⁵ ¹⁶ ¹⁷
5. The debate about limits is starting in earnest.
Legal analyses of whether LLMs have “hit a wall,” empirical work on self-reflection and academic tasks, and multi-disciplinary books and conferences suggest the field is entering a more reflective phase.¹¹ ¹³ ¹⁴ ¹⁸ The question is not just how big these models can get, but what forms of intelligence—and what kinds of society—we actually want them to support.
Over the next few months, expect these patterns to deepen. Open-weight frontier models will proliferate. Sovereign and sectoral LLMs will multiply. Regulators will issue more concrete guidance on datacentres, safety audits, and content provenance. And for builders and citizens alike, the challenge will be to navigate a world where large language models are no longer novelties but invisible infrastructure—as taken for granted, and as contested, as the power grid itself.
Research for this article assisted by ChatGPT 5.1
Endnotes
- Amazon Web Services and Kevin Roose, “Amazon Has New Frontier AI Models—and a Way for Customers to Build Their Own,” Wired, December 2025. WIRED
- Mistral AI, “Introducing Mistral 3,” December 1, 2025; NVIDIA, “NVIDIA Partners With Mistral AI to Accelerate New Family of Open Models,” December 2025; Financial Times, “AI Chatbot Race Enters Crunch Phase,” FT News Briefing, December 3, 2025. Mistral AI+2NVIDIA Blog+2
- Elizabeth Howcroft, “HSBC Taps French Start-Up Mistral to Supercharge Generative-AI Rollout,” Reuters, December 1, 2025. Reuters
- Anna Tong, “Anthropic Acquires Developer Tool Startup Bun to Scale AI Coding,” Reuters, December 2, 2025; Anthropic, “Newsroom,” accessed December 4, 2025. Reuters+1
- OpenAI, “GPT-5.1: A Smarter, More Conversational ChatGPT,” November 12, 2025; OpenAI, “Introducing GPT-5,” August 7, 2025; Shakudo, “Top 9 Large Language Models as of November 2025,” October 5, 2025. Shakudo+3OpenAI+3OpenAI+3
- Anthropic, “Introducing Claude Opus 4.5,” November 24, 2025. Anthropic
- Google, “Gemini 3: Introducing the Latest Gemini AI Model from Google,” November 18, 2025; Google, “Gemini Models,” accessed December 4, 2025. blog.google+1
- xAI, “Grok 4.1,” November 17, 2025; xAI, “Grok 4.1 Fast and Agent Tools API,” November 19, 2025; Economic Times, “Grok 4.1 Update: xAI Surpasses ChatGPT & Gemini on Key AI Benchmarks,” November 18, 2025. xAI+2xAI+2
- DeepSeek, “DeepSeek-V3.2 Release,” December 1, 2025; Sebastian Raschka, “A Technical Tour of the DeepSeek Models from V3 to V3.2,” Raschka’s AI Newsletter, December 2025. DeepSeek API Docs+1
- Reuters, “Ukraine Developing Independent AI System with Google Open Technology, Ministry Says,” December 1, 2025. Reuters+1
- Shakudo, “Top 9 Large Language Models as of November 2025,” October 5, 2025. Shakudo
- JIN, “Small, Large, and Multimodal Language Models: A Technical Builder’s Guide to 2025,” Medium, December 2025. Medium
- Lin Tian and Marian-Andrei Rizoiu, “What Are Small Language Models and How Do They Differ from Large Ones?,” The Conversation (republished by Mirage News), December 2, 2025. Mirage News
- Baoxue Li and Chunhui Zhao, “Self-Reflection Enhances Large Language Models Towards Substantial Academic Response,” npj Artificial Intelligence 1 (2025), Article 42. Nature
- Josh Butler, “‘Enable Workers’ Talents’: No Need for AI Legislation in Australia, Labor Says,” The Guardian, December 1, 2025. The Guardian
- Cybersecurity and Infrastructure Security Agency (CISA), “CISA, Australia, and Partners Author Joint Guidance on Securely Integrating Artificial Intelligence in Operational Technology,” Alert, December 3, 2025; CISA, “Home Page,” accessed December 4, 2025. CISA+1
- Real Change Newsroom, “Real Change Introduces Robust Generative AI Policy,” Real Change, December 3, 2025. Real Change News
- Greenberg Traurig LLP, “Have Large Language Models Hit a Wall?,” December 1, 2025. Gilbert + Tobin
- Meta and Reuters, “Meta Releases New AI Model Llama 4,” April 5, 2025; “Llama (Language Model),” Wikipedia, last modified November 2025
