On June 26, 2026, OpenAI began a limited preview of GPT-5.6 — and the strangest part of the launch is that almost no one can use it yet. The release introduces a three-model family with a new naming scheme: Sol (the flagship), Terra (a balanced everyday model), and Luna (a fast, low-cost model). It also ships two new reasoning modes, a state-of-the-art coding benchmark, and OpenAI's most aggressive safety stack to date. But instead of the usual same-day rollout to Plus and Pro subscribers, GPT-5.6 went out to roughly twenty trusted partners through the API and Codex — a phased release that OpenAI says was requested by the U.S. government.

OpenAI GPT-5.6 Terminal-Bench 2.1 results: bar chart of agentic command-line coding scores for Sol Ultra, Sol, Terra, Luna against Claude Mythos 5, Claude Fable 5, GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro Preview

This guide unpacks what GPT-5.6 actually is — the three-tier design, the benchmarks OpenAI chose to show, the new ultra and max modes, the pricing, and the cybersecurity story that explains the unusual rollout. It also covers what the model means for document and slide workflows once it reaches general availability. Because this is a preview, a healthy amount of skepticism is warranted: OpenAI published a deliberately narrow slice of evaluations and promised "an expanded suite" at launch. We will flag what is confirmed and what is still unverified throughout.

What Is GPT-5.6?

GPT-5.6 is the successor to GPT-5.5, the fully retrained agentic model OpenAI shipped in April 2026. Where GPT-5.5 was a single model with reasoning-effort levels, GPT-5.6 is a family of three durable tiers under one generation number. OpenAI's framing is explicit about the philosophy:

"The number identifies a model's generation, while Sol, Terra, and Luna identify durable capability tiers that can advance on their own cadence."

That sentence is the most important architectural decision in the release. It means OpenAI is decoupling "how smart" from "how new." A future GPT-5.7 Luna could ship without touching Sol; a Sol upgrade does not force everyone on Terra to re-test their pipelines. The celestial names — Sol (sun), Terra (earth), Luna (moon) — are meant to be permanent fixtures, the way "Pro," "Air," and "mini" became stable product lines elsewhere.

Here is how the three tiers break down:

GPT-5.6 Sol — the flagship. OpenAI calls it "our strongest model yet" and the only tier that exposes the new max and ultra reasoning modes. It is the cybersecurity and long-horizon-agent model.
GPT-5.6 Terra — the balanced workhorse. OpenAI says Terra "has competitive performance to GPT-5.5 while being 2x cheaper." This is the tier most production workloads will actually run on.
GPT-5.6 Luna — the budget tier. "A fast and affordable model" that "brings strong capability at our lowest cost," aimed at high-volume, routine tasks.

Multiple third-party reports put Sol's context window at roughly 1.5 million tokens — about 43% larger than the GPT-5.5 generation — though OpenAI's preview post itself focused on safety and evaluations rather than the spec sheet, so treat the context-window figure as reported-but-not-officially-confirmed until general availability.

The New Naming System: Why Sol, Terra, Luna Matters

For two years, OpenAI's naming was a running joke — gpt-4o, o1, o3-mini-high, gpt-5.4-pro. The Sol/Terra/Luna scheme is a deliberate reset. By fixing three tiers and letting the generation number float, OpenAI is signaling that it expects to ship capability upgrades far more often than it changes the product surface.

For developers this is genuinely useful. If you build an agent on Terra today, the contract is "balanced cost and capability" — and that contract survives the next three generation bumps. You no longer have to re-pick a model from a menu of eight confusingly-named options every quarter. The trade-off is that "Terra" tells you nothing absolute about capability; you have to read the generation number too. But that is a smaller cognitive load than the old system imposed, and it mirrors how the rest of the industry — including Anthropic's Mythos/Fable line and Google's Gemini Pro/Flash split — has converged on stable tier names.

Benchmark Results: A Deliberately Narrow Window

OpenAI did something unusual with the GPT-5.6 preview: it published only three benchmark areas — coding, biology, and cybersecurity — and explicitly held back the rest. There is no GDPval table, no SWE-bench number, no FrontierMath score, no hallucination rate in the preview post. OpenAI says the "expanded suite of evaluation results" will arrive at general availability. So the picture below is real but partial, and it is all vendor-reported under OpenAI's own harness.

Terminal-Bench 2.1: The Headline Coding Result

The one benchmark OpenAI clearly wants you to remember is Terminal-Bench 2.1, which tests command-line agentic workflows requiring planning, iteration, and tool coordination. Here is the full official table, transcribed exactly from OpenAI's chart:

Model	Terminal-Bench 2.1
GPT-5.6 Sol (ultra mode)	91.9%
GPT-5.6 Sol	88.8%
Claude Mythos 5	88.0%
GPT-5.6 Terra	84.3%
Claude Fable 5	84.3%
GPT-5.5	83.4%
GPT-5.6 Luna	82.5%
Claude Opus 4.8	78.9%
Gemini 3.1 Pro Preview	70.7%

Read carefully, this table is more honest than the "new state of the art" headline suggests, and the nuance is worth four observations:

1. The 91.9% record uses ultra mode — which is not a single agent. That top score comes from Sol orchestrating subagents (more on that below). Compared against single-agent baselines, it is not strictly apples-to-apples. The fairer comparison is Sol's single-agent 88.8%.

2. Single-agent Sol beats Claude Mythos 5 by 0.8 points. 88.8% vs 88.0% is inside the noise band — two runs of the same model can disagree by more than that. OpenAI has a real lead at the top of the coding curve, but on this benchmark it is a narrow lead over Anthropic's current flagship, not a blowout. For a balanced view of where Anthropic still competes, see our Claude Opus 4.8 guide.

3. Terra ties Claude Fable 5 and barely edges GPT-5.5. Terra's 84.3% matches Fable 5 exactly and sits less than a point above the GPT-5.5 it replaces. Combined with the 2× price cut, that is the real commercial story of this release — same coding ability for half the money, not a capability leap in the mid-tier.

4. Gemini 3.1 Pro Preview trails by ~21 points. On this specific agentic-coding benchmark, Google's frontier model is well behind. That gap is real but benchmark-specific; Gemini leads elsewhere on long-context retrieval and multimodal tasks not shown here.

Biology: GeneBench v1

OpenAI's second focus area was biology workflows. On GeneBench v1, which evaluates long-horizon genomics and quantitative-biology analyses, GPT-5.6 Sol "achieves stronger results than GPT-5.5 while using fewer tokens." OpenAI did not publish the raw GeneBench numbers in the preview post, so this is a directional claim — better accuracy at lower token cost — rather than a verifiable score. The token-efficiency angle is consistent across the release: Sol is repeatedly described as reaching higher accuracy with fewer output tokens, which matters more for cost than the per-token price alone.

Cybersecurity: ExploitBench and ExploitGym

The third area is where GPT-5.6 gets genuinely interesting — and where the whole unusual rollout originates.

ExploitBench: GPT-5.6 Sol is "competitive with Mythos Preview using only ~1/3 of the output tokens." So roughly comparable offensive-security capability at a third of the token spend.
ExploitGym: a benchmark built by UC Berkeley researchers in collaboration with OpenAI and other frontier labs. Sol, Terra, and Luna all show "strong improvements in cyber capabilities as we increase reasoning."

Crucially, OpenAI states that Sol does not cross the "Cyber Critical" threshold under its Preparedness Framework. In evaluations involving Chromium and Firefox, Sol "identified bugs and exploitation primitives — the building blocks of an exploit — but did not autonomously produce a functional full-chain exploit under the conditions tested." In plain terms: it can find the pieces, but it did not assemble a working weapon on its own in OpenAI's tests. That distinction — building blocks, not full chains — is the entire basis for releasing the model at all.

What's Genuinely New: `max` and `ultra` Reasoning Modes

The most important architectural addition is not in any benchmark cell. GPT-5.6 introduces two new reasoning settings, both exclusive to Sol:

max — a new top reasoning effort that gives the model "the most time to reason deeply." This is the familiar lever taken one notch further: a single agent thinking longer on one hard problem.
ultra — the genuinely new primitive. Ultra "goes beyond the capabilities of a single agent by leveraging subagents to accelerate complex work." Instead of one chain of thought, Sol spins up helper agents that split a complex task into parallel pieces and recombine the results.

Diagram comparing GPT-5.6 Sol's max reasoning mode, a single deep chain of thought, against ultra mode, which orchestrates parallel subagents that split and recombine complex work

This is the mechanism behind the 91.9% Terminal-Bench headline. ultra is OpenAI productizing multi-agent orchestration directly inside a single model endpoint — the same pattern developers have been hand-rolling with frameworks for the past year, now native. The practical implication is significant: for a long-horizon task like a multi-file refactor or an end-to-end security audit, you no longer have to build the orchestration layer yourself. The catch is cost and latency — subagents mean more total tokens and more wall-clock time, so ultra is for the high-stakes task where a few extra points justify the spend, not for routine calls.

It also explains why benchmark comparisons need care. A score produced by an agent that quietly spawns subagents is a different kind of result from a single forward pass. OpenAI is transparent about labeling the ultra bar separately, which is the right call — but it means "GPT-5.6 hits 91.9%" and "Claude Mythos 5 hits 88.0%" are not measuring quite the same thing.

Pricing: Sol Holds the Line, Terra Halves It

GPT-5.6's pricing, per 1M tokens, is one of the clearest parts of the release:

Model	Input	Output	Position
GPT-5.6 Sol	$5	$30	Same as GPT-5.5
GPT-5.6 Terra	$2.50	$15	~2× cheaper than GPT-5.5
GPT-5.6 Luna	$1	$6	Lowest cost

GPT-5.6 pricing ladder comparing Sol at 5 and 30, Terra at 2.50 and 15, and Luna at 1 and 6 USD per million input and output tokens, with Sol matching GPT-5.5 pricing and Terra roughly half the cost

Two things stand out. First, Sol holds GPT-5.5's exact $5 / $30 pricing while adding capability and the new modes — a rare case of "more for the same money." Second, Terra delivers GPT-5.5-class coding at half the price, which is the upgrade most teams will feel. With Sol's token-efficiency gains layered on top (fewer output tokens for equivalent work), the effective cost-per-task drop is larger than the headline rates imply.

GPT-5.6 also overhauls prompt caching. It adds explicit cache breakpoints and a 30-minute minimum cache life. Cache writes are billed at 1.25× the uncached input rate, while cache reads keep the 90% cached-input discount. For agentic workloads that re-send a large system prompt or codebase on every step, predictable caching can dominate the bill — this is a meaningful quality-of-life change for production agents.

Separately, OpenAI announced it will run GPT-5.6 Sol on Cerebras at up to 750 tokens per second in July, initially for select customers. At that throughput, long ultra-mode runs that would otherwise take minutes become interactive — a different experience from today's frontier latency.

The Real Story: Why You Can't Use It Yet

Here is the part that makes GPT-5.6 unlike any prior OpenAI launch. The model exists, the benchmarks are published, the pricing is set — and access is restricted to about twenty vetted partners. OpenAI's own explanation:

"As part of our ongoing engagement with the U.S. government, we previewed our plans and the models' capabilities ahead of today's launch. At their request, we are starting with a limited preview for a small group of trusted partners whose participation has been shared with the government, before releasing more broadly."

The trigger is the cybersecurity capability. Because Sol meaningfully advances vulnerability research and exploitation — even without crossing the Cyber Critical line — OpenAI is coordinating with the Administration on a "cyber Executive Order framework and a repeatable process for future model releases." Notably, OpenAI does not endorse this becoming permanent. Its language is unusually pointed:

"We don't believe this kind of government access process should become the long-term default. It keeps the best tools from users, developers, enterprises, cyber defenders, and global partners who need them."

That tension — a frontier lab shipping a model it believes should be broadly available, while accepting a government-requested gate to get there — is the defining feature of this release. Whatever your view, it is a genuine first: the rollout schedule of a commercial AI model shaped by a pre-release government review.

The Layered Safeguard Stack

To justify releasing the model at all, OpenAI paired it with what it calls its "most robust safety stack to date." It is worth understanding because it previews how high-capability models will ship going forward:

Trained-in refusals — the model is trained to refuse prohibited cyber assistance, including disguised intent and jailbreak attempts.
Real-time classifiers — cyber and biology misuse classifiers evaluate output as it is generated; on higher-risk cases, generation can pause while a larger reasoning model reviews the full conversation before anything reaches the user.
Account-level review — flagged activity can trigger review across a user's conversations to distinguish persistent malicious behavior from legitimate dual-use security work.
Differentiated access — the most sensitive capabilities are not broadly available by default.

OpenAI also dedicated over 700,000 A100-equivalent GPU hours to automated red-teaming aimed at finding universal jailbreaks — attacks that generalize across prompts rather than working in one narrow setting — supplemented by third-party human red-teaming that continues through the preview. OpenAI is candid that the safeguards will sometimes misfire: legitimate security researchers may hit refusals or delays on dual-use requests, and the preview is partly designed to measure exactly that false-positive rate.

Honest Assessment: Where GPT-5.6 Leads and Where It Doesn't

Stripping away the launch framing, here is the balanced read:

Where it genuinely leads:

Agentic command-line coding — single-agent Sol tops Terminal-Bench 2.1, and ultra mode extends the lead further for tasks that justify the cost.
Token efficiency — across coding, biology, and security, Sol reaches comparable-or-better results with fewer output tokens. For high-volume agents, this compounds.
Mid-tier value — Terra's GPT-5.5-class performance at half the price is the most broadly useful change.
Native multi-agent orchestration — ultra removes a layer developers used to build by hand.

Where the case is weaker or unproven:

The headline is a narrow win. Single-agent Sol leads Claude Mythos 5 by 0.8 points on one benchmark. That is a lead, not a generational gap.
The evaluation window is deliberately small. No SWE-bench, GDPval, math, or hallucination numbers were published. We do not yet know whether GPT-5.6 fixed the confident-hallucination problem that dogged GPT-5.5. Until the expanded suite lands, judgment on general capability should stay reserved.
Everything is vendor-reported. No independent lab or arena has scored GPT-5.6 yet, precisely because almost no one can access it. Vendor benchmarks run on the vendor's harness and task distribution; independent verification is the missing ingredient.
You can't actually use it. For the vast majority of developers and ChatGPT users, GPT-5.6 is, for now, a press release with a price list. The real test comes at general availability "in the coming weeks."

What GPT-5.6 Means for AI Slide Generation

For teams building document-to-slide pipelines, the most relevant parts of GPT-5.6 are not the cybersecurity benchmarks — they are the quieter capability and economics shifts. A larger context window and stronger long-horizon agentic behavior directly improve the hardest step in AI presentation tools: ingesting a long, messy source document and producing a coherent slide outline that holds its narrative thread from the first section to the last.

There are three places this matters concretely. First, Terra's pricing changes the economics of bulk slide generation. Producing a 60-slide deck from a 200-page report is a token-heavy job; halving the mid-tier rate while holding capability means a massive slide deck workflow that was marginal on cost becomes routine. Second, ultra-style subagent orchestration maps naturally onto multi-document decks — one subagent per source section, recombined into a single structured outline, is exactly the pattern a research-paper-to-slides workflow needs. Third, token efficiency reduces the cost of the per-slide rendering pass, where each slide's layout and content are generated and then verified against the source.

But the unsolved problem is the same one every model launch leaves untouched: a more capable model produces more confident output, and confidence is not the same as fidelity. Until OpenAI publishes GPT-5.6's hallucination numbers, the safe assumption is that the floor for trustworthy slides — grounded, traceable, factually honest — is still set by the orchestration around the model, not the model alone. At Tosea.ai, the document-to-PPT pipeline runs source-grounded outline generation followed by per-slide rendering, with every claim traceable back to its source paragraph, so a stronger model raises the ceiling on richness while the architecture keeps the floor honest. For the underlying approach, see our guides on zero-hallucination AI slide generation and converting PDF documents into PowerPoint slides.

How to Get Access

Limited preview (now): GPT-5.6 Sol, Terra, and Luna are available through the API and Codex to roughly twenty trusted partners and organizations, with participation shared with the U.S. government.
General availability ("in the coming weeks"): OpenAI plans to bring all three tiers to ChatGPT, Codex, and the API broadly. No firm date has been committed.
Cerebras (July): GPT-5.6 Sol at up to 750 tokens/second, initially for select customers.
Naming in the API: expect tier-based identifiers (Sol/Terra/Luna) under the GPT-5.6 generation, with max and ultra exposed as reasoning settings on Sol.

If you are not one of the twenty preview partners, the practical advice is to keep building on GPT-5.5 or Claude Opus 4.8 today and plan a migration test for when Terra hits general availability — the 2× cost cut is the change most workloads will want.

FAQ

Can I use GPT-5.6 right now? Almost certainly not. The June 26 release is a limited preview to about twenty vetted partners via the API and Codex. Broad ChatGPT and API access is promised "in the coming weeks" but has no committed date.

What's the difference between max and ultra mode? max is a single agent reasoning longer and deeper on one problem. ultra orchestrates multiple subagents that split complex work in parallel and recombine it. Both are exclusive to the Sol tier. ultra produced the 91.9% Terminal-Bench headline.

Is GPT-5.6 Sol better than Claude for coding? On Terminal-Bench 2.1, single-agent Sol (88.8%) edges Claude Mythos 5 (88.0%) by 0.8 points — a real but narrow lead. With ultra mode Sol reaches 91.9%, but that uses subagents and isn't a like-for-like single-agent comparison. No independent benchmarks exist yet, so treat this as a vendor-reported near-tie at the top.

Why is the U.S. government involved in the release? GPT-5.6 Sol meaningfully advances cybersecurity capabilities — including vulnerability research — even though it doesn't cross OpenAI's "Cyber Critical" threshold. At the government's request, OpenAI is running a phased preview while developing a cyber Executive Order framework. OpenAI has said it does not want this to become the long-term default.

Which tier should I plan to use? For most production workloads, Terra — it matches GPT-5.5-class capability at roughly half the cost. Reserve Sol for long-horizon coding, security work, or tasks that justify ultra mode. Use Luna for high-volume, latency-sensitive, routine calls.

Does GPT-5.6 fix GPT-5.5's hallucination problem? Unknown. OpenAI deliberately published only coding, biology, and cybersecurity evaluations in the preview and held back its hallucination and knowledge-work numbers until general availability. Until then, pair the model with source-grounded verification for any high-stakes output.

Closing Thought

GPT-5.6 is two stories wearing one announcement. The first is an ordinary, well-executed model-family update: a sensible three-tier naming reset, a mid-tier that halves cost without losing capability, and a clever ultra mode that bakes multi-agent orchestration into the endpoint. The second is extraordinary — a frontier model whose public availability is gated by a government review, shipped with a 700,000-GPU-hour safety stack, because its cybersecurity capabilities are strong enough to make a phased release the responsible choice.

For developers, the takeaway is patience plus planning: the benchmarks are promising but narrow and vendor-reported, the real evaluation comes at general availability, and the change most teams will actually feel is Terra's pricing, not Sol's headline. For everyone watching the trajectory of the field, the more important signal is the precedent — the first time the rollout calendar of a commercial AI model was set, in part, in coordination with the state.

Sources

Previewing GPT-5.6 Sol: a next-generation model — OpenAI, June 26, 2026
OpenAI Previews GPT-5.6 With Sol, Terra, and Luna: Tiered Models, New Reasoning Modes, Limited Access — MarkTechPost
OpenAI unveils GPT-5.6 Sol, Terra and Luna models — but only accessible to limited preview partners for now, per US Gov — VentureBeat
OpenAI releases powerful new GPT-5.6 model under restrictions — Axios
OpenAI upgrading ChatGPT and Codex with new GPT-5.6 models in limited release — 9to5Mac
OpenAI starts previewing GPT-5.6 and its three variants — Engadget
GPT-5.6 Sol Benchmarks Deep Dive: Terminal-Bench and Agentic Coding — Lushbinary
GPT-5.6 Sol, Terra & Luna: OpenAI's New Model Family — Digital Applied

GPT-5.6 Sol, Terra & Luna: Complete Guide to OpenAI's New Model Family

What Is GPT-5.6?

The New Naming System: Why Sol, Terra, Luna Matters

Benchmark Results: A Deliberately Narrow Window

Terminal-Bench 2.1: The Headline Coding Result

Biology: GeneBench v1

Cybersecurity: ExploitBench and ExploitGym

What's Genuinely New: `max` and `ultra` Reasoning Modes

Pricing: Sol Holds the Line, Terra Halves It

The Real Story: Why You Can't Use It Yet

The Layered Safeguard Stack

Honest Assessment: Where GPT-5.6 Leads and Where It Doesn't

What GPT-5.6 Means for AI Slide Generation

How to Get Access

FAQ

Closing Thought

Sources

Continue Reading

How to Use GPT-5.5: Complete Guide to OpenAI's New Agentic Model in 2026

How to Write a Good PRD: A Complete Guide for Product Managers in 2026

Best AiPPT.com Alternative: Accurate, Editable, Evidence-Driven Slides (2026)

What Is GPT-5.6?

The New Naming System: Why Sol, Terra, Luna Matters

Benchmark Results: A Deliberately Narrow Window

Terminal-Bench 2.1: The Headline Coding Result

Biology: GeneBench v1

Cybersecurity: ExploitBench and ExploitGym

What's Genuinely New: max and ultra Reasoning Modes

Pricing: Sol Holds the Line, Terra Halves It

The Real Story: Why You Can't Use It Yet

The Layered Safeguard Stack

Honest Assessment: Where GPT-5.6 Leads and Where It Doesn't

What GPT-5.6 Means for AI Slide Generation

How to Get Access

FAQ

Closing Thought

Sources

Continue Reading

How to Use GPT-5.5: Complete Guide to OpenAI's New Agentic Model in 2026

How to Write a Good PRD: A Complete Guide for Product Managers in 2026

Best AiPPT.com Alternative: Accurate, Editable, Evidence-Driven Slides (2026)

What's Genuinely New: `max` and `ultra` Reasoning Modes