GPT-5.4 with native computer use, NotebookLM Cinematic Videos, Codex on Windows

The week ends with several significant announcements: OpenAI’s GPT-5.4 consolidates native computer use with 75% on OSWorld and a one-million-token context window, NotebookLM introduces Cinematic Video Overviews with Gemini as director, and Codex extends support to Windows with a native sandbox. On the developer tooling side, Anthropic improves the skill-creator and launches HTTP hooks in Claude Code, and GitHub enables Copilot Memory by default for Pro users.

GPT-5.4 — Native computer use, 1M tokens, tool search

March 5, 2026 — OpenAI launches GPT-5.4, its frontier model for professional work. Available in ChatGPT (under the name GPT-5.4 Thinking), in the API (identifier gpt-5.4) and in Codex, this model consolidates reasoning, coding, and agentic workflow capabilities introduced in previous models into a single architecture.

The most significant technical novelty is the native integration of computer use: GPT-5.4 can operate graphical interfaces via screenshots and keyboard/mouse without third-party plugins. On OSWorld-Verified — the reference benchmark for interaction with real software interfaces — GPT-5.4 reaches 75.0%, versus 47.3% for GPT-5.2. The context window increases to 1 million tokens in Codex and the API.

Another notable addition is tool search: instead of receiving the full list of available tools on every call, the model receives a lightweight list and searches for tools on demand. OpenAI measures a 47% reduction in token consumption on multi-tool workflows (tested on Scale MCP Atlas). The /fast mode in Codex gains 1.5× speed at equal intelligence.

Benchmarks:

Evaluation	GPT-5.4	GPT-5.3-Codex	GPT-5.2
GDPval (professional work)	83.0 %	70.9 %	70.9 %
SWE-Bench Pro	57.7 %	56.8 %	55.6 %
OSWorld-Verified (computer use)	75.0 %	74.0 %	47.3 %
BrowseComp (web search)	82.7 %	77.3 %	65.8 %
Toolathlon (tool usage)	54.6 %	51.9 %	46.3 %
ARC-AGI-2 (abstract reasoning)	73.3 %	—	52.9 %

API pricing:

Model	Input	Output
gpt-5.2	$1.75 / M tokens	$14 / M tokens
gpt-5.4	$2.50 / M tokens	$15 / M tokens
gpt-5.2-pro	$21 / M tokens	$168 / M tokens
gpt-5.4-pro	$30 / M tokens	$180 / M tokens

GPT-5.4 Thinking is available today to ChatGPT Plus, Team, and Pro subscribers. GPT-5.2 Thinking will remain available under “Legacy Models” until June 5, 2026. On safety, OpenAI classifies GPT-5.4 as “High cyber capability” in its Preparedness Framework. The company simultaneously publishes CoT-Control, an open-source evaluation suite measuring chain-of-thought controllability across 13 frontier models — the scores, low (0.1% to 15.4%), indicate that monitoring chains of thought remains a reliable safety tool.

🔗 Introducing GPT-5.4 | OpenAI

NotebookLM — Cinematic Video Overviews

March 4, 2026 — NotebookLM introduces Cinematic Video Overviews in its Studio. These videos go beyond the Audio Overviews (podcast format) launched in 2024 and standard video templates.

The idea: Gemini is positioned as director. The model analyzes the user’s sources, decides on the most suitable format (tutorial, documentary, etc.), selects a visual style, generates images, then self-critiques before producing the final version. The result is an immersive, personalized video unique to each set of sources.

The feature is available to Google AI Ultra subscribers, in English, since March 4, 2026. Full rollout to Ultra users was confirmed the same day. Pro subscriber access is planned on the roadmap, with no precise timeline. The announcement tweet received 3 million views.

🔗 NotebookLM announcement on X

OpenAI — Codex on Windows, CoT-Control research

Codex available on Windows

March 4, 2026 — The Codex application is now available on Windows, with a native agent sandbox and support for Windows development environments via PowerShell. Two new skills are available: $aspnet-core for Blazor, ASP.NET MVC and Razor Pages applications, and $winui-app for native Windows apps with WinUI 3.

🔗 @OpenAIDevs on X

Research — chain-of-thought controllability

March 5, 2026 — OpenAI publishes “Reasoning models struggle to control their chains of thought, and that’s good.” The open-source evaluation suite CoT-Control measures chain-of-thought controllability across 13 frontier models. Scores range from 0.1% to 15.4%, indicating that current models struggle to deliberately alter their reasoning to bypass monitoring systems — a result presented as positive for safety. OpenAI plans to include these metrics in future models’ system cards.

🔗 CoT-Control research | OpenAI

Anthropic — Skill-creator and HTTP hooks

Improved skill-creator

March 3, 2026 — Anthropic releases a major update to its skill-creator tool for Claude Code and Claude.ai. The announcement introduces two formal types of Agent Skills:

Type	Description	Durability
Capability uplift	Helps Claude do something it does not yet do well	May become obsolete if the model improves
Encoded preference	Encodes team processes and preferences	Durable, depends on fidelity to the real workflow

New features: evals (automated tests) to verify a skill produces the expected result, a benchmark mode to measure success rate, time and token consumption, and multi-agent support to run evaluations in parallel without cross-contamination between tests. An A/B comparator mode allows comparing two versions of a skill. The skill-creator is available now on Claude.ai and Cowork; for Claude Code it installs as a plugin.

🔗 Improving skill-creator: Test, measure, and refine Agent Skills

HTTP hooks in Claude Code

March 4, 2026 — Claude Code launches HTTP hooks, an alternative to existing command hooks. Instead of running a local shell script, Claude Code sends an event to a user-chosen URL and waits for a response. Use cases: build a web app to visualize progress, manage permissions, or synchronize state between multiple Claude Code instances via a database. HTTP hooks work in plugins, custom agents, and managed enterprise settings.

🔗 Tweet @dickson_tsai

Gemini CLI v0.32.0 — Generalist Agent by default

March 3, 2026 — Gemini CLI version 0.32.0 enables the Generalist Agent by default to improve task delegation and routing. The update also adds Model Steering directly in the workspace, improvements to Plan Mode (opening and editing plans in an external editor, multi-selection management for complex tasks), interactive shell autocompletion, and parallel loading of extensions for better startup performance.

🔗 Changelog Gemini CLI

GitHub Copilot — Memory by default, mobile and metrics

Copilot Memory enabled by default

March 4, 2026 — GitHub enables Copilot Memory by default for all Pro and Pro+ plan users. The feature, previously in preview via opt-in, allows Copilot to retain persistent repository-level information: coding conventions, architectural patterns, critical dependencies.

Memories are strictly limited to a single repository and validated against current code before application, avoiding use of stale context. They automatically expire after 28 days. The feature is active on the coding agent, code review, and the Copilot CLI — knowledge discovered by one agent is immediately available to others. Users can disable Copilot Memory in their settings (Settings > Features > Copilot Memory); Enterprise admins retain full control.

🔗 Copilot Memory now on by default for Pro and Pro+ users

Live notifications for agents in GitHub Mobile

March 4, 2026 — GitHub Mobile receives real-time notifications for Copilot agent sessions. Developers can follow their agents’ progress, whether the session was started from a desktop or from the phone.

🔗 GitHub Mobile | Announcement on X

Grok Code Fast 1 in Copilot Free Auto

March 4, 2026 — GitHub adds xAI’s Grok Code Fast 1 to Copilot Free’s automatic model selection (Auto). This model can now be chosen by Copilot during chat sessions in Visual Studio Code, Visual Studio, JetBrains IDEs, Xcode and Eclipse.

🔗 Grok Code Fast 1 in Copilot Free auto model selection

Copilot CLI metrics at user level

March 5, 2026 — GitHub expands Copilot usage metrics to include user-level CLI activity. This update follows last week’s enterprise-level release. Admins can now identify active CLI users, view request and session counts, and track token consumption by user.

🔗 Copilot usage metrics — user-level CLI activity

Perplexity — GPT-5.4 and Voice Mode in Computer

GPT-5.4 Thinking available on Perplexity

March 5, 2026 — GPT-5.4 and GPT-5.4 Thinking are now accessible in Perplexity for Pro and Max subscribers. The Thinking version activates GPT-5.4’s extended reasoning for deeper answers to complex queries.

🔗 Announcement on X

Voice Mode in Perplexity Computer

March 4, 2026 — Perplexity introduces a Voice Mode in Perplexity Computer. The interface, which already allowed searching, coding and deploying projects, now accepts voice instructions directly.

🔗 Announcement on X

Cohere × Aston Martin F1 — multi-year partnership

March 4, 2026 — Cohere announces a multi-year partnership with the Aston Martin Aramco F1 team. Every team member will have access to enterprise models and Cohere’s agentic AI platform (North) to work in one of the most demanding data environments in world sport. The Cohere logo will appear on the car starting at the 2026 Australian Grand Prix.

🔗 Cohere announcement on X

March 4, 2026 — Black Forest Labs (creators of FLUX) releases Self-Flow in research preview. This approach trains multi-modal generative models (image, video, audio, text) without relying on external models for representation, using a self-supervised flow matching method.

Results shown: up to 2.8× faster cross-modal convergence, better temporal coherence in video, and crisper typographic rendering. Demos include a 4B-parameter video model trained on 6M videos, a 4B-parameter image model trained on 200M images, and a joint audio-video model. BFL positions Self-Flow as a path toward world models: “Self-Flow opens a path toward world models: combining visual scalability with semantic abstraction for planning and understanding.”

🔗 Tweet @bfl_ml

In brief

Runway launched a unified model hub on March 3, centralizing access to third-party image, video, audio and language models directly within the platform. 🔗 Announcement

Claude reached #1 on the iOS App Store in 14 countries simultaneously on March 5 — Australia, Austria, Belgium, Canada, France, Germany, Ireland, Italy, New Zealand, Norway, Singapore, Switzerland, United Kingdom, United States. 🔗 Tweet

Manus published its annual letter on March 5 for its first anniversary, highlighting user testimonials (a mother, an 86-year-old linguist, a florist). 🔗 Letter

Grok surpassed one million reviews on the US App Store. 🔗 Tweet @grok

What this means

GPT-5.4 confirms that computer use is moving from experimental to an integrated capability within a generalist model. The 75% score on OSWorld-Verified and the 47% token reduction via tool search are concrete measures of a paradigm shift: AI agents can now operate complex software interfaces without specialized infrastructure.

On the developer tools side, the week shows convergence: Anthropic improves how agent skills are tested and supervised, GitHub enables persistent memory for its coding agents, and Perplexity adds voice mode to its Computer agent. Agentic runtimes are gaining layers of memory, observability (HTTP hooks, mobile notifications) and natural interaction (voice).

NotebookLM’s Cinematic Video Overviews illustrate a different axis: generating long-form educational content from personal sources. Gemini as director — analyze, critique, recombine — is an example of AI as a meta-production tool rather than just a generation assistant.

Sources - Introducing GPT-5.4 | OpenAI

This document was translated from the fr version into the en language using the gpt-5-mini model. For more information on the translation process, see https://gitlab.com/jls42/ai-powered-markdown-translator