GLM-5 open-source, Sabotage Risk Report ASL-4, OpenAI launches agentic primitives

Z.ai launches GLM-5, its new flagship open-source model with 744 billion parameters under the MIT license, which rises to the top rank of open-source models on coding and agentic tasks. Anthropic publishes an ASL-4 sabotage risk report for Opus 4.6, OpenAI enriches its API with agentic primitives, and Kimi reveals a system of 100 parallel sub-agents. On the ecosystem side, Runway raises $315 million and ElevenLabs launches an expressive mode for its voice agents.

Z.ai launches GLM-5: 744B parameters, open-source under MIT license

February 11 — Z.ai (Zhipu AI) launches GLM-5, its new frontier model designed for complex systems engineering and long-duration agentic tasks. Compared to GLM-4.5, the model grows from 355B parameters (32B active) to 744B parameters (40B active), with pre-training data increasing from 23T to 28.5T tokens.

GLM-5 integrates DeepSeek Sparse Attention (DSA) to reduce deployment costs while preserving long-context capability, and introduces “slime”, an asynchronous reinforcement learning infrastructure that improves post-training throughput.

Benchmark	GLM-5	GLM-4.7	Kimi K2.5	Claude Opus 4.5	Gemini 3 Pro
SWE-bench Verified	77.8%	73.8%	76.8%	80.9%	76.2%
HLE (text)	30.5	24.8	31.5	28.4	37.2
HLE w/ Tools	50.4	42.8	51.8	43.4	45.8
Terminal-Bench 2.0	56.2	41.0	50.8	59.3	54.2
Vending Bench 2	$4,432	$2,377	$1,198	$4,967	$5,478

GLM-5 positions itself as the best open-source model on reasoning, coding, and agentic tasks, bridging the gap with proprietary frontier models. On Vending Bench 2, a benchmark that simulates managing a vending machine over a year, GLM-5 finishes with a balance of $4,432, approaching Claude Opus 4.5 ($ 4,967).

Beyond code, GLM-5 can directly generate .docx, .pdf, and .xlsx files — proposals, financial reports, spreadsheets — delivered turnkey. Z.ai deploys an Agent mode with built-in skills for document creation, supporting multi-turn collaboration.

The model weights are published on Hugging Face under the MIT license. GLM-5 is compatible with Claude Code and OpenClaw, and available on OpenRouter. Deployment is progressive, starting with Coding Plan Max subscribers.

🔗 GLM-5 Technical Blog 🔗 Announcement on X

Anthropic publishes first ASL-4 sabotage risk report

February 11 — Anthropic publishes a sabotage risk report for Claude Opus 4.6, in anticipation of the ASL-4 (AI Safety Level 4) safety threshold for autonomous AI R&D.

Upon the release of Claude Opus 4.5, Anthropic committed to writing sabotage risk reports for every new frontier model. Rather than navigating vague thresholds, the company chose to proactively respect the higher ASL-4 safety standard.

Element	Detail
Model evaluated	Claude Opus 4.6
Safety threshold	ASL-4 (AI Safety Level 4)
Domain	Autonomous AI R&D
Format	Public PDF report
Precedent	Commitment made during Opus 4.5 launch

This is a significant step in AI safety transparency: Anthropic is one of the first labs to publish such a sabotage report for a model in production.

When we released Claude Opus 4.5, we knew future models would be close to our AI Safety Level 4 threshold for autonomous AI R&D. We therefore committed to writing sabotage risk reports for future frontier models. Today we’re delivering on that commitment for Claude Opus 4.6. — @AnthropicAI on X

🔗 Anthropic Thread

OpenAI: new agentic primitives in the Responses API

February 10 — OpenAI introduces three new primitives in the Responses API for long-duration agentic work.

Server-side compaction

Allows multi-hour agent sessions without hitting context limits. Compaction is managed server-side. Triple Whale, an early access tester, reports having achieved 150 tool calls and 5 million tokens in a single session without loss of precision.

Containers with networking

Containers hosted by OpenAI can now access the internet in a controlled manner. Administrators define a whitelist of domains in the dashboard, requests must explicitly define a network_policy, and domain secrets can be injected without exposing raw values to the model.

Skills in the API

Native support for the Agent Skills standard with a first pre-built skill (spreadsheets). Skills are reusable and versioned bundles that can be mounted in hosted shell environments, and models decide at runtime whether to invoke them.

Primitive	Description	Status
Server-side compaction	Multi-hour sessions without context limits	Available
Containers with networking	Controlled internet access for hosted containers	Available
Skills in the API	Reusable bundles (first skill: spreadsheets)	Available

🔗 OpenAIDevs Thread

Kimi Agent Swarm: orchestration of 100 sub-agents

February 10 — Kimi (Moonshot AI) unveils Agent Swarm, a multi-agent coordination capability allowing the parallelization of complex tasks with up to 100 specialized sub-agents.

The system can execute more than 1,500 tool calls and achieves a speed 4.5x higher than sequential executions. Use cases cover simultaneous multi-file generation (Word, Excel, PDFs), parallel content analysis, and creative generation in multiple styles in parallel. Agent Swarm resolves a structural limit of LLMs: the degradation of reasoning during long tasks that fill the context.

🔗 Kimi Announcement

OpenAI Harness Engineering: zero lines of manual code with Codex

February 11 — OpenAI publishes feedback on building an internal software product with zero lines of code written manually. For 5 months, a team of 3 to 7 engineers used exclusively Codex to generate all code.

Metric	Value
Lines of code generated	~1 million
Pull requests	~1,500
PRs per engineer per day	3.5 on average
Internal users	Several hundred
Estimated time	1/10th of the time needed by hand
Codex sessions	Up to 6+ hours

The “Harness Engineering” approach redefines the role of the engineer: designing environments, specifying intent, and building feedback loops for agents, rather than writing code. Documentation structured in the repo serves as a guide (AGENTS.md as table of contents), the architecture is rigid with linters and structural tests generated by Codex, and recurring tasks scan for deviations and open refactoring PRs automatically.

🔗 Harness Engineering Blog

Runway raises $315 million in Series E

February 10 — Runway announces a $315 million Series E fundraising, bringing its valuation to$ 5.3 billion. The round is led by General Atlantic, with participation from NVIDIA, Adobe Ventures, AMD Ventures, Fidelity, AllianceBernstein, and others.

Detail	Value
Amount	$315M
Series	E
Valuation	$5.3B (vs$ 3.3B in Series D)
Lead investor	General Atlantic
Total raised since 2018	$860M

Funds will be used to pre-train the next generation of “world models” — models capable of simulating the physical world — and deploy them in new products and industries. This announcement comes after the launch of Gen-4.5, Runway’s latest video generation model.

🔗 Official Announcement 🔗 Runway Post on X

Cowork available on Windows

February 10 — Claude Cowork, the desktop application for multi-step tasks, is now available on Windows in research preview with full feature parity compared to macOS.

Feature	Description
File Access	Reading and writing local files
Plugins	Support for Cowork plugins
MCP Connectors	Integration with MCP servers
Folder Instructions	Claude.md style — natural language instructions per project

Cowork on Windows is available for all paid Claude plans via claude.com/cowork.

🔗 Cowork Windows Announcement

Free features on the Claude free plan

February 11 — Anthropic expands features accessible on the free Claude plan. File creation, connectors, skills, and compaction are now available without a subscription. Compaction allows Claude to automatically summarize previous context so that long conversations can continue without restarting.

🔗 Free plan Announcement

Claude Code Plan Mode in Slack

February 11 — The Claude Code integration in Slack receives Plan Mode. When giving Claude a code task in Slack, it can now elaborate a plan before executing, allowing validation of the approach before implementation.

Feature	Description
Plan Mode	Plan elaboration before execution
Automatic detection	Intelligent routing between code and chat
PR Creation	”Create PR” button directly from Slack
Prerequisites	Pro, Max, Team or Enterprise Plan + connected GitHub

🔗 Boris Cherny Thread

ElevenLabs launches Expressive Mode for its voice agents

February 10 — ElevenLabs unveils Expressive Mode for ElevenAgents, an evolution that makes its AI voice agents capable of adapting their tone, emotion, and emphasis in real-time.

The mode relies on Eleven v3 Conversational, a voice synthesis model optimized for real-time dialogue, coupled with a new turn-taking system that reduces interruptions. The price remains at $0.08 per minute. In parallel, ElevenLabs restructures its platform into three product families: ElevenAgents (voice agents), ElevenCreative (creative tools), and ElevenAPI (developer platform).

🔗 Expressive Mode Blog

Kimi K2.5 integrated on Qoder

February 9 — Qoder (AI platform for developers) deploys Kimi K2.5 as the flagship model of its marketplace, with a SWE-bench Verified score of 76.8% and an advantageous rate (0.3x credit in Efficient tier). The recommended workflow: use heavy models for design and architecture, then K2.5 for implementation.

🔗 Qoder Announcement

What this means

Open-source continues to progress rapidly towards frontier models. Z.ai’s GLM-5 narrows the gap with Claude Opus 4.5 and GPT-5.2 on coding and agentic task benchmarks, while being available under the MIT license. The publication of the ASL-4 sabotage report by Anthropic establishes a precedent for safety transparency that other labs will likely be compelled to follow.

On the developer side, OpenAI’s agentic primitives (server-side compaction, network containers, API skills) and the “Harness Engineering” approach outline a future where autonomous agents manage multi-hour sessions. Kimi Agent Swarm pushes this logic even further with the orchestration of hundreds of sub-agents in parallel.