Anthropic hits hard with Claude Sonnet 4.6, a model that rivals Opus on many tasks at a Sonnet price. Meanwhile, Qwen publishes its first open-weight model Qwen3.5 with 397 billion parameters, and Google integrates Lyria 3 โ its music generation model โ directly into Gemini.
Claude Sonnet 4.6: Opus performance at Sonnet price
February 17 โ Anthropic launches Claude Sonnet 4.6, described as the most capable Sonnet to date. The model represents a comprehensive upgrade on coding, computer use, long-context reasoning, agent planning, intellectual work, and design. It ships with a 1 million token context window in beta.
The positioning is clear: performances that would have required an Opus model are now accessible at the Sonnet rate, i.e., 15 per million tokens (unchanged from Sonnet 4.5). Sonnet 4.6 becomes the default model on Free and Pro plans in claude.ai and Claude Cowork.
Benchmarks and user feedback
In Claude Code, testers preferred Sonnet 4.6 to Sonnet 4.5 about 70% of the time, reporting better context reading before code modification and consolidation of shared logic instead of duplicating it. Even more notable: users preferred Sonnet 4.6 to Opus 4.5 (the frontier model of November 2025) 59% of the time, citing less over-engineering, less โlaziness,โ and better instruction following.
| Benchmark | Score |
|---|---|
| SWE-bench Verified | 80.2% (with prompt modification) |
| OSWorld (computer use) | Major progress over 16 months |
| OfficeQA | Equals Opus 4.6 |
| Vending-Bench Arena | Emerging investment/pivot strategy |
Computer use progresses significantly: Sonnet 4.6 also improves resistance to prompt injections compared to Sonnet 4.5, reaching a level comparable to Opus 4.6.
Associated product updates
The announcement comes with several general availability releases on the Claude API: code execution, memory, programmatic tool calls, tool search, and tool use examples. Web search and fetch tools now integrate dynamic filtering โ Claude automatically writes and executes code to filter search results, keeping only relevant content in context.
๐ Improved web search with dynamic filtering
For Claude in Excel users, the add-in now supports MCP connectors (S&P Global, LSEG, Daloopa, PitchBook, Moodyโs, FactSet), available on Pro, Max, Team, and Enterprise plans.
Anthropic measures AI agent autonomy in real conditions
February 18 โ Anthropic publishes a study analyzing millions of human-agent interactions across Claude Code and the public API, with one goal: to understand how humans handle agent autonomy in practice.
Key results
| Metric | Value |
|---|---|
| Maximum autonomous duration (99.9th percentile) | ~45 minutes (doubled in 3 months) |
| Auto-approve (experienced users) | 40%+ (vs 20% for new ones) |
| Share of software engineering in API traffic | ~50% |
| Actions with guardrails | 80% |
| Actions with human in the loop | 73% |
| Irreversible actions | 0.8% |
A counter-intuitive finding: experienced users increase both the auto-approve rate AND the interruption rate. They move from action-by-action supervision to active monitoring with targeted intervention. Moreover, Claude stops to ask for clarifications more often than humans interrupt it, particularly on complex tasks.
The study concludes that there is a significant gap between capability and usage: the autonomy that models are capable of managing largely exceeds what they are granted in practice โ a phenomenon researchers call โundeployed autonomy surplus.โ
๐ Full study
Anthropic: Rwanda and Infosys partnerships
February 17 โ Alongside the Sonnet 4.6 launch, Anthropic signs a memorandum of understanding with the government of Rwanda to deploy Claude in healthcare, education, and public administration sectors. The partnership, led with the Ministry of ICT and Innovation, includes training civil servants and deploying an AI learning companion in eight African countries.
Anthropic also announces a collaboration with Infosys to build AI agents intended for telecommunications and other regulated industries.
๐ Rwanda Partnership
Qwen3.5-397B-A17B: first open-weight of the 3.5 series
February 16 โ Alibaba Qwen releases Qwen3.5-397B-A17B, the first open-weight model of the Qwen3.5 series. It is a significant advance with a hybrid architecture combining linear attention and sparse Mixture-of-Experts (MoE).
| Feature | Details |
|---|---|
| Total parameters | 397B (hybrid MoE architecture) |
| Architecture | Hybrid linear attention + sparse MoE |
| Throughput | 8.6x to 19.0x superior to Qwen3-Max |
| Languages | 201 languages and dialects |
| License | Apache 2.0 |
| Training | Large-scale reinforcement learning |
| Specialty | Native multimodal, real agents |
The model is available immediately on Hugging Face, ModelScope, Alibaba Cloud Model Studio, and via Qwen Code. With 201 languages supported and an Apache 2.0 license, it is one of the most ambitious open-weight models of the moment in terms of linguistic coverage and inference throughput.
๐ Tweet @Alibaba_Qwen
Google Lyria 3: music generation arrives in Gemini
February 18 โ Google and DeepMind present Lyria 3, an AI music generation model integrated directly into the Gemini application. Users can create 30-second music tracks from text prompts, photos, or videos, with custom lyrics generation.
| Feature | Details |
|---|---|
| Inputs | Text, images, videos |
| Output | 30-second audio tracks |
| Customization | Varied musical styles, generated lyrics |
| Availability | Beta in Gemini (18+ years) |
Lyria 3 demonstrates notable flexibility in instrument and genre combinations, allowing creations ranging from jingles to lo-fi compositions. Global deployment is progressive.
๐ Tweet @GoogleAI
OpenAI EVMbench: security benchmark for smart contracts
February 18 โ OpenAI and Paradigm launch EVMbench, a benchmark evaluating the ability of AI agents to detect, fix, and exploit vulnerabilities in Ethereum smart contracts. The benchmark relies on 120 curated vulnerabilities from 40 audits (mainly Code4rena competitions).
| Mode | Description | GPT-5.3-Codex | GPT-5 (6 months) |
|---|---|---|---|
| Exploit | Execute drainage attacks | 72.2% | 31.9% |
| Detect | Audit and detect vulnerabilities | < complete coverage | - |
| Patch | Fix while preserving functionality | < complete coverage | - |
An interesting finding: AI agents succeed better in exploitation (explicit objective) than in detection and correction, where they often give up after the first vulnerability found. OpenAI reaffirms its commitment of $10M in API credits for defensive cybersecurity.
GLM-5 Technical Report: Z.ai documents its model
February 18 โ Z.ai publishes the GLM-5 full technical report, detailing the architectural innovations of the model launched on February 11 (744B parameters, 40B active, MIT License).
Three key innovations documented: Dynamic Sparse Attention (DSA) to reduce training and inference costs, an asynchronous RL infrastructure decoupling generation and training, and RL algorithms for agents allowing complex and long-horizon interactions. The report is available on arXiv.
๐ Tweet @Zai_org ยท ๐ arXiv
Cohere Labs Tiny Aya: ultra-compact multilingual AI
February 17 โ Cohere Labs presents Tiny Aya, a family of small language models supporting 70+ languages with only 3.35 billion parameters. The goal: to make multilingual AI accessible everywhere, including on phones and offline.
Tiny Aya targets three audiences: researchers working in non-English languages, developers building for digitally underserved communities, and embedded applications requiring reliable translation without cloud dependency. The model includes an offline translation capability, improving privacy and reducing latency.
๐ Tweet @cohere
Runway Gen-4.5 available via API + Claude Code Skill
February 17 โ Runway opens access to Gen-4.5 via its API, allowing developers to integrate image, video, and audio generation directly into their projects. The announcement is accompanied by a dedicated Claude Code Skill, available on GitHub, which allows generating Runway multimedia content without leaving the development environment.
๐ Tweet @runwayml ยท ๐ GitHub Skills
Manus Agents: personal agent with long-term memory
February 16 โ Manus launches Manus Agents, a capability allowing each user to have a personal agent directly in chat conversations. The agent combines long-term memory (style, tone, and retained preferences), full creation capabilities (videos, slides, sites, images), and direct integrations with Gmail, Calendar, and Notion.
๐ Tweet @ManusAI
ElevenAgents for Support
February 17 โ ElevenLabs launches ElevenAgents for Support, AI conversational agents for customer support. Operating in voice and digital channels in over 70 languages, these agents rely on the ElevenLabs agentic platform and its 4M+ deployments in production.
๐ ElevenLabs Agents
NotebookLM x Zillow: real estate notebook
February 18 โ NotebookLM launches in partnership with Zillow a free Featured Notebook for real estate buyers, centralizing expert advice on financial preparation, market assessment, and buying procedures.
๐ Tweet @NotebookLM
What this means
This week illustrates two major trends. The first is the democratization of frontier performances: Sonnet 4.6 brings Opus capabilities at a rate 5 times lower, while Qwen3.5 makes a 397B parameter model accessible in Apache 2.0. The second is the expansion of AI agents into new areas โ the Anthropic study shows that the longest autonomous sessions have doubled in three months, and players like Manus, ElevenLabs, and Runway are building specialized agents (personal chat, customer support, multimedia creation).
The arrival of music generation in Gemini with Lyria 3 and the EVMbench benchmark for blockchain security also show that generative AI and security AI continue to structure themselves as distinct fields.
Sources
- Introducing Claude Sonnet 4.6 โ Anthropic
- Measuring AI agent autonomy in practice โ Anthropic
- Anthropic + Rwanda MOU
- Qwen3.5-397B-A17B โ @Alibaba_Qwen
- Lyria 3 โ @GoogleAI
- EVMbench โ OpenAI
- GLM-5 Technical Report โ @Zai_org
- Tiny Aya โ @cohere
- Runway Gen-4.5 API โ @runwayml
- Manus Agents โ @ManusAI
- ElevenAgents for Support โ ElevenLabs
- NotebookLM x Zillow โ @NotebookLM
- Improved web search with dynamic filtering โ Claude Blog
- Claude API improvements โ @claudeai