Anthropic exposes industrial distillation attacks, OpenAI drops SWE-bench Verified, gpt-realtime-1.5

Anthropic today publishes a report detailing industrial-scale distillation campaigns run by three Chinese labs — DeepSeek, Moonshot AI and MiniMax — which collected more than 16 million exchanges with Claude via 24,000 fraudulent accounts. OpenAI, meanwhile, says it will stop reporting SWE-bench Verified as a reference for its frontier models after demonstrating that 59.4% of the benchmark’s tests are defective and that several state-of-the-art models memorized the reference fixes during training. On the tools side, gpt-realtime-1.5 improves the Realtime API voice capabilities, WebSockets arrive in the Responses API for long-running agents, and Gemini rolls out new Veo 3.1 templates for video creation.

Anthropic: industrial distillation attacks by three Chinese labs

February 23 — Anthropic publishes a report revealing that DeepSeek, Moonshot AI (Kimi) and MiniMax conducted large-scale illicit distillation campaigns against the Claude models.

What happened

The three labs created around 24,000 fraudulent accounts to generate more than 16 million exchanges with Claude via the API, in violation of Anthropic’s terms of service and regional access restrictions — China does not have commercial access to Claude.

The technique used, model distillation, consists of training a lower-capability model on the outputs of a higher-capability model. Legitimate when used internally, it becomes illicit when competitors extract another lab’s capabilities without authorization.

Volume by lab

Lab	Exchange volume	Main targets
DeepSeek	+150,000 exchanges	Reasoning, rubric grading, censorship-safe alternatives
Moonshot AI (Kimi)	+3.4 million exchanges	Agentic reasoning, coding, computer use, vision
MiniMax	+13 million exchanges	Agentic coding, tool use, orchestration

Notable techniques

The DeepSeek campaign stands out for prompts asking Claude to articulate its internal chain-of-thought step by step — thus generating large-scale chain-of-thought training data. Anthropic also detected tasks aimed at training DeepSeek to propose alternatives to politically sensitive questions.

Anthropic detected the MiniMax campaign while it was still active. When Anthropic released a new model, MiniMax redirected nearly half of its traffic to the new system within 24 hours — demonstrating automated monitoring of Anthropic’s outputs.

The infrastructure relied on “hydra cluster” architectures: networks of fraudulent accounts distributing traffic to the API and third-party cloud platforms. A single proxy network handled more than 20,000 accounts concurrently.

Anthropic’s response

Anthropic is deploying several countermeasures: classifiers and behavioral fingerprinting systems to detect distillation patterns, sharing technical data with other labs, cloud providers and authorities, tightening verifications for educational and research accounts, and developing product-, API- and model-level mitigations.

“These labs created over 24,000 fraudulent accounts and generated over 16 million exchanges with Claude, extracting its capabilities to train and improve their own models.” — @AnthropicAI on X

🔗 Anthropic report 🔗 Announcement @AnthropicAI

OpenAI drops SWE-bench Verified: 59.4% defective tests

February 23 — OpenAI publishes an analysis explaining why the company will no longer report SWE-bench Verified scores and recommends the industry follow suit.

Context

Since its creation in August 2024, SWE-bench Verified became the reference standard for measuring progress on autonomous software development tasks. After a rapid rise — from 0% to 75% in one year — scores have plateaued between 74.9% and 80.9% over the last six months. OpenAI conducted a deep audit to determine whether this plateau reflects model limits or benchmark flaws.

Audit results: two major problems

On a subset of 138 audited problems (27.6% of the dataset), at least 59.4% have tests that reject functionally correct solutions. Breakdown of defects:

Defect type	Share of defective cases
Tests too restrictive on implementation details	35.5%
Tests checking functionality not specified in the prompt	18.8%
Other defects (flaky tests, ambiguous specs)	5.1%

The second issue is training data contamination: SWE-bench problems come from widely used open-source repositories that are common in model training data. Using an automated red-teaming pipeline, OpenAI demonstrated that GPT-5.2, Claude Opus 4.5 and Gemini 3 Flash Preview can all reproduce the gold patches (reference fixes) verbatim for some problems — evidence these examples were seen during training.

Recommendations

OpenAI has stopped reporting SWE-bench Verified scores and recommends using SWE-bench Pro instead — its public split shows significantly less contamination. The company also calls on the academic community to invest in clean private benchmarks, like GDPVal (tasks authored by domain experts with holistic scoring).

🔗 OpenAI article

OpenAI: gpt-realtime-1.5 and WebSockets in the Responses API

gpt-realtime-1.5 in the Realtime API

February 23 — OpenAI announces the availability of gpt-realtime-1.5 in the Realtime API. This new voice model replaces the previous version and brings improvements for real-time conversational applications.

gpt-realtime-1.5 offers better instruction following, more reliable tool use, and improved multilingual accuracy. Partners like Genspark measured concrete results during the alpha phase: human-connection rates rising from 43.7% to 66%, and a 97.9% accuracy rate on evaluated conversations. The model is available directly in the existing Realtime API with no infrastructure changes.

🔗 Tweet @OpenAIDevs

WebSockets in the Responses API

February 23 — OpenAI introduces WebSocket support in the Responses API, designed for long-running agents with heavy tool-call usage.

A persistent WebSocket connection allows sending only new inputs each turn, without retransmitting the entire context on every request. State is kept in memory between interactions, avoiding redundant recomputations. According to OpenAI, this approach speeds up agent runs with 20 or more tool calls by 20–40%.

🔗 Tweet @OpenAIDevs — announcement

Anthropic: The AI Fluency Index

February 23 — Anthropic publishes “The AI Fluency Index,” a research report that measures AI fluency among Claude users by analyzing their real behaviors.

The study tracked 11 distinct behaviors across thousands of conversations on Claude.ai — for example, how often users iterate and refine their work with Claude — to measure how people develop effective AI skills in practice. The report is part of an effort to educate and understand AI adoption beyond simple usage metrics.

“We tracked 11 behaviors across thousands of Claude.ai conversations—for example, how often people iterate and refine their work with Claude—to measure how people actually develop AI skill in practice.” — @AnthropicAI on X

🔗 AI Fluency Index

Gemini: new Veo 3.1 templates for video creation

February 23 — Google rolls out new templates for Veo 3.1 in the Gemini app, simplifying AI-driven video creation for all users.

To access them: open gemini.google or the mobile app, then select “Create videos” from the tools menu. The template gallery appears, and each template can be customized with a reference photo and/or a text description.

This announcement comes during a busy week for the Gemini ecosystem: on February 19, Google launched Gemini 3.1 Pro with a 77.1% score on ARC-AGI-2, and on February 18, Lyria 3 introduced music generation directly in the app. Veo 3.1 templates complement this push toward multimodal creation within a single app.

🔗 Announcement @GeminiApp

Pika AI Selves: a documentary series autonomously produced by AI agents

February 23 — Pika announces that its “AI Selves” — AI extensions of a creator’s personality and skills — directed and edited their own documentary series autonomously, themed around their collaboration with humans at Pika.

Pika’s “AI Self” concept differs from classic AI agents: rather than a tool that executes tasks, an “AI Self” is an extension that embodies a creator’s skills, personality and aesthetic taste. The demo takes the form of a documentary series entirely produced by these AI entities, without human involvement in directing or editing.

🔗 Announcement @pika_labs

What this means

Anthropic’s distillation case goes beyond a mere terms-of-service violation: it documents, at scale for the first time, how competing labs systematically extract a frontier model’s capabilities. The sophistication of the MiniMax operation — traffic redirection within 24 hours to a new model, a “hydra” infrastructure with 20,000 accounts — suggests continuous automated surveillance. Anthropic’s call for a coordinated industry and policymaker response, coupled with export controls on chips, draws a new front in the competition between AI labs.

OpenAI’s decision to abandon SWE-bench Verified is a structural signal for the industry: public coding benchmarks are now contaminated by training data from the top-performing models. The recommended shift to SWE-bench Pro and private benchmarks like GDPVal signals a reconfiguration of evaluation standards — making public model comparisons harder to interpret.

On the tools side, OpenAI’s two announcements (gpt-realtime-1.5 and WebSockets) target concrete use cases: production voice agents and long-running agent runs with many tool calls. A 20–40% improvement from WebSockets is not trivial for workflows that chain 50 or 100 tool calls per session.

Sources

This document was translated from the fr version into the en language using the gpt-5-mini model. For more information about the translation process, see https://gitlab.com/jls42/ai-powered-markdown-translator