Xiaohu AI Daily — 2026-06-07

🌟 Today's Headline

Qwen3.7-Plus is Alibaba's bid to turn multimodal AI into a full-blown autonomous agent

Alibaba released Qwen3.7-Plus, a multimodal agent model combining visual perception, GUI operation, and coding within a single autonomous loop. Demonstrations show the model independently developing functional applications like vocabulary learning tools, marking progress toward end-to-end agent capabilities.

💬 Editor's Note

The leap isn't just better performance—it's autonomy. Qwen3.7-Plus shifts from assistant-mode (waiting for instructions) to agent-mode (reading screens, writing code, shipping tasks independently). For content creators and ops teams, this means workflows you currently batch-script could run unattended. Execution capability matured faster than expected.

Read more → Product

New open-source voice model listens nonstop and decides every 0.4 seconds whether to speak or stay silent

9/10 New Product

Audio Interaction is an open-source voice model enabling real-time translation, transcription, and conversation without waiting for recording to complete. Unlike GPT-4o or Qwen3.5-Omni, it makes speaking decisions every 0.4 seconds, supporting continuous interaction flow.

Sakana AI bets AI that improves itself can break the compute arms race of frontier labs

9/10 News

Japanese startup Sakana AI launches dedicated research lab for recursive self-improvement—AI systems that iteratively enhance their own capabilities. The initiative aims to challenge the compute-intensive arms race dominated by frontier AI labs, demonstrating that smaller teams can compete through innovation.

Elon Musk's xAI reportedly trained its coding models on Claude outputs for months before getting cut off

9/10 News

Elon Musk's xAI trained its coding models on Anthropic's Claude outputs for months, continuing even after Anthropic revoked access by using private accounts and Blackbox AI service. Meanwhile, xAI's pretraining team contracted to fewer than five people with key researchers departing, signaling internal challenges.

OpenAI and the Trump administration are negotiating a government stake in the AI startup

9/10 News

OpenAI negotiates direct government stake with Trump administration, proposing a 'Public Wealth Fund' distributing profits directly to American citizens. Senator Bernie Sanders simultaneously pushes for 50% taxation on AI company shares, signaling major regulatory and policy shifts in AI governance.

OpenAI unveils Lockdown Mode to protect sensitive data from prompt injection attacks

9/10 New Product

OpenAI introduced Lockdown Mode for ChatGPT to protect sensitive data from prompt injection attacks. While not completely eliminating vulnerability, the feature significantly reduces the likelihood of sensitive information disclosure in enterprise environments.

Sriram Krishnan is leaving his role as White House AI advisor

9/10 News

Krishnan is reportedly starting a new institution to continue shaping Trump's AI policy.

🕐 ~3 min read · Tutorial 7/10

Five labs， five minds： building a multi-model finance drama on small models

💡 Can be adapted into tutorial material

Thousand Token Wood v2使用四个不同实验室的小模型（gpt-oss-20b、MiniCPM3-4B、Nemotron-Mini-4B及微调Qwen 0.5B）驱动金融模拟游戏的智能体。核心发现是异构服务层摩擦在于vLLM 0.22.1需CUDA工具包，而非模型本身。通过容忍性JSON解析层，添加模型只需一条配置。信息隔离确保内幕标志不在提示词中，扫描测试验证无泄露。记忆用情绪摘要截断避免淹没。微调0.5B模型实现0%自成交、100%有效报价，真相防火墙零泄露。小模型是可靠格式生成器但不可靠推理器，可通过结构化、提示词和微调弥补。

🕐 ~3 min read · Tutorial 7/10

$0.07 for M3， $3.39 for Opus. Both caught 13 of 17 bugs. Really interesting breakdown from @kilocod…

💡 Can be adapted into tutorial material

对 Claude Opus 4.8 和 MiniMax M3 进行相同的代码审计：同一代码库、同一提示词，预先植入 17 个已知 bug。MiniMax M3 以 $0.07 抓到 13 个；最便宜的 Claude 运行同样抓到 13 个，花费 $1.30。MiniMax 表示这一对比非常有趣，绝对值得一读。

🕐 ~3 min read · Tutorial 6/10

Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

💡 Can be adapted into tutorial material

SAGE-PTQ proposes a novel quantization framework that minimizes hidden scaling overhead in ultra-low-bit post-training quantization for LLMs. Using graph-guided saliency analysis, it achieves efficient model compression without sacrificing performance on large-scale deployments.

New Product

Meta's Hatch AI agent could cost up to $200 a month and marks its first paid AI product

Meta developed Hatch, its first paid AI agent product priced up to $200/month. Users describe tasks in natural language, and Hatch autonomously builds tools, schedules appointments, sends emails, and handles complex workflows. CEO Mark Zuckerberg views it as a template for enterprise AI monetization.

v0.30.4

Ollama v0.30.4 includes updated llama.cpp and critical improvements to Windows cleanup procedures. The installer cleanup now properly terminates lingering llama-server.exe processes using taskkill /T to ensure all child processes are removed when Ollama is killed, preventing orphaned processes.

Toto 2.0: Time Series Forecasting Enters the Scaling Era

Toto 2.0 releases a family of five open-weights time series forecasting models (4M to 2.5B parameters) trained under a unified recipe. All models scale reliably and set new state-of-the-art on three benchmarks: BOOM (observability), GIFT-Eval (general-purpose), and a contamination-resistant benchmark. This represents a significant open-source contribution to forecasting.

Opinion

How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment

This study analyzes data from a discontinued Reddit r/ChangeMyView field experiment involving undisclosed AI-generated accounts. After public backlash and Reddit authorization, researchers examine archived AI comments to understand how LLM agents engage and persuade real users in live debates.

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

This paper releases CUA-HandCrafted, a 793-episode benchmark testing whether prior prompt-injection attack techniques still work against current frontier computer-using agents. It covers 24 multi-step web tasks and 56 attack templates, auditing reproducibility of recent red-teaming research.

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

Bibliometric audit reveals systematic flaw in academic LLM evaluation literature: researchers evaluate older, cheaper models (e.g., GPT-4o-mini zero-shot) against frontier systems (GPT-5.5 Pro, Claude Opus 4.7) months or years later, causing capability misrepresentation and misleading conclusions.

Tutorial

Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

CausalPhys is a benchmark of 3000+ video and image-based questions testing whether VLMs perform causal physical reasoning across four domains: Perception, Anticipation, Intervention, and Goal Orientation. It reveals that state-of-the-art models often produce plausible but incorrect answers.

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

SoCRATES is a benchmark for evaluating how well LLM mediators handle realistic multi-domain conflict resolution scenarios. It addresses limitations in existing testbeds by capturing real-time trajectories with shifting emotions and intentions, enabling more reliable evaluation.

Temporal Preference Concepts and their Functions in a Large Language Model

This paper investigates how LLMs internally represent and resolve tradeoffs between immediate gains and long-term consequences. Using causal analysis, researchers localized the neural subgraph responsible for temporal preference in Qwen3-4B, identifying key nodes in mid-to-upper layers.

📭Skip Today

Auto-filtered. Here's why — so you know you're not missing out:

How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment
→ Single-source paper, low reader value
Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs
→ Single-source paper, low reader value
SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
→ Single-source paper, low reader value
Temporal Preference Concepts and their Functions in a Large Language Model
→ Single-source paper, low reader value
Evaluating Agentic Configuration Repair for Computer Networks
→ Single-source paper, low reader value
Benchmark Everything Everywhere All at Once
→ Single-source paper, low reader value
Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming
→ Single-source paper, low reader value
Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation
→ Single-source paper, low reader value

Subscribe to Xiaohu AI Daily