AutoAgent autonomously optimizes its own harness using the same model to reach #... — Ep26

AutoAgent autonomously optimizes its own harness using the same model to reach #1 on Terminal-Bench and financial modeling in under 24 hours.

What You Need to Know: The open-source AutoAgent project demonstrates that the biggest limiter for agent performance isn't the underlying model but the quality of the harness (tools, prompts, and evaluation loops). By creating a meta-agent that iteratively tweaks and tests configurations, it turns any task into a self-improving domain expert. This week also saw strong community validation of Gemma 4 models as practical local agents, particularly the 26B-A4B MoE variant for constrained hardware.

Top Story

Kevin Gu open-sourced AutoAgent, an AI system that builds a Meta agent capable of autonomously upgrading its own harness (tools, system prompts, and evaluation logic) until it ranks #1 on its target task. The same base model (Claude) is used both for execution and evaluation, which the author credits with providing superior failure analysis and iteration quality compared to human oversight. It was demonstrated topping rankings on Terminal-Bench (coding) and spreadsheet-based financial modeling within 24 hours.

This directly addresses the widespread complaint that "agents suck" not because of model intelligence but because of brittle harnesses. Practitioners can now point this system at almost any domain-specific task and let it self-optimize. The full implementation is available on GitHub for immediate experimentation.

Source: github.com

Model Updates

Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B blind eval — r/LocalLLaMA

A 30-question blind test (code, reasoning, analysis, communication, meta-alignment) judged by Claude Opus 4.6 showed Qwen 3.5 27B winning 46.7% of matchups but suffering format/refusal failures on 10% of questions. Both Gemma 4 31B and the 26B-A4B MoE variant achieved identical 8.82 average scores, with Gemma 4 31B dominating communication and the MoE variant showing strong efficiency when it didn't error out. Gemma 4 31B also placed 3rd on the FoodTruck Bench, beating GLM 5, Qwen 3.5 397B, and several Claude Sonnet variants.

Source: reddit.com

Gemma 4 shines as local Windows and Mac agent — r/LocalLLaMA

Community reports highlight Gemma 4 (especially the 26B-A4B MoE) as an exceptionally strong local agent for Windows and 16GB hardware. Users report it outperforms Qwen 3.5 27B in real-world coding speed, multilingual support, Systems & DevOps tasks, and creative work such as SVG generation and building a Doom-style raycaster in HTML/JS with far fewer prompts and less looping behavior.

Source: reddit.com

Alibaba's Qwen team releases new reasoning algorithm — The Decoder

The Qwen team introduced a reinforcement learning variant that weights each reasoning step by its influence on future tokens rather than applying uniform rewards. This change reportedly doubles the length of coherent thought processes in reasoning models.

Source: the-decoder.com

Agent & Tool Developments

AutoAgent self-improving harness — r/artificial

The core contribution is treating the agent harness itself as the optimizable artifact. A meta-agent iteratively modifies prompts, tool definitions, and evaluation criteria, using the same model to critique its own performance. This removes humans as the bottleneck for domain-specific agent training.

Source: reddit.com

OpenClaw with local models on mid-range hardware — r/LocalLLaMA

A one-click OpenClaw setup with Gemma 4 and Qwen 3.5 using TurboQuant caching and context warming now runs viable local agents on 16GB MacBook Air / Mac Mini at 10-15 tokens per second. Tool calling reliability was improved through patches to the llama.cpp TurboQuant implementation.

Source: reddit.com

Practical & Community

Kreuzberg v4.7.0 document intelligence library — r/LocalLLaMA

The Rust-core extraction library now supports 248 languages via tree-sitter, performs AST-level code intelligence (functions, classes, imports, symbols), and dramatically improved Structural F1 scores across 23 formats (LaTeX 0%→100%, XLSX 30%→100%, PDF tables 15.5%→53.7%). It ships as an OpenWebUI backend and includes TOON format for 30-50% lower token usage in agent prompts.

Source: reddit.com

scan-for-secrets 0.1 — Simon Willison's Weblog

Simon Willison released a simple Python tool that scans directories for API keys and secrets, including common encodings and escaping variants. Built with Claude Code via README-driven development and red/green TDD, it helps safely publish transcripts of local agent sessions.

Source: simonwillison.net

Cadenza: Wandb integration for agents — r/artificial

New CLI and Python SDK that imports Wandb projects, indexes runs by config and metrics, and returns only high-performing experiments to agents. Designed to reduce context rot and enable better exploration/exploitation tradeoffs in autonomous research loops.

Source: reddit.com

Under the Hood: Self-Improving Agent Harnesses

Everyone talks about "self-improving agents" as if the model simply gets smarter through some magical feedback loop. In practice, the breakthrough in AutoAgent is that it treats the harness — the system prompt, tool schemas, error recovery logic, and evaluation criteria — as the primary optimization target rather than the weights.

The core insight is that most agent failures are systematic and reproducible once you have a reliable critic. By using the same model family for both acting and judging, the critic shares the exact tokenization, instruction-following quirks, and failure modes of the executor. This creates tighter feedback than a human or different-model judge can provide. The meta-agent then performs a form of program synthesis over the configuration space: mutating prompts, adjusting JSON schemas, changing tool descriptions, and measuring success on a validation set.

This adds overhead (each iteration requires multiple full test runs) but removes the human bottleneck that traditionally made domain-specific agent tuning expensive. The quality gain is most pronounced on long-horizon or highly structured tasks where brittle tool calling or prompt drift kills performance. The gotcha that bites most teams is that the meta-optimization process itself needs careful bounding — without limits on search depth or compute budget, it can simply produce ever-more-complex but fragile configurations.

When to use this approach: for any recurring domain task where you can define clear success metrics and are willing to spend 10-100x more upfront compute to get a specialized harness. The alternative (hand-crafting prompts across many models) scales poorly as task complexity grows.

Things to Try This Week

Try AutoAgent on one of your existing agent tasks — point the meta-optimizer at your current harness for Terminal-Bench style coding or data analysis and see how many iterations it takes to outperform your manual version.
Run Gemma 4 26B-A4B locally (Unsloth IQ4_XS or IQ4_NL quants) for coding and vision tasks on 16GB hardware — use the recommended temperature 0.3 / min-p 0.1 settings and --image-min-tokens 300 for strong vision performance.
Test Kreuzberg v4.7.0 as your document extraction backend for agents — feed it code repositories or mixed PDFs/spreadsheets and compare token efficiency using the new TOON format.
Integrate Cadenza with your Wandb projects if you're running autonomous research agents — it dramatically reduces context rot by surfacing only high-performing experiment configurations.
Use scan-for-secrets before publishing any local Claude Code or agent transcripts — especially useful if you're sharing detailed logs from AutoAgent or OpenClaw runs.

On the Horizon

Expect further releases and quant optimizations for Gemma 4 models as the community continues to push local agent performance on consumer hardware.
Qwen team’s per-step weighted RL technique will likely appear in future Qwen 3.6 releases, potentially improving long-chain reasoning reliability.
More agent frameworks are expected to adopt structured document intelligence backends like Kreuzberg now that it offers clean AST-level code extraction and OpenWebUI integration.
Continued experimentation with same-model critic loops in open-source agent projects as the AutoAgent approach spreads.

Sources

github.com

reddit.com

the-decoder.com

simonwillison.net

Full Episode Transcript

Hey everyone, welcome to Models and Agents, episode twenty-six, for April fifth, twenty twenty-six. Let’s dive into what happened in the A I world today. And trust me — it’s been one of those genuinely exciting days where you can feel the pace of progress accelerating. The top story this episode is one of those moments that makes you sit back and smile, because it directly attacks one of the biggest complaints in the agent space. Kevin Gu open-sourced AutoAgent, a system that builds a meta-agent capable of autonomously upgrading its own harness until it reaches number one on whatever target task you give it. The “harness” here refers to everything outside the raw model: the system prompts, the tool definitions, the output schemas, the evaluation logic, the few-shot examples — basically all the brittle scaffolding that usually makes agents frustrating to deploy in production. What’s particularly elegant is that AutoAgent uses the same base model — in this case Claude — for both execution and evaluation. This shared architecture gives it superior failure analysis and iteration quality compared to human oversight or using a different judge model. The meta-agent doesn’t just guess what’s wrong; it understands the exact quirks, tokenization behavior, and instruction-following tendencies of the executor because they’re the same model family. In under twenty-four hours, this system topped Terminal-Bench for coding tasks and completely dominated spreadsheet-based financial modeling benchmarks. That’s not a small achievement. This result directly challenges the common complaint that “agents suck.” AutoAgent suggests that the real limiter has rarely been the underlying model itself — it’s almost always been the quality of the harness. Once you remove the human bottleneck from the optimization loop, the performance jump can be dramatic. For practitioners, this is huge. You can now point this system at almost any domain-specific task — legal contract analysis, financial forecasting, DevOps automation, scientific literature review — and let it self-optimize into a specialized expert. The full implementation is already available on GitHub, and the barrier to experimentation is remarkably low. My take? This is one of the most important practical advances in agent development we’ve seen in the last year. It shifts the mindset from “how do I prompt this model better” to “how do I build a system that discovers the best possible harness for my use case.” The biggest limiter for agent performance isn’t intelligence — it’s configuration quality. AutoAgent turns that configuration problem into an optimization problem the model can solve itself. Let’s stay on this theme for a moment because the engineering insight here is deeper than it first appears. Everyone talks about self-improving agents as if the model simply gets smarter through some magical feedback loop. In practice, AutoAgent’s real breakthrough is that it treats the harness itself as the primary optimization target rather than trying to improve model weights. The core insight is that most agent failures are actually systematic and reproducible once you have a reliable critic. By using the same model family for both acting and judging, the critic shares the exact same strengths, weaknesses, tokenization quirks, and instruction-following patterns as the executor. This creates much tighter, more coherent feedback than a human reviewer or a different model family can provide. The meta-agent then performs a kind of program synthesis over the entire configuration space — mutating system prompts, adjusting JSON schemas, rewriting tool descriptions, changing error recovery strategies — and continuously measuring success on a held-out validation set. Of course, this comes with overhead. Each iteration requires multiple full test runs, which can get expensive. But it completely removes the human bottleneck that traditionally made high-quality domain-specific agent tuning prohibitively expensive and slow. The quality gains are most pronounced on long-horizon or highly structured tasks where brittle tool calling, prompt drift, or subtle schema mismatches usually kill performance after a few steps. That said, there’s an important gotcha: the meta-optimization process itself needs careful bounding. Without limits on search depth or compute budget, the system can drift toward ever-more-complex but increasingly fragile configurations that overfit to the validation set. So when should you actually reach for this approach? I’d say use it for any recurring domain task where you can define clear success metrics and you’re willing to spend ten to one hundred times more upfront compute to get a truly specialized harness. The old alternative of hand-crafting prompts across many different models scales extremely poorly as task complexity grows. Moving from frontier agent techniques to practical local deployment, this week also saw strong community validation of the new Gemma 4 models as genuinely useful local agents — especially the 26 billion parameter mixture-of-experts variant when running on constrained hardware. The local L L M community ran a fascinating thirty-question blind test covering code, reasoning, analysis, communication, and meta-alignment, with Claude Opus 4.6 acting as the judge. The results were closer than many expected. Chwen 3.5 27B won forty-six point seven percent of the matchups but suffered noticeable format and refusal failures on about ten percent of questions. Meanwhile, both Gemma 4 31B and the 26B A4B MoE variant achieved identical average scores of 8.82 — remarkably competitive. Gemma 4 31B particularly dominated in communication and writing tasks, while the 26B mixture-of-experts version showed impressive efficiency when it didn’t error out. Even more impressively, Gemma 4 31B placed third on the challenging FoodTruck Bench, beating GLM 5, Chwen 3.5 397B, and several Claude Sonnet variants. On the practical side, community reports are increasingly highlighting Gemma 4 — especially that 26B A4B MoE — as an exceptionally strong local agent for Windows machines and 16GB hardware. Users consistently report it outperforming Chwen 3.5 27B in real-world coding speed, multilingual support, systems and DevOps tasks, and creative work. People are getting impressive results with far fewer prompts and much less looping behavior — things like generating clean SVGs on the first or second try, or building a surprisingly capable Doom-style raycaster in HTML and JavaScript. Alibaba’s Chwen team also dropped an interesting new reasoning algorithm this week that could have significant downstream impact on future models. They introduced a reinforcement learning variant that weights each reasoning step by its actual influence on future tokens, rather than applying the traditional uniform reward across all steps. Early reports suggest this change can roughly double the length of coherent thought processes in reasoning models before they start to degrade. This feels like a meaningful step toward more reliable long-chain reasoning, especially for local models that need to maximize capability within tight compute budgets. On the agent and tooling front, there were several notable releases that make local and autonomous workflows much more practical. First, there’s now a one-click OpenClaw setup that combines Gemma 4 and Chwen 3.5 with TurboQuant caching and context warming. This makes viable local agents possible on a 16GB MacBook Air or Mac Mini at ten to fifteen tokens per second. The team also pushed patches to the Lah-mah.cpp TurboQuant implementation that significantly improved tool calling reliability. In the practical tools section, the Kreuzberg document intelligence library just hit version 4.7.0. This Rust-core extraction library now supports 248 languages through tree-sitter and delivers AST-level code intelligence for functions, classes, imports, and symbols. The improvements are striking: Structural F1 scores jumped across 23 different formats. LaTeX went from zero percent to one hundred percent, and PDF table extraction improved from fifteen point five percent to fifty-three point seven percent. It now ships as a clean OpenWebUI backend and includes a new TOON format that reduces token usage in agent prompts by thirty to fifty percent. This is going to be a game-changer for any agent that needs to deeply understand codebases or complex documents. Simon Willison also released a small but very useful Python tool called scan-for-secrets (v0.1). It scans directories for A P I keys and secrets, including common encodings and escaping variants. Built with Claude Code using README-driven development and red-green test-driven development, it’s perfect for safely publishing transcripts of local agent sessions without accidentally leaking credentials. Finally, there’s a new project called Cadenza that adds intelligent Wandb integration for agents. The CLI and Python S D K can import Wandb projects, index runs by configuration and metrics, and return only the highest-performing experiments to your agents. It’s specifically designed to reduce context rot and help autonomous research loops make better exploration-versus-exploitation tradeoffs. Here are the things you should try this week: 1. Take AutoAgent and point the meta-optimizer at one of your existing agent tasks. Give it your current harness for Terminal-Bench style coding or data analysis and see how many iterations it takes to outperform your manually tuned version. The results can be eye-opening. 2. Run Gemma 4 26B A4B locally using Unsloth’s IQ4_XS or IQ4_NL quants. Try it on coding and vision tasks on 16GB hardware. Use temperature 0.3, min-p 0.1, and image-min-tokens 300 for strong vision performance. 3. Test Kreuzberg 4.7.0 as your new document extraction backend. Feed it code repositories or mixed PDFs and spreadsheets, then compare token efficiency using the new TOON format. 4. If you’re running autonomous research agents, integrate Cadenza with your Wandb projects. The reduction in context rot is dramatic. 5. Get in the habit of running scan-for-secrets before publishing any local Claude Code or agent transcripts — especially detailed logs from AutoAgent or OpenClaw runs. On the horizon, we should expect more releases and quant optimizations for the Gemma 4 family as the community continues pushing local agent performance on consumer hardware. The Chwen team’s per-step weighted reinforcement learning technique will likely make its way into future Chwen 3.6 releases. We’ll also probably see more agent frameworks adopting structured document intelligence backends like Kreuzberg now that it offers clean AST-level code extraction and smooth OpenWebUI integration. And I expect the AutoAgent pattern of same-model critic loops to spread rapidly through other open-source agent projects. Before we go, tomorrow keep an eye on how quickly the AutoAgent pattern gets adopted across the broader open-source ecosystem. This one feels like it could move fast. That wraps up today’s A I briefing. If you found this valuable, share it with a developer or builder who wants to stay current. Subscribe wherever you listen, and I’ll see you tomorrow. This podcast is curated by Patrick but generated using AI voice synthesis of my voice using ElevenLabs. The primary reason to do this is I unfortunately don't have the time to be consistent with generating all the content and wanted to focus on creating consistent and regular episodes for all the themes that I enjoy and I hope others do as well.

Top Story

Model Updates

Agent & Tool Developments

Practical & Community

Under the Hood: Self-Improving Agent Harnesses

Things to Try This Week

On the Horizon

Sources

Enjoy this episode? Get Models & Agents in your inbox