Gemma 4 delivers massive gains across European languages while a 25.6M Rust mode... — Ep27

Gemma 4 delivers massive gains across European languages while a 25.6M Rust model achieves 50× faster inference via hybrid attention.

What You Need to Know: Google’s Gemma 4 (especially the 31B variant) has climbed to top-3 or better on nearly every major European language leaderboard according to EuroEval, often beating much larger models on Danish, Dutch, French, Italian and Finnish. Simultaneously, an independent researcher released a byte-level hybrid-attention model that replaces standard attention with linear–quadratic–linear stages plus a learned gate, delivering 286 tokens/sec on a 4060 Ti versus 5.6 t/s before with almost no quality regression. The community is also actively exploring “coding agents as general-purpose agents,” collapsing traditional pipelines into tool-using loops that read, write, and execute code on the fly. Pay attention to the tension between pure scaling, clever architecture, and agentic workflow design—this week’s papers and posts show all three are delivering measurable gains.

Top Story

A new 25.6M-parameter Rust-focused language model trained from scratch demonstrates that data scale still dominates architectural innovation even at tiny sizes. The author forked PyTorch and Triton to implement hybrid attention (local windowed attention + GRU-like recurrent state mixed by a learned gate) and replaced standard attention with a linear-first, quadratic-middle, linear-last pattern. While the architectural change produced a dramatic 50× inference speedup (5.6 → 286 tokens/sec on an RTX 4060 Ti with KV-cache keeping only a recent window in VRAM), expanding the training corpus from 31 MB to 173 MB of Rust crates delivered far larger reductions in validation loss (final perplexity 2.15). The model generates plausible syntax but still shows weak semantic consistency and repetition. The work reinforces a pragmatic lesson: at small scales, more high-quality data usually beats clever architecture, yet the hybrid mechanism remains a compelling inference optimization worth testing on your own edge workloads.

Source: reddit.com

Model Updates

Gemma 4 shines on European languages: EuroEval leaderboard

Gemma 4 31B ranks 3rd on Dutch, 2nd on Danish, 3rd on English, 1st on Finnish, 2nd on French, 5th on German, 2nd on Italian and 3rd on Swedish, representing a huge leap for a model of its size. Community members are now asking whether real-world usage matches the impressive benchmark numbers, especially for non-English coding, legal, or creative tasks. Early feedback suggests the multilingual improvements are noticeable even in smaller variants, making Gemma 4 an attractive drop-in replacement for Claude or GPT-4o-class models when token cost or latency matters.

Source: reddit.com

Thinking of using Gemma 4 E2B as a local preprocessor for Claude Code

A developer is prototyping a Bun-based proxy that sits in front of the Claude API: Gemma-4-E2B (via llama.cpp) translates Korean→English, prunes irrelevant context, and optionally performs initial reasoning before the expensive call. The setup caches results in SQLite WAL mode. The open question is whether pre-supplying reasoning actually reduces the paid model’s token usage or whether Claude simply re-does the work internally. Intel Mac users are particularly interested in real tokens/sec numbers for llama.cpp on non-Apple Silicon hardware.

Source: reddit.com

Agent & Tool Developments

Coding agents as general-purpose AI agents

Reddit user Individual-Library-1 reports using a Pi / coding-agent SDK with read/write/bash tools to handle document knowledge bases, structured extraction from 100+ PDFs, and database benchmarking—without any vector database. Traditional pipelines collapse into agent loops: RAG becomes “read index, choose files, open them”; ETL becomes “write script, run, inspect, retry.” The pattern has scaled to ~600 documents; the main uncertainties are what breaks first at larger scale—cost, latency, reliability, or context management. The author open-sourced the code for others to inspect.

Source: reddit.com

VisionClaw: always-on AI agents through Meta Ray-Ban smart glasses

VisionClaw turns Ray-Ban smart glasses into an always-on egocentric agent that continuously perceives the world and lets users initiate speech-driven tasks (add objects to Amazon cart, generate notes from documents, create calendar events from posters, control IoT). A controlled lab study (N=12) and longitudinal deployment (N=5) showed faster task completion and reduced interaction overhead versus non-agentic and non-always-on baselines. Interaction patterns shifted toward opportunistic initiation and delegation rather than manual control.

Source: arxiv.org

I built an AI SRE agent (Vyuha) with GLM-5.1 that autonomously triages cloud outages

A hackathon project created a triple-cloud (AWS/Azure/GCP) recovery orchestrator using GLM-5.1 as reasoning engine, FastAPI control plane, and evolutionary SQLite memory that learns from every human-approved failover. The agent gathers context, proposes JSON-formatted fixes, and only acts after human approval. Reflection phase writes incident learnings back into memory so future diagnoses improve. The live demo lets you “hard-kill” nodes and watch the system react in real time.

Source: reddit.com

Practical & Community

Minimal container images fortify AI agent security

New guidance highlights how drastically shrinking container images reduces attack surface for deployed AI agents. Smaller images mean fewer dependencies, smaller SBOMs, and faster cold starts—practical wins for both security and operational efficiency when running autonomous agents at scale.

Source: news.google.com

Flowise AI Agent Builder under active CVSS 10.0 RCE exploitation; 12,000+ instances exposed

Security researchers identified critical remote-code-execution vulnerabilities in publicly deployed Flowise instances. The issue is under active exploitation in the wild. Teams running Flowise or similar low-code agent builders should immediately isolate internet-facing instances, apply available patches, and follow responsible disclosure practices. This incident underscores the operational security debt that comes with rapidly shipping agent tooling.

Source: news.google.com

Under the Hood: Hybrid Attention for Tiny Models

Everyone talks about “hybrid attention” as if it’s a simple performance dial you turn on. In practice it is a carefully engineered compromise between local syntax, long-range state, and memory bandwidth. The core insight is to split attention into three stages—linear projection, a narrow quadratic window that still captures short-range dependencies, and another linear stage—while injecting a GRU-style recurrent state that compresses history. A learned gate then blends the local and recurrent paths so the model can decide, token by token, whether syntax or distant context matters more. Keeping only the recent window in the KV cache while recurrent state lives in tiny fixed-size registers delivers the 50× speedup; the quadratic middle layer never sees the full context. The quality-cost tradeoff is revealing: perplexity barely moves, yet semantic coherence still lags because the recurrent compression is lossy. Above ~30 M parameters the architectural gain shrinks rapidly—data scale reasserts dominance. The gotcha that bites most teams is assuming the hybrid mechanism will magically fix repetition; it reduces latency dramatically but does not replace the need for high-quality, diverse training data and proper temperature scheduling. Use hybrid attention when you are latency-bound on edge hardware and your context is mostly local (code, chat, control loops). For deep reasoning or long-document synthesis, classic transformer + efficient attention (FlashAttention-2, MLA, etc.) remains the safer default.

Things to Try This Week

Spin up the open-sourced coding-agent SDK from the LocalLLaMA post and replace one manual ETL or RAG pipeline with a read/write/bash agent loop—see how far you get before context or cost becomes the bottleneck.
Download Gemma 4 (9B or 31B) via Hugging Face and run the EuroEval languages you actually use; compare Korean or French token efficiency versus Claude 3.5/4 to quantify potential API savings.
Test the hybrid-attention 25.6 M Rust model (or replicate the linear-quadratic-linear pattern in your own small decoder) on a local 4060 Ti or laptop GPU to measure real-world tokens/sec versus a standard transformer of similar size.
Deploy VisionClaw-style egocentric prompting on your own Ray-Ban glasses or any always-on camera feed; try opportunistic “add this to cart” or “summarize this poster” flows and note how delegation changes your interaction habits.
If you run Flowise or any low-code agent builder, audit internet-exposed instances today and move to minimal container images or air-gapped setups—security debt compounds fast once agents can call tools.

On the Horizon

Continued community exploration of memory topologies in lifelong LLM multi-agent systems (LLMA-Mem style) expected to surface new open-source implementations within weeks.
More agentic-Federated-Learning papers and prototypes likely following the Agentic-FL paradigm shift toward LM-agent orchestration of clients and servers.
A2A-Agentization benchmarks and tooling for turning ordinary web digital assets into interoperable agents should see rapid iteration now that the formal framework and evaluation suite are public.
Watch for production deployments of Cognitive Fabric Nodes (CFN) middleware—early results show >10 % gains on HotPotQA and MuSiQue by turning memory into an active topology, grounding, and security layer.

Sources

reddit.com

arxiv.org

news.google.com

Full Episode Transcript

Hey everyone, welcome to Models and Agents, episode twenty-seven. It's April seventh, twenty twenty-six. Your daily briefing on the A I models and agents that are changing everything. And no, not THOSE kinds of models and agents. Let's get into it. Gemma 4 delivers massive gains across European languages while a 25.6M Rust model achieves 50× faster inference via hybrid attention. This week the A I world is showing us three different ways to get better results. Pure scaling is still powerful, but clever architecture can deliver ridiculous speedups at the edge. And agentic workflows are collapsing entire traditional pipelines into simple tool-using loops. There is real measurable progress on all three fronts, which makes this an especially fun week to dig in. The top story comes from an independent researcher who built a brand new 25.6 million parameter language model trained entirely on Rust code. What makes it interesting is not just the size but how it was built from scratch to test a very specific idea. The author forked Pie Torch and Triton to implement a hybrid attention mechanism. It combines local windowed attention with a GRU-like recurrent state, mixed together by a learned gate. They also replaced standard attention with a linear-quadratic-linear pattern. The results on inference speed are dramatic. On an RTX 4060 Ti the model jumped from 5.6 tokens per second to 286 tokens per second. That is a fifty times speedup. It achieves this by keeping only a recent window in the KV cache while the recurrent state lives in tiny fixed-size registers. Expanding the training data from 31 megabytes to 173 megabytes of Rust crates delivered much larger reductions in validation loss than the architectural tricks alone. Final perplexity landed at 2.15. The model generates plausible syntax but still shows weak semantic consistency and some repetition. The clearest takeaway is that at these tiny scales, high-quality data usually beats clever architecture. Yet the hybrid mechanism itself is a compelling inference optimization worth testing on your own edge workloads. Now let's talk about Gemma 4. Google's latest model, especially the 31 billion parameter version, has climbed into the top three or better on nearly every major European language leaderboard according to EuroEval. It ranks third on Dutch, second on Danish, third on English, first on Finnish, second on French, fifth on German, second on Italian, and third on Swedish. These are huge gains for a model of its size, often beating much larger models. Community members are now wondering whether real-world usage matches the impressive benchmark numbers. Especially for non-English coding, legal, or creative tasks. Early feedback suggests the multilingual improvements are noticeable even in the smaller variants. That makes Gemma 4 an attractive drop-in replacement for Claude or G P T 4o class models when token cost or latency matters to you. One developer is already prototyping a clever setup using Gemma 4 as a local preprocessor for Claude Code. They built a Bun-based proxy that sits in front of the Claude A P I. Gemma-4-E2B running through Lah-mah.cpp translates Korean to English, prunes irrelevant context, and optionally performs initial reasoning before the expensive call. The proxy caches results in SQLite using WAL mode. Intel Mac users are particularly interested in real tokens-per-second numbers for Lah-mah.cpp on non-Apple Silicon hardware. The open question is whether pre-supplying that reasoning actually reduces the paid model's token usage or whether Claude simply re-does the work internally. On the agent side, there is growing excitement around using coding agents as general-purpose agents. One Reddit user shared how they took a Pi coding-agent S D K with read, write, and bash tools and used it to handle document knowledge bases. They performed structured extraction from over 100 PDFs and ran database benchmarking. All without any vector database. Traditional pipelines collapse into agent loops. Retrieval-augmented generation becomes read index, choose files, open them. ETL becomes write script, run it, inspect output, retry if needed. The pattern has already scaled to roughly 600 documents. The main uncertainties are what breaks first at larger scale: cost, latency, reliability, or context management. The author open-sourced the code so others can inspect and build on it. Another fascinating project is VisionClaw. It turns Meta Ray-Ban smart glasses into an always-on egocentric agent. The system continuously perceives the world and lets users initiate speech-driven tasks. You can add objects to an Amazon cart, generate notes from documents, create calendar events from posters, or control I o T devices. A controlled lab study with 12 participants and a longitudinal deployment with five users showed faster task completion and reduced interaction overhead. The comparison was against both non-agentic and non-always-on baselines. Interaction patterns shifted toward opportunistic initiation and delegation rather than manual control. We are also seeing more specialized agent work. A hackathon project created Vyuha, an A I site reliability engineering agent built on GLM-5.1. It functions as a triple-cloud recovery orchestrator covering AWS, Azure, and GCP. The agent uses a FastAPI control plane and an evolutionary SQLite memory that learns from every human-approved failover. It gathers context, proposes fixes in JSON format, and only acts after human approval. A reflection phase writes incident learnings back into memory so future diagnoses improve. There is even a live demo where you can hard-kill nodes and watch the system react in real time. On the practical and community side, there is new guidance about using minimal container images to fortify A I agent security. Drastically shrinking container images reduces the attack surface for deployed agents. You get fewer dependencies, smaller software bill of materials, and faster cold starts. These are practical wins for both security and operational efficiency when running autonomous agents at scale. On a more serious note, security researchers identified critical remote code execution vulnerabilities in publicly deployed Flowise instances. The issue carries a CVSS score of 10.0 and is under active exploitation in the wild. More than twelve thousand instances are currently exposed. Teams running Flowise or similar low-code agent builders should immediately isolate internet-facing instances, apply available patches, and follow responsible disclosure practices. This incident underscores the operational security debt that comes with rapidly shipping agent tooling. Okay, let's pop the hood on this hybrid attention approach because it is more subtle than most people realize. Everyone talks about hybrid attention as if it is a simple performance dial you turn on. In practice it is a carefully engineered compromise between local syntax, long-range state, and memory bandwidth. The core insight is to split attention into three stages. A linear projection, a narrow quadratic window that still captures short-range dependencies, and another linear stage. At the same time they inject a GRU-style recurrent state that compresses history. A learned gate then blends the local and recurrent paths so the model can decide, token by token, whether syntax or distant context matters more. By keeping only the recent window in the KV cache while the recurrent state lives in tiny fixed-size registers, they get the 50 times speedup. The quadratic middle layer never sees the full context. The quality-cost tradeoff is revealing. Perplexity barely moves, yet semantic coherence still lags because the recurrent compression is lossy. Above roughly 30 million parameters the architectural gain shrinks rapidly and data scale reasserts dominance. The gotcha that bites most teams is assuming the hybrid mechanism will magically fix repetition. It reduces latency dramatically but does not replace the need for high-quality diverse training data and proper temperature scheduling. So when should you actually reach for this versus the alternative. Use hybrid attention when you are latency-bound on edge hardware and your context is mostly local, such as code, chat, or control loops. For deep reasoning or long-document synthesis, classic transformer with efficient attention like FlashAttention-2 or MLA remains the safer default. If you have not tried the open-sourced coding-agent S D K from the LocalLLaMA post, this week is a great time. Replace one manual ETL or retrieval-augmented generation pipeline with a read-write-bash agent loop and see how far you get before context or cost becomes the bottleneck. Download Gemma 4, either the 9 billion or 31 billion version, via Hugging Face. Run it on the European languages you actually use and compare Korean or French token efficiency versus Claude 3.5 or 4 to quantify potential A P I savings. Test the hybrid-attention 25.6 million parameter Rust model, or replicate the linear-quadratic-linear pattern in your own small decoder, on a local 4060 Ti or laptop G P U. Measure real-world tokens per second versus a standard transformer of similar size. Deploy a VisionClaw-style egocentric prompting setup on your own Ray-Ban glasses or any always-on camera feed. Try opportunistic commands like add this to cart or summarize this poster and notice how delegation changes your interaction habits. And if you run Flowise or any low-code agent builder, audit your internet-exposed instances today and consider moving to minimal container images or air-gapped setups. Security debt compounds fast once agents can call tools. On the horizon, keep an eye on continued community exploration of memory topologies in lifelong L L M multi-agent systems. More agentic federated learning papers and prototypes are likely following the Agentic-FL paradigm shift. A2A-Agentization benchmarks and tooling should see rapid iteration now that the formal framework is public. And watch for production deployments of Cognitive Fabric Nodes middleware, which are already showing more than 10 percent gains on HotPotQA and MuSiQue. Before we go — tomorrow we will likely have more details on the growing momentum behind agentic memory systems and how they are starting to move from research into early production experiments. That's Models and Agents for today. If you found this useful, share it with someone who's trying to keep up with all these changes, and subscribe so you don't miss tomorrow's update. The A I world moves fast. We'll help you keep up. See you tomorrow. This podcast is curated by Patrick but generated using AI voice synthesis of my voice using ElevenLabs. The primary reason to do this is I unfortunately don't have the time to be consistent with generating all the content and wanted to focus on creating consistent and regular episodes for all the themes that I enjoy and I hope others do as well.

Top Story

Model Updates

Agent & Tool Developments

Practical & Community

Under the Hood: Hybrid Attention for Tiny Models

Things to Try This Week

On the Horizon

Sources

Enjoy this episode? Get Models & Agents in your inbox