Aaron Levie declares the enterprise AI shift from chatbots to agents is now unde... — Ep30

Aaron Levie declares the enterprise AI shift from chatbots to agents is now underway, moving beyond the "Chat Era."

What You Need to Know: Box CEO Aaron Levie says organizations are rapidly moving from simple chat interfaces to autonomous agents that execute real workflows. Today also brings a wave of new technical releases including LG AI Research’s first open-weight vision-language model, MiniMax’s CLI giving agents native multimodal tooling, and an open-source identity platform designed specifically for autonomous agents. The research community dropped a dozen new arXiv papers probing everything from diffusion-model safety to medical reasoning benchmarks, signaling intense activity at the architecture and evaluation layers.

Top Story

Aaron Levie Says Enterprise AI Is Shifting From Chatbots To Agents, Warns 'We're Moving From Chat Era...' - Benzinga

Box CEO Aaron Levie stated that enterprise AI is transitioning from chatbot-style interfaces to full agents capable of executing complex, multi-step work. He argues the “Chat Era” is ending because organizations now demand systems that can act autonomously inside business processes rather than merely answer questions. This matches the surge in agent tooling and frameworks we’re seeing across both startups and big tech. Practitioners building internal tools should start experimenting with agentic patterns now—especially those that combine planning, tool use, and memory—before their competitors treat agents as table stakes. The next 12–18 months will likely separate companies that merely have a chatbot from those running fleets of specialized agents.

Source: news.google.com

Model Updates

EXAONE 4.5 Technical Report

LG AI Research released EXAONE 4.5, its first open-weight vision-language model built by adding a dedicated visual encoder to the EXAONE 4.0 text backbone. The model was trained with heavy emphasis on curated document-centric corpora, delivering strong gains in document understanding and Korean contextual reasoning while extending context to 256K tokens. It competes well on general benchmarks and outperforms same-scale models on document tasks. Enterprises working with PDFs, forms, or Korean-language content should evaluate it immediately.

Source: arxiv.org

Gemma 4 has a systemic attention failure. Here's the proof.

An independent researcher using a custom tensor-distribution diagnostic found that Gemma 4 26B (A4B, Q8_0 quant) exhibits severe KL-divergence drift in 21 of 29 affected tensors, almost all belonging to attention layers (attn_k, attn_q, attn_v). In contrast, Qwen 3.5 35B A3B showed healthy attention distributions with only a single broken layer. The analysis suggests Gemma 4 shipped with a systemic architectural or training flaw that standard benchmarks missed. Anyone running Gemma 4 in production should test against this diagnostic before trusting outputs on long or complex sequences.

Source: reddit.com

Medical Reasoning with Large Language Models: A Survey and MR-Bench

A new survey formalizes medical reasoning as iterative abduction-deduction-induction cycles and evaluates training-based and training-free methods under unified conditions. The authors introduce MR-Bench, built from real hospital data, revealing a large gap between exam-style performance and genuine clinical decision accuracy. The work gives practitioners a clearer map of current limitations and a harder benchmark for future medical LLMs.

Source: arxiv.org

Agent & Tool Developments

ZeroID: Open-source identity platform for autonomous AI agents - Help Net Security

ZeroID is a new open-source identity and authentication platform built specifically for autonomous AI agents operating across untrusted environments. It addresses core needs around verifiable identity, authorization, and auditability that generic auth systems ignore when the principal is software rather than a human. Developers building multi-agent systems or agents that cross organizational boundaries can integrate it today to reduce spoofing and permission-creep risks.

Source: news.google.com

MiniMax Releases MMX-CLI: A Command-Line Interface That Gives AI Agents Native Access to Image, Video, Speech, Music, Vision, and Search

MiniMax open-sourced MMX-CLI, a Node.js command-line tool that exposes its full omni-modal stack (image, video, speech, music, vision, search) to both human developers and AI agents running inside Cursor, Claude Code, or OpenCode. Agents can now call these capabilities natively without custom wrappers. If you are building agents that need to generate or analyze multimodal content, this is one of the cleanest ways to give them those superpowers today.

Source: marktechpost.com

Aethir Claw Enables AI Agents to Execute Creative Workflows - Cryptonews.net

Aethir Claw is a new execution environment that lets AI agents run creative production workflows (design, video, music, copy) at scale with decentralized compute. It removes many of the orchestration and resource-allocation headaches that previously limited agent creativity. Teams experimenting with generative creative pipelines should test it for cost-effective parallel execution.

Source: news.google.com

Practical & Community

Llamacpp on chromebook 4 gb ram

A developer successfully compiled and ran llama.cpp on a 4 GB RAM Chromebook, achieving 3–4 tokens/sec with Qwen 3.5 0.8B 4-bit. The post proves that extremely low-resource devices can still run capable small models locally. If you’re looking for on-device prototyping or want to validate edge deployment targets, this is an encouraging data point and recipe worth replicating.

Source: reddit.com

Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?

New research dissects preference optimization (DPO/KTO) by separating generator-level delta (capability gap between chosen/rejected trace producers) from sample-level delta (quality difference within a pair). Larger generator deltas steadily improve out-of-domain reasoning; filtering by sample-level delta yields more data-efficient training. The paper supplies a practical recipe: maximize generator delta when creating pairs and use LLM-as-a-judge filtering to pick the highest-signal examples.

Source: arxiv.org

Under the Hood: Diffusion Language Model Safety

Everyone talks about diffusion LLMs (dLLMs) as though safety is baked into the denoising schedule the same way alignment is baked into RLHF. In practice, their safety alignment rests on one extremely fragile architectural assumption: that the denoising process is strictly monotonic and that once a token is committed it is never revisited.

The core mechanism works like this. A dLLM starts with a fully masked sequence and iteratively unmasks or refines tokens over a fixed number of steps (commonly 64). Safety-tuned versions are trained so that harmful prompts trigger refusal tokens (e.g., “I cannot assist…”) within the first 8–16 steps. Because the schedule treats those early commitments as permanent, the model never reconsiders them. Researchers showed that simply re-masking the refusal tokens and injecting a short affirmative prefix is enough to break safety in two trivial steps—no gradients, no search—achieving 76–82 % attack success rate on HarmBench against current dLLM instruct models.

This exploit is structural, not adversarial sophistication. Adding gradient-based perturbations actually reduced attack success, proving the vulnerability lives in the irreversible-commitment schedule itself. The paper’s suggested defenses—safety-aware unmasking orders, step-conditional prefix detectors, and post-commitment re-verification—illustrate that real robustness will require changing the denoising dynamics rather than just training harder.

Practical engineering guidance: treat current dLLM safety as “good enough for non-adversarial use” but never rely on it for high-stakes deployments. If you are building with diffusion-based generators, budget time to implement at least one of the proposed schedule-hardening techniques or keep a separate verifier model in the loop. The gap between perceived safety and actual architectural shallowness is currently one of the widest in the entire LLM stack.

Things to Try This Week

Try MMX-CLI inside Cursor or Claude Code to give your agents native image, video, speech, and search tools—far simpler than stitching separate APIs.
Evaluate EXAONE 4.5 on your document-heavy workflows; the targeted training on enterprise docs makes it worth benchmarking against Claude-3.5 or GPT-4o for PDF understanding.
Run the Gemma 4 tensor-diagnostic pastebin script against any 26B-class model you depend on—catches attention collapse that standard benchmarks miss.
Experiment with ZeroID if you are designing multi-agent systems that cross trust boundaries; adding proper agent identity early prevents painful refactoring later.
Use the generator-level vs sample-level delta framework from the new preference optimization paper to curate your next DPO dataset—filtering by judged quality can dramatically improve data efficiency.

On the Horizon

More vision-language models emphasizing document and long-context understanding are expected from additional labs following LG’s lead.
Agent identity and authorization standards will likely consolidate around projects like ZeroID as fleets of autonomous agents become commonplace.
Expect rapid follow-up work on diffusion LLM defenses now that the re-masking vulnerability is public.
Medical-reasoning benchmarks such as MR-Bench should drive more clinically grounded evaluations instead of exam-style leaderboards in the coming months.

Sources

news.google.com

arxiv.org

reddit.com

marktechpost.com

Full Episode Transcript

What's up — welcome to Models and Agents, episode thirty, for April thirteenth, twenty twenty-six. Your daily A I briefing. New week in A I. And if last week was anything to go by, buckle up. Let's get into it. Aaron Levie declares the enterprise A I shift from chatbots to agents is now underway, moving beyond the "Chat Era." What you need to know today is that organizations are rapidly moving from simple chat interfaces to autonomous agents that can actually execute real business workflows. We are also seeing a wave of new technical releases including LG A I Research’s first open-weight vision-language model, Mini Max’s CLI giving agents native multimodal tooling, and an open-source identity platform designed specifically for autonomous agents. The research community dropped a dozen new arXiv papers probing everything from diffusion-model safety to medical reasoning benchmarks. It is a clear signal of intense activity happening at both the architecture and evaluation layers. Aaron Levie, the C E O of Box, says enterprise A I is transitioning from chatbot-style interfaces to full agents capable of executing complex, multi-step work. He argues the “Chat Era” is ending because organizations now demand systems that can act autonomously inside business processes rather than merely answer questions. This matches the surge in agent tooling and frameworks we are seeing across both startups and big tech. Practitioners building internal tools should start experimenting with agentic patterns now, especially those that combine planning, tool use, and memory. If you have been treating A I as a question-answering layer, the next twelve to eighteen months will likely separate companies that merely have a chatbot from those running fleets of specialized agents. The practical takeaway is clear. Start prototyping agentic workflows that can handle end-to-end processes before your competitors treat agents as table stakes. LG A I Research released EXAONE 4.5, its first open-weight vision-language model. They built it by adding a dedicated visual encoder to the EXAONE 4.0 text backbone. The model was trained with heavy emphasis on curated document-centric corpora. That focus delivers strong gains in document understanding and Korean contextual reasoning while extending context to two hundred fifty six thousand tokens. It competes well on general benchmarks and outperforms same-scale models on document tasks. Enterprises working with PDFs, forms, or Korean-language content should evaluate it immediately. An independent researcher using a custom tensor-distribution diagnostic found that Gemma 4 26B exhibits severe issues. The model shows KL-divergence drift in twenty-one of twenty-nine affected tensors, almost all belonging to attention layers. In contrast, Chwen 3.5 35B showed healthy attention distributions with only a single broken layer. The analysis suggests Gemma 4 shipped with a systemic architectural or training flaw that standard benchmarks missed. Anyone running Gemma 4 in production should test against this diagnostic before trusting outputs on long or complex sequences. A new survey formalizes medical reasoning as iterative abduction, deduction, and induction cycles. The authors evaluate both training-based and training-free methods under unified conditions. They introduce MR-Bench, built from real hospital data. It reveals a large gap between exam-style performance and genuine clinical decision accuracy. The work gives practitioners a clearer map of current limitations and a harder benchmark for future medical large language models. ZeroID is a new open-source identity and authentication platform built specifically for autonomous A I agents operating across untrusted environments. It addresses core needs around verifiable identity, authorization, and auditability that generic auth systems ignore when the principal is software rather than a human. Developers building multi-agent systems or agents that cross organizational boundaries can integrate it today. Doing so should reduce spoofing and permission-creep risks right from the start. Mini Max open-sourced MMX-CLI, a Node.js command-line tool. It exposes their full omni-modal stack including image, video, speech, music, vision, and search. Both human developers and A I agents running inside Cursor, Claude Code, or OpenCode can now call these capabilities natively without custom wrappers. If you are building agents that need to generate or analyze multimodal content, this is one of the cleanest ways to give them those superpowers today. Aethir Claw is a new execution environment that lets A I agents run creative production workflows at scale with decentralized compute. It removes many of the orchestration and resource-allocation headaches that previously limited agent creativity. Teams experimenting with generative creative pipelines for design, video, music, or copy should test it for cost-effective parallel execution. A developer successfully compiled and ran Lah-mah.cpp on a four gigabyte RAM Chromebook. They achieved three to four tokens per second with Chwen 3.5 0.8B quantized to four-bit. The post proves that extremely low-resource devices can still run capable small models locally. If you are looking for on-device prototyping or want to validate edge deployment targets, this is an encouraging data point and recipe worth replicating. New research dissects preference optimization methods like DPO and KTO. It separates generator-level delta, which is the capability gap between chosen and rejected trace producers, from sample-level delta, which is the quality difference within a pair. Larger generator deltas steadily improve out-of-domain reasoning. Filtering by sample-level delta yields more data-efficient training. The paper supplies a practical recipe. Maximize generator delta when creating pairs and use an L L M as-a-judge to pick the highest-signal examples. OK, let's pop the hood on diffusion language model safety because this one is important. Everyone talks about diffusion L L M's as though safety is baked into the denoising schedule the same way alignment is baked into reinforcement learning from human feedback. In practice their safety alignment rests on one extremely fragile architectural assumption. The denoising process is assumed to be strictly monotonic and once a token is committed it is never revisited. Here is how the core mechanism works. A diffusion L L M starts with a fully masked sequence and iteratively unmasks or refines tokens over a fixed number of steps, commonly sixty-four. Safety-tuned versions are trained so that harmful prompts trigger refusal tokens within the first eight to sixteen steps. Because the schedule treats those early commitments as permanent, the model never reconsiders them. Researchers showed that simply re-masking the refusal tokens and injecting a short affirmative prefix is enough to break safety in two trivial steps. No gradients, no search, yet they achieved seventy-six to eighty-two percent attack success rate on HarmBench against current diffusion L L M instruct models. This exploit is structural, not a matter of adversarial sophistication. Adding gradient-based perturbations actually reduced attack success, proving the vulnerability lives in the irreversible-commitment schedule itself. The paper suggests defenses like safety-aware unmasking orders, step-conditional prefix detectors, and post-commitment re-verification. Real robustness will require changing the denoising dynamics rather than just training harder. So treat current diffusion L L M safety as good enough for non-adversarial use but never rely on it for high-stakes deployments. If you are building with diffusion-based generators, budget time to implement at least one of the proposed schedule-hardening techniques or keep a separate verifier model in the loop. The gap between perceived safety and actual architectural shallowness is currently one of the widest in the entire large language model stack. If you have not tried MMX-CLI inside Cursor or Claude Code yet, this week is a great time because it gives your agents native image, video, speech, and search tools far simpler than stitching separate A P I's. Evaluate EXAONE 4.5 on your document-heavy workflows because the targeted training on enterprise docs makes it worth benchmarking against Claude 3.5 or G P T 4o for PDF understanding. Run the Gemma 4 tensor-diagnostic pastebin script against any 26B-class model you depend on because it catches attention collapse that standard benchmarks miss. Experiment with ZeroID if you are designing multi-agent systems that cross trust boundaries because adding proper agent identity early prevents painful refactoring later. Use the generator-level versus sample-level delta framework from the new preference optimization paper to curate your next DPO dataset because filtering by judged quality can dramatically improve data efficiency. More vision-language models emphasizing document and long-context understanding are expected from additional labs following LG’s lead. Agent identity and authorization standards will likely consolidate around projects like ZeroID as fleets of autonomous agents become commonplace. Expect rapid follow-up work on diffusion L L M defenses now that the re-masking vulnerability is public. Medical-reasoning benchmarks such as MR-Bench should drive more clinically grounded evaluations instead of exam-style leaderboards in the coming months. Before we go, keep an eye on how the research community responds to the diffusion model safety paper because the proposed fixes could reshape how these models are deployed. That's Models and Agents for today. If you found this useful, share it with someone who's trying to keep up with all these changes, and subscribe so you don't miss tomorrow's update. The A I world moves fast. We'll help you keep up. See you tomorrow. This podcast is curated by Patrick but generated using AI voice synthesis of my voice using ElevenLabs. The primary reason to do this is I unfortunately don't have the time to be consistent with generating all the content and wanted to focus on creating consistent and regular episodes for all the themes that I enjoy and I hope others do as well.

Top Story

Model Updates

Agent & Tool Developments

Practical & Community

Under the Hood: Diffusion Language Model Safety

Things to Try This Week

On the Horizon

Sources

Enjoy this episode? Get Models & Agents in your inbox