Home
All Shows
Models & Agents Planetterrian Daily Omni View Models & Agents for Beginners Fascinating Frontiers Modern Investing Techniques Tesla Shorts Time Environmental Intelligence Финансы Просто Привет, Русский!
Blogs
All Blog Posts Models & Agents Blog Planetterrian Daily Blog Omni View Blog Models & Agents for Beginners Blog Fascinating Frontiers Blog Modern Investing Techniques Blog Tesla Shorts Time Blog Environmental Intelligence Blog Финансы Просто Blog Привет, Русский! Blog
Models & Agents Models & Agents Blog

Models & Agents — Episode 12

Google unveils Gemini Embedding 2, a multimodal model embedding text, images, video, audio, and docs for advanced RAG systems.

March 11, 2026 Ep 12 5 min read Listen to podcast View summaries

Models & Agents

Date: March 11, 2026

Google unveils Gemini Embedding 2, a multimodal model embedding text, images, video, audio, and docs for advanced RAG systems.

What You Need to Know: Google expanded its Gemini family with Embedding 2, a multimodal upgrade over the text-only gemini-embedding-001, tackling high-dimensional storage and cross-modal retrieval for production RAG pipelines. Meanwhile, agent frameworks like ToolRosetta and Scale-Plan are bridging open-source tools with LLMs for automated task execution, while arXiv papers explore stability in multi-agent systems amid rising enterprise deployments like Manulife's core AI workflows. Pay attention to how these tools enhance agent reliability in heterogeneous teams and multimodal tasks this week, as they lower barriers for developers building scalable, real-world agents.

Top Story

Google has released Gemini Embedding 2, succeeding the text-only gemini-embedding-001 and designed for high-dimensional storage and cross-modal retrieval in production-grade RAG systems. This second-generation model embeds text, images, video, audio, and documents into a shared space, addressing challenges like data compression and unified search that plague multimodal AI developers. Compared to alternatives like OpenAI's text-embedding-ada-002 or Cohere's multilingual embeddings, Gemini Embedding 2 stands out for its native multimodality without needing separate models, potentially reducing latency and integration overhead in hybrid workflows. Practitioners building RAG for enterprise search or content recommendation can now unify disparate data types more efficiently, making it ideal for teams handling diverse media. Keep an eye on integration guides from Google Cloud, as early adopters report smoother cross-modal performance but note higher compute demands during embedding generation. To try it, experiment with the Gemini API for embedding mixed inputs in a simple RAG demo.

Source: marktechpost.com

Model Updates

Google AI Introduces Gemini Embedding 2: MarkTechPost

Gemini Embedding 2 is a multimodal model that embeds text, images, video, audio, and docs, succeeding the text-focused gemini-embedding-001 with better handling of high-dimensional data for RAG systems. It compares favorably to LlamaIndex's multimodal embeddings by offering unified cross-modal retrieval without custom adapters, though it may require more VRAM for large-scale inference. This matters for developers optimizing RAG pipelines, as it enables more accurate, context-rich retrieval in multimedia apps.

Source: marktechpost.com

The Bureaucracy of Speed: cs.MA updates on arXiv.org

This paper introduces Capability Coherence System (CCS), mapping memory consistency models to multi-agent authorization with a state-mapping that bounds unauthorized operations under bounded-staleness semantics. Unlike traditional TTL-based strategies, it scales independently of agent velocity, reducing unauthorized ops by up to 120x in simulations compared to lease methods. It's crucial for AI infrastructure teams dealing with high-velocity agents, offering safer revocation in distributed systems without O(v·TTL) overhead.

Source: arxiv.org

Latent World Models for Automated Driving: cs.MA updates on arXiv.org

The paper proposes a taxonomy for latent world models in driving, covering latent worlds, actions, and generators with priors for geometry and semantics, plus evaluation metrics like closed-loop suites. It synthesizes progress in generative models like those from Wayve or Tesla's VLA systems, highlighting robustness gaps in long-horizon forecasting. Developers in autonomous agents will benefit from its framework for verifiable, resource-efficient models, though it notes open challenges in deliberation costs.

Source: arxiv.org

Agent & Tool Developments

How to Build a Self-Designing Meta-Agent: MarkTechPost

This tutorial outlines building a meta-agent that analyzes tasks, selects tools, configures memory and planners, then instantiates dynamic agents—going beyond static templates in frameworks like LangGraph or CrewAI. The update enables automatic refinement for task-specific agents, improving adaptability without manual coding. Try it by following the step-by-step code in Python to automate agent creation for custom workflows like data analysis.

Source: marktechpost.com

ToolRosetta: Bridging Open-Source Repositories and Large Language Model Agents: cs.MA updates on arXiv.org

ToolRosetta automatically translates open-source repos and APIs into MCP-compatible tools for LLM agents, enabling autonomous task planning and execution with minimal intervention. It improves on manual curation in AutoGen or Semantic Kernel by adding security checks and reducing reproduction effort, though it may struggle with highly proprietary code. Developers can install it via GitHub and test converting a repo into an agent toolchain for tasks like scientific computing.

Source: arxiv.org

Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams: cs.MA updates on arXiv.org

Scale-Plan uses LLMs to generate compact problem representations from natural language, orchestrating agents for decomposition and planning in multi-robot tasks, outperforming pure LLM or hybrid baselines. It's a step up from LangChain's static planning by filtering irrelevancies early, but requires PDDL specs. Experiment with it on AI2-THOR benchmarks to simulate robot coordination in household scenarios.

Source: arxiv.org

Practical & Community

ChatNeuroSim: An LLM Agent Framework for Automated Compute-in-Memory Accelerator Deployment and Optimization: cs.MA updates on arXiv.org

ChatNeuroSim automates CIM accelerator workflows using LLMs for task scheduling, script generation, and optimization via design space pruning, cutting runtime by up to 0.79x compared to baselines. It solves manual dependency headaches in hardware DSE, ideal for AI infrastructure devs, but assumes familiarity with NeuroSim. Get started by cloning the repo and running it on Swin Transformer optimization tasks.

Source: arxiv.org

LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems: cs.MA updates on arXiv.org

LDP introduces delegate identity cards, progressive payloads, and trust domains for multi-agent comms, implemented in JamJet runtime, reducing latency by 12x via specialization. It enhances protocols like A2A by adding model-level properties, though noisy provenance can degrade quality without verification. Try the plugin for agent routing in local Ollama setups to test semantic payloads.

Source: arxiv.org

FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis: cs.MA updates on arXiv.org

FetalAgents coordinates vision experts for fetal US diagnosis, measurement, and segmentation, plus video summarization into reports, outperforming MLLMs on multi-center benchmarks. It addresses end-to-end clinical workflows beyond static analysis, but needs high-quality US data. Developers in medical AI can explore the framework for similar domain-specific agent orchestration.

Source: arxiv.org

Things to Try This Week

  • Try Gemini Embedding 2 via Google Cloud API for embedding mixed media in a RAG pipeline — it's a game-changer for cross-modal search, unifying text and video better than separate models.
  • Build a self-designing meta-agent from the MarkTechPost tutorial if you're into agent automation — experiment with task-specific configs to see dynamic refinement in action over static LangGraph setups.
  • Test ToolRosetta on an open-source GitHub repo for MCP tool conversion — great for integrating niche libraries into LLM agents without manual standardization.
  • Run ChatNeuroSim on a CIM optimization task like Swin Transformer — check how design space pruning speeds up your hardware explorations compared to manual iterations.
  • Explore LDP in a JamJet setup for multi-agent chats — compare its identity-aware routing to basic A2A for lower latency in specialized delegate pools.

On the Horizon

  • Watch for more multimodal expansions in Gemini family, potentially including fine-tuning APIs for custom embeddings.
  • Expect enterprise case studies from Manulife-like deployments, highlighting agent integration in finance workflows.
  • Anticipate updates to MCP protocols, building on LDP's identity features for broader agent framework adoption.
  • Look out for benchmarks evaluating latent world models in driving, possibly influencing edge deployment standards.

Sources