Models & Agents

Date: March 07, 2026

Anthropic's Claude AI discovered over 100 Firefox vulnerabilities that human testing missed for decades.

What You Need to Know: Anthropic's Claude model made headlines by uncovering over 100 security bugs in Firefox, showcasing AI's potential to revolutionize vulnerability detection beyond traditional methods. Other key releases include Microsoft's Phi-4-Reasoning-Vision-15B for multimodal math and GUI tasks, Google's TensorFlow 2.21 with LiteRT for edge inference, and OpenAI's Codex Security for codebase analysis. Pay attention this week to how these tools enhance security scanning, reasoning in video models, and efficient on-device AI deployment—perfect for developers tackling real-world bugs and optimizations.

Top Story

Anthropic's Claude AI has identified over 100 security vulnerabilities in Firefox, including critical bugs overlooked by decades of manual and automated testing. This demonstration leverages Claude's advanced reasoning to analyze codebases at scale, spotting issues like memory leaks and authentication flaws that tools like static analyzers missed. Compared to previous AI-assisted security tools, Claude's context-aware approach generalizes better across complex projects, though it still requires human validation for false positives. Developers and security teams should care as this enables faster, more thorough audits without massive compute overhead. To try it, integrate Claude via Anthropic's API for your own codebase scans; watch for broader integrations in tools like GitHub Copilot. Expect more case studies on AI-driven security as labs like Anthropic push boundaries in safe, aligned deployments.

Source: the-decoder.com

Model Updates

Microsoft Releases Phi-4-Reasoning-Vision-15B: MarkTechPost

Microsoft's Phi-4-Reasoning-Vision-15B is a new 15B-parameter multimodal model optimized for math, science, and GUI understanding, balancing efficiency with strong reasoning on image-text tasks. It outperforms larger models like GPT-4V in compact scenarios by using selective reasoning and lower compute needs, though it lags in general creativity compared to behemoths like Claude 3.5 or Gemini 1.5. This matters for developers building educational tools or UI automation, as it enables edge-friendly apps without sacrificing accuracy.

Source: marktechpost.com

Google Launches TensorFlow 2.21 And LiteRT: MarkTechPost

TensorFlow 2.21 introduces LiteRT as the new production-ready on-device inference engine, replacing TensorFlow Lite with faster GPU performance, NPU acceleration, and seamless PyTorch model deployment for edge devices. It improves inference speed by up to 30% over TFLite on mobile hardware, making it a strong alternative to ONNX Runtime for cost-sensitive apps, but requires updating workflows from older TFLite setups. Practitioners in mobile AI will benefit from easier quantization and lower latency in real-time tasks like object detection.

Source: marktechpost.com

Video AI models hit a reasoning ceiling: The Decoder

A new massive dataset for video reasoning reveals that models like Sora 2 and Veo 3.1 lag far behind humans on tasks like maze navigation and object counting, despite scaling training data 1,000x over prior benchmarks. This highlights a fundamental limitation where more data alone doesn't fix reasoning gaps, contrasting with text models like GPT-4 that scale better but still hallucinate. For video AI builders, this underscores the need for architectural innovations beyond raw scaling to enable reliable agents in dynamic environments.

Source: the-decoder.com

How our open-source AI model SpeciesNet is helping to promote wildlife conservation: AI

Google's open-source SpeciesNet model uses AI to identify animals in photos, aiding wildlife conservation by processing camera trap data efficiently. It builds on foundation models like Gemini for multimodal accuracy, outperforming older classifiers in species detection with lower false positives, though it requires fine-tuning for niche ecosystems. This is valuable for researchers and NGOs streamlining biodiversity monitoring without high costs.

Source: blog.google

Agent & Tool Developments

OpenAI Introduces Codex Security: MarkTechPost

Codex Security is OpenAI's new agent for context-aware vulnerability detection, validation, and patch generation across codebases, available in research preview for ChatGPT Enterprise users. It improves on tools like GitHub's CodeQL by using GPT models for semantic understanding, reducing false positives in large repos, with easy web-based integration. Try it today via the Codex web interface to scan your projects and review AI-suggested fixes.

Source: marktechpost.com

How to scan for vulnerabilities with GitHub Security Lab’s open source AI-powered framework: The GitHub Blog

GitHub's Taskflow Agent is an open-source framework for AI-powered vulnerability scanning, excelling at finding auth bypasses, IDORs, and token leaks using agentic workflows. This update enhances automation over manual tools like Burp Suite, with built-in LLM integration for high-impact bug hunting. Developers can install it from GitHub and run scans on their repos today for quick security audits.

Source: github.blog

Anthropic's new marketplace: The Decoder

Anthropic's Marketplace lets enterprise customers buy third-party tools built on Claude models using existing AI budgets, streamlining access to specialized agents for tasks like data analysis. It expands beyond basic API calls, competing with Hugging Face's Hub by focusing on vetted, Claude-integrated apps, though limited to corporate users. Explore it via Anthropic's platform to integrate custom agents into your workflows.

Source: the-decoder.com

Practical & Community

When language models hallucinate, they leave "spilled energy": The Decoder

Researchers developed a training-free method to detect LLM hallucinations by analyzing computational traces like "spilled energy" in math operations, generalizing better than prior entropy-based approaches. This solves reliability issues in models like Llama 3 or Grok, enabling safer RAG pipelines. Get started by implementing the open-source code from Sapienza University for your inference setups, noting it works best on math-heavy tasks but may need tweaks for creative outputs.

Source: the-decoder.com

Google AI Releases Android Bench: MarkTechPost

Android Bench is Google's new open-source evaluation framework and leaderboard for testing LLMs on Android development tasks, with datasets and harnesses to measure code generation accuracy. It addresses gaps in general benchmarks like HumanEval, providing Android-specific metrics for models like Code Llama. Developers can clone the GitHub repo to benchmark their fine-tuned models, though it requires Android SDK setup and focuses mainly on Java/Kotlin tasks.

Source: marktechpost.com

Things to Try This Week

Try Claude via Anthropic's API to scan your codebase for vulnerabilities — it's a game-changer for spotting hidden bugs faster than traditional tools, especially in legacy projects.
Experiment with Phi-4-Reasoning-Vision-15B on Hugging Face for GUI automation tasks — its compact size makes it ideal for testing math-heavy apps on edge devices without heavy compute.
Run GitHub's Taskflow Agent on a sample repo to hunt for auth bypasses — this open-source tool shows how agents can automate security workflows you might otherwise do manually.
Test the "spilled energy" hallucination detector in your LLM pipelines — it's a quick way to boost reliability in RAG setups compared to older methods.
Benchmark your favorite coding model on Android Bench — compare it against leaders like GPT-4 to see real gaps in mobile dev capabilities.

On the Horizon

Watch for broader rollouts of OpenAI's Codex Security beyond preview, potentially integrating with GitHub for seamless dev-sec-ops.
Expect updates to video reasoning datasets, as researchers hint at architectural fixes for models like Sora to overcome current ceilings.
Anthropic may expand its Marketplace to non-enterprise users, enabling more community-built Claude agents.
Google could announce LiteRT integrations with agent frameworks like AutoGen for on-device autonomous tasks.

Sources

the-decoder.com

marktechpost.com

blog.google

github.blog