Qwen3.6-35B-A3B brings sparse MoE vision-language capabilities with only 3B acti... — Ep32

Qwen3.6-35B-A3B brings sparse MoE vision-language capabilities with only 3B active parameters and strong agentic coding performance.

What You Need to Know: The Qwen team open-sourced a highly efficient 35B-parameter sparse MoE vision-language model that activates just 3B parameters per token while delivering competitive coding and multimodal reasoning. Community discussions highlight real-world differences between the latest Qwen3.6 and Gemma-4 releases, particularly in coding strength, non-English handling, and tool-use behavior. Several new arXiv papers explore everything from on-device personalized input methods to hierarchical RAG for cyber threat intelligence and long-term memory benchmarks in gamified environments.

Top Story

Qwen Team Open-Sources Qwen3.6-35B-A3B: A Sparse MoE Vision-Language Model with 3B Active Parameters and Agentic Coding Capabilities

The Qwen team released Qwen3.6-35B-A3B, a sparse Mixture-of-Experts vision-language model with 35B total parameters but only 3B active per forward pass. It combines efficient multimodal understanding with explicit agentic coding capabilities, targeting developers who need strong reasoning and tool-use in a memory-friendly package.

Compared to dense 27-31B models, the sparse architecture delivers superior coding performance while keeping inference costs closer to much smaller models. Early community testing shows it excels at Python data analysis, web app tasks, and image recognition, though non-English performance still lags behind its English and Chinese results.

Practically, this means you can now run capable vision-language agents locally or on modest GPUs that previously required much larger dense models. Builders working on coding assistants, document understanding, or multimodal agents should test it immediately.

Watch for follow-up quantized versions and integration examples in popular inference engines. The release strengthens the open-source ecosystem’s ability to compete with closed multimodal models on agentic tasks.

Source: marktechpost.com

Model Updates

Ternary Bonsai: Top intelligence at 1.58 bits — r/LocalLLaMA

PrismML released the Ternary Bonsai family (8B, 4B, 1.7B) using {-1, 0, +1} ternary weights, achieving roughly 9× smaller memory footprint than FP16 models while outperforming peers in their size class on standard benchmarks. The models are available in FP16 safetensors for immediate Hugging Face compatibility, with MLX 2-bit packed formats already supported.

Community members are particularly excited about potential future 20-40B ternary models that could challenge today’s leading dense and MoE releases. These represent a practical step forward for edge and memory-constrained deployments where every bit matters.

Source: reddit.com

SAGE Celer 2.6 Technical Card — arXiv

SAGEA introduced Celer 2.6 in 5B, 10B, and 27B sizes with architectural modifications, Inverse Reasoning training to reduce cascading errors, and a native end-to-end vision encoder. The models emphasize strong mathematics, coding, and general intelligence results alongside dedicated optimization for South Asian languages using a custom Devanagari tokenizer.

This release is notable for balancing English reasoning capability with competitive performance in Nepali and Hindi without the typical multilingual degradation. The native multimodal design avoids common adapter pitfalls seen in other vision-language efforts.

Source: arxiv.org

Benchmarking Linguistic Adaptation in Comparable-Sized LLMs — arXiv

Researchers systematically compared Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali using a 10,000-sample dataset under zero-shot and QLoRA fine-tuned conditions. After fine-tuning with rsLoRA (r=32), all models improved dramatically, with Qwen3-8B leading most structural metrics while Llama-3.1-8B showed the largest gains from its weak zero-shot baseline.

The work provides the first rigorous baseline for this under-resourced informal digital language and confirms substantial headroom remains for low-resource adaptation pipelines.

Source: arxiv.org

Agent & Tool Developments

Looking for help from people who built multi Agents systems [P] — r/MachineLearning

A practitioner who encountered production issues with multi-agent systems built a basic “chaos monkey” framework for agents and is seeking collaborators to mature it into a proper benchmarking and reliability tool. The project aims to stress-test agent robustness before customer deployment.

If you have shipped multi-agent systems at scale, this is a direct opportunity to contribute domain expertise and help establish better evaluation practices for production agent reliability.

Source: reddit.com

Hierarchical Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text — arXiv

H-TechniqueRAG introduces a two-stage hierarchical retrieval system that first identifies MITRE ATT&CK tactics then narrows to specific techniques, cutting the candidate search space by 77.5%. It adds tactic-aware reranking and hierarchy-constrained context organization, delivering a 3.8% F1 improvement over prior TechniqueRAG while reducing inference latency by 62.4% and LLM API calls by 60%.

Security teams working on automated CTI mapping should evaluate this approach for both accuracy and significant cost savings.

Source: arxiv.org

Practical & Community

My thought on Qwen and Gemma — r/LocalLLaMA

A detailed practitioner comparison of Qwen3.6 and Gemma-4 models (27/31B dense and 35/26B MoE) for text review, grammar checking in social sciences, light Python data analysis, and JavaScript/TypeScript web work. Qwen leads on coding and STEM reasoning plus image recognition; Gemma offers better non-English flexibility and more “fuzzy” creative thinking suited to humanities tasks.

Both show political/cultural biases (though improved in recent versions) and occasional hallucinations that the author finds useful for staying mentally engaged. The post is a practical field guide for choosing between the two families by use case.

Source: reddit.com

Which computer should I buy: Mac or custom-built 5090? [D] — r/MachineLearning

A machine learning engineer whose workload is 70% fine-tuning large pretrained models and 30% training from scratch (mostly image/video with occasional LLMs) seeks advice on choosing between an M5-series Mac with MLX or a custom RTX 5090 build. The discussion weighs VRAM needs against Apple’s improving MLX ecosystem for on-device and fine-tuning work.

Useful real-world cost and workflow considerations for practitioners balancing budget, VRAM, and framework maturity.

Source: reddit.com

MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios — arXiv

MemGround introduces a three-tier hierarchical benchmark (Surface State, Temporal Associative, and Reasoning-Based Memory) using rich interactive gamified environments instead of static retrieval tests. It adds multi-dimensional metrics including Memory Fragments Unlocked and Exploration Trajectory Diagrams that reveal current SOTA LLMs and memory agents still struggle with dynamic tracking and long-term evidence reasoning.

Teams building long-context agents or memory-augmented systems now have a more realistic evaluation suite to measure progress beyond simple QA.

Source: arxiv.org

Under the Hood: Post-Transformer Adapters for Logit Correction

Everyone talks about “uncensoring” or “fixing” aligned models as if you simply need better prompts or bigger models. In practice, the suppression of factual log-probabilities on sensitive topics often lives in the final layers even when the knowledge is clearly present in hidden states.

The elegant engineering fix is a tiny post-transformer adapter (roughly 786K parameters, 0.02% of a 4B–14B base) that reads frozen hidden states and learns a corrective mapping. Because it operates before the final LM head, it can restore proper probability rankings without retraining the entire model. Gated (SwiGLU) and simple linear-bottleneck versions perform comparably, suggesting the architecture choice here is less critical than the intervention point.

Applying the adapter at every token position during generation destroys coherence; restricting it to the current prediction position (last-token only) yields fluent, less-censored text. A logit-space adapter after token projection fails entirely, confirming that hidden-state intervention is the correct abstraction level.

Training uses anchored examples to prevent regression on general knowledge, delivering perfect memorization of training facts and 11–39% generalization to held-out items across scales. The quality gain is most pronounced on smaller models; above ~70B the hidden-state misalignment appears to shrink and the adapter’s relative value drops.

Practical decision framework: Use post-transformer adapters when you need surgical behavior correction on a frozen base model and can afford a few hundred thousand extra parameters. They are dramatically cheaper than full fine-tuning or preference optimization for targeted logit repair. The gotcha that still bites teams is applying the adapter at the wrong position in the generation loop—always validate last-position-only behavior first.

Things to Try This Week

Try Qwen3.6-35B-A3B on a coding or multimodal agent task — its 3B active parameters and agentic training make it surprisingly capable on modest hardware.
Load one of the Ternary Bonsai models (especially the 8B) in MLX or Hugging Face to test how 1.58-bit weights feel for your specific workload.
Experiment with H-TechniqueRAG code for mapping CTI reports to MITRE ATT&CK — the latency and API-cost reductions are immediately measurable.
Run the MemGround benchmark scenarios against your current long-context or memory-augmented agent to see where dynamic tracking actually breaks.
Compare Qwen3-8B vs Llama-3.1-8B on Romanized Nepali or another low-resource script after a quick QLoRA pass — the adaptation headroom differences are striking.

On the Horizon

Expected follow-up releases of larger ternary models (20-40B scale) from PrismML that could reshape local LLM viability.
Further optimizations and backend support for Ternary Bonsai beyond current MLX 2-bit packing.
Community chaos-monkey frameworks for multi-agent reliability testing likely to mature in the coming weeks.
Continued rapid iteration on hierarchical and stateful RAG techniques as papers like Stateful Evidence-Driven RAG and H-TechniqueRAG spark implementations.

Sources

marktechpost.com

reddit.com

arxiv.org

Full Episode Transcript

Welcome back to Models and Agents, episode thirty-two. It's April seventeenth, twenty twenty-six. The A I world never sleeps. Here's what you need to know today. Qwenthree point six to thirty-fiveB-A3B brings sparse MoE vision-language capabilities with only 3B active parameters and strong agentic coding performance. The Chwen team just open-sourced a model that changes the math on what you can run locally for multimodal agent work. This is the top story today because it delivers serious coding and vision performance while keeping inference costs remarkably low. The model has 35 billion total parameters but activates just 3 billion per forward pass thanks to a sparse Mixture-of-Experts architecture. That design gives it better coding performance than dense models in the 27 to 31 billion parameter range while using memory and compute much closer to smaller models. Early community testing shows it shines at Python data analysis, building web apps, and image recognition tasks. It does have a noticeable gap in non-English performance compared to its English and Chinese results, so keep that in mind for multilingual work. For developers, this means you can now run capable vision-language agents on modest G P U's that previously demanded much larger dense models. If you have been building coding assistants, document understanding tools, or multimodal agents, this release is worth testing immediately. The open-source nature plus expected quantized versions should make integration into popular inference engines straightforward in the coming days. This release really strengthens the open-source ecosystem's ability to compete with closed multimodal models on practical agentic tasks. Now on the model update side, the community is buzzing about Ternary Bonsai from PrismML. They released a family of models at 8 billion, 4 billion, and 1.7 billion parameters using ternary weights of negative one, zero, and positive one. These models achieve roughly nine times smaller memory footprint than FP16 equivalents while outperforming peers in their size class on standard benchmarks. The models come in FP16 safetensors for immediate Hugging Face compatibility, and MLX 2-bit packed formats are already supported. People are especially excited about the roadmap toward 20 to 40 billion parameter ternary models that could challenge today's leading dense and MoE releases. For anyone doing edge or memory-constrained deployments, this feels like a practical step forward where every bit truly matters. Another notable release is SAGE Celer 2.6, introduced in 5 billion, 10 billion, and 27 billion sizes. The team added architectural modifications, Inverse Reasoning training to reduce cascading errors, and a native end-to-end vision encoder. These models emphasize strong mathematics, coding, and general intelligence alongside dedicated optimization for South Asian languages using a custom Devanagari tokenizer. What stands out is how they balance English reasoning capability with competitive performance in Nepali and Hindi without the usual multilingual degradation. The native multimodal design avoids the common adapter pitfalls that plague many other vision-language efforts. Researchers also published a systematic comparison of linguistic adaptation across comparable-sized large language models. They tested Lah-mah-three point one to eightB, Mee-stral-7B-v0.1, and Qwenthree to eightB on Romanized Nepali using a ten thousand-sample dataset under both zero-shot and QLoRA fine-tuned conditions. After fine-tuning with rsLoRA at rank 32, all models improved dramatically, with Qwenthree to eightB leading most structural metrics. Lah-mah-three point one to eightB showed the largest gains from its relatively weak zero-shot baseline. This work gives the first rigorous baseline for this under-resourced informal digital language and confirms there is still substantial headroom for low-resource adaptation pipelines. Shifting to agent and tool developments, a practitioner on the Machine Learning subreddit is looking for collaborators on a chaos monkey framework for multi-agent systems. After hitting production issues with multi-agent setups, they built a basic stress-testing tool and want to mature it into a proper benchmarking and reliability suite. If you have shipped multi-agent systems at scale, this is a direct opportunity to contribute domain expertise and help establish better evaluation practices for production agent reliability. On the research side, the H-TechniqueRAG paper introduces a two-stage hierarchical retrieval system for cyber threat intelligence. It first identifies MITRE ATT&CK tactics, then narrows to specific techniques, cutting the candidate search space by seventy seven point five percent. The approach adds tactic-aware reranking and hierarchy-constrained context organization. The result is a 3.8 percent F1 improvement over prior TechniqueRAG while reducing inference latency by 62.4 percent and L L M A P I calls by 60 percent. Security teams working on automated cyber threat intelligence mapping should evaluate this for both accuracy gains and significant cost savings. In the practical and community space, one LocalLLaMA contributor posted a detailed head-to-head of Qwen3.6 versus Gemma-4 models. The comparison covers text review, grammar checking in social sciences, light Python data analysis, and JavaScript TypeScript web work. Chwen leads on coding, STEM reasoning, and image recognition, while Gemma offers better non-English flexibility and more fuzzy creative thinking suited to humanities tasks. Both models still show some political and cultural biases, though improved from earlier versions, and the author even finds the occasional hallucination useful for staying mentally engaged. The post serves as a practical field guide for choosing between the two families based on your specific use case. Elsewhere, a machine learning engineer asked the community whether to buy an M5-series Mac with MLX or build a custom RTX 5090 machine. Their workload breaks down to 70 percent fine-tuning large pretrained models and 30 percent training from scratch, mostly image and video with occasional large language models. The discussion delivers useful real-world insights on VRAM needs versus Apple's improving MLX ecosystem for on-device and fine-tuning work. It is a timely look at the budget, memory, and framework maturity tradeoffs many practitioners face right now. Researchers also released MemGround, a long-term memory evaluation kit for large language models in gamified scenarios. It uses a three-tier hierarchical benchmark covering Surface State, Temporal Associative, and Reasoning-Based Memory inside rich interactive environments instead of static retrieval tests. The suite adds multi-dimensional metrics like Memory Fragments Unlocked and Exploration Trajectory Diagrams. Current state-of-the-art models and memory agents still struggle with dynamic tracking and long-term evidence reasoning according to these tests. Teams building long-context agents or memory-augmented systems now have a much more realistic evaluation suite to measure real progress. OK, let's pop the hood on these post-transformer adapters for logit correction that the community has been discussing. Everyone talks about uncensoring or fixing aligned models as if you simply need better prompts or bigger models. In practice, the suppression of factual log-probabilities on sensitive topics often lives in the final layers even when the knowledge is clearly present in hidden states. The elegant engineering fix is a tiny post-transformer adapter of roughly seven hundred eighty-six thousand parameters, which is only 0.02 percent of a 4 billion to 14 billion parameter base model. It reads frozen hidden states and learns a corrective mapping before the final language model head. This restores proper probability rankings without retraining the entire model. Both gated SwiGLU and simple linear-bottleneck versions perform comparably, so the exact architecture seems less critical than the intervention point itself. Applying the adapter at every token position during generation destroys coherence. Restricting it to the current prediction position, meaning last-token only, yields fluent and less-censored text. A logit-space adapter after token projection fails entirely, confirming that hidden-state intervention is the correct level of abstraction. Training uses anchored examples to prevent regression on general knowledge. The adapter delivers perfect memorization of training facts and 11 to 39 percent generalization to held-out items across different scales. The quality gain is most pronounced on smaller models. Above roughly 70 billion parameters the hidden-state misalignment appears to shrink and the adapter's relative value drops. So when should you actually reach for this versus the alternative? Use post-transformer adapters when you need surgical behavior correction on a frozen base model and can afford a few hundred thousand extra parameters. They are dramatically cheaper than full fine-tuning or preference optimization for targeted logit repair. The gotcha that still bites teams is applying the adapter at the wrong position in the generation loop. Always validate last-position-only behavior first. If you have not tried Qwenthree point six to thirty-fiveB-A3B on a coding or multimodal agent task yet, this week is the perfect time. Its 3 billion active parameters and agentic training make it surprisingly capable on modest hardware. Load one of the Ternary Bonsai models, especially the 8 billion version, in MLX or Hugging Face to test how 1.58-bit weights feel for your specific workload. Experiment with the H-TechniqueRAG code for mapping cyber threat intelligence reports to MITRE ATT&CK. The latency and A P I cost reductions are immediately measurable in practice. Run the MemGround benchmark scenarios against your current long-context or memory-augmented agent to see where dynamic tracking actually breaks. Compare Qwenthree to eightB versus Lah-mah-three point one to eightB on Romanized Nepali or another low-resource script after a quick QLoRA pass. The adaptation headroom differences are striking and instructive. On the horizon, keep an eye on expected follow-up releases of larger ternary models in the 20 to 40 billion scale from PrismML. They could reshape local L L M viability in a big way. Before we go, tomorrow keep an eye on further community momentum around chaos-monkey frameworks for multi-agent reliability testing. That's Models and Agents for today. If you found this useful, share it with someone who's trying to keep up with all these changes, and subscribe so you don't miss tomorrow's update. The A I world moves fast. We'll help you keep up. See you tomorrow. This podcast is curated by Patrick but generated using AI voice synthesis of my voice using ElevenLabs. The primary reason to do this is I unfortunately don't have the time to be consistent with generating all the content and wanted to focus on creating consistent and regular episodes for all the themes that I enjoy and I hope others do as well.

Top Story

Model Updates

Agent & Tool Developments

Practical & Community

Under the Hood: Post-Transformer Adapters for Logit Correction

Things to Try This Week

On the Horizon

Sources

Enjoy this episode? Get Models & Agents in your inbox