Zhipu GLM-4.6V

Cherry Studio users can now, through the built-in CherryIN service, experience it for free Zhipu GLM-4.6V— a flagship vision model released by Z.ai (Zhipu AI) in December 2025, featuring a MoE architecture, 128K native multimodal context, and native multimodal tool calling. It is the preferred choice for image-text understanding and multimodal Agent scenarios.


🚀 What is GLM-4.6V?

GLM-4.6V is the latest-generation vision-language model in Z.ai's GLM-V series, natively supporting unified text + image modeling and further extending context and tool-calling capabilities based on GLM-4.5V.

  • Architecture: Mixture-of-Experts (MoE)

  • Total parameters: 106B

  • Activated parameters: about 12B

  • Context length: 128K tokens

  • Open-source license: MIT

  • Release date: December 8–9, 2025

  • Vision encoder: supports multi-resolution images (up to 4K)

The series also includes GLM-4.6V-Flash (9B), designed for local and low-latency scenarios, free for commercial use.


📚 Continuing the multimodal training system of the GLM-V series

GLM-4.6V follows the technical path of GLM-4.1V-Thinking / GLM-4.5V and has been further enhanced in vision and Agent capabilities:

  1. Native multimodal modeling: joint training of text and images, supporting mixed text-image input

  2. Context expansion: training context expanded to 128K tokens, able to process about 150 pages of dense documents, 200 pages of slides, or 1 hour of video in a single run

  3. Native multimodal tool calling: tools can directly receive and return images, handling multimodal outputs via URL based on the extended MCP protocol

  4. Reinforcement learning enhancements: continues the scalable RL pipeline of the GLM-V series


⚙️ Native multimodal, built for real-world scenarios

GLM-4.6V's multimodal capabilities cover both everyday and professional scenarios:

  • Rich-text content understanding: long documents, multi-page text, and mixed text-image layouts

  • Visual web search: web search and understanding combined with visual input

  • Frontend recreation: generate frontend code from design mockups or UI screenshots

  • Long-context multimodal document analysis: full PDF / slide deck / video-level input

  • Chart and table parsing: structured information extraction


💡 Native multimodal tool calling and Agent capabilities

One of GLM-4.6V's core upgrades is the "visual perception → executable action" closed loop: tool calling natively supports images as both input and output, enabling multimodal Agents to be deployed in real business scenarios.

Scenario
Recommended usage
Example

Simple image-text Q&A

Direct conversation

"What is in this image?"

Moderately complex tasks

Enable tool calling

Read the chart and then retrieve the data

Complex multimodal Agent

Multiple tools + MCP

Screenshot → understand → call API → generate report


🌟 Efficient MoE, openly available

  • ⚡ MoE sparse activation: 106B total parameters, only about 12B activated

  • 💰 Through CherryIN in Cherry Studiofree to use

  • 🖥️ The weights, inference code, and MCP tools have been open-sourced on GitHub and Hugging Face under the MIT license


🧠 Focus on practical capabilities: multimodal assistant

In actual use, GLM-4.6V is suitable for the following scenarios:

  • Document assistant: read and summarize long documents, scans, and slide decks

  • Data analysis: identify and interpret charts and dashboard screenshots

  • Frontend and design: generate or modify frontend code based on UI screenshots

  • Visual search: web search and information integration combined with images

  • Multimodal Agent: complete complex tasks using tools such as browsers, code execution, and retrieval


🧭 How to use it in Cherry Studio?

  1. Open Cherry Studio and go to Settings → Model Services.

  2. Find CherryIN the service provider and enable it.

  3. In the model list, select Zhipu GLM-4.6V.

  4. Return to the chat interface and switch the top model selector to GLM-4.6V, then you can upload images directly in the conversation for image-text interaction.

💡 Tip: The free model quota provided by CherryIN is covered by the Cherry Studio official team, making it suitable for everyday trials and evaluations; for production environments, it is recommended to use the official Z.ai (Zhipu) API.


📘 Try Zhipu GLM-4.6V now and unlock native multimodal and visual Agent capabilities!

Last updated

Was this helpful?