Zhipu GLM-4.6V
Cherry Studio users can now, through the built-in CherryIN service, experience it for free Zhipu GLM-4.6V— a flagship vision model released by Z.ai (Zhipu AI) in December 2025, featuring a MoE architecture, 128K native multimodal context, and native multimodal tool calling. It is the preferred choice for image-text understanding and multimodal Agent scenarios.
🚀 What is GLM-4.6V?
GLM-4.6V is the latest-generation vision-language model in Z.ai's GLM-V series, natively supporting unified text + image modeling and further extending context and tool-calling capabilities based on GLM-4.5V.
Architecture: Mixture-of-Experts (MoE)
Total parameters: 106B
Activated parameters: about 12B
Context length: 128K tokens
Open-source license: MIT
Release date: December 8–9, 2025
Vision encoder: supports multi-resolution images (up to 4K)
The series also includes GLM-4.6V-Flash (9B), designed for local and low-latency scenarios, free for commercial use.

📚 Continuing the multimodal training system of the GLM-V series
GLM-4.6V follows the technical path of GLM-4.1V-Thinking / GLM-4.5V and has been further enhanced in vision and Agent capabilities:
Native multimodal modeling: joint training of text and images, supporting mixed text-image input
Context expansion: training context expanded to 128K tokens, able to process about 150 pages of dense documents, 200 pages of slides, or 1 hour of video in a single run
Native multimodal tool calling: tools can directly receive and return images, handling multimodal outputs via URL based on the extended MCP protocol
Reinforcement learning enhancements: continues the scalable RL pipeline of the GLM-V series

⚙️ Native multimodal, built for real-world scenarios
GLM-4.6V's multimodal capabilities cover both everyday and professional scenarios:
✅ Rich-text content understanding: long documents, multi-page text, and mixed text-image layouts
✅ Visual web search: web search and understanding combined with visual input
✅ Frontend recreation: generate frontend code from design mockups or UI screenshots
✅ Long-context multimodal document analysis: full PDF / slide deck / video-level input
✅ Chart and table parsing: structured information extraction
💡 Native multimodal tool calling and Agent capabilities
One of GLM-4.6V's core upgrades is the "visual perception → executable action" closed loop: tool calling natively supports images as both input and output, enabling multimodal Agents to be deployed in real business scenarios.
Simple image-text Q&A
Direct conversation
"What is in this image?"
Moderately complex tasks
Enable tool calling
Read the chart and then retrieve the data
Complex multimodal Agent
Multiple tools + MCP
Screenshot → understand → call API → generate report
🌟 Efficient MoE, openly available
⚡ MoE sparse activation: 106B total parameters, only about 12B activated
💰 Through CherryIN in Cherry Studiofree to use
🖥️ The weights, inference code, and MCP tools have been open-sourced on GitHub and Hugging Face under the MIT license
🧠 Focus on practical capabilities: multimodal assistant
In actual use, GLM-4.6V is suitable for the following scenarios:
Document assistant: read and summarize long documents, scans, and slide decks
Data analysis: identify and interpret charts and dashboard screenshots
Frontend and design: generate or modify frontend code based on UI screenshots
Visual search: web search and information integration combined with images
Multimodal Agent: complete complex tasks using tools such as browsers, code execution, and retrieval
🧭 How to use it in Cherry Studio?
Open Cherry Studio and go to Settings → Model Services.
Find CherryIN the service provider and enable it.
In the model list, select Zhipu GLM-4.6V.
Return to the chat interface and switch the top model selector to GLM-4.6V, then you can upload images directly in the conversation for image-text interaction.
💡 Tip: The free model quota provided by CherryIN is covered by the Cherry Studio official team, making it suitable for everyday trials and evaluations; for production environments, it is recommended to use the official Z.ai (Zhipu) API.
📘 Try Zhipu GLM-4.6V now and unlock native multimodal and visual Agent capabilities!
Last updated
Was this helpful?