# Zhipu GLM-4.6V

Cherry Studio users can now, through the built-in **CherryIN** service, experience it for free **Zhipu GLM-4.6V**— a flagship vision model released by Z.ai (Zhipu AI) in December 2025, featuring a MoE architecture, 128K native multimodal context, and native multimodal tool calling. It is the preferred choice for image-text understanding and multimodal Agent scenarios.

***

## 🚀 What is GLM-4.6V?

GLM-4.6V is the latest-generation vision-language model in Z.ai's GLM-V series, natively supporting unified text + image modeling and further extending context and tool-calling capabilities based on GLM-4.5V.

* Architecture: Mixture-of-Experts (MoE)
* Total parameters: 106B
* Activated parameters: about 12B
* Context length: 128K tokens
* Open-source license: MIT
* Release date: December 8–9, 2025
* Vision encoder: supports multi-resolution images (up to 4K)

The series also includes **GLM-4.6V-Flash (9B)**, designed for local and low-latency scenarios, free for commercial use.

<figure><img src="/files/4443a1d3ef592dd1d2eb3e722a2b4484c7a3e1c4" alt=""><figcaption></figcaption></figure>

***

## 📚 Continuing the multimodal training system of the GLM-V series

GLM-4.6V follows the technical path of GLM-4.1V-Thinking / GLM-4.5V and has been further enhanced in vision and Agent capabilities:

1. **Native multimodal modeling**: joint training of text and images, supporting mixed text-image input
2. **Context expansion**: training context expanded to 128K tokens, able to process about 150 pages of dense documents, 200 pages of slides, or 1 hour of video in a single run
3. **Native multimodal tool calling**: tools can directly receive and return images, handling multimodal outputs via URL based on the extended MCP protocol
4. **Reinforcement learning enhancements**: continues the scalable RL pipeline of the GLM-V series

<figure><img src="/files/a65c7b6797bb19f1558abd5e684c889250f9475e" alt=""><figcaption></figcaption></figure>

***

## ⚙️ Native multimodal, built for real-world scenarios

GLM-4.6V's multimodal capabilities cover both everyday and professional scenarios:

* ✅ **Rich-text content understanding**: long documents, multi-page text, and mixed text-image layouts
* ✅ **Visual web search**: web search and understanding combined with visual input
* ✅ **Frontend recreation**: generate frontend code from design mockups or UI screenshots
* ✅ **Long-context multimodal document analysis**: full PDF / slide deck / video-level input
* ✅ **Chart and table parsing**: structured information extraction

***

## 💡 Native multimodal tool calling and Agent capabilities

One of GLM-4.6V's core upgrades is the **"visual perception → executable action"** closed loop: tool calling natively supports images as both input and output, enabling multimodal Agents to be deployed in real business scenarios.

| Scenario                 | Recommended usage    | Example                                              |
| ------------------------ | -------------------- | ---------------------------------------------------- |
| Simple image-text Q\&A   | Direct conversation  | "What is in this image?"                             |
| Moderately complex tasks | Enable tool calling  | Read the chart and then retrieve the data            |
| Complex multimodal Agent | Multiple tools + MCP | Screenshot → understand → call API → generate report |

***

## 🌟 Efficient MoE, openly available

* ⚡ MoE sparse activation: 106B total parameters, only about 12B activated
* 💰 Through CherryIN in Cherry Studio**free to use**
* 🖥️ The weights, inference code, and MCP tools have been open-sourced on GitHub and Hugging Face under the MIT license

***

## 🧠 Focus on practical capabilities: multimodal assistant

In actual use, GLM-4.6V is suitable for the following scenarios:

* **Document assistant**: read and summarize long documents, scans, and slide decks
* **Data analysis**: identify and interpret charts and dashboard screenshots
* **Frontend and design**: generate or modify frontend code based on UI screenshots
* **Visual search**: web search and information integration combined with images
* **Multimodal Agent**: complete complex tasks using tools such as browsers, code execution, and retrieval

***

## 🧭 How to use it in Cherry Studio?

1. Open Cherry Studio and go to **Settings → Model Services**.
2. Find **CherryIN** the service provider and enable it.
3. In the model list, select **Zhipu GLM-4.6V**.
4. Return to the chat interface and switch the top model selector to **GLM-4.6V**, then you can upload images directly in the conversation for image-text interaction.

> 💡 Tip: The free model quota provided by CherryIN is covered by the Cherry Studio official team, making it suitable for everyday trials and evaluations; for production environments, it is recommended to use the official Z.ai (Zhipu) API.

***

📘 **Try Zhipu GLM-4.6V now and unlock native multimodal and visual Agent capabilities!**


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cherry-ai.com/docs/en-us/pre-basic/providers/cherryai/mian-fei-ti-yan-zhi-pu-glm-4.6v-shi-jue-qi-jian-duo-mo-tai-moe.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
