Knowledge Popularization

What are tokens?

Tokens are the basic units that AI models use to process text; they can be understood as the smallest "thinking" units of the model. They are not exactly the same as characters or words as we understand them, but a special way the model itself segments text.

1. Chinese tokenization

  • A single Chinese character is usually encoded as 1–2 tokens

  • For example:"你好" ≈ 2–4 tokens

2. English tokenization

  • Common words are usually 1 token

  • Longer or uncommon words are broken into multiple tokens

  • For example:

    • "hello" = 1 token

    • "indescribable" = 4 tokens

3. Special characters

  • Spaces, punctuation, etc. also consume tokens

  • A newline is usually 1 token

Different providers have different tokenizers, and even different models from the same provider can use different tokenizers; this information is only to clarify the concept of tokens.


What is a Tokenizer?

A tokenizer is the tool AI models use to convert text into tokens. It determines how input text is split into the smallest units the model can understand.

Why are tokenizers different across models?

1. Different training data

  • Different corpora lead to different optimization directions

  • Differences in multilingual support

  • Specialized optimizations for specific domains (medical, legal, etc.)

2. Different tokenization algorithms

  • BPE (Byte Pair Encoding) - OpenAI GPT series

  • WordPiece - Google BERT

  • SentencePiece - suitable for multilingual scenarios

3. Different optimization goals

  • Some focus on compression efficiency

  • Some focus on semantic preservation

  • Some focus on processing speed

Practical impact

The same text may have different token counts in different models:


What is an Embedding Model?

Basic concept: An embedding model is a technique that converts high-dimensional discrete data (text, images, etc.) into low-dimensional continuous vectors. This transformation helps machines better understand and process complex data. Imagine simplifying a complex puzzle into a single coordinate point that still preserves the puzzle's key features. In the large model ecosystem, it acts as a "translator," converting human-understandable information into numeric forms that AI can compute.

How it works: Take natural language processing as an example: an embedding model can map words to specific positions in a vector space. In this space, semantically similar words naturally cluster together. For example:

  • The vectors for "king" and "queen" will be very close

  • Pet words like "cat" and "dog" will also be close to each other

  • Words that are semantically unrelated like "car" and "bread" will be far apart

Main application scenarios:

  • Text analysis: document classification, sentiment analysis

  • Recommendation systems: personalized content recommendation

  • Image processing: similar image retrieval

  • Search engines: semantic search optimization

Core advantages:

  1. Dimensionality reduction: simplifying complex data into easily processed vector forms

  2. Semantic retention: preserving the key semantic information from the original data

  3. Computational efficiency: significantly improving training and inference efficiency for machine learning models

Technical value: Embedding models are foundational components of modern AI systems, providing high-quality data representations for machine learning tasks and serving as key technology driving advances in natural language processing, computer vision, and other fields.


How embedding models work in knowledge retrieval

Basic workflow:

  1. Knowledge base preprocessing stage

  • Split documents into appropriately sized chunks

  • Use an embedding model to convert each chunk into a vector

  • Store the vectors and the original text in a vector database

  1. Query processing stage

  • Convert the user's question into a vector

  • Retrieve similar content from the vector store

  • Provide the retrieved relevant content as context to the LLM


What is MCP (Model Context Protocol)?

MCP is an open protocol designed to provide context information to large language models (LLMs) in a standardized way.

  • Analogy for understanding: You can think of MCP as a "USB drive" for the AI field. We know a USB drive can store various files and be used directly when plugged into a computer. Similarly, various context-providing "plugins" can be "plugged" into an MCP Server, and an LLM can request these plugins from the MCP Server as needed to obtain richer contextual information and enhance its capabilities.

  • Comparison with Function Tools: Traditional function tools can also provide external functionality to LLMs, but MCP is more like a higher-dimensional abstraction. Function tools are more task-specific tools, whereas MCP provides a more general, modular mechanism for obtaining context.

Core advantages of MCP

  1. Standardization: MCP provides a unified interface and data format, enabling different LLMs and context providers to collaborate seamlessly.

  2. Modularity: MCP allows developers to decompose context information into independent modules (plugins) for easier management and reuse.

  3. Flexibility: LLMs can dynamically choose the context plugins they need based on their own requirements, achieving smarter, more personalized interactions.

  4. Scalability: MCP's design supports adding more types of context plugins in the future, providing unlimited possibilities for expanding LLM capabilities.


Last updated

Was this helpful?