Knowledge Popularization
What are tokens?
Tokens are the basic units that AI models use to process text; they can be understood as the smallest "thinking" units of the model. They are not exactly the same as characters or words as we understand them, but a special way the model itself segments text.
1. Chinese tokenization
A single Chinese character is usually encoded as 1–2 tokens
For example:
"你好"≈ 2–4 tokens
2. English tokenization
Common words are usually 1 token
Longer or uncommon words are broken into multiple tokens
For example:
"hello"= 1 token"indescribable"= 4 tokens
3. Special characters
Spaces, punctuation, etc. also consume tokens
A newline is usually 1 token
What is a Tokenizer?
A tokenizer is the tool AI models use to convert text into tokens. It determines how input text is split into the smallest units the model can understand.
Why are tokenizers different across models?
1. Different training data
Different corpora lead to different optimization directions
Differences in multilingual support
Specialized optimizations for specific domains (medical, legal, etc.)
2. Different tokenization algorithms
BPE (Byte Pair Encoding) - OpenAI GPT series
WordPiece - Google BERT
SentencePiece - suitable for multilingual scenarios
3. Different optimization goals
Some focus on compression efficiency
Some focus on semantic preservation
Some focus on processing speed
Practical impact
The same text may have different token counts in different models:
What is an Embedding Model?
Basic concept: An embedding model is a technique that converts high-dimensional discrete data (text, images, etc.) into low-dimensional continuous vectors. This transformation helps machines better understand and process complex data. Imagine simplifying a complex puzzle into a single coordinate point that still preserves the puzzle's key features. In the large model ecosystem, it acts as a "translator," converting human-understandable information into numeric forms that AI can compute.
How it works: Take natural language processing as an example: an embedding model can map words to specific positions in a vector space. In this space, semantically similar words naturally cluster together. For example:
The vectors for "king" and "queen" will be very close
Pet words like "cat" and "dog" will also be close to each other
Words that are semantically unrelated like "car" and "bread" will be far apart
Main application scenarios:
Text analysis: document classification, sentiment analysis
Recommendation systems: personalized content recommendation
Image processing: similar image retrieval
Search engines: semantic search optimization
Core advantages:
Dimensionality reduction: simplifying complex data into easily processed vector forms
Semantic retention: preserving the key semantic information from the original data
Computational efficiency: significantly improving training and inference efficiency for machine learning models
Technical value: Embedding models are foundational components of modern AI systems, providing high-quality data representations for machine learning tasks and serving as key technology driving advances in natural language processing, computer vision, and other fields.
How embedding models work in knowledge retrieval
Basic workflow:
Knowledge base preprocessing stage
Split documents into appropriately sized chunks
Use an embedding model to convert each chunk into a vector
Store the vectors and the original text in a vector database
Query processing stage
Convert the user's question into a vector
Retrieve similar content from the vector store
Provide the retrieved relevant content as context to the LLM
What is MCP (Model Context Protocol)?
MCP is an open protocol designed to provide context information to large language models (LLMs) in a standardized way.
Analogy for understanding: You can think of MCP as a "USB drive" for the AI field. We know a USB drive can store various files and be used directly when plugged into a computer. Similarly, various context-providing "plugins" can be "plugged" into an MCP Server, and an LLM can request these plugins from the MCP Server as needed to obtain richer contextual information and enhance its capabilities.
Comparison with Function Tools: Traditional function tools can also provide external functionality to LLMs, but MCP is more like a higher-dimensional abstraction. Function tools are more task-specific tools, whereas MCP provides a more general, modular mechanism for obtaining context.
Core advantages of MCP
Standardization: MCP provides a unified interface and data format, enabling different LLMs and context providers to collaborate seamlessly.
Modularity: MCP allows developers to decompose context information into independent modules (plugins) for easier management and reuse.
Flexibility: LLMs can dynamically choose the context plugins they need based on their own requirements, achieving smarter, more personalized interactions.
Scalability: MCP's design supports adding more types of context plugins in the future, providing unlimited possibilities for expanding LLM capabilities.
Last updated
Was this helpful?