General Knowledge
What're tokens?
Tokens are the basic units that AI models use to process text, and can be understood as the smallest units that the model "thinks" with. They are not exactly the same as the characters or words we understand, but rather a special way the model itself divides text.
1. Word Splitting for Chinese
One Chinese character is usually encoded as 1-2 tokens
For example:
"你好"
≈ 2-4 tokens
2. Word Splitting for English
Common words are usually 1 token
Longer or less common words will be broken down into multiple tokens
For example:
"hello"
= 1 token"indescribable"
= 4 tokens
3. Special characters
Spaces and punctuation also consume tokens
A line break is usually 1 token
Tokenizers differ across service providers, and even across different models from the same provider. This information is only for clarifying the concept of a token.
What's a Tokenizer?
A tokenizer is a tool that AI models use to convert text into tokens. It determines how to split input text into the smallest units that the model can understand.
Why Do Different Models Have Different Tokenizers?
1. Different Training Data
Different corpora lead to different optimization directions
Varying degrees of multilingual support
Specialized optimization for specific domains (medical, legal, etc.)
2. Differences in Participle Algorithms
BPE (Byte Pair Encoding) - OpenAI GPT Serious
WordPiece - Google BERT
SentencePiece - Suitable for multilingual scenarios
3. Different Optimization Goals
Some prioritize compression efficiency
Some prioritize semantic preservation
Some prioritize processing speed
Practical Effect
The same text may have different token counts in different models:
What's an Embedding Model?
Basic concept: Embedding models are a technology that transforms high-dimensional discrete data (text, images, etc.) into low-dimensional continuous vectors. This transformation allows machines to better understand and process complex data. Imagine it as simplifying a complex jigsaw puzzle into a simple coordinate point, but this point still retains the key features of the puzzle. In the large model ecosystem, it acts as a "translator," converting human-understandable information into a numerical form that AI can compute.
Working principle: Taking natural language processing as an example, embedding models can map words to specific locations in a vector space. In this space, words with similar semantics will automatically cluster together. For example:
The vectors for "king" and "queen" will be very close
Pet words like "cat" and "dog" can be close together
Semantically unrelated words like "car" and "bread" will be farther away
Main Application Scenarios:
Text analytics: Document categorization, sentiment analysis
Recommender system: Personalized content recommendation
Image processing: Similar image retrieval
Search engine: Semantic search optimization
Core Advantages
Dimensionality reduction effect: Simplifies complex data into easily processed vector forms
Semantic preservation: Retaining key semantic information from the original data
Computational efficiency: Significantly improves the training and inference efficiency of machine learning models.
Technical Value: Embedded model is a fundamental component of modern AI systems, providing high-quality data representation for machine learning tasks, and is a key technology that drives the development of natural language processing, computer vision, and other fields.
How Embedding Models Work in Knowledge Retrieval
Basic workflow:
Knowledge Base Preprocessing Phase
Split the document into chunks of appropriate size
Convert each chunk into a vector using an embedding model
Store the vectors and original text in a vector database
Query Processing Phase
Convert user questions into vectors
Retrieve similar content in the vector database
Provide the retrieved relevant content as context to the LLM
最后更新于
这有帮助吗?