# Knowledge Primer

## What are tokens?

Tokens are the basic units that AI models use to process text. They can be understood as the smallest units of a model's "thinking." They are not exactly the same as the characters or words we understand, but rather a special way the model segments text.

#### 1. Chinese tokenization

* One Chinese character is usually encoded as 1-2 tokens
* For example:`"hello"` ≈ 2-4 tokens

#### 2. English tokenization

* Common words are usually 1 token
* Longer or less common words are split into multiple tokens
* For example:
  * `"hello"` = 1 token
  * `"indescribable"` = 4 tokens

#### 3. Special characters

* Spaces, punctuation, etc. also take up tokens
* A line break is usually 1 token

{% hint style="info" %}
Tokenizers from different providers are different, and even tokenizers for different models from the same provider may vary. This knowledge is only intended to clarify the concept of tokens.
{% endhint %}

***

## What is a Tokenizer?

A Tokenizer is a tool that converts text into tokens for an AI model. It determines how input text is split into the smallest units the model can understand.

### Why are tokenizers different for different models?

#### 1. Different training data

* Different corpora lead to different optimization directions
* Differences in the level of multilingual support
* Specialized optimization for specific domains (medical, legal, etc.)

#### 2. Different tokenization algorithms

* BPE (Byte Pair Encoding) - OpenAI GPT series
* WordPiece - Google BERT
* SentencePiece - suitable for multilingual scenarios

#### 3. Different optimization goals

* Some focus on compression efficiency
* Some focus on preserving semantics
* Some focus on processing speed

### Practical impact

The number of tokens for the same text may differ across models:

```
Input: "Hello, world!"
GPT-3: 4 tokens
BERT: 3 tokens
Claude: 3 tokens
```

***

## What is an Embedding Model?

**Basic concept:** An embedding model is a technique that converts high-dimensional discrete data (text, images, etc.) into low-dimensional continuous vectors. This transformation allows machines to better understand and process complex data. Imagine simplifying a complex puzzle into a simple coordinate point, while still preserving the puzzle's key features. In the large-model ecosystem, it acts as a "translator," converting human-understandable information into a numerical form that AI can compute with.

**How it works:** Using natural language processing as an example, an embedding model can map words to specific positions in vector space. In this space, words with similar meanings naturally cluster together. For example:

* The vectors for "king" and "queen" will be very close
* Words like "cat" and "dog" for pets will also be close together
* Whereas words with no semantic relation, such as "car" and "bread," will be farther apart

**Main application scenarios:**

* Text analysis: document classification, sentiment analysis
* Recommendation systems: personalized content recommendation
* Image processing: similar image retrieval
* Search engines: semantic search optimization

**Core advantages:**

1. Dimensionality reduction: simplifies complex data into easy-to-process vector form
2. Semantic preservation: retains key semantic information in the original data
3. Computational efficiency: significantly improves the training and inference efficiency of machine learning models

**Technical value:** Embedding models are fundamental components of modern AI systems. They provide high-quality data representations for machine learning tasks and are a key technology driving the development of natural language processing, computer vision, and other fields.

***

## How Embedding Models Work in Knowledge Retrieval

**Basic workflow:**

1. **Knowledge base preprocessing stage**

* Split documents into appropriately sized chunks (text blocks)
* Use an embedding model to convert each chunk into a vector
* Store the vectors and original text in a vector database

2. **Query processing stage**

* Convert the user's question into a vector
* Retrieve similar content from the vector database
* Provide the retrieved relevant content to the LLM as context

***

## **What is MCP (Model Context Protocol)?**

MCP is an open-source protocol designed to provide contextual information to large language models (LLMs) in a standardized way.

* **Analogy:** You can think of MCP as the "USB flash drive" of the AI field. We know that a USB flash drive can store all kinds of files and be used directly after plugging it into a computer. Similarly, various "plugins" that provide context can be plugged into an MCP Server. The LLM can request these plugins from the MCP Server as needed, thereby obtaining richer contextual information and enhancing its own capabilities.
* **Comparison with Function Tool:** Traditional Function Tools can also provide external capabilities to LLMs, but MCP is more like a higher-level abstraction. Function Tools are more task-specific tools, while MCP provides a more general and modular mechanism for obtaining context.

### **Core advantages of MCP**

1. **Standardization:** MCP provides a unified interface and data format, allowing different LLMs and context providers to work together seamlessly.
2. **Modularity:** MCP allows developers to break contextual information into independent modules (plugins), making management and reuse easier.
3. **Flexibility:** LLMs can dynamically choose the context plugins they need based on their own requirements, enabling smarter and more personalized interactions.
4. **Scalability:** MCP's design supports adding more types of context plugins in the future, offering unlimited possibilities for expanding LLM capabilities.

***
