Voice Features
This feature was shelved because the relevant developer did not continue maintaining the PR.
Cherry Studio Voice Features User Guide
I. Overview of Voice Features
Cherry Studio provides three major voice feature modules: TTS (text-to-speech), ASR (automatic speech recognition), and voice calls. These features allow you to communicate naturally with AI through voice, improving the user experience.
TTS (text-to-speech): converts AI response text into spoken output
ASR (automatic speech recognition): converts your speech into text input
Voice calls: combines TTS and ASR to deliver a voice conversation experience similar to ChatGPT
II. TTS (Text-to-Speech) Features
Supported service types
Cherry Studio supports four types of TTS services:
OpenAI: uses OpenAI's TTS API, requires an API key
Browser TTS: uses the browser's built-in speech synthesis feature, free and requires no configuration
SiliconFlow: uses SiliconFlow's TTS service, requires an API key
Free Online TTS: uses a free online TTS service, no API key required
Setup method
Go to the Settings page and select the "Voice Features" tab
In the "TTS" sub-tab:
Enable the TTS feature (turn on the switch)
Select the TTS service type
Configure the corresponding parameters according to the selected service type:
OpenAI: enter API key, API address, and select voice and model
Browser TTS: select a voice
SiliconFlow: enter API key, API address, select voice, model, response format, and speech rate
Free Online TTS: select voice and output format
Configure TTS filtering options (optional):
Filter reasoning process
Filter Markdown markup
Filter code blocks
Set whether to display the TTS progress bar
Click the "Test TTS" button to test whether the configuration is correct
How to use
After enabling TTS, AI responses will automatically be converted into spoken output
In the chat interface, a TTS play button will be shown below each AI response
Click the play button to play/pause the voice
If the TTS progress bar is enabled, playback progress will be shown below the text
Long text will automatically be synthesized in segments and played continuously
III. ASR (Automatic Speech Recognition) Features
Supported service types
Cherry Studio supports three types of ASR services:
OpenAI: uses OpenAI's Whisper model, requires an API key
Browser: uses the browser's built-in speech recognition feature, free and requires no configuration
Local server: connects to a local WebSocket server for speech recognition
Setup method
Go to the Settings page and select the "Voice Features" tab
In the "ASR" sub-tab:
Enable the ASR feature (turn on the switch)
Select the ASR service type
Configure the corresponding parameters according to the selected service type:
OpenAI: enter API key, API address, and select model
Browser: no additional configuration required
Local server: you can choose whether to automatically start the ASR server when the app launches
Select the speech recognition language (default is Chinese)
Click the "Test ASR" button to test whether the configuration is correct
How to use
After enabling ASR, a speech recognition button will appear next to the input box
Click the speech recognition button to start recording
After speaking, the speech will be converted to text and entered into the input box
Click the button again to stop recording
Speech recognition supports continuous recognition of multiple sentences using accumulation mode
IV. Voice Call Features
Features
Combines TTS and ASR to deliver a voice conversation experience similar to ChatGPT
Uses a draggable floating window interface
Supports press-and-hold-to-speak mode
Supports custom hotkeys
Supports window collapsing
You can choose a dedicated voice call model
Supports custom prompt text
Setup method
Go to the Settings page and select the "Voice Features" tab
In the "Call Features" sub-tab:
Enable the voice call feature (turn on the switch)
Click the "Select Model" button to choose the AI model for voice calls
Customize the voice call prompt in the prompt text box (optional)
Click the "Save" button to save the prompt, or click the "Reset" button to restore the default prompt
How to use
In the chat interface, click the voice call button (phone icon) to the right of the input box
The voice call window will open and play a welcome voice
Press and hold the "Press and Hold to Speak" button to start recording (or use the configured hotkey)
Release the button to stop recording and send it to AI for processing
AI generates a response and plays it through TTS
Use the control buttons in the window:
Mute/Unmute button: controls TTS output
Pause/Resume button: pauses or resumes the conversation
Settings button: configure hotkeys
Collapse button: collapses the window, leaving only the press-and-hold-to-speak row
Click the close button to end the call
Hotkey settings
In the voice call window, click the Settings button
In the pop-up settings panel, click the Hotkey button
Press the key you want to set (such as the Space key, Shift key, etc.)
Click the "Save" button to save the settings
When using it, hold down the configured hotkey to start recording, and release it to end recording and send
V. Common Issues and Solutions
TTS-related issues
Issue: TTS cannot play sound. Solution: Check whether TTS is enabled, and make sure the correct service type is selected and the necessary parameters are configured.
Issue: TTS playback quality is poor. Solution: Try switching to a different TTS service type or voice.
Issue: An error message is shown during TTS playback. Solution: Check whether the API key is correct and whether the network connection is normal.
ASR-related issues
Issue: ASR cannot recognize speech. Solution: Check whether ASR is enabled, and make sure the correct service type is selected and the necessary parameters are configured.
Issue: ASR recognition accuracy is low. Solution: Try switching to a different ASR service type, or adjust the microphone position and volume.
Issue: ASR server connection failed. Solution: Check whether the local server is running properly, or try restarting the app.
Voice call-related issues
Issue: The voice call window cannot be opened. Solution: Check whether the voice call feature is enabled, and ensure that TTS and ASR are configured correctly.
Issue: Press-and-hold-to-speak does not respond. Solution: Check whether microphone permission has been granted, or try restarting the voice call.
Issue: AI responses have no voice output. Solution: Check whether TTS is enabled and make sure it is not muted.
VI. Advanced Settings and Custom Options
TTS advanced settings
Filtering options: you can choose to filter the reasoning process, Markdown markup, and code blocks to make TTS playback smoother
Progress bar display: you can choose whether to display the TTS progress bar
Custom voices and models: you can add custom voice and model options
ASR advanced settings
Auto-start server: you can set whether to automatically start the ASR server when the app launches
Language selection: you can choose different speech recognition languages
Voice call advanced settings
Custom prompt: you can customize the voice call prompt to guide how AI responds in voice call mode
Dedicated model selection: you can choose a dedicated AI model for voice calls, separate from the model used in the current conversation
Hotkey customization: you can set custom hotkeys to control recording
VII. Usage Tips
Choose the right TTS service:
If you are pursuing high-quality voice, OpenAI or SiliconFlow is recommended
If you do not want to configure an API, you can use Browser TTS or Free Online TTS
Choose the right ASR service:
If you are pursuing high accuracy, OpenAI is recommended
If you do not want to configure an API, you can use the browser's built-in speech recognition
Optimize the voice call experience:
Using headphones can prevent TTS output from being captured again by ASR
Using it in a quiet environment can improve recognition accuracy
Using custom prompts can make AI responses more suitable for voice playback
Adjust settings according to your needs:
If you mainly use text communication, you can enable only the TTS feature
If you mainly use voice input, you can enable only the ASR feature
If you need a complete voice conversation experience, enable the voice call feature
We hope this user guide helps you make full use of Cherry Studio's voice features and enjoy a more natural and convenient AI interaction experience!
Last updated
Was this helpful?