phone-arrow-up-rightVoice Features

circle-exclamation

Cherry Studio Voice Features User Guide

I. Overview of Voice Features

Cherry Studio provides three major voice feature modules: TTS (text-to-speech), ASR (automatic speech recognition), and voice calls. These features allow you to communicate naturally with AI through voice, improving the user experience.

  • TTS (text-to-speech): converts AI response text into spoken output

  • ASR (automatic speech recognition): converts your speech into text input

  • Voice calls: combines TTS and ASR to deliver a voice conversation experience similar to ChatGPT

II. TTS (Text-to-Speech) Features

  1. Supported service types

Cherry Studio supports four types of TTS services:

  • OpenAI: uses OpenAI's TTS API, requires an API key

  • Browser TTS: uses the browser's built-in speech synthesis feature, free and requires no configuration

  • SiliconFlow: uses SiliconFlow's TTS service, requires an API key

  • Free Online TTS: uses a free online TTS service, no API key required

  1. Setup method

  2. Go to the Settings page and select the "Voice Features" tab

  3. In the "TTS" sub-tab:

    • Enable the TTS feature (turn on the switch)

    • Select the TTS service type

    • Configure the corresponding parameters according to the selected service type:

      • OpenAI: enter API key, API address, and select voice and model

      • Browser TTS: select a voice

      • SiliconFlow: enter API key, API address, select voice, model, response format, and speech rate

      • Free Online TTS: select voice and output format

  4. Configure TTS filtering options (optional):

    • Filter reasoning process

    • Filter Markdown markup

    • Filter code blocks

  5. Set whether to display the TTS progress bar

  6. Click the "Test TTS" button to test whether the configuration is correct

  7. How to use

  • After enabling TTS, AI responses will automatically be converted into spoken output

  • In the chat interface, a TTS play button will be shown below each AI response

  • Click the play button to play/pause the voice

  • If the TTS progress bar is enabled, playback progress will be shown below the text

  • Long text will automatically be synthesized in segments and played continuously

III. ASR (Automatic Speech Recognition) Features

  1. Supported service types

Cherry Studio supports three types of ASR services:

  • OpenAI: uses OpenAI's Whisper model, requires an API key

  • Browser: uses the browser's built-in speech recognition feature, free and requires no configuration

  • Local server: connects to a local WebSocket server for speech recognition

  1. Setup method

  2. Go to the Settings page and select the "Voice Features" tab

  3. In the "ASR" sub-tab:

    • Enable the ASR feature (turn on the switch)

    • Select the ASR service type

    • Configure the corresponding parameters according to the selected service type:

      • OpenAI: enter API key, API address, and select model

      • Browser: no additional configuration required

      • Local server: you can choose whether to automatically start the ASR server when the app launches

    • Select the speech recognition language (default is Chinese)

  4. Click the "Test ASR" button to test whether the configuration is correct

  5. How to use

  • After enabling ASR, a speech recognition button will appear next to the input box

  • Click the speech recognition button to start recording

  • After speaking, the speech will be converted to text and entered into the input box

  • Click the button again to stop recording

  • Speech recognition supports continuous recognition of multiple sentences using accumulation mode

IV. Voice Call Features

  1. Features

  • Combines TTS and ASR to deliver a voice conversation experience similar to ChatGPT

  • Uses a draggable floating window interface

  • Supports press-and-hold-to-speak mode

  • Supports custom hotkeys

  • Supports window collapsing

  • You can choose a dedicated voice call model

  • Supports custom prompt text

  1. Setup method

  2. Go to the Settings page and select the "Voice Features" tab

  3. In the "Call Features" sub-tab:

    • Enable the voice call feature (turn on the switch)

    • Click the "Select Model" button to choose the AI model for voice calls

    • Customize the voice call prompt in the prompt text box (optional)

    • Click the "Save" button to save the prompt, or click the "Reset" button to restore the default prompt

  4. How to use

  5. In the chat interface, click the voice call button (phone icon) to the right of the input box

  6. The voice call window will open and play a welcome voice

  7. Press and hold the "Press and Hold to Speak" button to start recording (or use the configured hotkey)

  8. Release the button to stop recording and send it to AI for processing

  9. AI generates a response and plays it through TTS

  10. Use the control buttons in the window:

    • Mute/Unmute button: controls TTS output

    • Pause/Resume button: pauses or resumes the conversation

    • Settings button: configure hotkeys

    • Collapse button: collapses the window, leaving only the press-and-hold-to-speak row

  11. Click the close button to end the call

  12. Hotkey settings

  13. In the voice call window, click the Settings button

  14. In the pop-up settings panel, click the Hotkey button

  15. Press the key you want to set (such as the Space key, Shift key, etc.)

  16. Click the "Save" button to save the settings

  17. When using it, hold down the configured hotkey to start recording, and release it to end recording and send

V. Common Issues and Solutions

  1. TTS-related issues

  • Issue: TTS cannot play sound. Solution: Check whether TTS is enabled, and make sure the correct service type is selected and the necessary parameters are configured.

  • Issue: TTS playback quality is poor. Solution: Try switching to a different TTS service type or voice.

  • Issue: An error message is shown during TTS playback. Solution: Check whether the API key is correct and whether the network connection is normal.

  1. ASR-related issues

  • Issue: ASR cannot recognize speech. Solution: Check whether ASR is enabled, and make sure the correct service type is selected and the necessary parameters are configured.

  • Issue: ASR recognition accuracy is low. Solution: Try switching to a different ASR service type, or adjust the microphone position and volume.

  • Issue: ASR server connection failed. Solution: Check whether the local server is running properly, or try restarting the app.

  1. Voice call-related issues

  • Issue: The voice call window cannot be opened. Solution: Check whether the voice call feature is enabled, and ensure that TTS and ASR are configured correctly.

  • Issue: Press-and-hold-to-speak does not respond. Solution: Check whether microphone permission has been granted, or try restarting the voice call.

  • Issue: AI responses have no voice output. Solution: Check whether TTS is enabled and make sure it is not muted.

VI. Advanced Settings and Custom Options

  1. TTS advanced settings

  • Filtering options: you can choose to filter the reasoning process, Markdown markup, and code blocks to make TTS playback smoother

  • Progress bar display: you can choose whether to display the TTS progress bar

  • Custom voices and models: you can add custom voice and model options

  1. ASR advanced settings

  • Auto-start server: you can set whether to automatically start the ASR server when the app launches

  • Language selection: you can choose different speech recognition languages

  1. Voice call advanced settings

  • Custom prompt: you can customize the voice call prompt to guide how AI responds in voice call mode

  • Dedicated model selection: you can choose a dedicated AI model for voice calls, separate from the model used in the current conversation

  • Hotkey customization: you can set custom hotkeys to control recording

VII. Usage Tips

  1. Choose the right TTS service:

    • If you are pursuing high-quality voice, OpenAI or SiliconFlow is recommended

    • If you do not want to configure an API, you can use Browser TTS or Free Online TTS

  2. Choose the right ASR service:

    • If you are pursuing high accuracy, OpenAI is recommended

    • If you do not want to configure an API, you can use the browser's built-in speech recognition

  3. Optimize the voice call experience:

    • Using headphones can prevent TTS output from being captured again by ASR

    • Using it in a quiet environment can improve recognition accuracy

    • Using custom prompts can make AI responses more suitable for voice playback

  4. Adjust settings according to your needs:

    • If you mainly use text communication, you can enable only the TTS feature

    • If you mainly use voice input, you can enable only the ASR feature

    • If you need a complete voice conversation experience, enable the voice call feature

We hope this user guide helps you make full use of Cherry Studio's voice features and enjoy a more natural and convenient AI interaction experience!

Last updated

Was this helpful?