sherpa-onnx

Project Url: k2-fsa/sherpa-onnx
Introduction: Speech-to-text, text-to-speech, speaker diarization, speech enhancement, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, HarmonyOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, support 11 programming languages
More: Author   ReportBugs   OfficialWebsite   
Tags:

Supported functions

Speech recognition Speech synthesis
✔️ ✔️
Speaker identification Speaker diarization Speaker verification
✔️ ✔️ ✔️
Spoken Language identification Audio tagging Voice activity detection
✔️ ✔️ ✔️
Keyword spotting Add punctuation Speech enhancement
✔️ ✔️ ✔️

Supported platforms

Architecture Android iOS Windows macOS linux HarmonyOS
x64 ✔️ ✔️ ✔️ ✔️ ✔️
x86 ✔️ ✔️
arm64 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
arm32 ✔️ ✔️ ✔️
riscv64 ✔️

Supported programming languages

1. C++ 2. C 3. Python 4. JavaScript
✔️ ✔️ ✔️ ✔️
5. Java 6. C# 7. Kotlin 8. Swift
✔️ ✔️ ✔️ ✔️
9. Go 10. Dart 11. Rust 12. Pascal
✔️ ✔️ ✔️ ✔️

For Rust support, please see sherpa-rs

It also supports WebAssembly.

Introduction

This repository supports running the following functions locally

  • Speech-to-text (i.e., ASR); both streaming and non-streaming are supported
  • Text-to-speech (i.e., TTS)
  • Speaker diarization
  • Speaker identification
  • Speaker verification
  • Spoken language identification
  • Audio tagging
  • VAD (e.g., silero-vad)
  • Keyword spotting

on the following platforms and operating systems:

with the following APIs

  • C++, C, Python, Go, C#
  • Java, Kotlin, JavaScript
  • Swift, Rust
  • Dart, Object Pascal
You can visit the following Huggingface spaces to try sherpa-onnx without installing anything. All you need is a browser. | Description | URL | |-------------------------------------------------------|-----------------------------------------| | Speaker diarization | Click me| | Speech recognition | Click me | | Speech recognition with Whisper | Click me | | Speech synthesis | Click me | | Generate subtitles | Click me | | Audio tagging | Click me | | Spoken language identification with Whisper| Click me | We also have spaces built using WebAssembly. They are listed below: | Description | Huggingface space| ModelScope space| |------------------------------------------------------------------------------------------|------------------|-----------------| |Voice activity detection with silero-vad | Click me|地址| |Real-time speech recognition (Chinese + English) with Zipformer | Click me|地址| |Real-time speech recognition (Chinese + English) with Paraformer |Click me| 地址| |Real-time speech recognition (Chinese + English + Cantonese) with Paraformer-large|Click me| 地址| |Real-time speech recognition (English) |Click me |地址| |VAD + speech recognition (Chinese + English + Korean + Japanese + Cantonese) with SenseVoice|Click me| 地址| |VAD + speech recognition (English) with Whisper tiny.en|Click me| 地址| |VAD + speech recognition (English) with Moonshine tiny|Click me| 地址| |VAD + speech recognition (English) with Zipformer trained with GigaSpeech |Click me| 地址| |VAD + speech recognition (Chinese) with Zipformer trained with WenetSpeech |Click me| 地址| |VAD + speech recognition (Japanese) with Zipformer trained with ReazonSpeech|Click me| 地址| |VAD + speech recognition (Thai) with Zipformer trained with GigaSpeech2 |Click me| 地址| |VAD + speech recognition (Chinese 多种方言) with a TeleSpeech-ASR CTC model|Click me| 地址| |VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-large |Click me| 地址| |VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-small |Click me| 地址| |Speech synthesis (English) |Click me| 地址| |Speech synthesis (German) |Click me| 地址| |Speaker diarization |Click me|地址|
You can find pre-built Android APKs for this repository in the following table | Description | URL | 中国用户 | |----------------------------------------|------------------------------------|-----------------------------------| | Speaker diarization | Address | 点此| | Streaming speech recognition | Address | 点此 | | Text-to-speech | Address | 点此 | | Voice activity detection (VAD) | Address | 点此 | | VAD + non-streaming speech recognition | Address | 点此 | | Two-pass speech recognition | Address | 点此 | | Audio tagging | Address | 点此 | | Audio tagging (WearOS) | Address | 点此 | | Speaker identification | Address | 点此 | | Spoken language identification | Address | 点此 | | Keyword spotting | Address | 点此 |
#### Real-time speech recognition | Description | URL | 中国用户 | |--------------------------------|-------------------------------------|-------------------------------------| | Streaming speech recognition | Address| 点此| #### Text-to-speech | Description | URL | 中国用户 | |------------------------------------------|------------------------------------|------------------------------------| | Android (arm64-v8a, armeabi-v7a, x86_64) | Address | 点此 | | Linux (x64) | Address | 点此 | | macOS (x64) | Address | 点此 | | macOS (arm64) | Address | 点此 | | Windows (x64) | Address | 点此 | > Note: You need to build from source for iOS.
#### Generating subtitles | Description | URL | 中国用户 | |--------------------------------|----------------------------|----------------------------| | Generate subtitles (生成字幕) | Address| 点此|
| Description | URL | |---------------------------------------------|---------------------------------------------------------------------------------------| | Speech recognition (speech to text, ASR) | Address | | Text-to-speech (TTS) | Address | | VAD | Address | | Keyword spotting | Address | | Audio tagging | Address | | Speaker identification (Speaker ID) | Address | | Spoken language identification (Language ID)| See multi-lingual Whisper ASR models from Speech recognition| | Punctuation | Address | | Speaker segmentation | Address | | Speech enhancement | Address |

Some pre-trained ASR models (Streaming)

Please see - https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/index.html - https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-paraformer/index.html - https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-ctc/index.html for more models. The following table lists only SOME of them. |Name | Supported Languages| Description| |-----|-----|----| |sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20| Chinese, English| See also| |sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16| Chinese, English| See also| |sherpa-onnx-streaming-zipformer-zh-14M-2023-02-23|Chinese| Suitable for Cortex A7 CPU. See also| |sherpa-onnx-streaming-zipformer-en-20M-2023-02-17|English|Suitable for Cortex A7 CPU. See also| |sherpa-onnx-streaming-zipformer-korean-2024-06-16|Korean| See also| |sherpa-onnx-streaming-zipformer-fr-2023-04-14|French| See also|

Some pre-trained ASR models (Non-Streaming)

Please see - https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-transducer/index.html - https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-paraformer/index.html - https://k2-fsa.github.io/sherpa/onnx/pretrained_models/offline-ctc/index.html - https://k2-fsa.github.io/sherpa/onnx/pretrained_models/telespeech/index.html - https://k2-fsa.github.io/sherpa/onnx/pretrained_models/whisper/index.html for more models. The following table lists only SOME of them. |Name | Supported Languages| Description| |-----|-----|----| |Whisper tiny.en|English| See also| |Moonshine tiny|English|See also| |sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17|Chinese, Cantonese, English, Korean, Japanese| 支持多种中文方言. See also| |sherpa-onnx-paraformer-zh-2024-03-09|Chinese, English| 也支持多种中文方言. See also| |sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01|Japanese|See also| |sherpa-onnx-nemo-transducer-giga-am-russian-2024-10-24|Russian|See also| |sherpa-onnx-nemo-ctc-giga-am-russian-2024-10-24|Russian| See also| |sherpa-onnx-zipformer-ru-2024-09-18|Russian|See also| |sherpa-onnx-zipformer-korean-2024-06-24|Korean|See also| |sherpa-onnx-zipformer-thai-2024-06-20|Thai| See also| |sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04|Chinese| 支持多种方言. See also|

How to reach us

Please see https://k2-fsa.github.io/sherpa/social-groups.html for 新一代 Kaldi 微信交流群 and QQ 交流群.

Projects using sherpa-onnx

Open-LLM-VTuber

Talk to any LLM with hands-free voice interaction, voice interruption, and Live2D taking face running locally across platforms

See also https://github.com/t41372/Open-LLM-VTuber/pull/50

voiceapi

Streaming ASR and TTS based on FastAPI It shows how to use the ASR and TTS Python APIs with FastAPI.

腾讯会议摸鱼工具 TMSpeech

Uses streaming ASR in C# with graphical user interface.

Video demo in Chinese: 【开源】Windows 实时字幕软件(网课/开会必备)

lol 互动助手

It uses the JavaScript API of sherpa-onnx along with Electron

Video demo in Chinese: 爆了!炫神教你开打字挂!真正影响胜率的英雄联盟工具!英雄联盟的最后一块拼图!和游戏中的每个人无障碍沟通!

Sherpa-ONNX 语音识别服务器

A server based on nodejs providing Restful API for speech recognition.

QSmartAssistant

一个模块化,全过程可离线,低占用率的对话机器人/智能音箱

It uses QT. Both ASR and TTS are used.

Flutter-EasySpeechRecognition

It extends ./flutter-examples/streaming_asr by downloading models inside the app to reduce the size of the app.

Note: [Team B] Sherpa AI backend also uses sherpa-onnx in a Flutter APP.

sherpa-onnx-unity

sherpa-onnx in Unity. See also #1695, #1892, and #1859

xiaozhi-esp32-server

本项目为 xiaozhi-esp32 提供后端服务,帮助您快速搭建 ESP32 设备控制服务器 Backend service for xiaozhi-esp32, helps you quickly build an ESP32 device control server.

See also

KaithemAutomation

Pure Python, GUI-focused home automation/consumer grade SCADA.

It uses TTS from sherpa-onnx. See also ✨ Speak command that uses the new globally configured TTS model.

Apps
About Me
GitHub: Trinea
Facebook: Dev Tools