sherpa-onnx Speech-to-text, text-to-speech, speake @codeKK c++Open Source Website

sherpa-onnx

Introduction: Speech-to-text, text-to-speech, speaker diarization, speech enhancement, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, HarmonyOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, support 11 programming languages

More: Author ReportBugs OfficialWebsite

Tags:

Supported functions

Speech recognition	Speech synthesis
✔️	✔️

Speaker identification	Speaker diarization	Speaker verification
✔️	✔️	✔️

Spoken Language identification	Audio tagging	Voice activity detection
✔️	✔️	✔️

Keyword spotting	Add punctuation	Speech enhancement
✔️	✔️	✔️

Supported platforms

Architecture	Android	iOS	Windows	macOS	linux	HarmonyOS
x64	✔️		✔️	✔️	✔️	✔️
x86	✔️		✔️
arm64	✔️	✔️	✔️	✔️	✔️	✔️
arm32	✔️				✔️	✔️
riscv64					✔️

Supported programming languages

1. C++	2. C	3. Python	4. JavaScript
✔️	✔️	✔️	✔️

5. Java	6. C#	7. Kotlin	8. Swift
✔️	✔️	✔️	✔️

9. Go	10. Dart	11. Rust	12. Pascal
✔️	✔️	✔️	✔️

For Rust support, please see sherpa-rs

It also supports WebAssembly.

Introduction

This repository supports running the following functions locally

Speech-to-text (i.e., ASR); both streaming and non-streaming are supported
Text-to-speech (i.e., TTS)
Speaker diarization
Speaker identification
Speaker verification
Spoken language identification
Audio tagging
VAD (e.g., silero-vad)
Keyword spotting

on the following platforms and operating systems:

x86, x86_64, 32-bit ARM, 64-bit ARM (arm64, aarch64), RISC-V (riscv64), RK NPU
Linux, macOS, Windows, openKylin
Android, WearOS
iOS
HarmonyOS
NodeJS
WebAssembly
NVIDIA Jetson Orin NX (Support running on both CPU and GPU)
NVIDIA Jetson Nano B01 (Support running on both CPU and GPU)
Raspberry Pi
RV1126
LicheePi4A
VisionFive 2
旭日 X3 派
爱芯派
etc

with the following APIs

C++, C, Python, Go, C#
Java, Kotlin, JavaScript
Swift, Rust
Dart, Object Pascal

Links for Huggingface Spaces

You can visit the following Huggingface spaces to try sherpa-onnx without installing anything. All you need is a browser.

| Description | URL | |-------------------------------------------------------|-----------------------------------------| | Speaker diarization | Click me| | Speech recognition | Click me | | Speech recognition with Whisper | Click me | | Speech synthesis | Click me | | Generate subtitles | Click me | | Audio tagging | Click me | | Spoken language identification with Whisper| Click me | We also have spaces built using WebAssembly. They are listed below: | Description | Huggingface space| ModelScope space| |------------------------------------------------------------------------------------------|------------------|-----------------| |Voice activity detection with silero-vad | Click me|地址| |Real-time speech recognition (Chinese + English) with Zipformer | Click me|地址| |Real-time speech recognition (Chinese + English) with Paraformer |Click me| 地址| |Real-time speech recognition (Chinese + English + Cantonese) with Paraformer-large|Click me| 地址| |Real-time speech recognition (English) |Click me |地址| |VAD + speech recognition (Chinese + English + Korean + Japanese + Cantonese) with SenseVoice|Click me| 地址| |VAD + speech recognition (English) with Whisper tiny.en|Click me| 地址| |VAD + speech recognition (English) with Moonshine tiny|Click me| 地址| |VAD + speech recognition (English) with Zipformer trained with GigaSpeech |Click me| 地址| |VAD + speech recognition (Chinese) with Zipformer trained with WenetSpeech |Click me| 地址| |VAD + speech recognition (Japanese) with Zipformer trained with ReazonSpeech|Click me| 地址| |VAD + speech recognition (Thai) with Zipformer trained with GigaSpeech2 |Click me| 地址| |VAD + speech recognition (Chinese 多种方言) with a TeleSpeech-ASR CTC model|Click me| 地址| |VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-large |Click me| 地址| |VAD + speech recognition (English + Chinese, 及多种中文方言) with Paraformer-small |Click me| 地址| |VAD + speech recognition (多语种及多种中文方言) with Dolphin-base |Click me| 地址| |Speech synthesis (English) |Click me| 地址| |Speech synthesis (German) |Click me| 地址| |Speaker diarization |Click me|地址|

Links for pre-built Android APKs

You can find pre-built Android APKs for this repository in the following table

| Description | URL | 中国用户 | |----------------------------------------|------------------------------------|-----------------------------------| | Speaker diarization | Address | 点此| | Streaming speech recognition | Address | 点此 | | Text-to-speech | Address | 点此 | | Voice activity detection (VAD) | Address | 点此 | | VAD + non-streaming speech recognition | Address | 点此 | | Two-pass speech recognition | Address | 点此 | | Audio tagging | Address | 点此 | | Audio tagging (WearOS) | Address | 点此 | | Speaker identification | Address | 点此 | | Spoken language identification | Address | 点此 | | Keyword spotting | Address | 点此 |

Links for pre-built Flutter APPs

#### Real-time speech recognition | Description | URL | 中国用户 | |--------------------------------|-------------------------------------|-------------------------------------| | Streaming speech recognition | Address| 点此| #### Text-to-speech | Description | URL | 中国用户 | |------------------------------------------|------------------------------------|------------------------------------| | Android (arm64-v8a, armeabi-v7a, x86_64) | Address | 点此 | | Linux (x64) | Address | 点此 | | macOS (x64) | Address | 点此 | | macOS (arm64) | Address | 点此 | | Windows (x64) | Address | 点此 | > Note: You need to build from source for iOS.

Links for pre-built Lazarus APPs

#### Generating subtitles | Description | URL | 中国用户 | |--------------------------------|----------------------------|----------------------------| | Generate subtitles (生成字幕) | Address| 点此|

Links for pre-trained models

| Description | URL | |---------------------------------------------|---------------------------------------------------------------------------------------| | Speech recognition (speech to text, ASR) | Address | | Text-to-speech (TTS) | Address | | VAD | Address | | Keyword spotting | Address | | Audio tagging | Address | | Speaker identification (Speaker ID) | Address | | Spoken language identification (Language ID)| See multi-lingual Whisper ASR models from Speech recognition| | Punctuation | Address | | Speaker segmentation | Address | | Speech enhancement | Address |