知识引擎/Hermes 知识引擎/语音模式 (Voice Mode)

返回分馆所属主题：核心功能更新于 2026年4月16日官方来源

Hermes Agent supports full voice interaction across CLI and messaging platforms. Talk to the agent using your microphone, hear spoken replies, and have live voi

语音模式 (Voice Mode)

> 📖 本文档翻译自 Hermes Agent 官方文档 > 最后更新：2026-04-16

Hermes Agent supports full voice interaction across CLI and messaging platforms. Talk to the agent using your microphone, hear spoken replies, and have live voice conversations in Discord voice channels.

If you want a practical setup walkthrough with recommended configurations and real usage patterns, see Use Voice Mode with Hermes.

Prerequisites

Before using voice features, make sure you have:

Hermes Agent installed — pip install hermes-agent (see Installation)
An LLM provider configured — run hermes model or set your preferred provider credentials in ~/.hermes/.env
A working base setup — run hermes to verify the agent responds to text before enabling voice

:::tip

:::

Overview

Feature	Platform	Description
Interactive Voice	CLI	Press Ctrl+B to record, agent auto-detects silence and responds
Auto Voice Reply	Telegram, Discord	Agent sends spoken audio alongside text responses
Voice Channel	Discord	Bot joins VC, listens to users speaking, speaks replies back

Requirements

Python Packages

# CLI voice mode (microphone + audio playback)
pip install "hermes-agent[voice]"

# Discord + Telegram messaging (includes discord.py[voice] for VC support)
pip install "hermes-agent[messaging]"

# Premium TTS (ElevenLabs)
pip install "hermes-agent[tts-premium]"

# Local TTS (NeuTTS, optional)
python -m pip install -U neutts[all]

# Everything at once
pip install "hermes-agent[all]"

Extra	Packages	Required For
voice	sounddevice,numpy	CLI voice mode
messaging	discord.py[voice],python-telegram-bot,aiohttp	Discord & Telegram bots
tts-premium	elevenlabs	ElevenLabs TTS provider

Optional local TTS provider: install neutts separately with python -m pip install -U neutts[all]. On first use it downloads the model automatically.

:::info

:::

System Dependencies

# macOS
brew install portaudio ffmpeg opus
brew install espeak-ng   # for NeuTTS

# Ubuntu/Debian
sudo apt install portaudio19-dev ffmpeg libopus0
sudo apt install espeak-ng   # for NeuTTS

Dependency	Purpose	Required For
PortAudio	Microphone input and audio playback	CLI voice mode
ffmpeg	Audio format conversion (MP3 → Opus, PCM → WAV)	All platforms
Opus	Discord voice codec	Discord voice channels
espeak-ng	Phonemizer backend	Local NeuTTS provider

API Keys

Add to ~/.hermes/.env:

# Speech-to-Text — local provider needs NO key at all
# pip install faster-whisper          # Free, runs locally, recommended
GROQ_API_KEY=your-key                 # Groq Whisper — fast, free tier (cloud)
VOICE_TOOLS_OPENAI_KEY=your-key       # OpenAI Whisper — paid (cloud)

# Text-to-Speech (optional — Edge TTS and NeuTTS work without any key)
ELEVENLABS_API_KEY=***           # ElevenLabs — premium quality
# VOICE_TOOLS_OPENAI_KEY above also enables OpenAI TTS

:::tip

:::

CLI Voice Mode

快速开始

Start the CLI and enable voice mode:

hermes                # Start the interactive CLI

Then use these commands inside the CLI:

/voice          Toggle voice mode on/off
/voice on       Enable voice mode
/voice off      Disable voice mode
/voice tts      Toggle TTS output
/voice status   Show current state

工作原理

Start the CLI with hermes and enable voice mode with /voice on
Press Ctrl+B — a beep plays (880Hz), recording starts
Speak — a live audio level bar shows your input: ● [▁▂▃▅▇▇▅▂] ❯
Stop speaking — after 3 seconds of silence, recording auto-stops
Two beeps play (660Hz) confirming the recording ended
Audio is transcribed via Whisper and sent to the agent
If TTS is enabled, the agent's reply is spoken aloud
Recording automatically restarts — speak again without pressing any key

This loop continues until you press Ctrl+B during recording (exits continuous mode) or 3 consecutive recordings detect no speech.

:::tip

:::

Silence Detection

Two-stage algorithm detects when you've finished speaking:

Speech confirmation — waits for audio above the RMS threshold (200) for at least 0.3s, tolerating brief dips between syllables
End detection — once speech is confirmed, triggers after 3.0 seconds of continuous silence

If no speech is detected at all for 15 seconds, recording stops automatically.

Both silence_threshold and silence_duration are configurable in config.yaml.

Streaming TTS

When TTS is enabled, the agent speaks its reply sentence-by-sentence as it generates text — you don't wait for the full response:

Buffers text deltas into complete sentences (min 20 chars)
Strips markdown formatting and <think> blocks
Generates and plays audio per sentence in real-time

Hallucination Filter

Whisper sometimes generates phantom text from silence or background noise ("Thank you for watching", "Subscribe", etc.). The agent filters these out using a set of 26 known hallucination phrases across multiple languages, plus a regex pattern that catches repetitive variations.

Gateway Voice Reply (Telegram & Discord)

If you haven't set up your messaging bots yet, see the platform-specific guides:

Start the gateway to connect to your messaging platforms:

hermes gateway        # Start the gateway (connects to configured platforms)
hermes gateway setup  # Interactive setup wizard for first-time configuration

Discord: Channels vs DMs

The bot supports two interaction modes on Discord:

Mode	How to Talk	Mention Required	Setup
Direct Message (DM)	Open the bot's profile → "Message"	No	Works immediately
Server Channel	Type in a text channel where the bot is present	Yes (@botname)	Bot must be invited to the server

DM (recommended for personal use): Just open a DM with the bot and type — no @mention needed. Voice replies and all commands work the same as in channels.

Server channels: The bot only responds when you @mention it (e.g. @hermesbyt4 hello). Make sure you select the bot user from the mention popup, not the role with the same name.

:::tip

:::

Commands

These work in both Telegram and Discord (DMs and text channels):

/voice          Toggle voice mode on/off
/voice on       Voice replies only when you send a voice message
/voice tts      Voice replies for ALL messages
/voice off      Disable voice replies
/voice status   Show current setting

Modes

Mode	Command	Behavior
off	/voice off	Text only (default)
voice_only	/voice on	Speaks reply only when you send a voice message
all	/voice tts	Speaks reply to every message

Voice mode setting is persisted across gateway restarts.

Platform Delivery

Platform	Format	Notes
Telegram	Voice bubble (Opus/OGG)	Plays inline in chat. ffmpeg converts MP3 → Opus if needed
Discord	Native voice bubble (Opus/OGG)	Plays inline like a user voice message. Falls back to file attachment if voice bubble API fails

Discord Voice Channels

The most immersive voice feature: the bot joins a Discord voice channel, listens to users speaking, transcribes their speech, processes through the agent, and speaks the reply back in the voice channel.

Setup

1. Discord Bot Permissions

If you already have a Discord bot set up for text (see Discord Setup Guide), you need to add voice permissions.

Go to the Discord Developer Portal → your application → Installation → Default Install Settings → Guild Install:

Add these permissions to the existing text permissions:

Permission	Purpose	Required
Connect	Join voice channels	Yes
Speak	Play TTS audio in voice channels	Yes
Use Voice Activity	Detect when users are speaking	Recommended

Updated Permissions Integer:

Level	Integer	What's Included
Text only	274878286912	View Channels, Send Messages, Read History, Embeds, Attachments, Threads, Reactions
Text + Voice	274881432640	All above + Connect, Speak

Re-invite the bot with the updated permissions URL:

https://discord.com/oauth2/authorize?client_id=YOUR_APP_ID&scope=bot+applications.commands&permissions=274881432640

Replace YOUR_APP_ID with your Application ID from the Developer Portal.

:::warning

:::

2. Privileged Gateway Intents

In the Developer Portal → your application → Bot → Privileged Gateway Intents, enable all three:

Intent	Purpose
Presence Intent	Detect user online/offline status
Server Members Intent	Map voice SSRC identifiers to Discord user IDs
Message Content Intent	Read text message content in channels

All three are required for full voice channel functionality. Server Members Intent is especially critical — without it, the bot cannot identify who is speaking in the voice channel.

3. Opus Codec

The Opus codec library must be installed on the machine running the gateway:

# macOS (Homebrew)
brew install opus

# Ubuntu/Debian
sudo apt install libopus0

The bot auto-loads the codec from:

macOS: /opt/homebrew/lib/libopus.dylib
Linux: libopus.so.0

4. Environment Variables

# ~/.hermes/.env

# Discord bot (already configured for text)
DISCORD_BOT_TOKEN=your-bot-token
DISCORD_ALLOWED_USERS=your-user-id

# STT — local provider needs no key (pip install faster-whisper)
# GROQ_API_KEY=your-key            # Alternative: cloud-based, fast, free tier

# TTS — optional. Edge TTS and NeuTTS need no key.
# ELEVENLABS_API_KEY=***      # Premium quality
# VOICE_TOOLS_OPENAI_KEY=***  # OpenAI TTS / Whisper

Start the Gateway

hermes gateway        # Start with existing configuration

The bot should come online in Discord within a few seconds.

Commands

Use these in the Discord text channel where the bot is present:

/voice join      Bot joins your current voice channel
/voice channel   Alias for /voice join
/voice leave     Bot disconnects from voice channel
/voice status    Show voice mode and connected channel

:::info

:::

工作原理

When the bot joins a voice channel, it:

Listens to each user's audio stream independently
Detects silence — 1.5s of silence after at least 0.5s of speech triggers processing
Transcribes the audio via Whisper STT (local, Groq, or OpenAI)
Processes through the full agent pipeline (session, tools, memory)
Speaks the reply back in the voice channel via TTS

Text Channel Integration

When the bot is in a voice channel:

Transcripts appear in the text channel: [Voice] @user: what you said
Agent responses are sent as text in the channel AND spoken in the VC
The text channel is the one where /voice join was issued

Echo Prevention

The bot automatically pauses its audio listener while playing TTS replies, preventing it from hearing and re-processing its own output.

Access Control

Only users listed in DISCORD_ALLOWED_USERS can interact via voice. Other users' audio is silently ignored.

# ~/.hermes/.env
DISCORD_ALLOWED_USERS=284102345871466496

Configuration Reference

config.yaml

# Voice recording (CLI)
voice:
  record_key: "ctrl+b"            # Key to start/stop recording
  max_recording_seconds: 120       # Maximum recording length
  auto_tts: false                  # Auto-enable TTS when voice mode starts
  silence_threshold: 200           # RMS level (0-32767) below which counts as silence
  silence_duration: 3.0            # Seconds of silence before auto-stop

# Speech-to-Text
stt:
  provider: "local"                  # "local" (free) | "groq" | "openai"
  local:
    model: "base"                    # tiny, base, small, medium, large-v3
  # model: "whisper-1"              # Legacy: used when provider is not set

# Text-to-Speech
tts:
  provider: "edge"                 # "edge" (free) | "elevenlabs" | "openai" | "neutts" | "minimax"
  edge:
    voice: "en-US-AriaNeural"      # 322 voices, 74 languages
  elevenlabs:
    voice_id: "pNInz6obpgDQGcFmaJgB"    # Adam
    model_id: "eleven_multilingual_v2"
  openai:
    model: "gpt-4o-mini-tts"
    voice: "alloy"                 # alloy, echo, fable, onyx, nova, shimmer
    base_url: "https://api.openai.com/v1"  # optional: override for self-hosted or OpenAI-compatible endpoints
  neutts:
    ref_audio: ''
    ref_text: ''
    model: neuphonic/neutts-air-q4-gguf
    device: cpu

环境变量

# Speech-to-Text providers (local needs no key)
# pip install faster-whisper        # Free local STT — no API key needed
GROQ_API_KEY=...                    # Groq Whisper (fast, free tier)
VOICE_TOOLS_OPENAI_KEY=...         # OpenAI Whisper (paid)

# STT advanced overrides (optional)
STT_GROQ_MODEL=whisper-large-v3-turbo    # Override default Groq STT model
STT_OPENAI_MODEL=whisper-1               # Override default OpenAI STT model
GROQ_BASE_URL=https://api.groq.com/openai/v1     # Custom Groq endpoint
STT_OPENAI_BASE_URL=https://api.openai.com/v1    # Custom OpenAI STT endpoint

# Text-to-Speech providers (Edge TTS and NeuTTS need no key)
ELEVENLABS_API_KEY=***             # ElevenLabs (premium quality)
# VOICE_TOOLS_OPENAI_KEY above also enables OpenAI TTS

# Discord voice channel
DISCORD_BOT_TOKEN=...
DISCORD_ALLOWED_USERS=...

STT Provider Comparison

Provider	Model	Speed	Quality	Cost	API Key
Local	base	Fast (depends on CPU/GPU)	Good	Free	No
Local	small	Medium	Better	Free	No
Local	large-v3	Slow	Best	Free	No
Groq	whisper-large-v3-turbo	Very fast (~0.5s)	Good	Free tier	Yes
Groq	whisper-large-v3	Fast (~1s)	Better	Free tier	Yes
OpenAI	whisper-1	Fast (~1s)	Good	Paid	Yes
OpenAI	gpt-4o-transcribe	Medium (~2s)	Best	Paid	Yes

Provider priority (automatic fallback): local > groq > openai

TTS Provider Comparison

Provider	Quality	Cost	Latency	Key Required
Edge TTS	Good	Free	~1s	No
ElevenLabs	Excellent	Paid	~2s	Yes
OpenAI TTS	Good	Paid	~1.5s	Yes
NeuTTS	Good	Free	Depends on CPU/GPU	No

NeuTTS uses the tts.neutts config block above.

Troubleshooting

"No audio device found" (CLI)

PortAudio is not installed:

brew install portaudio    # macOS
sudo apt install portaudio19-dev  # Ubuntu

Bot doesn't respond in Discord server channels

The bot requires an @mention by default in server channels. Make sure you:

Type @ and select the bot user (with the #discriminator), not the role with the same name
Or use DMs instead — no mention needed
Or set DISCORD_REQUIRE_MENTION=false in ~/.hermes/.env

Bot joins VC but doesn't hear me

Check your Discord user ID is in DISCORD_ALLOWED_USERS
Make sure you're not muted in Discord
The bot needs a SPEAKING event from Discord before it can map your audio — start speaking within a few seconds of joining

Bot hears me but doesn't respond

Verify STT is available: install faster-whisper (no key needed) or set GROQ_API_KEY / VOICE_TOOLS_OPENAI_KEY
Check the LLM model is configured and accessible
Review gateway logs: tail -f ~/.hermes/logs/gateway.log

Bot responds in text but not in voice channel

TTS provider may be failing — check API key and quota
Edge TTS (free, no key) is the default fallback
Check logs for TTS errors

Whisper returns garbage text

The hallucination filter catches most cases automatically. If you're still getting phantom transcripts:

Use a quieter environment
Adjust silence_threshold in config (higher = less sensitive)
Try a different STT model

Continue Exploring

继续探索

这不是课程式的上一篇下一篇，而是从当前节点向外继续漫游。

教程与指南

使用语音模式

本指南是语音模式功能参考的实用配套指南。如果说功能页面解释了语音模式能做什么，那么本指南则展示如何真正用好它。语音模式在以下情况特别有用：你想要免手操作的 CLI 工作流你想在 Telegram 或 Discord 中获得语音回复你想让 Hermes 坐在 Discord 语音频道中进行实时对话

快速入门

安装指南 (Installation)

通过一行安装命令，不到两分钟即可让 Hermes Agent 运行起来；或者按照手动步骤获得完全控制。 Hermes 现在也提供了 Termux 感知的安装路径：安装程序会自动检测 Termux 并切换到经过测试的 Android 安装流程： - 使用 Termux 的 pkg 安装系统依赖（git、python、n

消息平台

Hermes Agent integrates with Telegram as a full-featured conversational bot. Once connected, you can chat with your agent from any device, send voice memos that

消息平台

Discord

Hermes Agent integrates with Discord as a bot, letting you chat with your AI assistant through direct messages or server channels. The bot receives your message

核心功能

工具与工具集 (Tools & Toolsets)

Tools are functions that extend the agent's capabilities. They're organized into logical toolsets that can be enabled or disabled per platform.

核心功能

记忆系统 (Memory System)

Hermes Agent has bounded, curated memory that persists across sessions. This lets it remember your preferences, your projects, your environment, and things it h

Core Features

核心功能

Hermes 的能力核心：工具、记忆、技能、委派、自动化、语音、插件与浏览器控制。

31 篇文档30 个节点

当前节点

语音模式 (Voice Mode)

返回分馆回到知识引擎

同主题继续探索

工具与工具集 (Tools & Toolsets)

Tools are functions that extend the agent's capabilities. They're organized into logical toolsets that can be enabled or disabled per platform.

记忆系统 (Memory System)

Hermes Agent has bounded, curated memory that persists across sessions. This lets it remember your preferences, your projects, your environment, and things it h

技能系统 (Skill System)

技能是 Hermes 的可复用知识模块。每个技能都是一个 Markdown 文件，在激活时注入到 Agent 的上下文中——为其提供持久的工作流、领域知识和行为指南，而无需将这些内容塞入系统提示中。技能是可热插拔的：你可以在会话中途安装、创建、编辑和切换技能。它们在 CLI、消息平台和 Gateway 后台任务中均可

MCP 集成 (MCP Integration)

MCP 让 Hermes Agent 连接到外部工具服务器，使 Agent 能够使用 Hermes 本身之外的工具——GitHub、数据库、文件系统、浏览器栈、内部 API 等。如果你曾想让 Hermes 使用一个已经存在于其他地方的工具，MCP 通常是最简洁的方式。 - 无需先编写原生 Hermes 工具即可访问外

ACP 编辑器集成 (ACP Editor Integration)

Hermes Agent 可以作为 ACP 服务器运行，让 ACP 兼容的编辑器通过 stdio 与 Hermes 通信，并渲染： - 聊天消息 - 工具活动 - 文件差异 - 终端命令 - 审批提示 - 流式思考 / 响应片段当你希望 Hermes 像编辑器原生的编程 Agent 一样工作，而不是独立的 CLI 或

API 服务器 (API Server)

The API server exposes hermes-agent as an OpenAI-compatible HTTP endpoint. Any frontend that speaks the OpenAI format — Open WebUI, LobeChat, LibreChat, NextCha

语音模式 (Voice Mode)

Prerequisites

Overview

Requirements

Python Packages

System Dependencies

API Keys

CLI Voice Mode

快速开始

工作原理

Silence Detection

Streaming TTS

Hallucination Filter

Gateway Voice Reply (Telegram & Discord)

Discord: Channels vs DMs

Commands

Modes

Platform Delivery

Discord Voice Channels

Setup

1. Discord Bot Permissions

2. Privileged Gateway Intents

3. Opus Codec

4. Environment Variables

Start the Gateway

Commands

工作原理

Text Channel Integration

Echo Prevention

Access Control

Configuration Reference

config.yaml

环境变量

STT Provider Comparison

TTS Provider Comparison

Troubleshooting

"No audio device found" (CLI)

Bot doesn't respond in Discord server channels

Bot joins VC but doesn't hear me

Bot hears me but doesn't respond

Bot responds in text but not in voice channel

Whisper returns garbage text

继续探索

使用语音模式

安装指南 (Installation)

Telegram

Discord

工具与工具集 (Tools & Toolsets)

记忆系统 (Memory System)

核心功能

知识引擎 AI 问答