NonVerbalSpeech-38K:

A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding

Anonymous submission

GitHub Repo | Hugging Face

Abstract Non-verbal Vocalizations (NVs), such as laughter and sighs, are vital for conveying emotion and intention in human speech, yet most existing speech systems neglect them, which severely compromises communicative richness and emotional intelligence. Existing methods for NVs acquisition are either costly and unscalable (relying on manual annotation/recording) or unnatural (relying on rule-based synthesis). To address these limitations, we propose a highly scalable automatic annotation framework to label non-verbal phenomena from natural speech, which is low-cost, easily extendable, and inherently diverse and natural. This framework leverages a unified detection model to accurately identify NVs in natural speech and integrates them with transcripts via temporal-semantic alignment method. Using this framework, we created and released NonVerbalSpeech-38K, a diverse, real-world dataset featuring 38,718 samples across 10 NV categories collected from in-the-wild media. Experimental results demonstrate that our dataset provides superior controllability for NVs generation and achieves comparable performance for NVs understanding.

Contents

This page is for research demonstration purposes only.

Pipeline Overview

Figure 1. An overview of our NonVerbalSpeech-38K pipeline. It consists of three main stages: (1) Data Preparation and Preprocessing, (2) Non-Verbal Segment Detection, and (3) Integration of Non-Verbal Tags into Speech Content. Among these, NonVerbal Segment Detection serves as the core component of the pipeline. Colors other than black in the waveform indicate non-verbal vocalizations.

Figure 2. (a) The task definition, (b) The proposed frame-level detection model, and (c) The TSA method used in the pipeline.

Non-Verbal Speech Generation (ZH)

Note: Dia appears to lack support for Chinese (ZH), resulting in unintelligible audio. Furthermore, the pretrained F5-TTS model does not support non-verbal expression control.

Text Prompt Dia CosyVoice2 F5-TTS F5-TTS +
CapSpeech
F5-TTS +
NonVerbalTTS
F5-TTS +
SMIIP-NV
F5-TTS +
MNV-17
F5-TTS +
SynParaSpeech
F5-TTS +
NVSpeech
F5-TTS +
NonVerbalSpeech-38K (TBO) (Ours)
F5-TTS +
NonVerbalSpeech-38K (TSA) (Ours)
[sniff] 那封泛黄的家书,读着读着眼泪就下来了。
看着地平线,[breath] 我觉得一切很美好。
所有国际包裹都需要 [throatclearing] 填写海关申报单。
看到孩子的奖状,[sniff] 我激动得有点鼻酸。
这盆花养了半年,[sigh] 今天突然蔫了。
[throatclearing] 会议改到B会议室了。
[sigh] 面试表现很好,却被告知岗位临时取消
你把窗户关上吧,[coughing] 风太大嗓子受不了。
[laughing] 妈妈用防晒喷雾喷墙说除灰
他递来伞时,[breath] 我轻声道谢。
您的订阅包含 [throatclearing] 三个月免费高级服务。
快喝点热水吧,[coughing]别硬撑着了。
[laughing] 同桌把钢笔当成筷子用,墨水沾了满脸。
同事在群里发了表情包,[laughing] 太有梗了。
装修的灰尘太多了,[coughing] 我得戴个口罩。

Non-Verbal Speech Generation (EN)

Text Prompt Dia CosyVoice2 F5-TTS F5-TTS +
CapSpeech
F5-TTS +
NonVerbalTTS
F5-TTS +
SMIIP-NV
F5-TTS +
MNV-17
F5-TTS +
SynParaSpeech
F5-TTS +
NVSpeech
F5-TTS +
NonVerbalSpeech-38K (TBO) (Ours)
F5-TTS +
NonVerbalSpeech-38K (TSA) (Ours)
[laughing] The toddler 'read' the newspaper upside down with great seriousness.
Spent an hour cooking, [sigh] the dish is too salty.
[throatclearing] I think we can now move to the next speaker.
[breath] The meditation app guided my breathing rhythm.
Athletes joked about their silly warm-up moves, [laughing] before the game started.
I tried to dye Easter eggs [laughing] and ended up with blue fingers for a week.
The doctor will see you now [throatclearing] in examination room three.
[coughing] I think I need some warm tea to feel better.
I sat on the bench, [breath] watching people walk by.
[throatclearing] Fire exits are located at both ends of the hallway.
While playing with the kids at the park [coughing] I started coughing and had to sit down.
My internet dropped [sigh] during the video call.
[sniff] The subway platform carried ozone, pretzels, and the quiet determination of Monday.
While painting the high ceiling, [breath] she climbed down the ladder to breathe normally again.
[sniff] The scent of pine trees filled the mountain air.

Non-Verbal Speech Understanding (ZH)

Model Text
GT (SMIIP-NV|MNV-17) 爸爸喝多了,开始给我们讲他年轻时的“英雄事迹”,[laughing],好多都吹牛的。
Qwen2-Audio 爸爸喝多了开始给我们讲他年轻时的英雄事迹哈哈哈好多都是吹牛的。
Qwen2-Audio +
NVSpeech
爸爸喝多了开始给我们讲他年轻时的英雄事迹,[laughing]好多都是吹牛的。
Qwen2-Audio +
NonVerbalSpeech-38K(TBO) (Ours)
爸爸喝多了,开始给我们讲他年轻时的英雄事迹, [laughing] 哈哈哈好多都是吹牛的。
Qwen2-Audio +
NonVerbalSpeech-38K(TSA) (Ours)
爸爸喝多了,开始给我们讲他年轻时的英雄事迹。[laughing]好多都是吹牛的。
Whisper-Large-V3 爸爸喝多了开始给我们讲他年轻时的英雄事迹哈哈哈哈好多都是吹牛的
Whisper-Large-V3 +
NVSpeech
爸爸喝多了,开始给我们讲他年轻时的英雄事迹,[laughing],好多都是吹牛的。
Whisper-Large-V3 +
NonVerbalSpeech-38K(TBO) (Ours)
爸爸喝多了,开始给我们讲他年轻时的英雄事迹,[laughing]好多都是吹牛的。
Whisper-Large-V3 +
NonVerbalSpeech-38K(TSA) (Ours)
爸爸喝多了,开始给我们讲他年轻时的英雄事迹,[laughing]好多都是吹牛的。
GT (SMIIP-NV|MNV-17) 我见到一位画家,她能用食物绘画[coughing],太惊艳了。
Qwen2-Audio 哦,我见到一位画家,他能用食物绘画,太惊艳了。
Qwen2-Audio +
NVSpeech
我见到一位画家,他能用实物绘画,[coughing]太惊艳了。
Qwen2-Audio +
NonVerbalSpeech-38K(TBO) (Ours)
我见到一位画家,他能用食物绘画, [coughing] 太惊艳了。
Qwen2-Audio +
NonVerbalSpeech-38K(TSA) (Ours)
我见到一位画家,他能用食物绘画,[coughing]太惊艳了。
Whisper-Large-V3 我见到一位画家他能用食物绘画太惊艳了
Whisper-Large-V3 +
NVSpeech
我见到一位画家,他能用食物绘画,[laughing]太惊验了。
Whisper-Large-V3 +
NonVerbalSpeech-38K(TBO) (Ours)
我见到一位画家,他能用食物绘画, [coughing] 太惊艳了。
Whisper-Large-V3 +
NonVerbalSpeech-38K(TSA) (Ours)
我见到一位画家,他能用食物绘画,[coughing]太惊艳了。
GT (SMIIP-NV|MNV-17) 我好不容易把今天的活儿都干完了 [sigh],结果老板又发来一个新需求,这简直 离谱。
Qwen2-Audio 我好不容易把今天的活都干完了,哎,结果老板又发来一个新需求,这简直离谱。
Qwen2-Audio +
NVSpeech
我好不容易把今天的活都干完了,[sigh],结果老板又发来一个新需求,这简直离谱。
Qwen2-Audio +
NonVerbalSpeech-38K(TBO) (Ours)
我好不容易把今天的活儿都干完了,哎,结果老板又发来一个新需求,这简直离谱。
Qwen2-Audio +
NonVerbalSpeech-38K(TSA) (Ours)
我好不容易把今天的活儿都干完了,哎,结果老板又发来一个新需求,这简直离谱。
Whisper-Large-V3 我好不容易把今天的活都干完了哎结果老板又发来一个新需求这简直离谱
Whisper-Large-V3 +
NVSpeech
我好不容易把今天的活都干完了,[sigh],结果老板又发来一个新需求,这简直离谱。
Whisper-Large-V3 +
NonVerbalSpeech-38K(TBO) (Ours)
我好不容易把今天的活儿都干完了,[sigh] 结果老板又发来一个新需求,这简直离谱。
Whisper-Large-V3 +
NonVerbalSpeech-38K(TSA) (Ours)
我好不容易把今天的活儿都干完了,[sigh]结果老板又发来一个新需求,这简直离谱。

Non-Verbal Speech Understanding (EN)

Model Text
GT (NonVerbalTTS) [laughing] Yeah, I feel comfortable around people.
Qwen2-Audio yeah i feel comfortable around people.
Qwen2-Audio +
CapSpeech
[laughing] i feel comfortable around people
Qwen2-Audio +
NonVerbalSpeech-38K(TBO) (Ours)
[laughing] Yeah, I feel comfortable around people.
Qwen2-Audio +
NonVerbalSpeech-38K(TSA) (Ours)
[laughing] Yeah, I feel comfortable around people.
Whisper-Large-V3 Yeah, I feel comfortable around people.
Whisper-Large-V3 +
CapSpeech
[laughing] yeah i feel comfortable around people
Whisper-Large-V3 +
NonVerbalSpeech-38K(TBO) (Ours)
[laughing] Yeah, I feel comfortable around people.
Whisper-Large-V3 +
NonVerbalSpeech-38K(TSA) (Ours)
[laughing] Yeah, I feel comfortable around people.
GT (NonVerbalTTS) as much in more and other films, but she's very much [coughing]
Qwen2-Audio as much and more in other films but she's very much.
Qwen2-Audio +
CapSpeech
not as much in and more in other films but she's very much [coughing]
Qwen2-Audio +
NonVerbalSpeech-38K(TBO) (Ours)
as much and more in other films but she's very much [coughing]
Qwen2-Audio +
NonVerbalSpeech-38K(TSA) (Ours)
as much in and more in other films, but she's very much... [coughing]
Whisper-Large-V3 as much in more and other films, but she's very much
Whisper-Large-V3 +
CapSpeech
not as much and more in other films but she's very much [coughing]
Whisper-Large-V3 +
NonVerbalSpeech-38K(TBO) (Ours)
Not as much and more in other films, but she's very much... [coughing]
Whisper-Large-V3 +
NonVerbalSpeech-38K(TSA) (Ours)
as much and more in other films, but she's very much... [coughing]
GT (NonVerbalTTS) Everton play well in the second half, but I think [breath] we we win.
Qwen2-Audio Evertin play well in the second half, but I think we win.
Qwen2-Audio +
CapSpeech
so everton play well in the second half but i think ah we we win [coughing]
Qwen2-Audio +
NonVerbalSpeech-38K(TBO) (Ours)
Averton played well in the second half, but I think we [breath] win
Qwen2-Audio +
NonVerbalSpeech-38K(TSA) (Ours)
Everton played well in the second half, but I think we [breath] win.
Whisper-Large-V3 Everton play well in the second half, but I think we we win.
Whisper-Large-V3 +
CapSpeech
averton played well in the second half but i think we we win [breath]
Whisper-Large-V3 +
NonVerbalSpeech-38K(TBO) (Ours)
Everton played well in the second half, but I think we win. [breath]
Whisper-Large-V3 +
NonVerbalSpeech-38K(TSA) (Ours)
Everton play well in the second half, but I think we [breath] we win.

NonVerbalSpeech-38K dataset samples

More information about our NonVerbalSpeech-38K dataset is available on Hugging Face.

Note:

  • The <B> and </B> tags indicate that non-verbal expressions overlap with spoken words.
Audio Non-Verbal Segments Detected
Timestamp-Based Ordering (TBO): [snore]呃,乌托马乌托马他要跟我们去吗?你要跟我们去吗?
Temporal-Semantic Alignment (TSA) : [snore]嗯,呃,吴托玛吴托玛,他要跟我们去吗?你要跟我们去吗?
Timestamp-Based Ordering (TBO): 到门口,这蹲着摘耳朵,再次偷听,就听屋里的。 [snore]
Temporal-Semantic Alignment (TSA) : 到门口这蹲着摘耳朵,再次偷听,就听屋里头。[snore]
Timestamp-Based Ordering (TBO): don't thank me yet you must hurry saria sunrise will be here soon [throatclearing]
Temporal-Semantic Alignment (TSA) : Don't thank me yet. You must hurry, Saria. Sunrise will be here soon. [throatclearing]
Timestamp-Based Ordering (TBO): 我是不是随性过头了?[throatclearing]第一个环节是回忆过去。
Temporal-Semantic Alignment (TSA) : 我是不是随性过头了?[throatclearing]第一个环节是回忆过去。
Timestamp-Based Ordering (TBO): 他骗了我好几十万呢,我的棺材本全都搭进去了,然后然后他人就不见了,到警也找不到。[crying]
Temporal-Semantic Alignment (TSA) : 他骗了我好几十万呢,我的棺材本全都搭进去了,然后然后他人就不见了,报警也找不到。[crying]
Timestamp-Based Ordering (TBO): [crying]不要,王林说好的,我俩一块走的,我不要一个人走。
Temporal-Semantic Alignment (TSA) : [crying]不要,王林说好的,我俩一块走的,我不要一个人走。
Timestamp-Based Ordering (TBO): 啊,这都画的啥呀,[breath]不是你是小学生上语文课吗?又是卡通小人,又是图句号的。哎呀,行,男主名字,后面还写了个全剧最帅。
Temporal-Semantic Alignment (TSA) : 啊,这都画的啥呀?[breath]哇,你是小学生上语文课吗?又是卡通小人,又是图句号的。哎呀,行,男主名字后面还写了个全剧最帅。
Timestamp-Based Ordering (TBO): [breath]小孩子的把戏水做的枪也想杀人。
Temporal-Semantic Alignment (TSA) : [breath]小孩子的把戏水做的枪也想杀人。
Timestamp-Based Ordering (TBO): 李渊看着秦琼心说,这是我的金殿哪,他在这儿就指着我两个儿子,让我两个儿子心服口服。哎呀。李渊心说, 大唐[sniff]<B>江山</B>要紧,我也不能再包庇我的两个儿子了。哎呀,是世民,你也给我跪下。
Temporal-Semantic Alignment (TSA) : 李渊看着秦琼心说,这是我的金殿哪,他在这儿就指着我两个儿子,让我两个儿子心服口服。[sniff]哎呀。李渊心说,大唐江山要紧,我也不能再包庇我的两个儿子了,二儿是民,你也给我跪下。
Timestamp-Based Ordering (TBO): [sniff]说什么呢?我是说韩长老青通药理,韩家又一直盯着我不放,有没有可能是冲着我身上的某种东西来的。
Temporal-Semantic Alignment (TSA) : [sniff]说说什么呢?我是说韩长老精通药理,韩家又一直盯着我不放,有没有可能是冲着我身上的某种东西来的。
Timestamp-Based Ordering (TBO): [laughing]我们来助早齿大人一臂之力。
Temporal-Semantic Alignment (TSA) : [laughing]我们来助早齿大人一臂之力嗯。
Timestamp-Based Ordering (TBO): 女娲说完一笑,便直接转身离开了后院。而蚩尤则是愣住了片刻,随即[laughing]<B>仰天</B>大笑。 女娲,你果然厉害呀。
Temporal-Semantic Alignment (TSA) : 女娲说完一笑,便直接转身离开了后院。而蚩尤则是愣住了片刻,随即仰天大笑。[laughing]女娲,你果然厉害呀。
Timestamp-Based Ordering (TBO): 先生,您好歹吃点东西吧,再难受,也不能不吃东西。[coughing]
Temporal-Semantic Alignment (TSA) : 先生,您好歹吃点东西吧,再难受,也不能不吃东西。[coughing]
Timestamp-Based Ordering (TBO): 你去嗯嗯大人,我该注意些什么?[coughing]
Temporal-Semantic Alignment (TSA) : 你去嗯嗯大人,我该注意些什么?[coughing]
Timestamp-Based Ordering (TBO): Heya, kids. The vet reported your dog is a biter, so I'm supposed to wait until animal control gets here to take him away. [gasp]
Temporal-Semantic Alignment (TSA) : Hey, kids, the vet reported your dog is a biter, so I'm supposed to wait until animal control gets here to take him away. [gasp]<B> What? </B>
Timestamp-Based Ordering (TBO): 这是怎么[gasp]<B>回事?</B>
Temporal-Semantic Alignment (TSA) : 这是怎么回事[gasp]啊,你了。
Timestamp-Based Ordering (TBO): [yawn]好久没有熬夜了,好困啊。
Temporal-Semantic Alignment (TSA) : [yawn]啊,嗯好久没有熬夜了,好困啊。
Timestamp-Based Ordering (TBO): [yawn]妖怪,你才是妖怪呢。
Temporal-Semantic Alignment (TSA) : [yawn]啊,妖怪,你才是妖怪呢。
Timestamp-Based Ordering (TBO): 杨总,我知道数的都是雷政委的名字,没关系,以我的身份没有权利,拥有什么研究成果的,让你受委屈了。没事的,能在[sigh]<B>红暗</B>基地做研究,能看到这么丰富的资料。
Temporal-Semantic Alignment (TSA) : 杨总,我知道数的都是雷政委的名字,没关系,以我的身份没有权利,拥有什么研究成果的,[sigh]让你受委屈了。没事的,能在红暗基地做研究,能看到这么丰富的资料。
Timestamp-Based Ordering (TBO): [sigh]<B> They say </B> time flies when you're having fun, so...
Temporal-Semantic Alignment (TSA) : [sigh] They say time flies when you're having fun, so...