LongCat-AudioDiT

Abstract

We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion Transformer (DiT) backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance (CFG) with adaptive projection guidance (APG) to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B (trained on 1M hours of Chinese and English speech), outperforms the previous SOTA model Seed-DiT, improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies, we reveal the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance.

Architecture Overview

LongCat-AudioDiT consists of only two streamlined components: a Wav-VAE and a Diffusion Transformer (DiT). By generating waveform latents directly, it avoids the compounding errors from predicting and converting intermediate representations (e.g., mel-spectrograms) into waveforms, completely bypassing auxiliary vocoders.

Figure 1: Overview of LongCat-AudioDiT. Our architecture generates continuous waveform latents directly, thereby avoiding the compounding errors that arise when predicting and subsequently converting intermediate representations (e.g., mel-spectrograms) into waveforms.

Main Results

Objective evaluation on the Seed benchmark. LongCat-AudioDiT establishes new SOTA speaker similarity (SIM) on Seed-ZH and Seed-Hard, while securing competitive intelligibility — all with an end-to-end architecture and a single training stage. Bold = best; Underline = second-best.

Zero-Shot Voice Cloning Samples

12 random samples from the Seed benchmark. Each row shows the reference prompt audio used for zero-shot voice cloning, the target text, and the audio synthesized by LongCat-AudioDiT-3.5B.

#	Prompt Audio	Text	Synthesized Audio
1		伤不起啊，微博就像热气球，飘上天大家都看得到。
2		精心伪装的朝鲜朝鲜朝鲜朝鲜朝鲜特种兵，还是还是还是还是还是还是在一月二十一日晚八时左右，摸进了汉城。
3		经过调查，经过调查，经过调查，经过调查，经过调查，将筛选出一批口味令消费者满意的产品。
4		等到拿回来，用嘴一咬，根本咬不动，也没什么味道。也没什么味道。也没什么味道。也没什么味道。也没什么味道。也没什么味道。也没什么味道。
5		These drawings are usually outline drawings that are quite expressionless and somber in appearance.
6		To wish one well To wish one ill.
7		A current list of directors is available from the Canadian Governor in Council here.
8		Other researchers repeated the experiment with greater accuracy.
9		大家非得盘算来盘算去省着点吃粮食。
10		拨打幺幺零谎报险情或者警情等方式，搞恶作剧寻求刺激的。
11		开云集团支付了几千万欧元违约金，给霞飞诺。
12		兔宝宝经常去泉眼井边照影子，在这个镜子前做各种动作，很开心。

Emotional Speech Synthesis

By operating in the waveform latent space, LongCat-AudioDiT preserves fine-grained acoustic details including emotional prosody. Below are samples where the model clones diverse emotional styles (calm, gentle, confident, angry) from the prompt audio.

#	Emotion	Text
1	生气 Angry	你看，这零件都没安装好，他们到底有没有认真修啊？
2	生气 Angry	你这修的都是什么呀？完全是胡乱搞一通嘛。
3	生气 Angry	我就使了，你能把我怎么样？今天就是要给你点颜色看看。
4	自信 Confident	对啊，我特别期待以后还会有什么样的新科技出现。也许有一天我们可以像科幻电影里那样，进行时空旅行或者瞬间移动之类的。
5	自信 Confident	等会儿到场上，就按咱们之前商量的来。
6	自信 Confident	是的呢。还有就业信息服务板块，为那些正在找工作或者想换工作的居民提供了很多就业资讯，像招聘信息、职业培训之类的。
7	温和 Gentle	对呢，感觉这个社区服务中心很注重社区的可持续发展啦。再看这边，还有志愿者服务项目的介绍呢。
8	温和 Gentle	嗯，今天在这里能放松一下真不错。平时工作学习太忙了，都没有什么时间好好读书。
9	温和 Gentle	确实呢，这后台的每个工作人员都像是这个时尚舞台背后的无名英雄，少了他们中的任何一个环节啊，这个时尚秀可能都不会这么完美。
10	冷静 Calm	没错，以后也得注意，可不能大意了。
11	冷静 Calm	没错，安全第一嘛。你看这杆儿，也得检查下有没有损坏啥的。
12	冷静 Calm	关你什么事啊？

Ablation Studies

Comprehensive ablation studies validating three core research questions: (RQ1) Wav-VAE vs. Mel-VAE, (RQ2) CFG vs. APG, and (RQ3) impact of training-inference mismatch correction.

RQ1: Waveform Latent (Wav-VAE) vs. Mel-Spectrogram Latent (Mel-VAE)

The Wav-VAE model consistently and significantly outperforms the Mel-VAE baseline across all metrics, especially in speaker similarity (SIM). Fine-grained acoustic details essential for voice cloning are easily lost during the cascading conversions (latent → mel → waveform) inherent to the Mel-VAE pipeline.

#	Text	Mel Latent	Waveform Latent (Ours)
1	报警人说，绑匪威胁不准报案，否则将撕票。
2	要求赔偿粉尘污染，对他们造成的精神伤害。
3	在今年二月份，做客央视遇见大咖节目时。
4	In some species, females are also capable of stridulation.
5	As he stated; 'I have lost two sisters and you offer me twenty servants.
6	The rest of the money can be based on playing time.

RQ2: Classifier-Free Guidance (CFG) vs. Adaptive Projection Guidance (APG)

A large CFG scale induces oversaturation artifacts in diffusion-based TTS. APG decomposes the guidance residual into parallel and orthogonal components and selectively dampens the parallel term that causes oversaturation, yielding superior UTMOS and DNSMOS scores while maintaining comparable intelligibility and speaker similarity.

#	Text	CFG	APG (Ours)
1	一九九五年，他与西雅图交响乐的总监史瓦兹一起指挥。
2	雪孩子紧紧把它攥在手里，好像攥着自己的生命一样。
3	你们俩一个爬到另一个背上，让我摸摸你们的脸吧。
4	Three weeks later, he was feeling a lot better.
5	Mrs. Travis, when I leave my kids in kindergarten, I expect you to supervise them.
6	She searched in vain for a familiar face.
7	通过创新技术让未来出行更加安全，通过创新技术让未来出行更加安全，通过创新技术让未来出行更加安全，通过创新技术让未来出行更加安全，通过创新技术让未来出行更加安全，通过创新技术让未来出行更加安全，高效。
8	一端一端一端一端一端一端是该名词名词名词名词名词名词的正向描述，另一端是是是是是是负面描述描述描述描述描述。
9	个人脸谱主页上的认证，是台湾双语言外籍竞技主播。

RQ3: Training-Inference Mismatch Correction

During inference, the model's velocity predictions for the prompt region are unconstrained, causing the prompt latent to drift from its ground-truth trajectory. We fix this by overwriting the prompt latent with its GT value at every Euler step, and also drop the noisy prompt when computing the unconditional velocity to prevent acoustic information leakage.

#	Text	With Mismatch	Without Mismatch (Ours)
1	只有当科技为本地社群创造价值的时候，才真正有意义。
2	并采取实际行动，去培训快递员这方面的法律意识。
3	该男子爬起后竟再次返回宾馆，从四楼跳下。
4	He must be disguised to avoid encounters with thieves.
5	The stained glass offered a hypnotic atmosphere.
6	The wooden shrine is generously proportioned for the three images it houses.

Community

Scan the QR codes below to join the LongCat WeChat group or follow the official WeChat account.

WeChat Discussion Group

Official WeChat Account