We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech
(TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that
rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of
LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach
effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring
only a waveform variational autoencoder (Wav-VAE) and a diffusion Transformer (DiT) backbone. Furthermore,
we introduce two critical improvements to the inference process: first, we identify and
rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free
guidance (CFG) with adaptive projection guidance (APG) to elevate generation quality. Experimental
results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning
performance on the Seed benchmark while maintaining competitive intelligibility. Specifically,
our largest variant, LongCat-AudioDiT-3.5B (trained on 1M hours of Chinese and English speech), outperforms the previous SOTA model
Seed-DiT, improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from
0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies, we reveal the counterintuitive
finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance.
Architecture Overview
LongCat-AudioDiT consists of only two streamlined components: a Wav-VAE and a Diffusion Transformer (DiT).
By generating waveform latents directly, it avoids the compounding errors from predicting and converting
intermediate representations (e.g., mel-spectrograms) into waveforms, completely bypassing auxiliary vocoders.
Figure 1: Overview of LongCat-AudioDiT. Our architecture generates continuous waveform latents directly,
thereby avoiding the compounding errors that arise when predicting and subsequently converting
intermediate representations (e.g., mel-spectrograms) into waveforms.
Main Results
Objective evaluation on the Seed benchmark. LongCat-AudioDiT establishes new SOTA speaker similarity (SIM)
on Seed-ZH and Seed-Hard, while securing competitive intelligibility — all with an end-to-end architecture and a single training stage.
Bold = best; Underline = second-best.
Zero-Shot Voice Cloning Samples
12 random samples from the Seed benchmark. Each row shows the reference prompt audio used for zero-shot voice cloning,
the target text, and the audio synthesized by LongCat-AudioDiT-3.5B.
These drawings are usually outline drawings that are quite expressionless and somber in appearance.
6
To wish one well To wish one ill.
7
A current list of directors is available from the Canadian Governor in Council here.
8
Other researchers repeated the experiment with greater accuracy.
9
大家非得盘算来盘算去省着点吃粮食。
10
拨打幺幺零谎报险情或者警情等方式,搞恶作剧寻求刺激的。
11
开云集团支付了几千万欧元违约金,给霞飞诺。
12
兔宝宝经常去泉眼井边照影子,在这个镜子前做各种动作,很开心。
Emotional Speech Synthesis
By operating in the waveform latent space, LongCat-AudioDiT preserves fine-grained acoustic details
including emotional prosody. Below are samples where the model clones diverse emotional styles
(calm, gentle, confident, angry) from the prompt audio.
Comprehensive ablation studies validating three core research questions:
(RQ1) Wav-VAE vs. Mel-VAE, (RQ2) CFG vs. APG, and (RQ3) impact of training-inference mismatch correction.
RQ1: Waveform Latent (Wav-VAE) vs. Mel-Spectrogram Latent (Mel-VAE)
The Wav-VAE model consistently and significantly outperforms the Mel-VAE baseline across all metrics,
especially in speaker similarity (SIM). Fine-grained acoustic details essential for voice cloning
are easily lost during the cascading conversions (latent → mel → waveform) inherent to the Mel-VAE pipeline.
#
Text
Mel Latent
Waveform Latent (Ours)
1
报警人说,绑匪威胁不准报案,否则将撕票。
2
要求赔偿粉尘污染,对他们造成的精神伤害。
3
在今年二月份,做客央视遇见大咖节目时。
4
In some species, females are also capable of stridulation.
5
As he stated; 'I have lost two sisters and you offer me twenty servants.
6
The rest of the money can be based on playing time.
RQ2: Classifier-Free Guidance (CFG) vs. Adaptive Projection Guidance (APG)
A large CFG scale induces oversaturation artifacts in diffusion-based TTS. APG decomposes the guidance
residual into parallel and orthogonal components and selectively dampens the parallel term that causes oversaturation,
yielding superior UTMOS and DNSMOS scores while maintaining comparable intelligibility and speaker similarity.
#
Text
CFG
APG (Ours)
1
一九九五年,他与西雅图交响乐的总监史瓦兹一起指挥。
2
雪孩子紧紧把它攥在手里,好像攥着自己的生命一样。
3
你们俩一个爬到另一个背上,让我摸摸你们的脸吧。
4
Three weeks later, he was feeling a lot better.
5
Mrs. Travis, when I leave my kids in kindergarten, I expect you to supervise them.
During inference, the model's velocity predictions for the prompt region are unconstrained,
causing the prompt latent to drift from its ground-truth trajectory. We fix this by overwriting
the prompt latent with its GT value at every Euler step, and also drop the noisy prompt when computing the
unconditional velocity to prevent acoustic information leakage.
#
Text
With Mismatch
Without Mismatch (Ours)
1
只有当科技为本地社群创造价值的时候,才真正有意义。
2
并采取实际行动,去培训快递员这方面的法律意识。
3
该男子爬起后竟再次返回宾馆,从四楼跳下。
4
He must be disguised to avoid encounters with thieves.
5
The stained glass offered a hypnotic atmosphere.
6
The wooden shrine is generously proportioned for the three images it houses.
Community
Scan the QR codes below to join the LongCat WeChat group or follow the official WeChat account.