BigCodec:

Pushing the Limits of Low-Bitrate Neural Speech Codec

[Paper] [Code]

Detai Xin¹, Xu Tan², Shinnosuke Takamichi³, Hiroshi Saruwatari¹

¹The University of Tokyo, ²Microsoft, ³Keio University

Abstract: We present BigCodec, a low-bitrate neural speech codec. While recent neural speech codecs have shown impressive progress, their performance significantly deteriorates at low bitrates (around 1 kbps). Although a low bitrate inherently restricts performance, other factors, such as model capacity, also hinder further improvements. To address this problem, we scale up the model size to 159M parameters that is more than 10 times larger than popular codecs with about 10M parameters. Besides, we integrate sequential models into traditional convolutional architectures to better capture temporal dependency and adopt low-dimensional vector quantization to ensure a high code utilization. Comprehensive objective and subjective evaluations show that BigCodec, with a bitrate of 1.04 kbps, significantly outperforms several existing low-bitrate codecs. Furthermore, BigCodec achieves objective performance comparable to popular codecs operating at 4-6 times higher bitrates, and even delivers better subjective perceptual quality than the ground truth.

This page is for research demonstration purposes only.

Overview

Fig. 1: Architecture of the VQ-VAE generator of BigCodec. Please refer to the paper for the definitions of the symbols.

Main results

TABLE 1: Main results of BigCodec on the LibriSpeech test set with 2620 utterances. Bold indicates the best score with p < 1e-3 compared to previous low-bitrate codecs. For MUSHRA both the mean scores and the 95% confidence intervals are reported. The inter-rater agreement of the MUSHRA test is revealed by an ICC(2) of 0.73.

Audio samples

We show 10 audio samples of the reconstructed speech by BigCodec and other baselines.
All samples are randomly selected from the LibriSpeech test set.
Note that, these samples were also used in the MUSHRA test, so you can also rate them by yourself and compare with the MUSHRA results.

We indiate the bitrate (bps) of each codec.

GT	BigCodec (1.04k)	DAC-4k	TF-Codec (1.5k)	Encodec-6k	LLM-Codec (0.74k)	Encodec-1.5k	DAC-1k

Unseen languages

TABLE 2: Multilingual evaluation results of BigCodec on the MLS test set with 700 utterances from 7 OOD languages. Note that BigCodec is the only codec that is trained on a monolingual corpus (English). Bold indicates the best score with p < 1e-3 compared to previous low-bitrate codecs.