The rapid growth of large language models (LLMs) has highlighted the importance of discrete speech tokenization in injecting speech into these models. However, this process of discretization results in a loss of information, leading to a decrease in overall performance. To address this issue, we introduce RepCodec, a novel speech representation codec that aims to improve the performance of discrete speech tokens in semantic speech tokenization.

Unlike traditional audio codecs that reconstruct raw audio, RepCodec learns a vector quantization codebook by reconstructing speech representations from speech encoders such as HuBERT or data2vec. The combined efforts of the speech encoder, codec encoder, and vector quantization codebook form a pipeline that converts speech waveforms into semantic tokens. Through extensive experiments, we have demonstrated that RepCodec surpasses the widely used k-means clustering approach in both speech understanding and generation. This superiority is consistent across different speech encoders and languages, proving the robustness of RepCodec.

We believe that our method, RepCodec, can greatly contribute to research on large language models in the field of speech processing.