Recent advances in self-supervised representation learning, sequence modeling, and audio synthesis have significantly improved the performance of conditional audio generation. A common approach is to represent the audio signal as a discrete or continuous compressed representation and apply a generative model to it. Various studies have explored techniques such as applying vector quantized variational autoencoders (VQ-VAEs) directly to raw waveforms and training conditional diffusion-based generative models on learned continuous representations. I did.
To address the limitations of existing approaches, FAIR Team META researchers introduced: magnet, an acronym for Masked Audio Generation Using Non-Autoregressive Transformers. MAGNET is a new masked generative sequence modeling technique that operates on multistream representations of audio signals.
Unlike autoregressive models, MAGNET operates in a non-autoregressive manner, significantly reducing inference time and latency. During training, MAGNET samples the masking rate from the masking scheduler and masks and predicts spans of input tokens conditioned on the unmasked ones. Gradually build the output audio sequence using several decoding steps during inference. Additionally, we introduce a new scoring method that leverages external pre-trained models to improve generation quality.
We also explore a hybrid version of MAGNET that combines autoregressive and non-autoregressive models. In a hybrid approach, the beginning of a token sequence is generated autorecursively and the rest of the sequence is decoded in parallel. Previous studies have proposed similar non-autoregressive modeling techniques for machine translation and image generation tasks. However, MAGNET is unique in its application to audio generation, which exploits the entire frequency spectrum of the signal.
They evaluate MAGNET on text-to-music and text-to-audio generation tasks, report objective metrics, and conduct human studies. The results show that MAGNET achieves comparable results to the autoregressive baseline while significantly reducing the delay. Furthermore, we analyze the tradeoffs between autoregressive and non-autoregressive models and provide insights into their performance characteristics. Their contributions include his introduction of MAGNET as a new non-autoregressive model for audio generation, the use of pre-trained external models for rescoring, and the combination of autoregressive and non-autoregressive modeling. This includes exploring hybrid his approaches.
Furthermore, their study contributes to the exploration of non-autoregressive modeling techniques in audio generation and provides insights into its effectiveness and applicability in real-world scenarios. MAGNET expands the possibilities for interactive applications such as music generation and editing in digital audio workstations (DAWs) by significantly reducing latency without sacrificing production quality.
Moreover, the proposed scoring method improves the overall quality of the generated audio, further strengthening the practicality of this approach. Through rigorous evaluation and analysis, we provide a comprehensive understanding of the trade-offs between autoregressive and non-autoregressive models, paving the way for future advances in efficient and high-quality audio production systems.
Please check paper and github. All credit for this study goes to the researchers of this project.Don’t forget to follow us twitter and google news.participate 38,000+ ML SubReddits, 41,000+ Facebook communities, Discord channeland linkedin groupsHmm.
If you like what we do, you’ll love Newsletter..
Don’t forget to join us telegram channel
You may also like Free AI courses….
Arshad is an intern at MarktechPost. He is currently continuing his international studies. He holds a master’s degree in physics from the Indian Institute of Technology, Kharagpur. Understanding things from the fundamentals leads to new discoveries and advances in technology. He is passionate about leveraging tools such as mathematical models, ML models, and AI to fundamentally understand the essence.