High Impact Factor : 4.396 icon | Submit Manuscript Online icon |

Hybrid Transformers for Music Source Separation

Author(s):

Maviya Mahagami , All India Shri Shivaji Memorial Society Polytechnic; Vivek Deshmukh, All India Shri Shivaji Memorial Society Polytechnic; Himanshu Lohokane, All India Shri Shivaji Memorial Society Polytechnic; Rashi Kacchwah, All India Shri Shivaji Memorial Society Polytechnic; Prof. Krushna Jagtap, All India Shri Shivaji Memorial Society Polytechnic

Keywords:

Music Source Separation, Transformers

Abstract

The study in Music Source Separation (MSS) raises a fundamental question: Is there any benefit in considering broader contextual information, or are local acoustic features adequate? In various domains, attention-based Transformers [1] have demonstrated their capacity to assimilate information across extensive sequences. In our research, we introduce Hybrid Transformer Demucs (HT Demucs), a hybrid temporal/spectral bi-U-Net based on Hybrid Demucs [2]. Here, the innermost layers are substituted with a cross-domain Transformer Encoder, utilizing self-attention within one domain and cross-attention across domains. Although its performance is lacking when exclusively trained on MUSDB [3], we illustrate that it surpasses Hybrid Demucs (trained on the same data) by 0.45 dB of Signal-to-Distortion Ratio (SDR) when provided with an additional 800 training songs. By employing sparse attention kernels to broaden its receptive field and undertaking per-source fine-tuning, we attain state-of-the-art results on MUSDB with extra training data, achieving a remarkable 9.20 dB of SDR.

Other Details

Paper ID: IJSRDV11I120056
Published in: Volume : 11, Issue : 12
Publication Date: 01/03/2024
Page(s): 67-70

Article Preview

Download Article