Unsupervised Multi-channel Speech Dereverberation via Diffusion

2025/02 - 2025/06, accepted by WASPAA 2025

We consider the problem of multi-channel single-speaker blind dereverberation, where multi-channel mixtures are used to recover the clean anechoic speech. To solve this problem, we propose USD-DPS, Unsupervised Speech Dereverberation via Diffusion Posterior Sampling. USD-DPS uses an unconditional clean speech diffusion model as a strong prior to solve the problem by posterior sampling. At each diffusion sampling step, we estimate all microphone channels’ room impulse responses (RIRs), which are further used to enforce a multi-channel mixture consistency constraint for diffusion guidance. For multi-channel RIR estimation, we estimate reference-channel RIR by optimizing RIR parameters of a sub-band RIR signal model, with the Adam optimizer. We estimate non-reference channels’ RIRs analytically using forward convolutive prediction (FCP). We found that this combination provides a good balance between sampling efficiency and RIR prior modeling, which shows superior performance among unsupervised dereverberation approaches. An audio demo page is provided in this URL.

USD-DPS Algorithm Diagram.

Introduction

Why and what is dereverberation
  • In real rooms, microphone arrays capture multi-channel reverberant mixtures which degrade performance in automatic speech recognition and hearing aids \(y=\left\{y_c \mid y_c=h_c * x_1+n_c, 1 \leq c \leq C\right\}\)
  • Dereverberation aims to invert this process and recover the clean anechoic speech from mixtures.
    Strengths of our USD-DPS method
  • Unsupervised: no paired clean/reverb data needed
  • Powerful priori: Leverages a pre-trained clean-speech diffusion model
  • Hybrid RIR estimation: efficient analytic FCP for non-ref channels + optimized ref-channel RIR, leveraging multi-channel information with a small demand of computing

Backgrounds

Score-Based Diffusion

Score-based models learn $\nabla_x \log p(x)$ via a DNN, and sample by solving the probabilistic flow ODE: \(d x_1^\tau = -\,\sigma(\tau)\,\nabla_{x_1^\tau}\log p\!\left(x_1^\tau\right)\, d\tau\)

Diffusion Posterior Sampling (DPS)

DPS solves inverse problems by sampling from the posterior $p!\left(x_1 \mid y\right)$ using: \(d x_1^\tau = -\,\sigma(\tau)\,\nabla_{x_1^\tau}\log p\!\left(x_1^\tau \mid y\right)\, d\tau\)

With Bayes’ rule and Tweedie’s formula, the posterior score can be decomposed/approximated as: \(\nabla_{x_1^\tau}\log p\!\left(y \mid x_1^\tau\right) = \nabla_{x_1^\tau}\log p\!\left(y \mid x_1^\tau, h\right) \simeq \nabla_{x_1^\tau}\log p\!\left(y \mid \hat{x}_1^0, h\right)\)

BUDDy and RIR Model

Each frequency band undergoes independent 1D convolution: \(Y_{m,k}=H_{m,k} * X_{m,k}=\sum_{n=0}^{N_h-1} H_{n,k}\, X_{m-n,k}\)

RIR is parameterized as $\psi=\bigl{\Phi,\,(w_b,\alpha_b)_{b=1}^B\bigr}$

  • Magnitude response $A \in \mathbb{R}^{N_h \times K}: \quad A’_{n,b}=w_b\,e^{-\alpha_b n}, A=\exp!\bigl(\operatorname{lerp}(\log A’)\bigr)$
  • Phase $\Phi \in \mathbb{R}^{N_h \times K}$

RIR model is optimized by: \(\hat{\psi} =\underset{\psi}{\operatorname{argmin}} \;\Bigl\| S_{\mathrm{comp}}(y_1)-S_{\mathrm{comp}}\!\bigl(\mathcal{A}_\psi(\hat{x}_1^0)\bigr) \Bigr\|_2^2 + R(\psi)\)

Forward Convolutive Prediction (FCP)

For each non-reference channel, estimate a forward convolutive filter: \(\begin{aligned} \hat{H}^c &=\mathrm{FCP}\!\left(Y^c,\hat{X}\right) =\underset{H^c}{\operatorname{argmin}} \sum_{m,k} \frac{1}{\hat{\lambda}_{m,k}^c} \left| Y_{m,k}^c-\sum_{n=0}^{N_h'-1} H_{n,k}^c\,\hat{X}_{m-n,k} \right|^2 \\ \hat{\lambda}_{m,k}^c &=\frac{1}{C}\sum_{c=1}^C \left|Y_{m,k}^c\right|^2 +\epsilon \cdot \max_{m,k}\left[ \frac{1}{C}\sum_{c=1}^C \left|Y_{m,k}^c\right|^2 \right] \end{aligned}\)

Proposed Method: USD-DPS

We adopt a hybrid likelihood score estimation for reference and non-reference channels. At each diffusion step, we guide the diffusion model by optimizing RIR/FCP models.

Reference Channel: RIR Model

During each diffusion step, a parametric RIR model $\hat{\psi}_1$ for the reference channel is optimized by Adam. The reconstruction loss is: \(\mathcal{L}_{\text{ref}} =\left\| S_{\text{comp}}(y_1) - S_{\text{comp}}\!\bigl(A_{\hat{\psi}_1}(\hat{x}_1^0)\bigr)\right\|_2^2\)

Non-reference Channels: FCP Models

We estimate FCP models $\widehat{H}^{2:c}$ for each pair of estimated clean speech and non-reference channel. The non-reference reconstruction loss is: \(\mathcal{L}_{\text{non-ref}}=\sum_{c=2}^{C}\left\|S_{\text{comp}}(y_c)-S_{\text{comp}}\!\bigl(A_{\hat{H}^c}(\hat{x}_1^0)\bigr)\right\|_2^2\)

Likelihood Score Calculation

We combine the two losses with a tunable coefficient $\lambda’$ and take the derivative w.r.t. $x_1^\tau$ to obtain the likelihood score: \(\nabla_{x_1^\tau}\log p\!\left(y \mid x_1^\tau\right)\simeq\zeta(\tau)\,\nabla_{x_1^\tau}\!\Biggl(\left\|S_{\text{comp}}(y_1)-S_{\text{comp}}\!\bigl(A_{\hat{\psi}_1}(\hat{x}_1^0)\bigr)\right\|_2^2 + \lambda'\sum_{c=2}^{C}\left\|S_{\text{comp}}(y_c)-S_{\text{comp}}\!\bigl(A_{\hat{H}^c}(\hat{x}_1^0)\bigr)\right\|_2^2\Biggr)\)

Experiment Results (SI-SDR, PESQ, ESTOI)

USD-DPS Experiment Results.

References