Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Moreno La Quatra1, Juan Rafael Orozco-Arroyave2, Sabato Marco Siniscalchi3,4
1Kore University of Enna, Italy 2University of Antioquia, Colombia 3UniPA, Italy 4NTNU, Norway
International Conference on Acoustics, Speech, and Signal Processing - ICASSP 2025
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Motivation

Parkinson's disease detection from speech offers a non-invasive diagnostic approach.

  • PD affects speech production across languages
  • Non-invasive speech-based detection is promising
  • But models struggle with cross-language generalization
  • Performance drops dramatically when testing on unseen languages
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Cross-Language Challenge

Monolingual models learn language-specific rather than disease-specific features
Single-Language Training
  • High in-language accuracy (83-90%)
  • Poor cross-language performance (24-65%)
  • Features tied to linguistic context
Naive Multi-Language
  • Moderate performance (75-78%)
  • Confusion: language vs disease markers
  • Suboptimal feature extraction

International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Proposed Approach

A novel model architecture specialized for cross-language Parkinson's disease detection

Key: Task-specific processing paths that capture disease markers independent of language
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Dual-Head Architecture

Specialized Processing Routes

  • One head for DDK tasks
  • One head for continuous speech
  • Shared backbone for common features

Why This Works

  • Different speech tasks reveal unique PD markers
  • Task-specific heads focus on relevant patterns
  • Backbone learns language-invariant features
Dual-head architecture
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Adaptive Layers for Cross-Language Learning

Designed to reduce language-specific variations and focus on disease-specific features.

  1. Normalize input features to zero mean and unit variance:

  2. Apply language-specific modulation:

  3. Key parameters:

    • : Language-specific scaling factor
    • : Language-specific shift parameter
    • : Learnable language embedding
Adaptive layers
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Bottleneck Layers

  • Compress and expand feature space
  • Help to focus on key speech characteristics
  • Filter out irrelevant information

Contrastive Learning

  • Enhance discriminative capabilities
  • Separate HC/PD across languages
  • Target informative examples

International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Data & Experimental Setup

EWA-DB

Slovak

  • 863 HC, 95 PD patients
  • 5 spontaneous recordings/speaker
  • DDK task ("pataka" repetitions)
  • 70% train, 10% validation, 20% test

PC-GITA

Spanish

  • 140 speakers (70 HC, 70 PD)
  • s-PC-GITA: clean recordings (train/val)
  • e-PC-GITA: real-world conditions (test)
  • Read text, monologue, sentences, DDK
Preprocessing: Voice activity detection, speech dereverberation, and speech denoising - standardized across datasets
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

In-Language Performance

Training Dataset Test Dataset Accuracy F1 Score Sensitivity Specificity
In-Language EWA-DB 83.59 70.28 69.57 85.5
In-Language PC-GITA 90.00 90.00 91.67 88.33
Cross-Language EWA-DB 23.96 23.91 89.13 15.09
Cross-Language PC-GITA 65.00 63.54 45.00 85.00
Bilingual (Baseline) EWA-DB 75.78 60.74 57.97 78.21
Bilingual (Baseline) PC-GITA 78.33 78.33 80.00 76.67
Bilingual (Proposed) EWA-DB 84.72 69.03 56.52 88.56
Bilingual (Proposed) PC-GITA 90.83 90.83 93.33 88.33
Analysis: Single-language models achieve strong performance when trained and tested on the same language.
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Cross-Language Generalization

Training Dataset Test Dataset Accuracy F1 Score Sensitivity Specificity
In-Language EWA-DB 83.59 70.28 69.57 85.5
In-Language PC-GITA 90.00 90.00 91.67 88.33
Cross-Language EWA-DB 23.96 23.91 89.13 15.09
Cross-Language PC-GITA 65.00 63.54 45.00 85.00
Bilingual (Baseline) EWA-DB 75.78 60.74 57.97 78.21
Bilingual (Baseline) PC-GITA 78.33 78.33 80.00 76.67
Bilingual (Proposed) EWA-DB 84.72 69.03 56.52 88.56
Bilingual (Proposed) PC-GITA 90.83 90.83 93.33 88.33
Analysis: Significant performance drop in cross-language scenarios. Monolingual models learn language-specific rather than disease-specific features.
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Bilingual Model Performance

Training Dataset Test Dataset Accuracy F1 Score Sensitivity Specificity
In-Language EWA-DB 83.59 70.28 69.57 85.5
In-Language PC-GITA 90.00 90.00 91.67 88.33
Cross-Language EWA-DB 23.96 23.91 89.13 15.09
Cross-Language PC-GITA 65.00 63.54 45.00 85.00
Bilingual (Baseline) EWA-DB 75.78 60.74 57.97 78.21
Bilingual (Baseline) PC-GITA 78.33 78.33 80.00 76.67
Bilingual (Proposed) EWA-DB 84.72 69.03 56.52 88.56
Bilingual (Proposed) PC-GITA 90.83 90.83 93.33 88.33
Analysis: Our dual-head architecture significantly outperforms the baseline bilingual approach, achieving results comparable to or better than in-language models.
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Feature Space: Traditional Models

t-SNE of model embedding space without dual-head architecture.
Clear separation by language (EWADB vs PC-GITA), not by disease state
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Feature Space: Dual-Head Model

t-SNE of model embedding space with dual-head architecture.
Clearer separation by disease state (Healthy vs Parkinson's) with smoother transition between languages.
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Ablation Study

  • Dual-head architecture is the most critical component (significant drop in PC-GITA performance)
  • Adaptive layers important for both datasets
  • Bottleneck layers more critical for PC-GITA than EWA-DB
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Conclusions

  • Novel dual-head architecture for cross-language PD detection
  • Improvement in generalization across Slovak and Spanish
  • Task-specific processing branches capture better disease markers
  • Potential for broader application across diverse linguistic populations
Future Work:
  • Extend to additional languages
  • Enhance interpretability for medical expert validation
  • Expand application to other speech disorders
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025
Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Thank You!

Moreno La Quatra
Kore University of Enna, Italy
Juan Rafael Orozco-Arroyave
University of Antioquia, Colombia
Marco Sabato Siniscalchi
UniPA, Italy & NTNU, Norway
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025

Hello everyone. [pause] I'm Moreno La Quatra from Kore University of Enna. Today I'll present our work on a **Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech**. [pause] This is joint work with Juan Rafael Orozco-Arroyave from the University of Antioquia in Colombia and Sabato Marco Siniscalchi from the University of Palermo in Italy.

Let me just introduce the motivation behind our work. [pause] Parkinson's disease affects speech production across languages, making speech-based detection a promising non-invasive approach for early diagnosis. [pause] However, there's a significant challenge: models struggle with **cross-language generalization**. [pause] When tested on languages not seen during training, their performance decreases significantly.

The main issue is that monolingual models learn a mix of **language-specific features** for the task rather than extracting **disease-specific patterns**. [pause] Anticipating a little bit the results, models trained on a single language achieve high in-language accuracy—between 83% and 90%—but perform poorly when tested on other languages, with accuracy dropping to as low as 24%. [pause] Even standard and naive multi-language approaches only achieve moderate performance, around 75-78%, likely because they cannot efficiently separate language markers from disease markers right?. [pause] So at the end, this leads to suboptimal feature extraction and poor generalization across languages.

To address these challenges, we developed a **specific architecture** designed for cross-language Parkinson's disease detection. [pause] As shown in the image, our approach uses **separate processing branches for different speech tasks**. This helps capture disease markers that are disentangled from language markers. [pause] Our backbone extracts features using **Self-Supervised Learning models** - like in our case WavLM - combined with wavelet transforms. This provides robust speech representations that remain effective across different languages. [pause]

The proposed model uses two specialized processing routes. [pause] The first head analyzes DDK tasks - where patients repeat syllables like "pataka" - which help assess motor control in speech. [pause] The second stream processes continuous speech instead, such as reading tasks or spontaneous speech. [pause] However, both routes share a common backbone that learns features shared across both tasks. [pause] Each classifier includes an **attention pooling layer** that summarizes the input sequence weighting frames based on their importance And this is followed by fully connected layers with ReLU activation. [pause] **Why this is actually effective?** [pause] Well, different speech tasks reveal distinct Parkinson's disease markers. The proposed architecture allow the model to focus on relevant patterns for each speech type, while the shared backbone is guided to learn features that can be useful for both tasks. [pause]

In our architecture we implemented adaptive layers to specifically reduce language-specific variations and emphasize disease features. [pause] But what those adaptive layers actually do? Well First, we **normalize the features** to zero mean and unit variance. and then, we apply **language-specific modulation** using a language specific embedding that are learned during training. [pause] These embeddings, combined with two parameters gamma and beta allows us to scale and shift the normalized features. [pause] This approach helps preserve disease markers while tryin to reduce language-specific variations that are not relevant to diagnosis.

As I said previously, our model integrates **self-supervised features from WavLM base** and **wavelet transforms** at the frame level. For each audio, we extract SSL features and apply wavelet decomposition with the same granularity. [pause] Both of them follow include specific adaptive layers and are then concatenated to create mixed features. [pause] At this point we introduce CNN-based **bottleneck layers** that compress and expand the feature space. [pause] A first convolutional layer compress the feature space to a lower dimension and then a second one expand them. At the output we used sigmoid to selectively retain important features. [pause] This forms the input for the dual-head architecture that we discussed earlier. [pause] Finally, in our overall implementation we also use **contrastive learning** to enhance the discriminative capabilities of the model. [pause] We use a loss function that separates healthy controls from Parkinson's patients across languages, with custom mining to identify the most informative examples. [pause]

For our experiments, we used two datasets in two different languages. [pause] The i dabliu ei DB dataset in Slovak, with 863 healthy speakers and 95 speakers with Parkinson. Those are divided into 70% for training, 10% for validation, and 20% for testing with no speaker overlap of course. [pause] For Spanish we use the **PC-GITA** dataset, with 140 speakers divided into s-PC-GITA that contains clean recordings and is used for training and validation and the e-PC-GITA split that contains real-conditions recordings and is used for testing [pause] In both datasets, we have a mix of speech tasks like reading text, monologue, sentences, and of course DDK tasks. [pause] All data was processed with the same pipeline including voice activity detection, speech dereverberation, and speech denoising. This ensure a fair setup across datasets. [pause]

This table shows the performance of different models. Here is highlighted the portion showing the models trained and tested on the **same** language. [pause] As you can see, **in-language models** achieve strong performance, with accuracy of 83.6% for i dabliu ei DB and 90% for PC-GITA. [pause] This give us the strong baseline to compare the cross-language and bilingual models, very high performance when both training and testing on the same language. [pause]

Now, let's see what happens when we test models on a different language from their training data. [pause] The **cross-language results** show a significant drop in performance. [pause] When trained on Spanish and tested on Slovak, accuracy drops to just 24%. [pause] The reverse direction performs better at 65%, but still significantly worse than in-language performance. [pause] This actually confirms our hypothesis. Monolingual models primarily learn disease features that are language-bounded. They are not able to generalize to other languages. [pause]

Here instead, we see the results for the **bilingual models**. [pause] The **baseline bilingual model** is a standard model trained on a mix of both datasets. It is actually able to improve performance over cross-language models. [pause] It got moderate performance, with 75.8% accuracy on i dabliu ei DB and 78.3% on PC-GITA. Those are actually good results, but still far from the in-language performance. [pause] In contrast, our **proposed dual-head architecture** significantly outperforms the bilingual baseline. It reaches 84.7% accuracy on i dabliu ei DB and 90.8% on PC-GITA. [pause] This actually validate the choices for our model architecture. It effectively generalizes across languages, achieving results that are comparable to or better than in-language models. [pause]

To better understand why traditional models struggle with cross-language generalization, we tried to visualize the embedding space of a model without our dual-head architecture. [pause] This model is trained on the mixed dataset but misses the dual-head architecture. [pause] Notice how the data points clearly separate by language i dabliu ei DB versus PC-GITA Slovak versus Spanish [pause] rather than having a clear separation by disease state. [pause] This indicates that naive combination of datasets without specific processing branches leads to models that primarily learn to distinguish between languages. [pause]

In contrast, the t-SNE plot of the model **with** our dual-head architecture shows a different behavior. [pause] In this case, the data points belonging to different languages are more mixed. This is what we expect from a model that is not focusing on language-specific features. [pause] Instead, the data points are more separated by disease state, with a smoother transition between languages. [pause] Of course the t-SNE plot gives us only a visual representation in two dimensions, but it's a good indicator that our model is effectively learning disease-specific features that are disentangled from language markers. [pause]

To under understand which components of our model are most critical for performance we also conducted an ablation study. [pause] As you can see, the **dual-head architecture** is the most critical component. [pause] The removal of this part of the architecture leads to a significant drop in performance, especially on the PC-GITA dataset. [pause] Wavelet features and contrastive learning also contribute to the overall performance on both datasets but with less impact. Adaptive layers are actally quite relevant for both datasets, while we found bottleneck layers to be more critical for PC-GITA rather than for i dabliu ei DB. [pause] This ablation study confirms that giving the model the ability to focus on tasks separately is a key factor for its performance. [pause]

To wrap up, our novel **dual-head architecture** significantly improves generalization across Slovak and Spanish for detecting parkinson. It achieves results comparable to or better than in-language models. [pause] The task-specific processing branches allow the model to capture better disease markers, with potential for broader application across diverse linguistic populations. [pause] - As for future work, we want to extend our model to additional languages to understand whether this approach actually scales to different linguistic populations. [pause] Also, given that we are working with medical data, we want to improve on the interpretability side of our model for medical expert validation. [pause] Finally we also aim to expand the application of our model to other speech disorders, such as Alzheimer's or Huntington's disease for example. [pause]

This concludes my presentation. [pause] Thank you for your attention! [pause] If you have any questions or want to know more about our work, feel free to reach out to us. [pause] You can also visit our GitHub repository for more details on the project. [pause] Thank you again! [pause]