Semi Supervised Audio Classification (2024)


Improving Speech Recognition for African American English With Audio Classification

Sep 16, 2023

Shefali Garg, Zhouyuan Huo, Khe Chai Sim, Suzan Schwartz, Mason Chua, Alëna Aksënova, Tsendsuren Munkhdalai, Levi King, Darryl Wright, Zion Mengesha(+4 more)

Semi Supervised Audio Classification (2)

Semi Supervised Audio Classification (3)

Semi Supervised Audio Classification (4)

Semi Supervised Audio Classification (5)

Automatic speech recognition (ASR) systems have been shown to have large quality disparities between the language varieties they are intended or expected to recognize. One way to mitigate this is to train or fine-tune models with more representative datasets. But this approach can be hindered by limited in-domain data for training and evaluation. We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain (long-form) African American English (AAE) data. We use CORAAL, YouTube and Mozilla Common Voice to train an audio classifier to approximately output whether an utterance is AAE or some other variety including Mainstream American English (MAE). By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.

ViaSemi Supervised Audio Classification (6)

Access Paper or Ask Questions

Modular Neural Network Approaches for Surgical Image Recognition

Jul 17, 2023

Nosseiba Ben Salem, Younes Bennani, Joseph Karkazan, Abir Barbara, Charles Dacheux, Thomas Gregory

Deep learning-based applications have seen a lot of success in recent years. Text, audio, image, and video have all been explored with great success using deep learning approaches. The use of convolutional neural networks (CNN) in computer vision, in particular, has yielded reliable results. In order to achieve these results, a large amount of data is required. However, the dataset cannot always be accessible. Moreover, annotating data can be difficult and time-consuming. Self-training is a semi-supervised approach that managed to alleviate this problem and achieve state-of-the-art performances. Theoretical analysis even proved that it may result in a better generalization than a normal classifier. Another problem neural networks can face is the increasing complexity of modern problems, requiring a high computational and storage cost. One way to mitigate this issue, a strategy that has been inspired by human cognition known as modular learning, can be employed. The principle of the approach is to decompose a complex problem into simpler sub-tasks. This approach has several advantages, including faster learning, better generalization, and enables interpretability. In the first part of this paper, we introduce and evaluate different architectures of modular learning for Dorsal Capsulo-Scapholunate Septum (DCSS) instability classification. Our experiments have shown that modular learning improves performances compared to non-modular systems. Moreover, we found that weighted modular, that is to weight the output using the probabilities from the gating module, achieved an almost perfect classification. In the second part, we present our approach for data labeling and segmentation with self-training applied on shoulder arthroscopy images.

ViaSemi Supervised Audio Classification (8)

Access Paper or Ask Questions

A Semi-Supervised Algorithm for Improving the Consistency of Crowdsourced Datasets: The COVID-19 Case Study on Respiratory Disorder Classification

Sep 09, 2022

Lara Orlandic, Tomas Teijeiro, David Atienza

Semi Supervised Audio Classification (10)

Semi Supervised Audio Classification (11)

Semi Supervised Audio Classification (12)

Semi Supervised Audio Classification (13)

Cough audio signal classification is a potentially useful tool in screening for respiratory disorders, such as COVID-19. Since it is dangerous to collect data from patients with such contagious diseases, many research teams have turned to crowdsourcing to quickly gather cough sound data, as it was done to generate the COUGHVID dataset. The COUGHVID dataset enlisted expert physicians to diagnose the underlying diseases present in a limited number of uploaded recordings. However, this approach suffers from potential mislabeling of the coughs, as well as notable disagreement between experts. In this work, we use a semi-supervised learning (SSL) approach to improve the labeling consistency of the COUGHVID dataset and the robustness of COVID-19 versus healthy cough sound classification. First, we leverage existing SSL expert knowledge aggregation techniques to overcome the labeling inconsistencies and sparsity in the dataset. Next, our SSL approach is used to identify a subsample of re-labeled COUGHVID audio samples that can be used to train or augment future cough classification models. The consistency of the re-labeled data is demonstrated in that it exhibits a high degree of class separability, 3x higher than that of the user-labeled data, despite the expert label inconsistency present in the original dataset. Furthermore, the spectral differences in the user-labeled audio segments are amplified in the re-labeled data, resulting in significantly different power spectral densities between healthy and COVID-19 coughs, which demonstrates both the increased consistency of the new dataset and its explainability from an acoustic perspective. Finally, we demonstrate how the re-labeled dataset can be used to train a cough classifier. This SSL approach can be used to combine the medical knowledge of several experts to improve the database consistency for any diagnostic classification task.

ViaSemi Supervised Audio Classification (14)

Access Paper or Ask Questions

Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation

Oct 27, 2022

Fernando López, Jordi Luque

Semi Supervised Audio Classification (16)

Semi Supervised Audio Classification (17)

Semi Supervised Audio Classification (18)

Semi Supervised Audio Classification (19)

High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm. The produced alignments are employed to customize an end-to-end Automatic Speech Recognition (ASR) and iteratively refined. The algorithm is fed with frame-wise character posteriors produced by a seed ASR, trained with out-of-domain data, and optimized throughout a Connectionist Temporal Classification (CTC) loss. The alignments are computed iteratively upon a corpus of broadcast TV. The process is repeated by reducing the quantity of text to be aligned or expanding the alignment window until finding the best possible audio-text alignment. The starting timestamps, or temporal anchors, are produced uniquely based on the confidence score of the last aligned utterance. This score is computed with the paths of the CTC-alignment matrix. With this methodology, no human-revised text references are required. Alignments from long audio files with low-quality transcriptions, like TV captions, are filtered out by confidence score and ready for further ASR adaptation. The obtained results, on both the Spanish RTVE2022 and CommonVoice databases, underpin the feasibility of using CTC-based systems to perform: highly accurate audio-text alignments, domain adaptation and semi-supervised training of end-to-end ASR.

* 5 pages, 4 figures, IberSPEECH2022

ViaSemi Supervised Audio Classification (20)

Access Paper or Ask Questions

Semi-Supervised Audio Classification with Partially Labeled Data

Nov 24, 2021

Siddharth Gururani, Alexander Lerch

Semi Supervised Audio Classification (22)

Semi Supervised Audio Classification (23)

Semi Supervised Audio Classification (24)

Semi Supervised Audio Classification (25)

Audio classification has seen great progress with the increasing availability of large-scale datasets. These large datasets, however, are often only partially labeled as collecting full annotations is a tedious and expensive process. This paper presents two semi-supervised methods capable of learning with missing labels and evaluates them on two publicly available, partially labeled datasets. The first method relies on label enhancement by a two-stage teacher-student learning process, while the second method utilizes the mean teacher semi-supervised learning algorithm. Our results demonstrate the impact of improperly handling missing labels and compare the benefits of using different strategies leveraging data with few labels. Methods capable of learning with partially labeled data have the potential to improve models for audio classification by utilizing even larger amounts of data without the need for complete annotations.

* To be presented at IEEE ISM 2021

ViaSemi Supervised Audio Classification (26)

Access Paper or Ask Questions

Cross-domain Semi-Supervised Audio Event Classification Using Contrastive Regularization

Sep 29, 2021

Donmoon Lee, Kyogu Lee

Semi Supervised Audio Classification (28)

Semi Supervised Audio Classification (29)

Semi Supervised Audio Classification (30)

Semi Supervised Audio Classification (31)

In this study, we proposed a novel semi-supervised training method that uses unlabeled data with a class distribution that is completely different from the target data or data without a target label. To this end, we introduce a contrastive regularization that is designed to be target task-oriented and trained simultaneously. In addition, we propose an audio mixing based simple augmentation strategy that performed in batch samples. Experimental results validate that the proposed method successfully contributed to the performance improvement, and particularly showed that it has advantages in stable training and generalization.

* 5 pages, 3 figures, and 2 tables. Accepted paper at IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021

ViaSemi Supervised Audio Classification (32)

Access Paper or Ask Questions

SpliceOut: A Simple and Efficient Audio Augmentation Method

Oct 13, 2021

Arjit Jain, Pranay Reddy Samala, Deepak Mittal, Preethi Jyoti, Maneesh Singh

Semi Supervised Audio Classification (34)

Semi Supervised Audio Classification (35)

Semi Supervised Audio Classification (36)

Semi Supervised Audio Classification (37)

Time masking has become a de facto augmentation technique for speech and audio tasks, including automatic speech recognition (ASR) and audio classification, most notably as a part of SpecAugment. In this work, we propose SpliceOut, a simple modification to time masking which makes it computationally more efficient. SpliceOut performs comparably to (and sometimes outperforms) SpecAugment on a wide variety of speech and audio tasks, including ASR for seven different languages using varying amounts of training data, as well as on speech translation, sound and music classification, thus establishing itself as a broadly applicable audio augmentation method. SpliceOut also provides additional gains when used in conjunction with other augmentation techniques. Apart from the fully-supervised setting, we also demonstrate that SpliceOut can complement unsupervised representation learning with performance gains in the semi-supervised and self-supervised settings.

ViaSemi Supervised Audio Classification (38)

Access Paper or Ask Questions

Phone-to-audio alignment without text: A Semi-supervised Approach

Oct 08, 2021

Jian Zhu, Cong Zhang, David Jurgens

Semi Supervised Audio Classification (40)

Semi Supervised Audio Classification (41)

Semi Supervised Audio Classification (42)

Semi Supervised Audio Classification (43)

The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment. Code and pretrained models are available at https://github.com/lingjzhu/charsiu.

* Submitted to ICASSP 2022

ViaSemi Supervised Audio Classification (44)

Access Paper or Ask Questions

Peer Collaborative Learning for Polyphonic Sound Event Detection

Oct 07, 2021

Hayato Endo, Hiromitsu Nishizaki

Semi Supervised Audio Classification (46)

Semi Supervised Audio Classification (47)

Semi Supervised Audio Classification (48)

Semi Supervised Audio Classification (49)

This paper describes that semi-supervised learning called peer collaborative learning (PCL) can be applied to the polyphonic sound event detection (PSED) task, which is one of the tasks in the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge. Many deep learning models have been studied to find out what kind of sound events occur where and for how long in a given audio clip. The characteristic of PCL used in this paper is the combination of ensemble-based knowledge distillation into sub-networks and student-teacher model-based knowledge distillation, which can train a robust PSED model from a small amount of strongly labeled data, weakly labeled data, and a large amount of unlabeled data. We evaluated the proposed PCL model using the DCASE 2019 Task 4 datasets and achieved an F1-score improvement of about 10% compared to the baseline model.

* Submitted to ICASSP 2022

ViaSemi Supervised Audio Classification (50)

Access Paper or Ask Questions

Semi Supervised Learning For Few-shot Audio Classification By Episodic Triplet Mining

Feb 16, 2021

Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu

Semi Supervised Audio Classification (52)

Semi Supervised Audio Classification (53)

Semi Supervised Audio Classification (54)

Semi Supervised Audio Classification (55)

Few-shot learning aims to generalize unseen classes that appear during testing but are unavailable during training. Prototypical networks incorporate few-shot metric learning, by constructing a class prototype in the form of a mean vector of the embedded support points within a class. The performance of prototypical networks in extreme few-shot scenarios (like one-shot) degrades drastically, mainly due to the desuetude of variations within the clusters while constructing prototypes. In this paper, we propose to replace the typical prototypical loss function with an Episodic Triplet Mining (ETM) technique. The conventional triplet selection leads to overfitting, because of all possible combinations being used during training. We incorporate episodic training for mining the semi hard positive and the semi hard negative triplets to overcome the overfitting. We also propose an adaptation to make use of unlabeled training samples for better modeling. Experimenting on two different audio processing tasks, namely speaker recognition and audio event detection; show improved performances and hence the efficacy of ETM over the prototypical loss function and other meta-learning frameworks. Further, we show improved performances when unlabeled training samples are used.

* 5 pages

ViaSemi Supervised Audio Classification (56)

Access Paper or Ask Questions

Semi Supervised Audio Classification (2024)
Top Articles
Latest Posts
Article information

Author: Corie Satterfield

Last Updated:

Views: 6134

Rating: 4.1 / 5 (62 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Corie Satterfield

Birthday: 1992-08-19

Address: 850 Benjamin Bridge, Dickinsonchester, CO 68572-0542

Phone: +26813599986666

Job: Sales Manager

Hobby: Table tennis, Soapmaking, Flower arranging, amateur radio, Rock climbing, scrapbook, Horseback riding

Introduction: My name is Corie Satterfield, I am a fancy, perfect, spotless, quaint, fantastic, funny, lucky person who loves writing and wants to share my knowledge and understanding with you.