6 months ago

Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee. Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings

This work given a approach to map audio signals to phoneme sequences without phoneme-labeled audio data.
Proposed Framework

This framework consists of three parts:

1. Audio to Vector
divides the audio utterances into segments and obtains the audio embeddings for the segments

2. K-means clustering
cluster all audio embeddings and assigned cluster indices

3. Mapping relationship GAN
the generator produces predicted phoneme vector sequences from cluster index sequences, and the discriminator is trained to distinguish the predicted and the real phoneme vector sequences collected from text sentences and the lexicon

 Mapping relationship GAN

← [Paper Reading] VoxelNet - Point Cloud Based 3D Object Detection [Paper Reading] ArcFace: Additive Angular Margin Cosine Loss for Deep Face Recognition →
comments powered by Disqus