XAI 610 (Fall 2024)

Overview

Before 1980s, Automatic Speech Recognition (ASR) traditionally relied on pattern matching approaches such as Dynamic Time Warping (DTW). From the 1970s, the trend slowly shifted towards statistical approaches based on systems consisting of the Hidden Markov Model (HMM), Gaussian Mixture Model (GMM), pronunciation model (PM), and decoding algorithms since the introduction of HMM in the 1960s. However, with the introduction of the Attention-based Encoder-Decoder (AED) ap- proach in the mid-2010s, the field rapidly advanced towards all neural network- based approaches, such as encoder models with a Connectionist Temporal Classi- fication (CTC) Loss, encoder-decoder models with Cross-Entropy (CE) loss, trans- ducer models with transducer loss, and so on. This course focuses on contemporary model structures, training methods, and their loss functions introduced since 2016. It also covers commonly used inference ap- proaches such as beam search decoding and shallow fusion with Language Models (LMs), among others. Additionally, we will study advanced topics such as the appli- cation of self-supervised training. The course will also cover latest research topics, such as generative speech models.

The course is divided into the following five modules:

Review of neural networks and connectionist temporal classification
Advanced inference
Transducer
Self-supervised learning techniques in speech recognition
Generative speech model Please refer to the Schedule section for more details

Learning Objectives

Basics: Understand the basic machine learning theories used in speech recognition
Latest speech recognition models and related theories: Understand the structures of contemporary all neural network speech recognition models
Techniques for performance improvement: Learn how to improve speech recognition accuracy using additional techniques such as data augmentation, beam search decoding, and shallow fusioning with a Language Model (LM).
Implementation skill: : Gain skills to implement, improve, and modify speech recognition codes.

Grading Scheme

In this class, grades are evaluated through absolute assessment.

40% Midterm Exam
50% Final Exam
10% Course Participation There may be quizzes during the class, but those will not be explicitly included in grading.

Prerequisite

Students are expected to have knowledge in the following fields:

Linear Algebra
Basic Neural Network Theory
Calculus
Python Programming
Some Basic Knowledge in Speech Recognition

Projects and Assignments

There are not projects and assignments in this course. Evaluation will be made based on exam grading and class participation.

Class Schedule

Week	Title	Course Materials and Relavant Papers	Quiz Paper
Week1 (09/02 - 09/08)	Course Overview	Lecture 0	-
	MODULE 1: REVIEW OF NEURAL NETWORKS AND CONNECTIONIST TEMPORAL CLASSIFICATION
Week2 (09/09 - 09/15)	Neural Network Basics	Lecture 1	-
Week3 (09/16 - 09/21)	Connectionist Temporal Classification-I	Lecture 2	-
Week4 (09/23 - 09/29)	Connectionist Temporal Classification-II	Lecture 3	Quiz 1

Chanwoo Kim