Overview

Before 1980s, Automatic Speech Recognition (ASR) traditionally relied on pattern matching approaches such as Dynamic Time Warping (DTW). From the 1970s, the trend slowly shifted towards statistical approaches based on systems consisting of the Hidden Markov Model (HMM), Gaussian Mixture Model (GMM), pronunciation model (PM), and decoding algorithms since the introduction of HMM in the 1960s. However, with the introduction of the Attention-based Encoder-Decoder (AED) ap- proach in the mid-2010s, the field rapidly advanced towards all neural network- based approaches, such as encoder models with a Connectionist Temporal Classi- fication (CTC) Loss, encoder-decoder models with Cross-Entropy (CE) loss, trans- ducer models with transducer loss, and so on. This course focuses on contemporary model structures, training methods, and their loss functions introduced since 2016. It also covers commonly used inference ap- proaches such as beam search decoding and shallow fusion with Language Models (LMs), among others. Additionally, we will study advanced topics such as the appli- cation of self-supervised training. The course will also cover latest research topics, such as generative speech models.

The course is divided into the following five modules:

  1. Review of neural networks and connectionist temporal classification
  2. Advanced inference
  3. Transducer
  4. Self-supervised learning techniques in speech recognition
  5. Generative speech model Please refer to the Schedule section for more details

Learning Objectives

  • Basics: Understand the basic machine learning theories used in speech recognition
  • Latest speech recognition models and related theories: Understand the structures of contemporary all neural network speech recognition models
  • Techniques for performance improvement: Learn how to improve speech recognition accuracy using additional techniques such as data augmentation, beam search decoding, and shallow fusioning with a Language Model (LM).
  • Implementation skill: : Gain skills to implement, improve, and modify speech recognition codes.

Grading Scheme

In this class, grades are evaluated through absolute assessment.

  • 40% Midterm Exam
  • 50% Final Exam
  • 10% Course Participation There may be quizzes during the class, but those will not be explicitly included in grading.

Prerequisite

Students are expected to have knowledge in the following fields:

  • Linear Algebra
  • Basic Neural Network Theory
  • Calculus
  • Python Programming
  • Some Basic Knowledge in Speech Recognition

Projects and Assignments

There are not projects and assignments in this course. Evaluation will be made based on exam grading and class participation.

Class Schedule

Week Title Course Materials and Relavant Papers Quiz Paper
Week1 (09/02 - 09/08) Course Overview Lecture 0 -
  MODULE 1: REVIEW OF NEURAL NETWORKS AND CONNECTIONIST TEMPORAL CLASSIFICATION    
Week2 (09/09 - 09/15) Neural Network Basics Lecture 1 -
Week3 (09/16 - 09/21) Connectionist Temporal Classification-I Lecture 2 -
Week4 (09/23 - 09/29) Connectionist Temporal Classification-II Lecture 3 Quiz 1