CSIDM-200806: Adaptive Non-native Acoustic Modelling and Automatic Corrective Feedback Generation for Spoken Language Learning

Principal Investigators:

BACKGROUND/ INTRODUCTION / OBJECTIVES

Computer-aided spoken language learning aims at building an automated system to assist human in learning to speak a foreign language. This is typically conducted in a supervised mode where the learner is provided with sample sentences to speak and an acoustic model is used to assess the Goodness of Pronunciation (GOP).

Current state-of-the-art system uses Hidden Markov Model (HMM) to model the acoustic aspect of the speech signals. Typically, a speaker independent (SI) acoustic model will be deployed and adaptation techniques (e.g. Maximum Likelihood Linear Regression, MLLR or Maximum a Posteriori, MAP) may be used to obtain a better model for a particular user. However, in addition to learning the speaker-dependent features (e.g. speaker characteristics and vocal tract properties), these data-driven techniques will also adapt non-speaker-dependent features (e.g. systematic pronunciation errors), which are undesirable for language learning. Therefore, it is important to design an adaptation scheme, which is able to isolate speaker-dependent features from pronunciation errors.

The GOP provides a simple feedback to the learner as to how well he or she has learned the new language. The GOP is usually given by the average confidence scores produced by the acoustic model at the sentence, word or phone levels. However, GOP does not provide useful hints or pointers to help learner improve their pronunciation. Therefore, it is important for a computer-aided spoken language learning system to provide corrective feedbacks to enhance the learning experience.

SCOPE OF WORK

The proposed project aims to achieve adaptive non-native acoustic modelling with automatic corrective feedback generation using adaptive non-native acoustic models for computer-aided spoken language learning. The main objective of this project is to improve the following aspects for language learning:

  • Adaptive non-native acoustic modelling
  • Pronunciation error detection using native- and IPA-based phone mapping
  • Tone and duration (prosody) modelling
  • Automatic corrective feedback generation for pronunciation and prosody errors
  • Interactive speech synthesis with high naturalness
  • Automatic training speech generation model for the language learning by integrating speech synthesis markup language

This project will research on a bi-directional learning mechanism where both the system and the user will be learning at the same time:

  1. The system will adapt its acoustic model to eliminate mismatch in terms of speaker characteristics and vocal tract properties (attributes that the user is not required to learn). This project will focus on the design of a constrained adaptation scheme for this purpose.
  2. The user will improve by correcting mispronunciations and prosodic attributes. This project will focus on pronunciation and prosody models that will provide useful and corrective feedbacks to enhance learning experience.
  3. The user will improve by using the speech synthesis result for simulation with the arbitrary text input. The project will focus on adapting the current TTS system from reading style to conversational style which is more suitable for the language learning.
  4. The system will integrate a markup language to generate the high quality pre-defined speech for the language learning to save the manpower. The project will focus on defining and testing several user tags to improve the expressiveness of the speech synthesis.

In summary, this research project aims to build a computer-aided spoken language learning system that provides corrective feedbacks to enhance learning experience. The project will investigate various correction feedback generation methods to help the learners improve pronunciation.