Content Comparison

...

For example, a classification algorithm (classifier) might learn to predict whether a given email is spam or not, as illustrated below. This is a binary classification task, where the goal is to categorize the input data into two mutually exclusive classes. In this case, the training data is labeled with binary categories, such as "spam" and "not spam," "true" and "false," or "positive" and "negative." These labels guide the model in learning the differences between the two categories, allowing it to make accurate predictions when exposed to new data.

...

...
Spoken dialogue systems (SDSs) have been the most prominent component in today’s virtual personal assistants, such as Microsoft’s Cortana, Apple’s Siri, Amazon Alexa, Google Assistant, and Facebook’s M. Unlike chitchat, task-oriented SDSs aim to assist users with a specific goal, for example, recommend a recipe or booking a hotel.
A classical pipeline architecture of a task-oriented spoken dialogue system includes key components:
Automatic Speech Recognition (ASR) - Converts spoken language into textual transcript.
Natural Language Understanding (NLU) - Interprets and extracts meaning from the transcript.
Dialogue Management (DM) - Manages the flow of conversation and determines the system’s response.
Natural Language Generation (NLG) - Constructs responses in natural language.
Text to Speech (TTS) - Converts the generated text into spoken output.
...
In this project, we will focus on building a simple pipeline that integrates ASR followed by a NLU component. We will use an existing ASR model (e.g., Whisper) for inference/prediction only (no training), while enhancing the performance of the NLU model (e.g., BERT) by training it on conversational data collected from the previous course.
By the end of the project, you will learn how to:
Construct a basic dialogue pipeline.
Train and improve individual components, specifically the NLU model.
This hands-on approach will provide insight into developing and refining key elements of a dialogue system.

ASR with WHISPER

Automatic Speech Recognition (ASR) is a key component in the pipeline architecture, which converts spoken language into text. It enables machines to interpret and transcribe human speech, allowing for seamless interaction between users and applications through voice commands.

Whisper is a commonly used general-purpose speech recognition model developed by OpenAI. It is trained on a large dataset of diverse audio and is also a multi-tasking model that can perform multi-lingual speech recognition, speech translation, and language identification.

NLU with BERT

Unlike chitchat, task-oriented dialogue’s pattern is restricted by a dialogue ontology, which defines all possible intents, slots, and their corresponding candidate values in specific domains. The NLU component maps a user’s utterance to a structured semantic representation, which includes the intent behind the utterance and a set of key-value pairs known as slots and values. This mapping enables dialogue systems to understand user needs and respond appropriately. For example, given the transcribed utterance “I want to cook Italian pizza“, the NLU model can identify: the intent as “addFilter“ and the value of the slot “ingredienttype“as “italian pizza“.

NLU task → Intent and Slot Classification

The NLU task can be approached as joint learning of intent classification (IC) and slot filling (SF), with the slot labels typically formatted in the widely-used BIO format, as shown below. In general, joint learning of intent and slot classification models are mutually beneficial. Here is an example of SF and IC output for an utterance. Slot labels are in BIO format: B indicates the start of a slot span, I the inside of a span while O denotes that the word does not belong to any slot.

...

Utterance

...

I

...

want

...

to

...

cook

...

Italian

...

pizza

...

Slot

...

O

...

O

...

O

...

O

...

B-ingredienttype

...

I-ingredienttype

...

Intent

...

addFilter

...

What is a deep neural network and how does it work?

A deep neural network (DNN) is a machine learning model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options, and arrive at conclusions.

Every neural network consists of layers of nodes, or artificial neurons—an input layer, one or more hidden layers, and an output layer. Each node connects to others, and has its own associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.

...

There are several key concerns during the development of a DNN model:

Defining a model architecture:

Let's define our first deep neural network (DNN) with a single-layer, fully connected neural network, and a 3-dimensional input. In a fully connected layer, each input is connected to every output, ensuring comprehensive interaction between neurons.

...

Defining a loss function: Loss functions are quantitative measures of how satisfactory the model predictions are (i.e., how “good” the model parameters are). We will use the cross entropy (CE) loss which is standard and common for classification.

Optimizing the loss function: During training, the goal is to find the “best” values of the model parameters that minimize the loss function based on the training dataset. This process is known as optimization.

...

A General Pipeline of Task-Oriented Spoken Dialogue Systems

Spoken dialogue systems (SDSs) have been the most prominent component in today’s virtual personal assistants, such as Microsoft’s Cortana, Apple’s Siri, Amazon Alexa, Google Assistant, and Facebook’s M. Unlike chitchat, task-oriented SDSs aim to assist users with a specific goal, for example, recommend a recipe or booking a hotel.
A classical pipeline architecture of a task-oriented spoken dialogue system includes key components:

Automatic Speech Recognition (ASR) - Converts spoken language into textual transcript.
Natural Language Understanding (NLU) - Interprets and extracts meaning from the transcript.
Dialogue Management (DM) - Manages the flow of conversation and determines the system’s response.
Natural Language Generation (NLG) - Constructs responses in natural language.
Text to Speech (TTS) - Converts the generated text into spoken output.

...

In this project, we will focus on building a simple pipeline that integrates ASR followed by a NLU component. We will use an existing ASR model (e.g., Whisper) for inference/prediction only (no training), while enhancing the performance of the NLU model (e.g., BERT) by training it on conversational data collected from the previous course.

By the end of the project, you will learn how to:

Construct a basic dialogue pipeline.
Train and improve individual components, specifically the NLU model.

This hands-on approach will provide insight into developing and refining key elements of a dialogue system.

ASR with WHISPER

Automatic Speech Recognition (ASR) is a key component in the pipeline architecture, which converts spoken language into text. It enables machines to interpret and transcribe human speech, allowing for seamless interaction between users and applications through voice commands.

Whisper is a commonly used general-purpose speech recognition model developed by OpenAI. It is trained on a large dataset of diverse audio and is also a multi-tasking model that can perform multi-lingual speech recognition, speech translation, and language identification.

NLU with BERT

Unlike chitchat, task-oriented dialogue’s pattern is restricted by a dialogue ontology, which defines all possible intents, slots, and their corresponding candidate values in specific domains. The NLU component maps a user’s utterance to a structured semantic representation, which includes the intent behind the utterance and a set of key-value pairs known as slots and values. This mapping enables dialogue systems to understand user needs and respond appropriately. For example, given the transcribed utterance “I want to cook Italian pizza“, the NLU model can identify: the intent as “addFilter“ and the value of the slot “ingredienttype“as “italian pizza“.

NLU task → Intent and Slot Classification

The NLU task can be approached as joint learning of intent classification (IC) and slot filling (SF), with the slot labels typically formatted in the widely-used BIO format, as shown below. In general, joint learning of intent and slot classification models are mutually beneficial. Here is an example of SF and IC output for an utterance. Slot labels are in BIO format: B indicates the start of a slot span, I the inside of a span while O denotes that the word does not belong to any slot.

Utterance	I	want	to	cook	Italian	pizza
Slot	O	O	O	O	B-ingredienttype	I-ingredienttype
Intent	addFilter

The NLU architecture includes the following key parts:

Tokenization & Embeddings
Tokenization is the process of breaking down text into smaller units, typically words or phrases, called tokens. This allows machines to process and understand the complexities of human language. Each token is represented by a k-dimensional vector, learned from large amounts of text data, enabling models to capture the meaning and relationships between words.
Base Model: Pre-trained BertModel (e.g.,bert-base-uncased) for generating contextual embeddings. It includes two main parts: the encoder and the attention mechanism. The encoder processes the input sequence and creates contextual embeddings for each token, while the attention mechanism helps capture dependencies between words, regardless of their position in the sequence.
Intent Classifier: A linear layer on top of the [CLS] token output for intent prediction. The final output of this layer is typically a softmax function, which predicts the probability distribution over a predefined set of possible intents.
Slot Classifier: A linear layer applied to the token-level embeddings for slot tagging. It assigns a label to each token, indicating whether it represents a particular entity (e.g., a destination, date, etc). This process is often referred to as token tagging. The output of this linear layer is typically a softmax layer that predicts slot labels for each token.
Joint Learning of the Two Classifiers: During training, the model minimizes a combined loss function, which includes separate losses for intent classification and slot filling. This ensures that the model not only accurately predicts the intent but also extracts the correct slots for each token in the sentence.

...

Version	Old Version 34	New Version 35
Changes made by	Gardner, I.V. (Bella)	j.pei2
Saved on	Jan 06, 2025	Jan 06, 2025

Versions Compared

Key

ASR with WHISPER

NLU with BERT

What is a deep neural network and how does it work?

A General Pipeline of Task-Oriented Spoken Dialogue Systems

ASR with WHISPER

NLU with BERT