...
This hands-on approach will provide insight into developing and refining key elements of a dialogue system.
ASR
...
with WHISPER
ASR component converts spoken language into text. It enables machines to interpret and transcribe human speech, allowing for seamless interaction between users and applications through voice commands.
Whisper is a commonly used general-purpose speech recognition model developed by OpenAI. It is trained on a large dataset of diverse audio and is also a multi-tasking model that can perform multi-lingual speech recognition, speech translation, and language identification.
NLU
...
with BERT
Unlike the open-domain dialogue (e.g., chitchat), task-oriented dialogue’s pattern is restricted by a dialogue ontology, which defines all possible intents, slots, and their corresponding candidate values in specific domains. The NLU component maps a user’s utterance to a structured semantic representation, which includes the intent behind the utterance and a set of key-value pairs known as slots and values. This mapping enables dialogue systems to understand user needs and respond appropriately. For example, given the transcribed utterance “Recommend a restaurant at China Town“, the NLU model can identify: the intent as “inform“ and the value of the slot “destination“as “China Town“. This mapping enables dialogue systems to understand user needs and respond appropriately. Unlike the open-domain dialogue .
NLU task → Intent and Slot Classification
The NLU task can be approached as joint learning of intent classification (IC) and slot filling (SF), with the slot labels typically formatted in the widely-used BIO format, as shown below. In general, joint learning of intent and slot classification models are mutually beneficial. https://arxiv.org/abs/2011.00564
Utterance | Recommend | a | restaurant | at | China | Town |
---|---|---|---|---|---|---|
Slot | O | O | O | O | B-destination | I-destination |
Intent | Inform |
Example of SF and IC output for an utterance. Slot labels are in BIO format: B indicates the start of a slot span, I the inside of a span while O denotes that the word does not belong to any slot.
The NLU architecture include the following key parts:
NLU task → Intent and Slot Classification
- Explain what the task is generally
- Explain how it is done
- What is an ontology - importance
Base Model: Pre-trained
BertModel
(e.g.,
...
bert-base-uncased
) for generating contextual embeddings. It includes two main parts: the encoder and the attention mechanism. The encoder processes the input sequence and creates contextual embeddings for each token, while the attention mechanism helps capture dependencies between words, regardless of their position in the sequence.Intent Classifier: A linear layer on top of the
[CLS]
token output for intent prediction. The final output of this layer is typically a softmax function, which predicts the probability distribution over a predefined set of possible intents.Slot Classifier: A linear layer applied to the token-level embeddings for slot tagging. It assigns a label to each token, indicating whether it represents a particular entity (e.g., a location, date, or other domain-specific information). This process is often referred to as token tagging. The output of this linear layer is typically a softmax layer that predicts slot labels for each token.
Joint Learning of the Two Classifiers: During training, the model minimizes a combined loss function, which includes separate losses for intent classification and slot filling. This ensures that the model not only accurately predicts the intent but also extracts the correct slots for each token in the sentence.
Pre-training and fine-tuning BERT
BERT, Bidirectional Encoder Representations from Transformers, is a widely used transformer-based language model designed for various natural language processing tasks, including classification. It consists of two types of training procedures:
During pre-training, BERT is trained on a large corpus of English text in a self-supervised manner. This means it is trained on large-scale, raw, unlabeled text without human annotations, using an automatic process to generate input-output pairs from the text.
During fine-tuning, BERT is first initialized with its pre-trained parameters, and then all parameters are fine-tuned using labeled data from downstream tasks, allowing it to adapt to specific applications.
...
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
LLMs
...
on Hugging Face
- Explain Pretrained LLMs like BERT used
Hugging Face is an AI community and platform that offers an easy-to-use interface for accessing and utilizing pretrained large language models (LLMs) like BERT released by various organizations and researchers. Here is a simple example of how to use this model to get the features of a given text in PyTorch:
Code Block |
---|
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input) |
https://huggingface.co/google-bert/bert-base-uncased
HTML and Bootstrap
You will be developing a few basic web pages to provide some visual support to a user while they is conversing with your agent. We assume you are familiar with basic forms of HTML, the markup language for developing a webpage. If not, please check out this https://www.w3schools.com/html/default.asp to get you started.
...