Preliminaries and Quiz Materials

Welcome to this page with an overview of some of the required background information for this project. You most likely have already seen some of this information in other courses, but some of it will also be new to you. We expect each of you to have a decent understanding of the material listed below, as that is necessary to be able to complete this project successfully. You need to complete a quiz, where the material presented below can be used to prepare.

Page Overview

1 Git and GitHub
2 Machine Learning Basics Relevant to This Course
3 What is a deep neural network and how does it work?
4 A General Pipeline of Task-Oriented Spoken Dialogue Systems
5 LLMs on Hugging Face
6 HTML, Javascript, and Bootstrap
7 Prolog

Git and GitHub

We use GitHub Classroom to provide you with the initial agent code. GitHub is a code hosting platform for version control and collaboration. You need to join the GitHub classroom and use it for developing and sharing your code, and for storing and updating all the deliverables in this project. In order to help you understand how to do that, we will introduce you to some basic readings and a tutorial to help you gain knowledge of how to use git. Git is a common tool used by many coding teams worldwide to develop code in tandem and facilitate its alignment. Getting to know git as part of this course will surely be of benefit to you in the long term.

Git

Git is a tool used by developers to manage and track changes in their code or files. Think of it like a magical filing cabinet that remembers every version of a document or project you’ve ever worked on.

Imagine you’re writing a book. You make changes, but then decide you liked the way it was two days ago. Git lets you go back and see what it looked like back then.
It tracks what changes were made, who made them, and when.
It’s designed to help teams work together without accidentally overwriting each other’s work. It merges everyone’s contributions intelligently.
While Git works locally on your computer, it also pairs with tools like GitHub to store a backup of your work online.

Key Features:

Version Control: Keeps track of all the changes to a file or project.
Branching: You can create “branches” to work on different features or ideas without messing up the main version.
Undo Mistakes: If something breaks, you can roll back to a previous version.

GitHub

GitHub is like an online home for Git projects. If Git is your magical filing cabinet, GitHub is the cloud storage where you can share that cabinet with others.

It’s a website where people can store, share, and back up their Git projects online.
It ensures your work is safe even if something happens to your local files (like your computer crashing).
It makes it easy for teams to work together because everyone can see the latest version of the project and contribute their own changes.
GitHub also adds tools for collaboration like:
- Issues: A way to track bugs or tasks.
- Pull Requests: When someone suggests a change, the team can review it and decide whether to include it.
- Actions: Automate tasks like testing your code every time there’s a change.

Git Commands

You can do just the basics reading, just the interactive tutorial, or both. There is also a more in-depth explanation of each command on the third page.

The absolute basics (reading): https://www.simplilearn.com/tutorials/git-tutorial/git-commands.

The basics (an interactive tutorial!) - https://learngitbranching.js.org/.

If you want to know more (not required): Everything on Git.

Git Merging and Conflicts

https://www.simplilearn.com/tutorials/git-tutorial/merge-conflicts-in-git

Git Best Practices

One of the most important takeaways from the link below is that:

Commits are Supposed to Be Small and Frequent

Whenever you have made a single logical alteration, you can commit code. Frequent commit helps you to write brief commit messages that are short yet informative. Also, it will provide significant meaning for those who may be reading through your code.

https://www.geeksforgeeks.org/best-git-practices-to-follow-in-teams/

In the case of this project commits must be made with certain specifications and things in mind please read Commits for more information.

Machine Learning Basics Relevant to This Course

Note: We do not expect you to master all aspects of machine learning. Instead, focus on the following fundamental concepts that are directly related to this course.

What is Machine Learning?

Machine learning (ML) is a type of Artificial Intelligence (AI) that allows computers to learn and make decisions without being explicitly programmed. It involves feeding data into algorithms that can then identify patterns and make predictions on new data. In short, machines observe a pattern from data and attempt to imitate it in some way that can be either direct or indirect.

Machine learning can be categorized into three main types: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning is the most commonly used type of machine learning, and it is also the focus of this course. In supervised learning, the model is trained on labeled data, meaning each input (feature) has a corresponding output (label). The objective is for the model to learn the relationship between the input variables and their associated labels. Once trained, the model can make accurate predictions or inferences on new, unseen data by applying the patterns it has learned from the labeled dataset.

Here are some examples of ML problems (input in bold and output in italic):

Using the pixels of an image to detect the presence or absence of a cat (image pixels → cat or no cat)
Using the movies you've liked to predict other movies you may enjoy (liked movies → recommended movies)
Using someone's words to predict whether they're happy or sad (text input → happy or sad)
Using a raw audio file to predict a transcript of the audio (audio file → transcribed text)

Why do we need Machine Learning?

Machine learning is able to learn, train from data, and solve/predict complex solutions which cannot be done with traditional programming. It enables us with better decision making ands solve complex business problems in optimized time. Recent advancements in AI have been propelled by machine learning, particularly its subset, deep learning. Additionally, compared to black-box agents like Dialogflow, developing our own machine-learning models provides greater control, enabling continuous improvement over time.

How do we train and evaluate a model?

When developing a machine learning model, one of the fundamental steps is to split the data into different subsets: training, testing, and validation datasets.

Training dataset: This subset is used to train the model. During training, the model learns patterns, relationships, and features from the data. The algorithm adjusts its parameters based on this dataset to minimize error and improve its predictions.
Test dataset: This subset is used to evaluate the performance (e.g., accuracy) of the trained model. After the model has been trained, the test dataset provides an unbiased assessment of how well the model generalizes to new, unseen data. The test data should NOT be involved in any way during training or validation.
Validation dataset (optional): This subset is used to fine-tune the model’s hyperparameters and provide an unbiased evaluation during the tuning process. Hyperparameters are settings of the model that are not learned during training but are predefined before training starts (e.g., learning rate, number of layers in a neural network). The validation dataset helps ensure the model is tuned for optimal performance before final evaluation of the test data.

Therefore, there are several stages during the development of a model:

Training means building the model by learning patterns and parameters from the training dataset.
Testing involves assessing the model’s performance using the test dataset.
Hyperparameter tuning involves adjusting the model's hyperparameters to optimize its performance.
Model inference refers to using the trained model to make predictions or draw conclusions from new, unseen data. This step leverages the model’s learned patterns to apply it to real-world situations or new inputs.

What is Classification in Machine Learning?

Classification is a supervised machine learning method where the model aims to predict the correct label or category for a given input. In classification, the model is trained using labeled training data, learns to identify patterns, and is then evaluated on test data to assess its performance. Once trained and evaluated, the model can be used to make predictions on new, unseen data.

For example, a classification algorithm (classifier) might learn to predict whether a given email is spam or not, as illustrated below. This is a binary classification task, where the goal is to categorize the input data into two mutually exclusive classes. In this case, the training data is labeled with binary categories, such as "spam" and "not spam," "true" and "false," or "positive" and "negative." These labels guide the model in learning the differences between the two categories, allowing it to make accurate predictions when exposed to new data.

What is a deep neural network and how does it work?

A deep neural network (DNN) is a machine learning model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options, and arrive at conclusions.

Every neural network consists of layers of nodes, or artificial neurons—an input layer, one or more hidden layers, and an output layer. Each node connects to others, and has its own associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.

There are several key concerns during the development of a DNN model:

Defining a model architecture:

Let's define our first deep neural network (DNN) with a single-layer, fully connected neural network, and a 3-dimensional input. In a fully connected layer, each input is connected to every output, ensuring comprehensive interaction between neurons.

Defining a loss function: Loss functions are quantitative measures of how satisfactory the model predictions are (i.e., how “good” the model parameters are). We will use the cross entropy (CE) loss which is standard and common for classification.

Optimizing the loss function: During training, the goal is to find the “best” values of the model parameters (weights and bias) that minimize the loss function based on the training dataset. This process is known as optimization.

A General Pipeline of Task-Oriented Spoken Dialogue Systems

Spoken dialogue systems (SDSs) have been the most prominent component in today’s virtual personal assistants, such as Microsoft’s Cortana, Apple’s Siri, Amazon Alexa, Google Assistant, and Facebook’s M. Unlike chitchat, task-oriented SDSs aim to assist users with a specific goal, for example, recommend a recipe or booking a hotel.
A classical pipeline architecture of a task-oriented spoken dialogue system includes key components:

Automatic Speech Recognition (ASR) - Converts spoken language into textual transcript.
Natural Language Understanding (NLU) - Interprets and extracts meaning from the transcript.
Dialogue Management (DM) - Manages the flow of conversation and determines the system’s response.
Natural Language Generation (NLG) - Constructs responses in natural language.
Text to Speech (TTS) - Converts the generated text into spoken output.

Pipeline Architecture

In this project, we will focus on building a simple pipeline that integrates ASR followed by a NLU component. We will use an existing ASR model (e.g., Whisper) for inference/prediction only (no training), while enhancing the performance of the NLU model (e.g., BERT) by training it on conversational data collected from the previous course.

By the end of the project, you will learn how to:

Construct a basic dialogue pipeline.
Train and improve individual components, specifically the NLU model.

This hands-on approach will provide insight into developing and refining key elements of a dialogue system.

ASR with WHISPER

Automatic Speech Recognition (ASR) is a key component in the pipeline architecture, which converts spoken language into text. It enables machines to interpret and transcribe human speech, allowing for seamless interaction between users and applications through voice commands.

Whisper is a commonly used general-purpose speech recognition model developed by OpenAI. It is trained on a large dataset of diverse audio and is also a multi-tasking model that can perform multi-lingual speech recognition, speech translation, and language identification.

NLU with BERT

Unlike chitchat, task-oriented dialogue’s pattern is restricted by a dialogue ontology, which defines all possible intents, slots, and their corresponding candidate values in specific domains. The NLU component maps a user’s utterance to a structured semantic representation, which includes the intent behind the utterance and a set of key-value pairs known as slots and values. This mapping enables dialogue systems to understand user needs and respond appropriately. For example, given the transcribed utterance “I want to cook Italian pizza“, the NLU model can identify: the intent as “addFilter“ and the value of the slot “ingredienttype“as “italian pizza“.

NLU task → Intent and Slot Classification

The NLU task can be approached as joint learning of intent classification (IC) and slot filling (SF), with the slot labels typically formatted in the widely-used BIO format, as shown below. In general, joint learning of intent and slot classification models are mutually beneficial. Here is an example of SF and IC output for an utterance. Slot labels are in BIO format: B indicates the start of a slot span, I the inside of a span while O denotes that the word does not belong to any slot.

Utterance	I	want	to	cook	Italian	pizza

Utterance	I	want	to	cook	Italian	pizza
Slot	O	O	O	O	B-ingredienttype	I-ingredienttype
Intent	addFilter

The NLU architecture includes the following key parts:

Tokenization & Embeddings
Tokenization is the process of breaking down text into smaller units, typically words or phrases, called tokens. This allows machines to process and understand the complexities of human language. Each token is represented by a k-dimensional vector, learned from large amounts of text data, enabling models to capture the meaning and relationships between words.
Base Model: Pre-trained BertModel (e.g.,bert-base-uncased) for generating contextual embeddings. It includes two main parts: the encoder and the attention mechanism. The encoder processes the input sequence and creates contextual embeddings for each token, while the attention mechanism helps capture dependencies between words, regardless of their position in the sequence.
Intent Classifier: A linear layer on top of the [CLS] token output for intent prediction. The final output of this layer is typically a softmax function, which predicts the probability distribution over a predefined set of possible intents.
Slot Classifier: A linear layer applied to the token-level embeddings for slot tagging. It assigns a label to each token, indicating whether it represents a particular entity (e.g., a destination, date, etc). This process is often referred to as token tagging. The output of this linear layer is typically a softmax layer that predicts slot labels for each token.
Joint Learning of the Two Classifiers: During training, the model minimizes a combined loss function, which includes separate losses for intent classification and slot filling. This ensures that the model not only accurately predicts the intent but also extracts the correct slots for each token in the sentence.

Pre-training and fine-tuning BERT

BERT, Bidirectional Encoder Representations from Transformers, is a widely used transformer-based language model designed for various natural language processing tasks, including classification. It consists of two types of training procedures:
During pre-training, BERT is trained on a large corpus of English text in a self-supervised manner. This means it is trained on large-scale, raw, unlabeled text without human annotations, using an automatic process to generate input-output pairs from the text.
During fine-tuning, BERT is first initialized with its pre-trained parameters, and then all parameters are fine-tuned using labeled data from downstream tasks, allowing it to adapt to specific applications.

Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different downstream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers).

You can do just the basic reading of the above. There are also more in-depth explanations on the following third page.

LLMs on Hugging Face

Hugging Face is an AI community and platform that offers an easy-to-use interface for accessing and utilizing pretrained large language models (LLMs) like BERT released by various organizations and researchers. Here is a simple example of how to use this model to get the features of a given text in PyTorch:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

You can easily switch to different models by changing the model name, e.g., bert-large-uncased, which has 340 million parameters. Generally, models with more parameters tend to deliver better performance.

HTML, Javascript, and Bootstrap

You will be developing a few basic web pages to provide some visual support to a user while they is conversing with your agent. We assume you are familiar with basic forms of HTML, the markup language for developing a webpage. If not, please check out this https://www.w3schools.com/html/default.asp to get you started.

On top of HTML, we use Bootstrap 4 to facilitate the development of a webpage. The main purpose of this visual support is twofold: to provide (1) support to a user to reduce their cognitive load (the amount of information working memory needs to process at any given time) and (2) an indication of the progress on the task that has been made thus far. A user may not be able to remember all preferences and/or constraints on recipes they selected thus far. A system that would require a user to do so, would likely not be experienced as very user-friendly. It is also nice to show a user how much progress has been made in finding a recipe they like. A simple measure for our recipe recommendation agent to indicate progress is to show how many recipes still match the preferences of the user.

Bootstrap is a powerful, open-source front-end framework for web development. Many of the Bootstrap components can be used by a MARBEL agent to create and display webpages in a Chrome browser using the Social Interaction Cloud infrastructure. We first list a few of Bootstrap’s key features:

Responsive Design: Bootstrap's grid system and pre-designed components enable easy creation of responsive websites.
HTML, CSS, and JS Components: Offers a wide range of reusable components like buttons, and navigation bars.
Customization: Allows for extensive customization.
Community and Documentation: Backed by a strong community and comprehensive documentation.
Mobile-First Approach: Prioritizes mobile devices in design strategies.

This framework simplifies web development, making it accessible for beginners while still providing a powerful tool for more experienced developers.

To gain an understanding of Bootstrap, this https://www.w3schools.com/bootstrap4/default.asp will be very useful. To familiarize yourself with some of the basic components of Bootstrap, take a look at the first few items in the tutorial. We recommend you read at least up to the item on https://www.w3schools.com/bootstrap4/bootstrap_buttons.asp. The Tutorial will be useful for later reference to look up how you can change colors and use, for example, a progress bar.

Prolog

You will develop your recipe recommendation agent using MARBEL and SWI Prolog. The MARBEL agent implements a dialog management engine that you will use. You do not need to change this agent. You are, however, allowed to modify it if you like. The focus will be mostly on using Prolog to provide the agent with the knowledge it needs and to make it smarter by providing it with some logic related to its recipe recommendation task.

Prolog is a rule-based programming language based on symbolic logic. It is commonly used in Artificial Intelligence and Computational Linguistics. To understand Prolog, you should have familiarized yourself with its key concepts and structures using the book https://www.let.rug.nl/bos/lpn//lpnpage.php?pageid=online. This book covers fundamental topics like facts, rules, queries, unification, proof search, recursion, lists, arithmetic, definite clause grammars, and more. It also delves into more advanced topics such as cuts and negation. We briefly summarize here some of the core concepts for your convenience.

Logic-Based Programming: Prolog is fundamentally different from procedural languages like C or Python. It is based on formal logic, making it well-suited for tasks that involve rules and constraints, such as solving puzzles or processing natural language.
Facts, Rules, and Recursion: The core of Prolog programming involves defining facts and rules. Facts are basic statements about objects and/or their relationships. Rules define relationships between facts using basic logical relations such as conjunction, disjunction, and negation. The fact that rules can be recursive is what gives Prolog its power as a programming language. Recursion can be used, for example, for iterating over frequently used data structures in Prolog such as lists.
Lists and Arithmetic: Lists are fundamental data structures in Prolog. Prolog offers a range of built-in predicates for list manipulation. It also provides built-in support for arithmetic operations. Because Prolog’s basic form of computation is based on term matching, which does not support efficiently doing math, care must be taken to use the right operators when handling numbers in Prolog.
Pattern Matching and Unification: The Prolog core form of computation consists of pattern matching with the aim of unifying Prolog terms. Unification of two terms is a fundamental operation in Prolog, which, if it succeeds, returns substitutions for Prolog variables. When these substitutions are applied to the terms (and the variables instantiated), the result would be two identical terms.
Backtracking: Prolog uses backtracking to evaluate the rules in a program to find solutions to problems. If one trace (part of a search tree) fails, Prolog automatically backtracks to find and try alternative options that have not yet been explored to continue searching for a solution.
Advanced Features: Prolog provides advanced features like the cut operator. This operator can be used for controlling the backtracking process, mainly to increase the efficiency of Prolog programs.
Definite Clause Grammars (DCGs): These are used in Prolog for parsing and generating natural language constructs, making them a powerful tool for language-related applications.
Applications: Prolog is widely used in AI for tasks such as expert systems, natural language processing, and theorem proving, owing to its ability to handle complex symbolic information and logical constructs efficiently.

Project Conversational Agents 2025