Preparing the Dataset

There are function(s) to be completed in this section!

Due to how the imports are set up and because the intent and slot classifier is part of the social-interaction-cloud python package when you make changes you need to make sure to install the social interaction cloud via pip install -e . If you do not see changes functioning or relative imports are not working then do pip install . instead.

Why do we need to prepare the dataset?

Machine learning models can only process numerical data or matrices (tensors) as input. Preparing the dataset ensures that raw data is transformed into a numerical format the model can understand and learn from.

Preprocessing Steps

Raw Data Before Preprocessing

The dataset begins as a collection of raw examples in JSON format (see train.json) where each entry includes:

A unique ID.
The text of the user's input.
The intent of the input.
A dictionary of slots, mapping slot types to their values.

Example Raw Data:

{
    "id": "st041",
    "text": "I’d like a meal that’s fast to prepare.",
    "intent": "addFilter",
    "slots": {
        "duration": "fast"
    }
}

Steps in Preprocessing

Tokenization:
- The text is broken into smaller units (tokens) using the BERT tokenizer:
  tokens = tokenizer.tokenize("I’d like a meal that’s fast to prepare.")
- Output:
  ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.']
BIO Slot Label Encoding:
- Slot values are matched with their positions in the tokenized text.
- BIO-format labels are generated:
  - B-{slot}: Beginning of the slot value.
  - I-{slot}: Inside the slot value.
  - O: Outside any slot.
Example:
- Slot Definition: "duration": "fast"
- Output BIO Tags:
  ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-duration', 'O', 'O', 'O']
Intent Encoding:
- The intent is mapped to a numerical label using the intent encoder:
  intent_label = intent_label_encoder.transform(["addFilter"])[0]
- Output:
  0 # (Example numerical label for "addFilter")
Padding and Truncation:
- Sequences are padded or truncated to a fixed length (max_length):
  - Token IDs are padded with zeros.
  - BIO tags are padded with O.
- Example:
  - Original Tokens: ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.']
  - Padded Tokens: ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.', '[PAD]']
  - BIO Tags are similarly padded to match the length.

Processed Data After Preprocessing

After preprocessing, each example is transformed into a structured format, including:

input_ids: Tokenized input converted to numerical IDs.
attention_mask: Mask indicating which tokens are real (1) and which are padding (0).
intent_label: Encoded intent label.
slot_labels: Encoded BIO-format slot labels.

Example Processed Data:

{
    'input_ids': torch.tensor([101, 1045, 1521, 1005, 1040, 2066, 1037, 7953, 2008, 1521, 1055, 3435, 2000, 6624, 1012, 0]),
    'attention_mask': torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]),
    'intent_label': torch.tensor(0),  # Encoded "addFilter"
    'slot_labels': torch.tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])  # BIO tags for "fast"
}

Summary of Preprocessing

Before Processing:
- Raw data includes plain text, an intent label, and slots with values.
- Example: "text": "I’d like a meal that’s fast to prepare."
After Processing:
- Tokenized, encoded, padded, and structured data ready for model input.
- Example includes token IDs, attention mask, intent label, and BIO slot labels.

This transformation ensures the data is in the correct format for the BERTNLUModel and PyTorch pipeline.

Distribution

Objective

Analyzing the distribution of intents and slots in your dataset helps you understand its structure, identify potential issues, and ensure the model learns effectively across all labels. In this section, you’ll compute and interpret the frequency of intents and slots for both training and testing datasets.

Implementation needed for this section!

How to Analyze Distribution

Write a Distribution Analysis Function
- See run_train_test.py for the incomplete function.
- This function should calculate how often each intent and slot appears in the dataset.
- Think about what fields in the dataset you’ll need:
  - Intent: Directly accessible as example['intent'].
  - Slots: Comes from example['slots'] but might need to be flattened into a single list.
- Use a counting method to track the frequency of intents and slots.
Tips:
- Use tools like collections.Counter for efficient counting.
- Ensure your function handles edge cases, such as examples without any slots.
Run the Function on Training and Testing Data
- Call your distribution function for both datasets.
- Print the results to inspect the frequency of each intent and slot.
Tips:
- Compare the distributions of training and test datasets.
- Look for imbalances or unexpected gaps. For example:
  - Are certain intents or slots underrepresented or missing?
  - Does the test set mirror the training set?
Interpret the Results
- Once you have the distributions, analyze them to answer key questions:
  - Which intents or slots are the most frequent? The least?
  - Are there any imbalances that might cause the model to focus too heavily on common labels?
  - Are rare intents or slots important for the system’s performance?
- Reflect on how these observations might affect training.
Tips:
- If rare intents or slots are crucial, consider strategies like data augmentation or using weighted loss functions during training.
- If the test distribution doesn’t match the training distribution, think about how this might affect evaluation.

Hints for Implementation

Think about how to efficiently loop through the dataset:
- For intents, you can directly extract them from the dataset examples.
- For slots, remember to handle nested structures since slots are stored as dictionaries.
Use your existing knowledge of Python tools to count and organize results:
- A Counter object can help you group and tally items.
Make sure to test your function on small, simple datasets before running it on the full dataset to ensure correctness.

Reflection Questions

After analyzing the distributions, think about:

Dataset Balance:
- Are the distributions skewed? How might this affect the model?
- Are all intents and slots well-represented, or are some missing?
Test Dataset Representativeness:
- Does the test set reflect the training set’s distribution?
- If not, how might this impact evaluation?

By following these steps and reflecting on the results, you’ll gain a deeper understanding of your dataset and its potential challenges. This analysis is crucial for making informed decisions during model training and evaluation.

Done? Continue with https://socialrobotics.atlassian.net/wiki/spaces/PCA2/pages/2731180088 .