Content Comparison

Panel

panelIconId	atlassian-flag_on
panelIcon	:flag_on:
panelIconText	:flag_on:
bgColor	#FFEBE6

There are function(s) to be completed in this section!

Preprocessing Steps

Raw Data Before Preprocessing

...

Code Block
{ "id": "st041", "text": "I’d like a meal that’s fast to prepare.", "intent": "addFilter", "slots": { "shortTimeKeyWordduration": "fast" } }

...

Steps in Preprocessing

Tokenization:

The text is broken into smaller units (tokens) using the BERT tokenizer:
Code Block
tokens = tokenizer.tokenize("I’d like a meal that’s fast to prepare.")

Output:

Code Block
['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.']

BIO Slot Label Encoding:
- Slot values are matched with their positions in the tokenized text.
- BIO-format labels are generated:
  - B-{slot}: Beginning of the slot value.
  - I-{slot}: Inside the slot value.
  - O: Outside any slot.
Example:
- Slot Definition: "shortTimeKeyWordduration": "fast"
- Output BIO Tags:
  Code Block
  ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-shortTimeKeyWordduration', 'O', 'O', 'O']

Intent Encoding:

The intent is mapped to a numerical label using the intent encoder:
Code Block
intent_label = intent_label_encoder.transform(["addFilter"])[0]

Output:

Code Block
0 # (Example numerical label for "addFilter")

Padding and Truncation:
- Sequences are padded or truncated to a fixed length (max_length):
  - Token IDs are padded with zeros.
  - BIO tags are padded with O.
- Example:
  - Original Tokens: ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.']
  - Padded Tokens: ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.', '[PAD]']
  - BIO Tags are similarly padded to match the length.

...

Think about how to efficiently loop through the dataset:
- For intents, you can directly extract them from the dataset examples.
- For slots, remember to handle nested structures since slots are stored as dictionaries.
Use your existing knowledge of Python tools to count and organize results:
- A Counter object can help you group and tally items.
Make sure to test your function on small, simple datasets before running it on the full dataset to ensure correctness.

...

Note

Reflection Questions

After analyzing the distributions, think about:

Dataset Balance:
- Are the distributions skewed? How might this affect the model?
- Are all intents and slots well-represented, or are some missing?
Test Dataset Representativeness:
- Does the test set reflect the training set’s distribution?
- If not, how might this impact evaluation?

...

By following these steps and reflecting on the results, you’ll gain a deeper understanding of your dataset and its potential challenges. This analysis is crucial for making informed decisions during model training and evaluation.

...

Version	Old Version 3	New Version 4
Changes made by	Gardner, I.V. (Bella)	Gardner, I.V. (Bella)
Saved on	Dec 24, 2024	Dec 30, 2024

Content Comparison

Versions Compared

Key

Preprocessing Steps

Raw Data Before Preprocessing

Steps in Preprocessing

Reflection Questions