Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
There are function(s) to be completed in this section! |
Preprocessing Steps
Raw Data Before Preprocessing
...
Code Block |
---|
{ "id": "st041", "text": "I’d like a meal that’s fast to prepare.", "intent": "addFilter", "slots": { "shortTimeKeyWordduration": "fast" } } |
...
Steps in Preprocessing
Tokenization:
The text is broken into smaller units (tokens) using the BERT tokenizer:
Code Block tokens = tokenizer.tokenize("I’d like a meal that’s fast to prepare.")
Output:
Code Block ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.']
BIO Slot Label Encoding:
Slot values are matched with their positions in the tokenized text.
BIO-format labels are generated:
B-{slot}
: Beginning of the slot value.I-{slot}
: Inside the slot value.O
: Outside any slot.
Example:
Slot Definition:
"shortTimeKeyWordduration": "fast"
Output BIO Tags:
Code Block ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-shortTimeKeyWordduration', 'O', 'O', 'O']
Intent Encoding:
The intent is mapped to a numerical label using the intent encoder:
Code Block intent_label = intent_label_encoder.transform(["addFilter"])[0]
Output:
Code Block 0 # (Example numerical label for "addFilter")
Padding and Truncation:
Sequences are padded or truncated to a fixed length (
max_length
):Token IDs are padded with zeros.
BIO tags are padded with
O
.
Example:
Original Tokens:
['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.']
Padded Tokens:
['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.', '[PAD]']
BIO Tags are similarly padded to match the length.
...
Think about how to efficiently loop through the dataset:
For intents, you can directly extract them from the dataset examples.
For slots, remember to handle nested structures since slots are stored as dictionaries.
Use your existing knowledge of Python tools to count and organize results:
A
Counter
object can help you group and tally items.
Make sure to test your function on small, simple datasets before running it on the full dataset to ensure correctness.
...
Note |
---|
Reflection QuestionsAfter analyzing the distributions, think about:
|
...
By following these steps and reflecting on the results, you’ll gain a deeper understanding of your dataset and its potential challenges. This analysis is crucial for making informed decisions during model training and evaluation.
...