Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

There are function(s) to be completed in this section!

Preprocessing Steps

Raw Data Before Preprocessing


Code Block
    "id": "st041",
    "text": "I’d like a meal that’s fast to prepare.",
    "intent": "addFilter",
    "slots": {
        "shortTimeKeyWordduration": "fast"


Steps in Preprocessing

  1. Tokenization:

    • The text is broken into smaller units (tokens) using the BERT tokenizer:

      Code Block
      tokens = tokenizer.tokenize("I’d like a meal that’s fast to prepare.")
    • Output:

      Code Block
      ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.']
  2. BIO Slot Label Encoding:

    • Slot values are matched with their positions in the tokenized text.

    • BIO-format labels are generated:

      • B-{slot}: Beginning of the slot value.

      • I-{slot}: Inside the slot value.

      • O: Outside any slot.


    • Slot Definition: "shortTimeKeyWordduration": "fast"

    • Output BIO Tags:

      Code Block
      ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-shortTimeKeyWordduration', 'O', 'O', 'O']
  3. Intent Encoding:

    • The intent is mapped to a numerical label using the intent encoder:

      Code Block
      intent_label = intent_label_encoder.transform(["addFilter"])[0]
    • Output:

      Code Block
      0  # (Example numerical label for "addFilter")
  4. Padding and Truncation:

    • Sequences are padded or truncated to a fixed length (max_length):

      • Token IDs are padded with zeros.

      • BIO tags are padded with O.

    • Example:

      • Original Tokens: ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.']

      • Padded Tokens: ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.', '[PAD]']

      • BIO Tags are similarly padded to match the length.


  • Think about how to efficiently loop through the dataset:

    • For intents, you can directly extract them from the dataset examples.

    • For slots, remember to handle nested structures since slots are stored as dictionaries.

  • Use your existing knowledge of Python tools to count and organize results:

    • A Counter object can help you group and tally items.

  • Make sure to test your function on small, simple datasets before running it on the full dataset to ensure correctness.



Reflection Questions

After analyzing the distributions, think about:

  1. Dataset Balance:

    • Are the distributions skewed? How might this affect the model?

    • Are all intents and slots well-represented, or are some missing?

  2. Test Dataset Representativeness:

    • Does the test set reflect the training set’s distribution?

    • If not, how might this impact evaluation?


By following these steps and reflecting on the results, you’ll gain a deeper understanding of your dataset and its potential challenges. This analysis is crucial for making informed decisions during model training and evaluation.
