Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Panel
panelIconIdatlassian-flag_on
panelIcon:flag_on:
panelIconText:flag_on:
bgColor#FFEBE6

There are function(s) to be completed in this section!

Preprocessing Steps

Raw Data Before Preprocessing

...

Code Block
{
    "id": "st041",
    "text": "I’d like a meal that’s fast to prepare.",
    "intent": "addFilter",
    "slots": {
        "shortTimeKeyWordduration": "fast"
    }
}

...

Steps in Preprocessing

  1. Tokenization:

    • The text is broken into smaller units (tokens) using the BERT tokenizer:

      Code Block
      tokens = tokenizer.tokenize("I’d like a meal that’s fast to prepare.")
      
    • Output:

      Code Block
      ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.']
      
  2. BIO Slot Label Encoding:

    • Slot values are matched with their positions in the tokenized text.

    • BIO-format labels are generated:

      • B-{slot}: Beginning of the slot value.

      • I-{slot}: Inside the slot value.

      • O: Outside any slot.

    Example:

    • Slot Definition: "shortTimeKeyWordduration": "fast"

    • Output BIO Tags:

      Code Block
      ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-shortTimeKeyWordduration', 'O', 'O', 'O']
      
  3. Intent Encoding:

    • The intent is mapped to a numerical label using the intent encoder:

      Code Block
      intent_label = intent_label_encoder.transform(["addFilter"])[0]
      
    • Output:

      Code Block
      0  # (Example numerical label for "addFilter")
      
  4. Padding and Truncation:

    • Sequences are padded or truncated to a fixed length (max_length):

      • Token IDs are padded with zeros.

      • BIO tags are padded with O.

    • Example:

      • Original Tokens: ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.']

      • Padded Tokens: ['i', '’', 'd', 'like', 'a', 'meal', 'that', '’', 's', 'fast', 'to', 'prepare', '.', '[PAD]']

      • BIO Tags are similarly padded to match the length.

...

  • Think about how to efficiently loop through the dataset:

    • For intents, you can directly extract them from the dataset examples.

    • For slots, remember to handle nested structures since slots are stored as dictionaries.

  • Use your existing knowledge of Python tools to count and organize results:

    • A Counter object can help you group and tally items.

  • Make sure to test your function on small, simple datasets before running it on the full dataset to ensure correctness.

...

Note

Reflection Questions

After analyzing the distributions, think about:

  1. Dataset Balance:

    • Are the distributions skewed? How might this affect the model?

    • Are all intents and slots well-represented, or are some missing?

  2. Test Dataset Representativeness:

    • Does the test set reflect the training set’s distribution?

    • If not, how might this impact evaluation?

...

By following these steps and reflecting on the results, you’ll gain a deeper understanding of your dataset and its potential challenges. This analysis is crucial for making informed decisions during model training and evaluation.

...