Model Evaluation

Evaluation statistics must meet the thresholds described here. If they do not please first go over the instructions again and ensure that you have correctly implemented the model and training procedure. Then check out our Model Improvement page. To receive full points on your basic intent and slot classifier you must exceed our defined thresholds. See more information on our Assessment Rubric page.

Objective

In evaluate.py one can see an evaluate function. The evaluation step helps us understand how well the trained model performs on the test dataset. It provides insights into the model's strengths and weaknesses by calculating key metrics for intent classification and slot-filling tasks. These metrics are critical for ensuring the model meets performance expectations.

What Does the Evaluation Show?

Intent Accuracy:
- This metric shows the percentage of correct intent predictions made by the model.
- A higher intent accuracy indicates that the model successfully identifies the overall purpose of user inputs (e.g., addFilter, recipeRequest).
Intent Classification Report:
- Provides a detailed breakdown of precision, recall, and F1-score for each intent.
- Precision: How many predicted intents are correct.
- Recall: How many true intents are correctly identified.
- F1-Score: The harmonic mean of precision and recall, offering a balanced measure of accuracy for each intent.
Slot Classification Report:
- Includes precision, recall, and F1-score for each slot type in BIO format (e.g., B-ingredient, I-ingredient, O).
- Shows how well the model identifies and tags individual tokens within a user input.
- Weighted F1-Score: A single score summarizing the model's performance across all slot labels, weighted by label frequency.

Why Is This Important?

Measure Performance:
- The metrics highlight whether the model performs well enough to be deployed for practical use.
- The intent accuracy and slot weighted F1-score provide a quick snapshot of overall model performance.
Identify Weaknesses:
- The detailed classification reports help pinpoint specific intents or slots where the model struggles.
- This information can guide further training, such as focusing on underperforming labels or collecting more data for rare cases.

How to Interpret the Results

Intent Accuracy:
- A score of 0.85 means the model correctly predicts intents 85% of the time.
- If this score is low, the model might need better training data or hyperparameter tuning.
Slot Weighted F1-Score:
- A score of 0.80 indicates good overall slot tagging but doesn’t mean every slot type is tagged perfectly.
- Inspect the classification report to identify specific slot types with low scores.
Combined View:
- Use both metrics together to evaluate if the model performs well on both tasks (intent classification and slot filling).

Conclusion

The evaluation process provides a detailed understanding of your model’s performance and areas for improvement. By interpreting the intent accuracy and slot classification reports, you can refine your training process and ensure the model meets desired benchmarks for real-world use. This step is also crucial for automated grading and comparison of models across submissions.

Reflection Questions:

Understanding Metrics
- What does the intent accuracy score tell you about your model's ability to understand user inputs?
- Why is the weighted F1-score important for evaluating slot classification? How does it provide a balanced view of the model's performance?
Model Strengths and Weaknesses
- Which intents or slots had the highest precision, recall, or F1-score? What does this indicate about the model's strengths?
- Which intents or slots had the lowest scores? How could you address these weaknesses in future training or data collection?
Improving the Model
- If the intent accuracy is lower than expected, what steps would you take to improve it?
- What actions could you take to improve slot tagging for low-performing slot types?
Real-World Application
- Based on the evaluation results, would you feel confident deploying this model for a real-world conversational agent? Why or why not?
- What additional metrics or analyses might you consider before deployment?
Automated Testing
- How could the intent accuracy and slot weighted F1-score thresholds be used to determine whether a model is acceptable for submission?
- Why is it beneficial to have automated testing in place for performance evaluation?
Reflection on the Process
- What did you learn from this evaluation process about your model’s capabilities and limitations?
- How might these insights influence your approach to future iterations of the model?

Done? Proceed with https://socialrobotics.atlassian.net/wiki/pages/createpage.action?spaceKey=PCA2&title=%5BTODO%5DConnecting%20Your%20Classifier%20with%20WHISPER&linkCreation=true&fromPageId=2727182472.

Project Conversational Agents 2025