Welcome to the agent testing section!
Agent testing comprises a set of checks that you as developers of the agent do to identify how well the agent and its components operate when exposed to diverse input. This is done in a continuous manner when developing your bot, by engaging in conversations with the agent and analyzing its functionality. You should use your agent a ton (at least 10 different conversations per person), and try and vary the conversations you have. You should also try and trigger each pattern, page, intent, and filter to check if it all works. As you improve it and add things, you should perpetually re-test your conversational agent. We recommend that each team member should try and test the work of other pair programming subteams.
There is a useful feature in Dialogflow that allows you to analyze the quality of intent detection of the test conversations. Go to the Training page in Dialogflow and filter conversations at the top left. This video provides a more thorough explanation of how to do this: Use Dialogflow Analytics & Training to Improve Your Chatbot (2021). Due to the agent set-up in the Project MAS course, where Dialogflow is only triggered for Automatic Speech Recognition and Intent detection, the Analytics part does not apply.
For the improvement of other components of the agent (such as patterns, visuals and prolog predicates), you can inspect error messages and debugging in MARBEL, and try out prolog predicates in a prolog interpreter.
The testing is most effective if you do this systematically (for example, deliberately using different utterances to trigger particular intents). Part of your end report will be about how you did this (see 2023: End Report for more info). Hereby a list of things you should keep in mind during testing and should be included somewhere in your end report:
What capabilities of your agent do you identify?
Which of those do you think are most important to test?
What did you test and how?
What should an example conversation look like?
How did your test go? What went well, and what went bad (focus on the bad and why it went wrong)?
What problems came up from the tests, and how can they be fixed? Which problems were feasible to fix before doing the user test? What problems were not?
What possible extensions came up from the tests? Which of those were feasible to implement before doing the user test?
Add Comment