Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Welcome to the bot evaluation agent testing section!

In the 2023: End Report you will see a section asking how you tested and evaluated your bot. Throughout this course, you should continuously and thoroughly engage in conversations and analyze the functionality of your bot. As you improve it and add things you should perpetually re-test your conversational agentAgent testing means performing a number of steps to evaluate your agent that you as developers of the agent do, with the aim of identifying how well your agent and its components operate when exposed to a diverse range of inputs. This is done in a continuous manner while developing your agent, by engaging in conversations with the agent and analyzing its capabilities. You should use your agent a ton (go for a minimum of at least 10, but preferably more, conversations per team member per week). We recommend that each team member should try and test other sections' parts.

You should use your bot a ton(at least 10 different conversations per person). Test everything to see if it is working i.e. make sure you trigger all possibilities, patterns, pages, filters, intents etc. 

Then, analyze no-match conversations in Training. Go to the Training page in Dialogflow and filter conversations at the top left, this the work of other pair programming subteams.

Most important is that you try and vary the conversations you have. You should, for example, try to trigger each intent, pattern, and aim to visit each page to check if it all works. The testing is most effective if you do this systematically (for example, deliberately using different utterances to trigger particular intents). As you improve your agent and add things, you should continuously re-test your conversational agent (to test its new capabilities, but also to make sure the ones you implemented before still work).

You can use different tools to inspect the performance of your agent. For example, there is a useful feature in Dialogflow that allows you to analyze the quality of intent recognition of the conversations you have during testing. There is a Training page in Dialogflow where you can filter the conversations that you had at the top left. This video provides a more thorough explanation of how to do this: Use Dialogflow Analytics & Training to Improve Your Chatbot (2021). Due to us using SIC the agent set-up in the Project MAS course, where Dialogflow is only triggered for Automatic Speech Recognition and intent recognition, the Analytics part does not apply. Things we think

For the improvement of other components of the agent (such as patterns, visuals and Prolog predicates), you can inspect error messages and use the debugging perspective in Eclipse for the MARBEL agent. You can also test Prolog predicates in a Prolog interpreter.

Part of your final report will be about how you tested your agent (see Final Report for more info). This is a list of things you should keep in mind during testing and , some of which should be included somewhere in the Testing Section of your End Reportyour final report:

  • Capabilities What capabilities of your botwhat do you think is agent did you test? How did you go about this?

  • Which of those capabilities were most important to test in this phase?Test set up? Why?

  • What an example conversation should look like

  • What did you test and how

  • mismatch conversation analysis

  • how did your test go: good and bad( focus on the bad and why it went wrong)

  • how could one fix problems. will you improve it before turn in or is it not feasible?

  • during use, what kind of extensions do you think could be useful or improve the performance

    how could one even further extend

    kinds of example conversations did you use for testing?

  • How did your tests go? What went well, and where did you run into problems (focus on these problems and explain why things went wrong)?

  • How did you handle or fix these problems? Which choices did you make to focus on what you could reasonably fix within the time frame of the project? What problems did you ignore?

  • What kind of extensions did you implement to address some of the issues you ran into? How did you make choices to focus on the things you as a team thought most important?