Pipeline Testing

Welcome to the agent testing page!

Agent testing means performing a number of steps to evaluate your agent that you as developers of the agent do, with the aim of identifying how well your agent and its components operate when exposed to a diverse range of inputs. This is done in a continuous manner while developing your agent, by engaging in conversations with the agent and analyzing its capabilities. You should use your agent a ton (go for a minimum of at least 10, but preferably more, conversations per team member per week). We recommend that each team member tests the work of other team members.

Most important is that you try and vary the conversations you have. You should, for example, try to trigger each intent, pattern, and aim to visit each page to check if it all works. The testing is most effective if you do this systematically (for example, deliberately using different utterances to trigger particular intents). As you improve your agent and add things, you should continuously re-test your conversational agent (to test its new capabilities, but also to make sure the ones you implemented before still work).

You can use different tools to inspect the performance of your agent.

To test the Intent and Slot Classifier performance individually look at https://socialrobotics.atlassian.net/wiki/spaces/PCA2/pages/2727182472 . This is just some evaluation methods that are useful for these kinds of classifiers. Feel free to investigate any others that you think could be relevant or useful like confusion matrices.

For the improvement of other components of the agent (such as patterns, visuals and your recipe recommendation logic), you can inspect the log files of the MARBEL agent. You can also test your Prolog code in a Prolog interpreter such as https://swi-prolog.org/ (Tip: use https://www.swi-prolog.org/pldoc/man?predicate=consult/1 to load one or more Prolog files into the interpreter).

Part of your final report will be about how you tested your agent. This is a list of things you should keep in mind during testing, some of which should be included in your final report:

What capabilities of your agent did you test? How did you go about this?
Which of those capabilities were most important to test? Why?
What kinds of example conversations did you use for testing?
How did your tests go? What went well, and where did you run into problems (focus on these problems and explain why things went wrong)?
How did you handle or fix these problems? Which choices did you make to focus on what you could reasonably fix within the time frame of the project? What problems did you ignore?
What kind of extensions did you implement to address some of the issues you ran into? How did you make choices to focus on the things you as a team thought most important?