How to use test files for AI model analysis Take Blip January 15, 2021 21:59 Updated The test file is a list of critical interactions that the NLP model is intended to classify. This model was created using the knowledge base (intentions and entities). The purpose of creating this file is to make it possible to validate the model's assertiveness, more specifically, to ensure that the model correctly identifies the intentions for the most critical interactions of the chatbot. Critical questions are understood as interactions related to skills (and content) that the chatbot cannot, under any circumstances, fail to answer. The recommendation is to collect real interactions from users who are within the critical issues mentioned above. Tip: Use the filters on the Enhancement Screen to find these interactions. This file is important, as it is possible to validate the changes made to the database, ensuring that such changes do not have any negative impact on the model, that is, everything that was recognized continues to be recognized correctly. The file must be in .csv format, where the first column contains the questions and the second the id of the intention that the model is expected to recognize for that question, use the Blip Build AI Model Analysis File tool to easily build this file. Use The use of the file is done on the AI Model Analysis screen, where you can create the report with AI model evaluation metrics. You must choose the File option and follow the guidelines. Remembering that to generate the report it is necessary that Blip send the questions to the model, which can generate costs depending on the provider used. Results The metrics presented by the report are: Accuracy Precision Recall F1 Score Average reliability Sorted correctly Incorrectly classified Top False Positives Top False Negatives In the case of the report created with the test file, the generated metrics must have the following values from the table below: Accuracy 1,00 Precision 1,00 Recall 1,00 F1 Score 1,00 Average Reliability Variable Correctly classified 100% Incorrectly classified 0% Top False Positives None Top False Negatives None The average reliability is variable, as this value is the average of the reliability given by the provider when analyzing each of the questions in the test file. If the value of the other metrics is different from what is in the table, it means that the model is not answering all questions correctly. Therefore, the suggestion is to check which ones are in the Top False Positive and Top False Negative tabs, where it is possible to identify which intention was expected and which was recognized. In addition, the Confusion Matrix is also generated where it is possible to identify points of confusion between intentions. The upper column represents the expected intentions, while the left column shows the recognized intentions. Ex.: It was expected that 10 questions would be recognized as Curiosities, but only 5 were. Therefore, there is confusion between the Curiosities' intention with the What is it, Basic signs, and How to learn, since one question was recognized as What is it, another as Basic signs and another 3 as How to learn. The ideal scenario for the analysis of the confusion matrix is that only the main diagonal is different from 0 (zero), and it is this scenario that you should have when using the test file to generate the report. File update The test file must contain the critical questions related to skills (and content) that the chatbot cannot, under any circumstances, fail to answer. Therefore, every time the model is trained and published, interactions must be added that test what has changed in the database (as long as it is something critical). It is important to point out that it is not necessary to add exactly the example that was added to an intention, but rather an interaction that tests the NLP model's ability to understand when something similar is sent to the chatbot. In addition, the recommendation is that the update (and operation) of the file be done by the same person who made the changes in the knowledge base (intentions and entities) or, at most, by someone who is aware of the changes that were made. Versioning For version control of the file, it is recommended that each version created is named with the day and time of publication of the model to be tested, so that there is a relationship between the version of the file and the respective model. If the recommendations made in this document are followed, the person responsible for the evolution of the knowledge base (and, consequently, of the NLP model) will have the ability to validate the changes made to the base, ensuring that, in general, there has been an evolution and not a setback. In addition, a way of ensuring that the model responds correctly to what the customer expects is created and, if something is not answered, it should be understood as improving the model and not a bug. Related articles How to improve my artificial intelligence model How to use the Content Assistant How to test your NLP model How to set up Watson Assistant as an Artificial Intelligence provider How to import or export a knowledge base