The P&F data science team has successfully developed a novel approach to evaluating the accuracy of their chatbot, moving away from relying on subjective expert opinions and instead utilizing historical customer questions to test the chatbot’s performance. By creating a dataset from conversation history, the team was able to retrospectively evaluate the chatbot’s replies and compare them to expert and GPT-4 evaluations. This innovative approach has not only streamlined the evaluation process but has also enabled the automation of chatbot accuracy evaluation using GPT-4. The team’s efforts have led to the creation of a golden standard dataset and evaluation best practice guidelines, which will greatly improve the chatbot’s performance and ultimately enhance the customer experience.

Chatbot Evaluation Revolutionized
P&F organizes a workshop with the experts to create golden standard responses to the historical question dataset and evaluation best practice guidelines.
1–2 minutes










