The algorithm maze: an intern’s tale of chatbot optimization

Imagine being a 19-year-old who recently arrived in the UK, completed just one semester of a BSc in Computer Science, and secured an internship typically reserved for second and third-year students. That was me! Welcome to my journey of creating a chatbot evaluation system during my internship—a journey filled with challenges, learning, and triumph.

The challenge

I was selected for an internship with Japeto AI through the ARISE innovation program. Japeto AI, funded by Innovate UK, develops innovative chatbot tools for healthcare, education, and charities. The team had an existing algorithm for analysing and scoring chatbot conversations, but it was unreliable and dependent on user and manager ratings. If the chatbot recognized an intent, it received a high rating, even if it didn’t answer the user’s question. This algorithm required manual intervention to rate conversations, leading to questions about its accuracy.

My challenge was to gauge the effectiveness of our chatbots and create a reliable evaluation system. This involved assessing chatbot performance and identifying areas for improvement using the vast amounts of data generated from interactions. My task was to develop algorithms that could serve as a health score on Japeto’s dashboard, providing a quantifiable measure of the chatbot’s performance.

The challenge was set. The stage was ready.

Laying the groundwork: research, setup, and onboarding

The first step in my journey was familiarizing myself with the team and the firm’s operations. I was onboarded and learned the standard operating procedures, which helped me understand the firm’s context and expectations.

Next, I researched to devise an algorithm for evaluating chatbot performance. I explored successful use cases and various evaluation metrics online. After extensive research, I settled on using weighted metrics. This method involved assigning different weights to factors based on their importance to calculate a comprehensive health score for the chatbot.

The metrics I decided to include were:

Number of Messages with a Thumbs Up
Search in Conversation for “Thank You” or Similar
Link Clicks
Missed Intents
Sentiment Score (Positive and Negative)
Number of Messages with Thumbs Down
Repeat Questions / Repeat Missed Intents
NLU Agreements
Complaints (Quantitatively)
Similarity Score (Question and Response)

With these metrics in mind, I was ready to start building the algorithm. The groundwork had been laid, and it was time to start the real work. I used my metrics to an algorithm that best fit the layout. I assigned certain percentages to the different metrics according to their levels of importance. For example, a link click could contribute to up to 40% of the score, but a missed intent could bring it down by up to 20%. These percentages would add up to the total score. I assigned weightings ranging from 0.1 – 0.4 to different metrics based on discussions internally with the team and a customer / consumer’s perspective on how important a certain metric is for them.

Constructing the code

With the outline in hand, I stepped into the execution phase. For initial testing I settled on using ChatGPT for some sample data. I simulated three scenarios: Good, Neutral, and Bad chatbot interactions. I gave ChatGPT the metrics I was going to use and told it to simulate the following three scenarios:

Good Scenario: Users were satisfied, the conversation was rated thumbs-up, and there were positive sentiments in the users’ messages.
Neutral Scenario: Conversations were neither great nor terrible.
Bad Scenario: conversations were rated Thumbs-down, mostly missed intents, and negative sentiment rose.

My AI-generated sample data became my means for testing. Each variation was tested and adjusted for optimal weightings. The results were solid; all versions performed well, further increasing our confidence in the reliability of our approach.

However, our algorithm needed to be validated with real user data to ensure its practical applicability and reliability.

Enter: User Data

We tested the algorithm with data from the Japeto dashboard, applying it to real user interactions. By analysing actual conversations, we evaluated the chatbot’s performance in real-life scenarios, understanding the algorithm’s strengths and areas for improvement.

We collected a diverse set of data, considering multiple scenarios: smaller 1 or 2 message conversations and longer conversations with about 50-60 messages. But wait, not enough data?

A minor setback, but nothing that cannot be fixed!

We set a task to compile some data ourselves. We tested out the chatbot every way we could, good conversations, bad conversations and even conversations we knew the chatbot was not designed to handle. We wanted to test its limits.

The data was still raw and had discrepancies. But dissatisfaction is a creative state, they say. I used the opportunity to dig deep into excel and learned as much as I could on how to bend the rules of data to my will. I was heavily helped by my team who were present at all turns and I managed to filter out the data with excel functions, pivot tables and making use of referencing within excel. It was a learning curve, but after I had worked on it once, it became easier to filter out data whenever we added new cases.

We gathered the data for 75 conversations in total, after filtering. We then used this data set and multiplied the weightings with the data we got. We used the Session ID for each conversation as the unique identifier for each conversation and calculated the score for each. We had the scores for every conversation and continued to use this dataset for every variation of the algorithm. This data set provided a comprehensive view of the chatbot’s performance, highlighting its successes and areas that required enhancement. The algorithm’s ability to adapt and provide accurate health scores was put to the test, and the results were analysed critically.

Through thorough testing and refinement, we were able to fine-tune the algorithm, ensuring it delivered reliable and practical insights. User data confirmed our approach and provided a foundation for improvement.

The path to optimization:

With the algorithm set and real user data in hand, we compared different chatbot versions. Each variation had a unique approach, and we needed to determine which provided the most accurate and reliable health scores. This comparison was essential for selecting the best version for final implementation.

The high penalty version quickly identified critical issues but often exaggerated minor problems, leading to an over-penalized view of the chatbot and inaccurate results (for example, instances where the user had genuinely typed gibberish and it was rightfully classed as a missed intent). It was useful for urgent troubleshooting but didn’t provide a balanced performance perspective.

The user feedback emphasis version incorporated both positive and negative interactions, capturing comprehensive user feedback. However, it required high user engagement and risked biased feedback from more vocal users.

We replicated the original algorithm and compared its scores. To ensure accuracy, we manually rated all 73 conversations and calculated the difference between the expected values and the algorithm’s output.

The healthscore in the chatbot dashboard showing the score for each conversation from 0 – 100

The “only positives” version emerged as the most balanced and reliable. Focusing solely on positive interactions, it highlighted the chatbot’s strengths. This version used thumbs ups, link clicks, intents gotten right, net sentiment score, and NLU agreement, encouraging successful strategies and offering a balanced view of performance without neglecting areas needing improvement.

Glitch in the matrix: only quantitative data?

Everything was going well with the “Only Positives” algorithm, but we wondered: what if we removed human intervention entirely?

We experimented by eliminating human feedback, such as thumbs ups, and adjusted the weightages for the remaining metrics. We were now planning to create a fully automated evaluation system.

Uzair presenting the results to the Japeto developer team

The results were astounding. The revised “Only Positives” algorithm, now free from human input, surpassed our expectations. It emerged as the clear winner, with a difference of just 18 between expected and calculated values, and a standard deviation of only 0.14, demonstrating excellent accuracy and consistency.

The end?

After hours of navigating numerical mazes, scrutinizing data, and working with spreadsheets, we successfully implemented the algorithm into the system. I collaborated closely with the developer team, explaining the processes and cross-checking data until it was fully integrated.

Reflecting on my four-week journey, this internship was more than just a project—it was an invaluable learning experience. I gained skills in data analysis, machine learning, tokenization, and understanding intents. From improving the chatbot evaluation system to creating a fully autonomous algorithm, the journey was rich with insights and practical knowledge.

This experience has been a crucial step in my career towards becoming a ML engineer and helped me navigate my steps in the short term to reach my goal. I’m indebted to the team that helped me through the journey and hope I cross paths with them again in the future.

Share this page

Ingrid Folland

Got a project?

Let us talk it through