Evaluating Chatbots: Technical Rigour and Human Experience

Chatbots are everywhere now. They answer customer questions, help you book tickets, and even act like digital assistants. They’re powered by large language models, which is just a fancy way of saying computers trained on lots of text so they can “talk” like us.

Making a chatbot is not the hard part anymore. Despite their ubiquity, many organisations lack a clear understanding of how well their chatbots perform.

Building a chatbot is now relatively straightforward, but evaluating one effectively is far more complex. Especially if you’ve built your chatbot on a platform that has limited accessibility to evaluation metrics. We’ll break chatbot evaluation into two sides: technical performance and human experience.

The first side is technical: this is about numbers, data, and AI performance metrics that measure how accurate, efficient and reliable the chatbot is. The second side is human: this is about how people feel when they use the chatbot, and whether it’s helping them or just annoying them.

This article examines chatbot evaluation through both lenses, providing a guide for organisations aiming to optimise performance and user satisfaction.

AI in finance

Part One: Technical Evaluation of Chatbots

First we’ll break down the technical evaluation side, which focuses on measurable aspects of chatbot performance. These are the AI chatbot performance metrics that tell us whether the machine is doing its job properly.

Retrieval and Search Performance

A lot of commercial chatbots rely on Retrieval-Augmented Generation (RAG), combining the reasoning abilities of LLMs with precise search systems. Two key metrics evaluate this architecture:

Search Stability: This measures whether the chatbot consistently interprets variations of the same question. For example, “When is my bill due?” and “What’s the date of my next invoice?” should trigger the same answer. Stability across phrasing indicates robust understanding.

Search Relevance: This assesses whether the retrieved information is useful and appropriate. Traditional search uses metrics like Normalised Discounted Cumulative Gain (NDCG) or Precision@K, but chatbots often return only the top response. Relevance can be determined via curated datasets, A/B testing, or LLM-based evaluators that simulate human judgement.

High-quality retrieval ensures that chatbots consistently deliver accurate, contextually appropriate information.

A curated dataset is a carefully selected and organised collection of data that’s been prepared to serve a specific purpose. Unlike raw or unfiltered data, a curated dataset is reviewed, cleaned, and structured to ensure quality, relevance, and consistency.

Accuracy and Groundedness

Even the most advanced chatbots can make mistakes. Generative AI models are designed to produce answers in natural language, but they sometimes provide false information. This phenomenon is known as hallucination, where the AI generates a confident-sounding answer that isn’t actually based on verified facts

The prevalence of hallucination varies depending on the context and AI model. For highly specialised legal queries, hallucination rates can be as high as 69–88%. However, more general, open-domain models report lower rates, around 15–30%. Advanced systems can achieve hallucination rates below 1%, particularly when the AI is designed to reference verified sources.

Groundedness addresses this issue by ensuring that every piece of information the chatbot provides can be traced back to a reliable, verified source, such as a document, database, or official website. In other words, groundedness acts as a built-in fact-check that prevents the AI from making unsupported claims.

In RAG systems (Retrieval-Augmented Generation), groundedness works through a two-step process. First, the chatbot retrieves relevant information from trusted sources. Then, it generates a response using only that retrieved information. This ensures that the AI is basing its output on real data rather than its own assumptions.

For example, if a user asks, “When is my next invoice due?” a chatbot without groundedness might guess, “Next Monday,” which could be completely inaccurate. A grounded chatbot, on the other hand, checks the verified billing records and gives the correct date.

Task Completion

This one is simple: did the chatbot do what you asked? If you wanted to reset your password, did it help you reset it? If you asked to book an appointment, did the booking go through? Task completion is one of the clearest AI chatbot performance metrics.

Intelligence and Response Quality

Sometimes a chatbot is technically correct but unhelpful. If you ask, “What’s the weather?” and it says, “I can’t check live data, look it up online,” that’s correct but not very intelligent. A better answer would suggest useful sources or explain why it can’t give real-time data. Developers measure qualities like coherence (does it make sense?), originality (is it thoughtful, not copy-paste?), and adaptability (can it handle new contexts?). These make a chatbot feel “smarter.”

Reliability: No-Response and Fallback Rates

What happens when the chatbot doesn’t know the answer? Some chatbots stay silent, while others say “I’m sorry, I don’t know.”

The no-response rate measures how often a chatbot fails to provide any answer at all when a user submits a query. A high no-response rate indicates that the system is struggling to interpret or handle certain types of questions, leaving users without the information they need.

The fallback rate, on the other hand, tracks instances where the chatbot responds with a default message, such as “I’m sorry, I don’t know,” or redirects the user to a human agent. While fallback messages are sometimes necessary, a high fallback rate can signal that the chatbot is brittle and unable to handle a wide range of interactions.

Wasted Resources

Chatbots don’t have infinite memory. They use tokens (little chunks of text) to keep track of conversations. If they lose context, they can waste tokens by repeating themselves or forgetting important details. Token usage metrics help developers see where resources are wasted. Useful becasue wasted tokens cost more money and make responses slower.

Bringing It Together: The Jchat Health Score

Looking at dozens of separate metrics can get messy. That’s why Jchat created a single health score that combines all these checks into one easy number.

The health score combines accuracy, relevance, message quality and user feedback. We don’t just use thumbs-up or thumbs-down, but subtle signals like whether people rephrase their questions or whether their mood improves. The health score turns this into a normalised scale, averages it over the whole conversation, and produces one score.

An artist’s illustration of artificial intelligence (AI). This image is a positive imagining of humanities future with AI-enabled fusion as the primary energy source.

Part Two: Human-Centred Evaluation

Now let’s talk about the non-technical side. You can build a chatbot that scores perfectly on AI metrics, but if people don’t like using it, it’s still a failure. This is where we measure the human experience.

Engagement and Usage

First, we need to ask: are people even using it? Activity volume tells us how many conversations are happening. Retention rate shows whether people come back for a second try. If people only use the chatbot once and never return, that’s a sign they didn’t find it helpful. Looking at usage by time of day can also show if the chatbot really acts like a 24/7 service, or if it’s mostly ignored.

Bounce Rates and Conversation Length

A bounce happens when someone starts chatting but leaves right away without getting anything useful. High bounce rates mean the chatbot isn’t answering the right questions or is badly placed in the customer journey. Conversation length matters too. Very short chats might mean people gave up, while very long chats could mean the chatbot is taking too long to reach an answer.

User Sentiment

Explicit feedback such as thumbs-up or thumbs-down is relatively rare. Therefore, evaluation increasingly relies on indirect signals. Sentiment analysis evaluates whether the user’s emotional tone improves during the conversation. Positive shifts suggest the chatbot is helpful and empathetic; persistent negative sentiment signals potential issues with clarity, tone, or guidance.

Confusion Indicators

One of the clearest signals of trouble is when users rephrase their questions. If I ask, “How do I reset my password?” and then immediately ask, “Okay, but how do I reset my account login password?” it means the first answer didn’t make sense. Most companies ignore this signal because it’s harder to measure than a thumbs-up click, but it’s extremely valuable. Which is why we include it in the health score.

Goal Alignment

Ultimately, the most important measure of user experience is whether the chatbot enables users to achieve their goals efficiently. This could include completing transactions, obtaining accurate information, or resolving issues without escalation. Goal completion rates connect the technical and human perspectives, demonstrating whether a technically competent chatbot delivers meaningful outcomes.

From a business perspective, one of the most important measures of a chatbot’s effectiveness is whether it helps the organisation achieve its objectives. For example, a chatbot can reduce pressure on staff by handling routine queries, taking the pressure off staff. Constant availability is another major benefit for organisations.

Goal alignment measures whether the chatbot is contributing tangible value to an organisation and its users.

Trust and Responsibility

Finally, there’s the matter of trust. Users must be confident that the chatbot won’t expose their personal information, give harmful answers, or pretend to be something it’s not. A chatbot that fails in this area might still score high on technical metrics, but people won’t use it because it feels unsafe or untrustworthy.

Conclusion

Evaluating chatbots is a bit like checking both the engine and the driver. On the technical side, you need AI performance metrics like accuracy, relevance, task completion and resource efficiency. On the human side, you need to measure how people feel, whether they are confused or satisfied, and whether they actually achieved their goals.

The smartest organisations treat evaluation as an ongoing process, not a one-time test. Systems like Jchat’s health score show how you can pull together all these signals, technical and human, into one clear, interpretable measure.

In the end, a chatbot is successful only if it helps people get things done without frustration. Anything else is just noise dressed up as intelligence.

Share this page

Picture of Emily Coombes

Emily Coombes

Hi! I'm Emily, a content writer at Japeto and an environmental science student.

Got a project?

Let us talk it through