Japeto partnered with the State of Rio de Janeiro to evaluate the accuracy, quality, and relevance of a public-facing AI chatbot serving 500+ real users. The core finding: rigorous chatbot evaluation requires human review, real-world testing, and structured methodology — not just user ratings. For organisations deploying AI in high-stakes settings, this is what a responsible approach looks like.
- Test with real users from the outset — internal testing doesn’t reflect how the public actually interacts with your system
- User feedback alone is unreliable — negative ratings often reflect unmet expectations, not chatbot errors; human review is essential
- Measure three dimensions — accuracy, relevance, and quality; these correlate strongly, meaning one well-designed question can proxy the full picture for ongoing monitoring
- Retrieval-augmented generation (RAG) reduces hallucination risk — grounding responses in a controlled knowledge base is critical for healthcare, government, and social services contexts
- Evaluation is not a pre-launch checkbox — it must be continuous; performance degrades, use cases evolve, and new failure modes emerge after deployment
- Cosine similarity is useful but incomplete — combine it with human review and user feedback for a meaningful picture of chatbot health
At Japeto, we build conversational AI for organisations where getting it wrong has real consequences — charities helping vulnerable people, healthcare providers, educational institutions. So when the opportunity arose to collaborate with the Department of Digital Transformation for the State of Rio de Janeiro on a rigorous chatbot evaluation project, we jumped at it. This post shares what we built, what we found, and what it means for any organisation deploying public-facing AI.
The problem: good intentions aren’t enough
The state of Rio de Janeiro had developed a chatbot to help citizens navigate Portal RJ Digital — a platform consolidating over 2,500 public services covering healthcare, education, social assistance, taxation, and more. The chatbot’s job was to take a user’s plain-language description of their problem and surface the right service with a direct link.
It was a genuinely useful tool. But Madalena Alves dos Santos, Superintendent of Data and Results Management at Rio’s Department of Digital Transformation, was asking the right question: how do we actually know it’s working?
That’s the question that brought us together through The Turing Way Practitioners Hub, a programme that connects organisations working on responsible data science and AI.
As Madalena put it: “Having good intentions is not enough. We need proper methodologies to show how we’re managing risks — especially in government, where accountability and transparency matter so much.”
At Japeto, we’d been grappling with the same challenge. We had a basic internal method for reviewing chatbot conversations, but it wasn’t grounded in systematic data. This collaboration gave us the opportunity to build something better.
The technical setup
Rio’s chatbot prototype used the following stack:
- OpenAI’s API for natural language processing
- LangChain to structure user interactions
- FAISS (Facebook AI Similarity Search) as the vector search engine — comparing user queries against an embedded database of public services
- Flask to deploy the chatbot via HTTP endpoints
- JSON files to store conversations for tracking and analysis
When a user submits a query, the system performs an initial FAISS search. If no sufficiently relevant service is found, an interpretation agent reformulates the query and searches again. Users could also rate responses using like/dislike buttons, feeding back into performance monitoring.
On our side, Japeto built a REST API to pipe data from Rio’s chatbot into our Japeto Chat analytics dashboard in real time. Each conversation record captured:
- The user’s message
- The chatbot’s response
- Cosine similarity score — a measure of how close the chatbot’s answer was to the expected correct answer
- User feedback (like/dislike)
This gave us a structured, queryable dataset to work with.
Real-world testing at scale
One of the most important decisions the team made was to test with real users — not developers, not internal staff.
Internal testing has a fundamental limitation: developers interact with systems differently from the people those systems are designed to serve. To get meaningful signal, you need to expose the chatbot to the kinds of queries it will actually receive in production.
The team chose Central do Brasil, a major transport hub in Rio de Janeiro city, as the testing location. With thousands of people passing through daily and a broad socioeconomic mix of visitors, it was ideal for capturing realistic usage patterns.
From 17 to 21 February 2025, passersby were invited to interact with the chatbot on laptop computers. Over 500 people participated, generating approximately 700 conversations.
One significant finding from this phase: spelling and grammatical errors were extremely common in user queries. This isn’t surprising — it reflects how people actually type — but it underscores a design requirement that’s easy to underestimate: chatbots serving general populations need robust tolerance for imperfect input. A system that only handles well-formed queries will fail a large proportion of real users.
A secondary remote testing cohort was made available to a network of micro-entrepreneurs via SEBRAE (the Brazilian Service of Support for Micro and Small Enterprises), though the sample from this channel was smaller.
The evaluation model: why user feedback alone isn’t enough
After testing, Rio’s digital transformation team conducted a human review of the chatbot’s responses. This step deserves explanation, because it’s a point organisations often underestimate.
User frustration ≠ chatbot inaccuracy.
A significant proportion of users were searching for municipal or federal services — services that don’t exist in the state-level Portal RJ Digital database. In those cases, the chatbot correctly returned a “service not found” message. That’s the right answer. But users, understandably, found it frustrating.
If you rely solely on like/dislike ratings, you’ll misattribute these cases as failures. Your evaluation data becomes noisy in a way that’s hard to untangle without human judgment.
Human reviewers assessed responses against three dimensions:
- Quality — clarity and coherence of the chatbot’s response, beyond just whether the right service was returned
- Accuracy — whether the chatbot correctly matched the query to a service in the dataset, without hallucinating incorrect or misleading information
- Relevance — whether the suggested service was genuinely appropriate for the citizen’s actual need
One notable finding from the statistical analysis: relevance, accuracy, and quality scores were closely correlated. This is practically useful — it suggests that a single well-designed evaluation question could serve as a reasonable proxy for a more complex multi-dimensional review. That has implications for scalable, cost-effective monitoring frameworks.
Hallucination in high-stakes contexts
Hallucination — where a generative AI system confidently produces incorrect information — is a known risk with LLM-based systems. In this project, the team identified instances including a small number of invalid links in chatbot responses.
In a consumer product context, a hallucinated answer is a bad experience. In a government context, or in healthcare or social services, the stakes are categorically higher. A chatbot that gives a user the wrong address for a hospital, or incorrect information about eligibility for a benefit, can cause real harm.
This is precisely why retrieval-augmented generation (RAG) — the approach used here, where the model pulls from a defined dataset rather than generating freely — is important for high-stakes deployments. It constrains the space of possible outputs to content that has been vetted. It doesn’t eliminate hallucination risk entirely, but it substantially reduces it.
It’s also why evaluation can’t be a one-time exercise. After official deployment, Rio’s team will continue monitoring accuracy using the methodology developed through this project — treating evaluation as an ongoing operational process, not a pre-launch checkbox.
What this means for organisations like yours
If you’re a charity, healthcare provider, or educational institution considering or already running a public-facing AI chatbot, here are the practical takeaways from this project:
Test with real users, not proxies. Internal testing has blind spots. The population you design for behaves differently from your internal team. Plan for public testing, even at small scale, before deployment.
Treat user feedback as a signal, not a verdict. Like/dislike data is useful, but it needs to be interpreted carefully. Negative feedback may reflect unmet expectations rather than system errors. Human review adds the context that automated signals can’t capture.
Build evaluation in from the start. An evaluation framework is much easier to design before a system goes live than after. Know what you’re measuring — accuracy, relevance, quality — and how you’ll measure it.
RAG is your friend in high-stakes settings. Grounding your chatbot in a defined, controlled knowledge base reduces hallucination risk and makes the system’s behaviour more predictable and auditable.
Cosine similarity is a useful but incomplete metric. It tells you how semantically close a response is to an expected answer, but it doesn’t capture everything that matters. Combine it with human review and user feedback for a fuller picture.
One good question may be enough for ongoing monitoring. If your quality, accuracy, and relevance measures are correlated, you may not need to ask all three every time. A single well-calibrated question can give you actionable signal at lower cost.
Looking ahead
This project gave us at Japeto a clearer path towards a generalisable methodology for evaluating chatbot performance — one that can be applied across the kinds of sensitive, high-stakes contexts we work in. We’re now better positioned to compare accuracy across different LLM backends and to build evaluation into client projects from the outset, rather than treating it as an afterthought.
For Rio’s digital transformation team, the next step is moving from prototype to production — with a continuous monitoring framework in place from day one.
We’re proud of what this cross-organisational collaboration produced, and grateful to The Turing Way Practitioners Hub for creating the conditions that made it possible.


