Skip to main content

Admin guide 204: Evaluating Atom's responses

R
Written by Riya Sebastian
Updated over 2 weeks ago

As your employees rely on Atom to answer questions and guide them through requests, it becomes essential to ensure that Atom’s responses are accurate, consistent, and aligned with the expectations of your organization. Over time, changes to your knowledge base, service catalogs, workflows, permissions, or even updates to Atom itself can alter the answers employees receive.

With the Test suite, admins can test Atom’s responses to expected employee questions at scale. You can:

  • Evaluate how Atom answers real questions before and after changes

  • Establish a “baseline” for expected answer quality

  • Track improvements or drops in accuracy

  • Simulate how different employees experience Atom

  • Maintain long-term reliability as your content and processes evolve

A baseline or golden dataset is a collection of your key questions and expected answers that acts as the benchmark for how Atom should ideally respond.

This provides a repeatable, scalable way to maintain Atom’s accuracy and deepen trust in its answers, ensuring consistent performance as your organization grows and adapts.


Why testing matters

Establishing a baseline and testing Atom against it regularly is important because:

  • Answer accuracy shifts over time - As you add new content or update catalogs, Atom may interpret information differently. Testing helps you catch these changes early.

  • Employees experience Atom differently - Testing as a specific user or role ensures that Atom applies the right permissions, visibility rules, and access levels.

  • Your organization evolves - As priorities, processes, or policies change, new questions may become more important than they were at launch. Testing helps you keep Atom aligned with what matters today.

  • You need a clear benchmark for quality - A baseline dataset gives you a consistent reference point for evaluating whether Atom’s answers are accurate and reliable.


How testing Atom works

Testing Atom follows a three-stage process:

  1. Create a dataset: Import the questions you want to test.

  2. Review expected answers: Decide what the “correct” answer should look like. This becomes the baseline against which Atom is evaluated.

  3. Run tests and review results: Compare Atom’s answers in each test with your expected ones to understand how well Atom is performing.

Each dataset is tied to one specific user, ensuring consistent permission-based testing.

Creating a dataset

To create a new dataset to test Atom:

  • Navigate to Settings > Assistant > Test. You can see all the datasets created for your workspace and their latest run status here.

  • Click on New dataset.

  • Give your dataset a clear name that indicates what you’re testing, for example IT onboarding questions or HR policy checks.

  • Choose the user you want to test as. Atom responds exactly as it would for that employee, based on their permissions. For example, an HR partner may see policy details that a new hire cannot.

Each dataset supports only one user at a time.

If you want to test for multiple user types, create separate datasets for each.

Using different users helps you verify each role or region independently.

  • Upload a CSV file with a list of your Questions. You can also include a column of Expected answers, if you would like.

  • Click on Create dataset.

Atomicwork will then process your questions and generate Atom’s first-pass answers for each one, if you have not specified an expected answer. You’ll be notified by email when the dataset is ready to review.

Reviewing and setting expected answers

You need to review the generated answers for your dataset for what a “correct answer” should be.

This is not the same as verified answers — you’re not creating a fixed response for Atom to use. You’re defining a benchmark to evaluate whether Atom’s future answers remain accurate.

For example, if your expected answer for “How do I view my payslip?” includes the correct steps for a US-based employee, Atom won’t repeat that answer word-for-word in the future. Instead, it will generate a response that conveys the same steps and information. The expected answer simply helps you confirm that future responses remain accurate.

  • Click on Review dataset against the dataset marked as Ready to review.

  • The review screen will show the list of questions in the dataset and whether the expected answer is set or not.

  • Click on a question to view and edit its expected answer.

  • Click on Save to save any changes to the answer.

  • Once you've reviewed everything, click on Save at the top right.

Understanding dataset versioning

Dataset versioning tracks how your dataset changes over time, without requiring you to recreate or manually compare earlier versions.

How versioning works

  • Version 1 is created after your first complete review.

  • When you add, remove, or edit questions or expected answers, Atomicwork automatically creates a new version.

  • Each test run is linked to the specific version it used, so results are always compared against the latest and correct question set.

Versioning helps you:

  • Know exactly which version of your dataset was tested

  • Add or adjust questions without affecting earlier accuracy records

  • Maintain a clear history of how your data — and Atom’s performance — has improved

It ensures you’re always testing Atom against the right, up-to-date baseline while preserving past evaluations for reference.

Running a test

  • Navigate to your dataset and click on Run test at the top right.

  • Select the user to test as and click on Create run.

  • While a test is running, the dataset shows an In progress status.

Only one dataset can be run per workspace at a time to ensure accuracy and avoid conflicting results.

  • You’ll be notified over email when the test run is complete.

  • You will see an evaluation summary showing how closely Atom’s answers matched your expected ones. The summary includes:

    • Tested for – The user whose permissions and visibility were used during the test.

    • Success rate – The percentage of questions where Atom’s actual answer was close enough to your expected answer.

    • Tested by – The admin who initiated the test run. This helps your team keep track of who performed each evaluation.

    • Date and time – When the test was run, allowing you to track performance changes over time or compare results across versions.

  • Each question in the dataset is marked as Pass or Fail. A pass means Atom’s answer was close enough to the expected answer.

  • Click on any question to open the detailed comparison view where you can see Atom’s actual answer, your expected answer and the failure reason, if the test failed.

Common reasons for a Fail include:

  • Missing or outdated knowledge

  • Unclear content

  • Permission constraints

  • Catalog mismatches

  • Differences in phrasing or factual detail

  • You can update your content, catalogs or expected answers, and rerun the same dataset to confirm improvements.

  • Click on Export at the top right for deeper offline analysis.


Best practices for effective testing

  • Build a representative “golden dataset.”
    Include the questions that matter most to your organization whether they come from policies, workflows, or real employee conversations.

  • Test before and after any change.

    Any update to content, catalogs, or configuration may affect how Atom answers.

  • Use different users to test different roles or regions.

    Employees with different access may receive different responses.

  • Keep your expected answers clear and factual.

    The clearer the baseline, the easier it becomes to catch deviations.

  • Iterate your dataset over time.

    As real usage surfaces new priorities, evolve your dataset using versioning.

  • Review failed cases thoroughly.

    Failures often point to gaps in content, permissions, or phrasing that can be improved.


Frequent testing gives you a reliable, repeatable way to understand and improve how Atom responds to employees. By creating datasets, reviewing expected answers, and running tests over time, you can confidently maintain answer accuracy even as your organization’s content and processes change.

This makes sure you are not only deploying AI, but maintaining it responsibly, reinforcing trust and delivering consistent value to employees at every stage.

Did this answer your question?