November 10, 2024

How do we ensure safety and reliability?

From Vignettes to the Clinic: How We Evaluate Qu for Safety, Precision, and Clinical Quality

This article delineates our evaluation methodology, designed to ensure high standards of clinical quality, accuracy, and safety in the deployment of Qu. Our approach is structured around five components:

Evaluation Scorecard: Defines the key performance indicators and benchmarks necessary for assessing Qu's clinical efficacy and safety.
Vignette-based testing: Where Qu is assessed using clinical scenarios that replicate representative clinical conditions.
Simulated Real-world testing: Where clinicians engage in simulations of consultations based on recent patient encounters.
Real world testing: Where Qu will undergo real-world testing, where its performance is evaluated in live clinical settings.
Post market monitoring: Ongoing surveillance that will ensure that any emerging issues are promptly addressed, maintaining the integrity and reliability of the system over time.

We are currently developing our corpus of vignettes and performing vignette-based testing, and will be building the rest of our safety and precision framework with time.

Evaluation Scorecard

Each consultation transcript is evaluated using a scorecard to ensure quality and safety. Adapting the OSCE (Objective Structured Clinical Examination) method, we assess Qu by incorporating elements like avoidance of repetitive questions, response latency, and recall and precision measures. Key evaluation aspects include data gathering, clinical management, interpersonal skills, and—where relevant—hypothesis accuracy and identification.

These first three aspects are rated as: clear pass, pass, fail, or clear fail (see our whitepaper for a detailed scorecard). Evaluation metrics vary by consultation intent. For example, in health Q&A consultations, no hypotheses are generated, so only data gathering, clinical management, and interpersonal skills are assessed.

Currently, vignette transcripts are evaluated by accredited clinicians, with future plans to automate this process using an AI agent and periodic clinician review to maintain quality.

Vignette-Based Testing

A clinical vignette is a description of a hypothetical patient’s clinical situation that can be used for teaching, assessing and research purposes.

Qu will follow a different agenda depending on the patient needs in the consultation. We approach testing by creating a set of separate scenarios for each intent. We represent a scenario as a vignette and therefore each intent will have its own set of vignettes.

Vignettes are needed for the following reasons:

To ensure we can run the same test on different versions of the Qu;
To ensure we can quantify “coverage” e.g. how many presenting complaints are we testing;
To ensure that any human or AI agent can execute the test and evaluate the consultation.

To ensure scalability, we developed an “AI patient” that answers Qu’s questions based strictly on vignette data, defaulting to “I do not know” or “no” when information is missing. During testing, a clinician or AI agent interacts with Qu, generating a consultation transcript that is then evaluated for quality and accuracy using our scorecard. Currently, clinicians handle evaluations, with an “AI assessor” in development for future automation.

Simulated Real-World Testing

We conducted simulated real-world testing for Qu, engaging a diverse cohort of clinicians, including healthcare innovators, specialists, and general practitioners. This varied expertise provides valuable insights to ensure Qu meets the broad needs of the healthcare community.

Each clinician reenacts a consultation from a patient’s perspective, initiating dialogue with Qu and responding based on real-world experience. This authentic setup allows a robust evaluation of Qu’s decision-making and response accuracy.

Each transcript is carefully reviewed by the participating clinician using our Evaluation Scorecard or Clinician Feedback Questionnaire, allowing in-depth analysis of Qu’s performance. This process identifies areas for improvement, ensuring Qu is well-prepared to handle real-world healthcare challenges.

Real World Testing

In the Real-World Testing phase, Quadrivia will recruit a diverse group of volunteers representing specific demographics. Each participant will recall a recent health issue to consult with Qu and a human evaluator. Volunteers will input the same information in both sessions, ensuring consistent details for comparison.

Consultations will be recorded and scored based on the Evaluation Scorecard, with key performance indicators including pass rates across both groups and analysis of outcome differences between Qu and human evaluators.

Post Market Monitoring

After market launch, all consultations via Qu will be securely stored as transcripts linked to patient profiles, and automatically evaluated using our Evaluation Scorecard to ensure service quality.

A random selection of transcripts will be reviewed periodically by Quadrivia clinicians, especially after major releases, following regulatory guidelines. Clinicians will assess each transcript with the Evaluation Scorecard and provide feedback, identifying safety and quality issues. Safety issues are prioritized by urgency, while quality issues are reviewed for appropriate solutions. This ongoing process helps Qu continuously align with clinical best practices and deliver accurate, safe symptom assessments for patients.

Regulatory Approval

Qu will only be launched in any jurisdiction after it meets all its local regulatory and safety standards. This includes the relevant data privacy and security (i.e. HIPAA compliance in USA, GDPR and upcoming AI regulations in EU) as well as Medical Devices Regulations (i.e. MDR and FDA classification and regulatory approval).

May 15, 2025