This article delineates our evaluation methodology, designed to ensure high standards of clinical quality, accuracy, and safety in the deployment of Qu. Our approach is structured around five components:
We are currently developing our corpus of vignettes and performing vignette-based testing, and will be building the rest of our safety and precision framework with time.
Each consultation transcript is evaluated using a scorecard to ensure quality and safety. Adapting the OSCE (Objective Structured Clinical Examination) method, we assess Qu by incorporating elements like avoidance of repetitive questions, response latency, and recall and precision measures. Key evaluation aspects include data gathering, clinical management, interpersonal skills, and—where relevant—hypothesis accuracy and identification.
These first three aspects are rated as: clear pass, pass, fail, or clear fail (see our whitepaper for a detailed scorecard). Evaluation metrics vary by consultation intent. For example, in health Q&A consultations, no hypotheses are generated, so only data gathering, clinical management, and interpersonal skills are assessed.
Currently, vignette transcripts are evaluated by accredited clinicians, with future plans to automate this process using an AI agent and periodic clinician review to maintain quality.
A clinical vignette is a description of a hypothetical patient’s clinical situation that can be used for teaching, assessing and research purposes.
Qu will follow a different agenda depending on the patient needs in the consultation. We approach testing by creating a set of separate scenarios for each intent. We represent a scenario as a vignette and therefore each intent will have its own set of vignettes.
Vignettes are needed for the following reasons:
To ensure scalability, we developed an “AI patient” that answers Qu’s questions based strictly on vignette data, defaulting to “I do not know” or “no” when information is missing. During testing, a clinician or AI agent interacts with Qu, generating a consultation transcript that is then evaluated for quality and accuracy using our scorecard. Currently, clinicians handle evaluations, with an “AI assessor” in development for future automation.
We conducted simulated real-world testing for Qu, engaging a diverse cohort of clinicians, including healthcare innovators, specialists, and general practitioners. This varied expertise provides valuable insights to ensure Qu meets the broad needs of the healthcare community.
Each clinician reenacts a consultation from a patient’s perspective, initiating dialogue with Qu and responding based on real-world experience. This authentic setup allows a robust evaluation of Qu’s decision-making and response accuracy.
Each transcript is carefully reviewed by the participating clinician using our Evaluation Scorecard or Clinician Feedback Questionnaire, allowing in-depth analysis of Qu’s performance. This process identifies areas for improvement, ensuring Qu is well-prepared to handle real-world healthcare challenges.
In the Real-World Testing phase, Quadrivia will recruit a diverse group of volunteers representing specific demographics. Each participant will recall a recent health issue to consult with Qu and a human evaluator. Volunteers will input the same information in both sessions, ensuring consistent details for comparison.
Consultations will be recorded and scored based on the Evaluation Scorecard, with key performance indicators including pass rates across both groups and analysis of outcome differences between Qu and human evaluators.
After market launch, all consultations via Qu will be securely stored as transcripts linked to patient profiles, and automatically evaluated using our Evaluation Scorecard to ensure service quality.
A random selection of transcripts will be reviewed periodically by Quadrivia clinicians, especially after major releases, following regulatory guidelines. Clinicians will assess each transcript with the Evaluation Scorecard and provide feedback, identifying safety and quality issues. Safety issues are prioritized by urgency, while quality issues are reviewed for appropriate solutions. This ongoing process helps Qu continuously align with clinical best practices and deliver accurate, safe symptom assessments for patients.
Qu will only be launched in any jurisdiction after it meets all its local regulatory and safety standards. This includes the relevant data privacy and security (i.e. HIPAA compliance in USA, GDPR and upcoming AI regulations in EU) as well as Medical Devices Regulations (i.e. MDR and FDA classification and regulatory approval).