Study Overview
Objective. To determine the prevalence and nature of the harm that could result from patients or consumers using conversational assistants for medical information.
Design. Observational study.
Settings and participants. Participants were recruited from an online job posting site and were eligible if they were aged ≥ 21 years and were native speakers of English. There were no other eligibility requirements. Participants contacted a research assistant by phone or email, and eligibility was confirmed before scheduling the study visit and again after arrival. However, data from 4 participants was excluded after the participants disclosed that they were not native English speakers at the end of their study sessions. Participants were compensated for their time.
Each participant took part in a single 60-minute usability session. Following informed consent and administration of baseline questionnaires, each was assigned a random selection of 2 medication tasks and 1 emergency task (provided as written scenarios) to perform with each conversational assistant—Siri, Alexa, and Google Assistant—with the order of assistants and tasks counterbalanced. Before the participants completed their first task with each conversational assistant, the research assistant demonstrated how to activate the conversational assistant using a standard weather-related question, after which the participant was asked to think of a health-related question and given 5 minutes to practice interacting with the conversational assistant with their question. Participants were then asked to complete the 3 tasks in sequence, querying the conversational assistant in their own words. Tasks were considered completed either when participants stated that they had found an answer to the question or when 5 minutes had elapsed. At task completion, the research assistant asked the participant what they would do next given the information obtained during the interaction with the conversational assistant. After the participant completed the third task with a given conversational assistant, the research assistant administered the satisfaction questionnaire. After a participant finished interacting with all 3 conversational assistants, they were interviewed about their experience.
Measures and analysis. Interactions with conversational assistants were video recorded, with the audio transcribed for analysis. Since each task typically took multiple attempts before resolution or the participant gave up, usability metrics were coded at both the task and attempt level, including time, outcomes, and error analysis. Participant-reported actions for each medical task were rated for patient harm by 2 judges (an internist and a pharmacist) using a scale adapted from those used by the Agency for Healthcare Research and Quality and the US Food and Drug Administration. Scoring was based on the following values: 0 for no harm; 1 for mild harm, resulting in bodily or psychological injury; 2 for moderate harm, resulting in bodily or psychological injury adversely affecting the functional ability or quality of life; 3 for severe harm, resulting in bodily or psychological injury, including pain or disfigurement, that interferes substantially with functional ability or quality of life; and 4 was awarded in the event of death. The 2 judges first assigned ratings independently, then met to reach consensus on cases where they disagreed. Every harmful outcome was then analyzed to determine the type of error and cause of the outcome (user error, system error, or both). The satisfaction questionnaire included 6 self-report items with response values on a 7-point scale ranging from “Not at all” to “Very satisfied.”
Main results. 54 participants completed the study, with a mean age of 42 years (SD 18) and a higher representation of individuals in the 21- to 24-year-old category than the general US adult population (30% compared to 14%). Twenty-nine (54%) were female, 31 (57%) Caucasian, and 26 (50%) college educated. Most (52 [96%]) had high levels of health literacy. Only 8 (15%) reported using a conversational assistant regularly, while 22 (41%) had never used one, and 24 (44%) had tried one “a few times.” Forty-four (82%) used computers regularly.
Of the 168 tasks completed with reported actions, 49 (29.2%) could have resulted in some degree of harm, including 27 (16.1%) that could have resulted in death. An analysis of 44 cases that potentially resulted in harm yielded several recurring error scenarios, with blame attributed solely to the conversational assistant in 13 (30%) cases, to the user in 20 (46%) cases, and to both the user and the conversational assistant in the remaining 11 (25%) cases. The most common harm scenario (9 cases, (21%) is one where the participant fails to provide all the information in the task description, and the conversational assistant responds correctly to the partial query, which the user then accepts as the recommended action to take. The next most common type of harm scenario occurs when the participant provides a complete and correct utterance describing the problem and the conversational assistant responds with partial information (7 cases, 16%). Overall self-reported satisfaction with conversational assistants was neutral, with a median rating of 4 (IQR 1-6).