Outcomes by conversational assistant were significantly different (X24 = 132.2, P < 0.001). Alexa failed for most tasks (125/394 [91.9%]), resulting in significantly more attempts made but significantly fewer instances in which responses could lead to harm. Siri had the highest task completion rate (365 [77.6%]), in part because it typically displayed a list of web pages in its response that provided at least some information to the participant. However, because of this, it had the highest likelihood of causing harm for the tasks tested (27 [20.9%]). Median user satisfaction with the 3 conversational assistants was neutral, but with significant differences among them. Participants were least satisfied with Alexa and most satisfied with Siri, and stated they were most likely to follow the recommendations provided by Siri.
Qualitatively, most participants said they would use conversational assistants for medical information, but many felt they were not quite up to the task yet. When asked about their trust in the results provided by the conversational assistants, participants said they trusted Siri the most because it provided links to multiple websites in response to their queries, allowing them to choose the response that most closely matched their assumptions. They also appreciated that Siri provided a display of its speech recognition results, giving them more confidence in its responses, and allowing them to modify their query if needed. Many participants expressed frustration with the systems, but particularly Alexa.
Conclusion. Reliance on conversational assistants for actionable medical information represents a safety risk for patients and consumers. Patients should be cautioned to not use these technologies for answers to medical questions they intend to act on without further consultation from a health care provider.
Commentary
Roughly 9 in 10 American adults use the Internet,1 with the ability to easily access information through a variety of devices including smartphones, tablets, and laptop computers. This ease of access to information has played an important role in shifting how individuals access health information and interact with their health care provider.2,3 Online health information can increase patients’ knowledge of, competence with, and engagement in health care decision-making strategies. Online health information seeking can also complement and be used in synergy with provider-patient interactions. However, online health information is difficult to regulate, complicated further by the wide range of health information literacy among patients. Inaccurate or misleading health information can lead patients to make detrimental or even dangerous health decisions. These benefits and concerns similarly apply to conversational assistants like Siri (Apple), Alexa (Amazon), and Google Assistant, which are increasingly being used by patients and consumers to access medical- and health-related information. As these technologies are voice-activated, they appear to address some health literacy limitations. However, they still pose important limitations and safety risks,4 especially as conversational assistants are being perceived as a trustworthy parallel to clinical assessment and counseling systems.5
There has been little systematic research to explore potential risks of these platforms, as well as systematically characterize error types and error rates. This study aimed to determine the capabilities of widely used, general-purpose conversational assistants in responding to a broad range of medical questions when asked by laypersons in their own words and sought to conduct a systematic evaluation of the potential harm that could result from patients or consumers acting on the resulting recommendations. The study authors found that when asked questions about situations that require medical expertise, conversational assistants failed more than half of the time and led study participants to report that they would take actions that could have resulted in harm or death. Further, the authors characterized several failure modes, including errors due to misrecognition of study participant queries, study participant misunderstanding of tasks and responses by the conversation assistant, and limited understanding of the capabilities of the assistants to understand user queries. This misalignment of expectations by users that assistants can follow conversations/discourse led to frustrating experiences by some study participants.