Calculation of Rate-dependent Metrics of Spontaneous Recall Using a Validated Automated Transcription and Diarization Approach
Abstract number :
1.095
Submission category :
11. Behavior/Neuropsychology/Language / 11A. Adult
Year :
2024
Submission ID :
1215
Source :
www.aesnet.org
Presentation date :
12/7/2024 12:00:00 AM
Published date :
Authors :
Presenting Author: Eden Tefera, B.S. – NYU Langone Health
Zehui Gu, MS – NYU Langone Health
Ayelet Rosenberg, MS – NYU Langone Health
Stephen Johnson, PhD – NYU Langone Health
Anli Liu, MD, MA – NYU Langone Health
Rationale: A major obstacle in clinical neuropsychological memory assessment is the laborious nature of data analysis to obtain measures of human cognition. A potential solution is to apply Natural Language Processing (NLP) to the dialogue between investigators and patients to extract and measure linguistic information from patients’ spontaneous speech. However, current methods of automated transcription can make errors in identifying speakers, which can affect rate-dependent measures such as utterance length or word fluency. To improve transcription performance, we sought to validate an automated transcription and speaker diarization system (automatic segmentation of audio recordings based on the identity of different speakers). We used audio collected during a Famous Faces (FF) Memory Task from patients with temporal lobe epilepsy (TLE) and healthy controls (HC).
Methods: Seventy adults (44 TLE and 26 HC) were included in this prospective study. Twenty FF in
politics, sports, and entertainment (active 2008-2017) were shown to subjects, who were asked to spontaneously recall as much biographical detail as possible. 26 subjects engaged remotely via Webex (1 continuous audio recording), while 44 engaged in person via laptop (60 separate audio recordings). We developed an automated transcription system using OpenAI’s WhisperX and pyannote to transcribe patient speech and assign speaker labels to text. After the initial transcription & diarization step, the automatically-generated transcripts were manually corrected for correct speaker labels by a human rater (human-validated). We generated several NLP measures relevant to memory assessment from the human-generated (gold-standard), human-validated, and automated transcriptions, while evaluating the transcription & diarization performance of our automated system in the two recording environments (webex vs in person).
Results: We transcribed a cumulative 15 hours of patient speech. There were no significant group-level
differences in Total Word Count, Content Word Count, Utterance Duration, Word Rate, and Content Word Rate across transcription methods. Median Word Error Rate (WER) for the human-validated transcripts was 16.8% when compared to the human-generated transcripts. The automated system had a significantly higher WER when transcribing interviews recorded in person compared to webex (χ² (1, N= 70) = 7.82, p = .005). Median Diarization Error Rate (DER) for the automated transcripts was 3.33% before human validation. Testing environment did not have a statistically significant difference on DER (χ² (1, N = 70) = 1.74, p =.19).
Conclusions: Our automated system generated speech-to-text transcriptions with excellent word
accuracy (16.8% errors) and outstanding performance on speaker diarization (3.33% errors). Automated transcription with speaker diarization has potential as an efficient and accurate means of studying rate-dependent features of spontaneous speech in naturalistic memory paradigms.
Funding: NINDS
Behavior