Overview
This research investigated the capability of instruction-tuned large language models (LLMs) to reliably perform token-level classification of Correct Information Units (CIUs) within transcripts of aphasic discourse. CIUs are a metric in discourse assessment used to quantify communicative informativeness in contexts affected by aphasia. The study evaluated LLM performance against human annotations, specifically focusing on whether automated identification could mitigate the time-intensive nature and necessity for trained raters associated with traditional CIU scoring.
Research Context
CIU scoring is a fundamental component of discourse assessment in aphasia, providing a measure of communicative informativeness distinct from linguistic form. The process requires trained human raters and is noted for its time demands. The study aimed to explore whether contemporary instruction-tuned LLMs could offer a reliable alternative for this specialized classification task. The Nicholas and Brookshire (1993) method for CIU status annotation served as the reference standard.
Approach
The study employed a dataset of sixteen picture-description transcripts, elicited using the Cat Rescue stimulus. These transcripts were annotated for CIU status according to the Nicholas and Brookshire (1993) criteria. The sample included discourse from four severity strata: control, mild aphasia, moderate aphasia, and severe aphasia. Four distinct, publicly available instruction-tuned LLMs were selected for benchmarking:
- Llama-3.1-8B
- Qwen2.5-7B
- Mistral-7B
- Phi-3-mini
Each LLM was evaluated under two main prompting conditions across five stratified random seeds:
- Zero-shot prompting
- Few-shot prompting (two conditions: fixed global example selection and per-chunk local example selection)
Performance assessment was conducted by comparing LLM outputs against consensus human labels. The evaluation metrics included accuracy, precision, recall, F1 score, and Cohen's kappa.
Findings
- Zero-Shot Prompting: Zero-shot prompting was found to be insufficient for reliable CIU classification across all tested LLMs.
- Few-Shot Prompting: Few-shot prompting conditions resulted in substantial gains in performance and enabled three of the four models to achieve competitive results.
- Viable Models: Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B demonstrated viable performance. Their mean few-shot F1 scores ranged from 0.776 to 0.817.
- Example Selection Impact: No significant differences were observed between fixed global and per-chunk local example selection methods under few-shot prompting.
- Unstable Model: Phi-3-mini exhibited unstable performance and did not produce reliable results in this classification task.
- Precision and Recall Characteristics: The viable models generally showed high recall but comparatively lower precision. This pattern suggested a tendency for systematic over-classification of tokens as CIUs.
- Severity-Based Variation: LLM performance varied depending on the discourse severity of the aphasic transcripts. The weakest results were observed in cases of more severe aphasia.
Why This Matters
The findings indicate that few-shot LLM prompting can support automated CIU identification without the need for gradient-based task training, potentially offering a method to streamline discourse assessment. However, the observed agreement with human annotation is not yet sufficient for fully autonomous application. These results suggest that LLM-based CIU scoring could contribute as a human-in-the-loop component within future discourse assessment systems, assisting human raters rather than entirely replacing them.
Potential Applications
The study suggests LLM-based CIU scoring as a promising component for human-in-the-loop discourse assessment systems. Such systems could potentially expedite the CIU scoring process by providing initial classifications for human review, thus reducing the time burden on trained raters in clinical or research settings.
Key Limitations Mentioned by Researchers
The study noted that while few-shot prompting improved performance significantly, the agreement with human annotation remained insufficient for fully autonomous use. The systematic over-classification (lower precision) by viable models and weaker performance in severe aphasia are specific areas indicating current limitations of the LLM approach for this task.