LLMs for Correct Information Unit Identification in Aphasic Discourse with Few-Shot Prompting

arXiv CS · June 16, 2026 · 3 min read · Engineering & Technology

Read research and analysis on LLMs for Correct Information Unit Identification in Aphasic Discourse with Few-Shot Prompting published by ICANEWS, a global research journal for emerging researchers.

Key Takeaways

Zero-shot prompting was insufficient for token-level CIU classification.
Few-shot prompting produced substantial performance gains, yielding F1 scores of 0.776 to 0.817 for Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B.
Viable models exhibited high recall but lower precision, suggesting systematic over-classification of CIUs.
Performance varied by discourse severity, with weakest results observed in severe aphasia.
Phi-3-mini was unstable and did not yield reliable performance.

Why This Matters

The development of LLM-based CIU scoring could support human-in-the-loop discourse assessment systems, potentially streamlining the process and reducing reliance on time-intensive manual annotation. While not yet autonomous, the approach offers a promising tool to assist in quantifying communicative informativeness in aphasia.

Overview

This research investigated the capability of instruction-tuned large language models (LLMs) to reliably perform token-level classification of Correct Information Units (CIUs) within transcripts of aphasic discourse. CIUs are a metric in discourse assessment used to quantify communicative informativeness in contexts affected by aphasia. The study evaluated LLM performance against human annotations, specifically focusing on whether automated identification could mitigate the time-intensive nature and necessity for trained raters associated with traditional CIU scoring.

Research Context

CIU scoring is a fundamental component of discourse assessment in aphasia, providing a measure of communicative informativeness distinct from linguistic form. The process requires trained human raters and is noted for its time demands. The study aimed to explore whether contemporary instruction-tuned LLMs could offer a reliable alternative for this specialized classification task. The Nicholas and Brookshire (1993) method for CIU status annotation served as the reference standard.

Approach

The study employed a dataset of sixteen picture-description transcripts, elicited using the Cat Rescue stimulus. These transcripts were annotated for CIU status according to the Nicholas and Brookshire (1993) criteria. The sample included discourse from four severity strata: control, mild aphasia, moderate aphasia, and severe aphasia. Four distinct, publicly available instruction-tuned LLMs were selected for benchmarking:

Llama-3.1-8B
Qwen2.5-7B
Mistral-7B
Phi-3-mini

Each LLM was evaluated under two main prompting conditions across five stratified random seeds:

Zero-shot prompting
Few-shot prompting (two conditions: fixed global example selection and per-chunk local example selection)

Performance assessment was conducted by comparing LLM outputs against consensus human labels. The evaluation metrics included accuracy, precision, recall, F1 score, and Cohen's kappa.

Findings

Zero-Shot Prompting: Zero-shot prompting was found to be insufficient for reliable CIU classification across all tested LLMs.
Few-Shot Prompting: Few-shot prompting conditions resulted in substantial gains in performance and enabled three of the four models to achieve competitive results.
Viable Models: Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B demonstrated viable performance. Their mean few-shot F1 scores ranged from 0.776 to 0.817.
Example Selection Impact: No significant differences were observed between fixed global and per-chunk local example selection methods under few-shot prompting.
Unstable Model: Phi-3-mini exhibited unstable performance and did not produce reliable results in this classification task.
Precision and Recall Characteristics: The viable models generally showed high recall but comparatively lower precision. This pattern suggested a tendency for systematic over-classification of tokens as CIUs.
Severity-Based Variation: LLM performance varied depending on the discourse severity of the aphasic transcripts. The weakest results were observed in cases of more severe aphasia.

Why This Matters

The findings indicate that few-shot LLM prompting can support automated CIU identification without the need for gradient-based task training, potentially offering a method to streamline discourse assessment. However, the observed agreement with human annotation is not yet sufficient for fully autonomous application. These results suggest that LLM-based CIU scoring could contribute as a human-in-the-loop component within future discourse assessment systems, assisting human raters rather than entirely replacing them.

Potential Applications

The study suggests LLM-based CIU scoring as a promising component for human-in-the-loop discourse assessment systems. Such systems could potentially expedite the CIU scoring process by providing initial classifications for human review, thus reducing the time burden on trained raters in clinical or research settings.

Key Limitations Mentioned by Researchers

The study noted that while few-shot prompting improved performance significantly, the agreement with human annotation remained insufficient for fully autonomous use. The systematic over-classification (lower precision) by viable models and weaker performance in severe aphasia are specific areas indicating current limitations of the LLM approach for this task.

Research Information

Institution: arXiv CS
Original Study: View Publication
Source: arXiv CS

About ICANEWS

ICANEWS is a global research journal for emerging researchers, publishing student and emerging researcher work across all fields.