Overview
This study investigates the provenance of social and STEM reasoning capabilities within language models by mapping specific pretraining corpus regions to benchmark performance. It employs training-data attribution as a tool for capability discovery, focusing on OLMo3-7B.
Research Context
Prior work using training-data attribution has often focused on factual knowledge rather than reasoning, and document-level scores have been noted as too noisy for identifying corpus regions supporting specific capabilities. This research addresses these limitations by aggregating influence across a detailed taxonomy and contrasting different types of capabilities.
Approach
The researchers utilized gradient-based attribution, specifically TrackStar via Bergson, applied to a working set derived from the de-duplicated Dolma3 mix. Influence scores were aggregated across WebOrganizer's 24-format x 24-topic taxonomy, resulting in 576 distinct bins. A 2x2 experimental design was employed to compare benchmark pairs, varying both domain (social vs. STEM) and capability type (reasoning vs. knowledge).
Benchmarking
- Social Reasoning: SocialIQA
- MMLU Social Sciences: Knowledge-based social tasks
- STEM Reasoning: ARC-Challenge
- MMLU STEM: Knowledge-based STEM tasks
Causal Validation
To provide partial causal validation, targeted machine unlearning was performed. This involved forgetting high-attribution topic bins, such as 'Literature' for SocialIQA, and comparing the degradation of aligned benchmarks against within-bin random baselines.
Findings
- Social reasoning and STEM reasoning capabilities in OLMo3-7B draw upon qualitatively distinct regions of the pretraining corpus.
- The contrast between corpus regions for social versus STEM capabilities is sharper when examining reasoning tasks compared to knowledge-level tasks.
- Targeted machine unlearning, by removing high-attribution topic bins, resulted in a greater degradation of performance on the aligned benchmark than observed with random baselines within the same bin. For instance, forgetting the 'Literature' topic bin particularly impacted SocialIQA performance.
Why This Matters
The identification of distinct corpus regions supporting social versus STEM reasoning provides insight into the interpretability of language model capabilities. This method of capability discovery through training-data attribution and its causal validation through targeted unlearning could inform the development and refinement of large language models.
Potential Applications
The methodology, including all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints, has been open-sourced. This could facilitate further research into training-data attribution and the understanding of capability provenance in language models.