Predicting Student Video Behavior with Multimodal LLMs: 77 Million Events from 66 Online Courses
Research accepted at AIED 2026 demonstrates that multimodal large language models can reliably predict student video interaction patterns (watching, pausing, skipping, rewinding) from video content alone, using these interactions as proxies for cognitive load.
The Approach
The pipeline leverages MLLMs to compute embeddings of short video segments, then trains a neural classifier to identify interaction peaks — moments where students are most likely to pause, skip, or rewind.
The Data
- 77 million video control events
- 66 online courses
- Cross-multiple academic fields
- Events treated as implicit signals of cognitive processing
Key Findings
- Reliable prediction — MLLM embeddings reliably predict interaction peaks
- Cross-domain generalization — Works on unseen academic fields
- Interpretable — Predictions encode theory-relevant instructional concepts
- Cost-efficient — Practical for pre-screening educational video design
The Theory Connection
The work draws from multimedia learning theory on instructional design for optimal cognitive load. Using concept activation vectors, the researchers show that the model's predictions correspond to theoretically meaningful instructional features coded by GPT-5.
Why This Matters
- Instructors can pre-screen video designs before deployment
- EdTech platforms can identify problematic content at scale
- Researchers can empirically examine multimedia learning theory at unprecedented scale
- Students benefit from better-designed educational content
Practical Application
Imagine uploading a lecture video and immediately knowing which segments will cause cognitive overload, confusion, or disengagement — that's what this pipeline enables.