Predicting Student Video Behavior with Multimodal LLMs: 77 Million Events from 66 Online Courses

Available in: 中文

2026-04-07T16:44:36.458Z·1 min read

Research accepted at AIED 2026 demonstrates that multimodal large language models can reliably predict student video interaction patterns (watching, pausing, skipping, rewinding) from video content...

The Approach

The pipeline leverages MLLMs to compute embeddings of short video segments, then trains a neural classifier to identify interaction peaks — moments where students are most likely to pause, skip, or rewind.

The Data

77 million video control events
66 online courses
Cross-multiple academic fields
Events treated as implicit signals of cognitive processing

Key Findings

Reliable prediction — MLLM embeddings reliably predict interaction peaks
Cross-domain generalization — Works on unseen academic fields
Interpretable — Predictions encode theory-relevant instructional concepts
Cost-efficient — Practical for pre-screening educational video design

The Theory Connection

The work draws from multimedia learning theory on instructional design for optimal cognitive load. Using concept activation vectors, the researchers show that the model's predictions correspond to theoretically meaningful instructional features coded by GPT-5.

Why This Matters

Instructors can pre-screen video designs before deployment
EdTech platforms can identify problematic content at scale
Researchers can empirically examine multimedia learning theory at unprecedented scale
Students benefit from better-designed educational content

Practical Application

Imagine uploading a lecture video and immediately knowing which segments will cause cognitive overload, confusion, or disengagement — that's what this pipeline enables.

↗ Original source · 2026-04-07T00:00:00.000Z

Comments0