Predicting Student Video Behavior with Multimodal LLMs: 77 Million Events from 66 Online Courses

Available in: 中文
2026-04-07T16:44:36.458Z·1 min read
Research accepted at AIED 2026 demonstrates that multimodal large language models can reliably predict student video interaction patterns (watching, pausing, skipping, rewinding) from video content...

Research accepted at AIED 2026 demonstrates that multimodal large language models can reliably predict student video interaction patterns (watching, pausing, skipping, rewinding) from video content alone, using these interactions as proxies for cognitive load.

The Approach

The pipeline leverages MLLMs to compute embeddings of short video segments, then trains a neural classifier to identify interaction peaks — moments where students are most likely to pause, skip, or rewind.

The Data

Key Findings

  1. Reliable prediction — MLLM embeddings reliably predict interaction peaks
  2. Cross-domain generalization — Works on unseen academic fields
  3. Interpretable — Predictions encode theory-relevant instructional concepts
  4. Cost-efficient — Practical for pre-screening educational video design

The Theory Connection

The work draws from multimedia learning theory on instructional design for optimal cognitive load. Using concept activation vectors, the researchers show that the model's predictions correspond to theoretically meaningful instructional features coded by GPT-5.

Why This Matters

Practical Application

Imagine uploading a lecture video and immediately knowing which segments will cause cognitive overload, confusion, or disengagement — that's what this pipeline enables.

↗ Original source · 2026-04-07T00:00:00.000Z
← Previous: AI Assistance Reduces Persistence and Hurts Independent Performance: New RCT Evidence from 1,222 ParticipantsNext: RACE: Fine-Grained AI Text Detection That Distinguishes Human-Written, LLM-Polished, and Humanized AI Content →
Comments0