Measuring event segmentation
An investigation into the stability of event boundary agreement across groups
Background
To study how people spontaneously divide continuous experience into distinct events (event segmentation), we usually ask participants to watch a movie and simultaneously mark the boundaries between events (segmentation task).
Performance in this task is typically quantified as a measure of segmentation agreement - how well do participants’ boundary markings overlap with each other. Though segmentation agreement can be relatively high, the segmentation task is still naturally subjective & ambiguous. Thus, what kind of information is reflected in segmentation agreement values is unclear.
Questions
In this paper we ask whether measures (new and existing; Fig 1) that quantify segmentation agreement can:
- Capture responses that are non-random and movie-specific.
- Stabilize over increasing sample size.

Method
First, we bootstrapped segmentation task performance collected in-lab and online to create samples of varying sizes (Fig 2.A). Then, for each agreement measures, we calculated agreement values using true, randomized, or mismatched data to test the sensitivity and specificity of each agreement measure (Fig 2.B). We then tested the ability of each agreement measure in capturing non-random and movie-specific responses, the minimum sample size required to observe these properties and the minimum sample size required to observe stable performance (Fig 2.C).

Main Findings
For almost all measures, we found:
- When considering multiple samples, non-random & movie-specific agreement can be captured with performance from as little as 2 people.
- BUT, for a single sample to reliably capture signal-driven behavior, larger size is needed (6-18; Table 1).
- Agreement improved with increasing sample size and eventually stabilized (Fig 3) when the sample is large enough (10-16 people).


Practical takeaways
- Segmentation task remains a powerful tool to capture how people spontaneously divide experience
- But as our data shows, it has it’s limits
- So we should be mindful when using one group’s segmentation to infer another group’s boundary placement
Moving forward, we suggest that future studies using the segmentation task should consider:
- Type of agreement to measure (group/ individual level).
- Mode and medium of collected data (e.g., online/ in-lab, commercial/ everyday movie).
- Sample size needed to capture reliable, stable, & signal driven performance.