This study investigates whether addressees visually attend to speakers’ gestures in interaction and whether attention is modulated by changes in social setting and display size. We compare a live face-to-face setting to two video conditions. In all conditions, the face dominates as a fixation target and only a minority of gestures draw fixations. The social and size parameters affect gaze mainly when combined and in the opposite direction from the predicted with fewer gestures fixated on video than live. Gestural holds and speakers’ gaze at their own gestures reliably attract addressees’ fixations in all conditions. The attraction force of holds is unaffected by changes in social and size parameters, suggesting a bottom-up response, whereas speaker-fixated gestures draw significantly less attention in both video conditions, suggesting a social effect for overt gaze-following and visual joint attention. The study provides and validates a video-based paradigm enabling further experimental but ecologically valid explorations of cross-modal information processing.