This paper trials new experimental methods for the analysis of natural language reasoning and the development of critical ordinary language philosophy in the wake of J.L. Austin. Philosophical arguments and thought experiments are strongly shaped by default pragmatic inferences, including stereotypical inferences. Austin suggested that contextually inappropriate stereotypical inferences are at the root of some philosophical paradoxes and problems, and that these can be resolved by exposing those verbal fallacies. This paper builds on recent efforts to empirically document inappropriate stereotypical inferences that may drive philosophical arguments. We demonstrate that previously employed questionnaire-based output measures do not suffice to exclude relevant confounds. We then report an experiment that combines reading time measurements with plausibility ratings. The study seeks to provide evidence of inappropriate stereotypical inferences from appearance verbs that have been suggested to lie at the root of the influential ‘argument from illusion’. Our findings support a diagnostic reconstruction of this argument. They provide the missing component for proof of concept for an experimental implementation of critical ordinary language philosophy that is in line with the ambitions of current ‘evidential’ experimental philosophy.