Abstract
The issues of double-counting, use-constructing, and selection effects have long been the subject of debate in the philosophical as well as statistical literature. I have argued that it is the severity, stringency, or probativeness of the test—or lack of it—that should determine if a double-use of data is admissible. Hitchcock and Sober ([2004]) question whether this ‘severity criterion' can perform its intended job. I argue that their criticisms stem from a flawed interpretation of the severity criterion. Taking their criticism as a springboard, I elucidate some of the central examples that have long been controversial, and clarify how the severity criterion is properly applied to them.1. Severity and Use-Constructing: Four Points (and Some Clarificatory Notes) 1.1. Point 1: Getting beyond ‘all or nothing’ standpoints1.2. Point 2: The rationale for prohibiting double-counting is the requirement that tests be severe1.3. Point 3: Evaluate severity of a test T by its associated construction rule R1.4. Point 4: The ease of passing vs. ease of erroneous passing: Statistical vs. ‘Definitional’ probability2. The False Dilemma: Hitchcock and Sober 2.1. Marsha measures her desk reliably2.2. A false dilemma3. Canonical Errors of Inference 3.1. How construction rules may alter the error-probing performance of tests3.2. Rules for accounting for anomalies3.3. Hunting for statistically significant differences4. Concluding Remarks.