Elsevier

Cognition

Volume 172, March 2018, Pages 11-25
Cognition

Original Articles
Perceptions of randomness in binary sequences: Normative, heuristic, or both?

https://doi.org/10.1016/j.cognition.2017.11.002Get rights and content

Abstract

When people consider a series of random binary events, such as tossing an unbiased coin and recording the sequence of heads (H) and tails (T), they tend to erroneously rate sequences with less internal structure or order (such as HTTHT) as more probable than sequences containing more structure or order (such as HHHHH). This is traditionally explained as a local representativeness effect: Participants assume that the properties of long sequences of random outcomes—such as an equal proportion of heads and tails, and little internal structure—should also apply to short sequences. However, recent theoretical work has noted that the probability of a particular sequence of say, heads and tails of length n, occurring within a larger (>n) sequence of coin flips actually differs by sequence, so P(HHHHH) < P(HTTHT). In this alternative account, people apply rational norms based on limited experience. We test these accounts. Participants in Experiment 1 rated the likelihood of occurrence for all possible strings of 4, 5, and 6 observations in a sequence of coin flips. Judgments were better explained by representativeness in alternation rate, relative proportion of heads and tails, and sequence complexity, than by objective probabilities. Experiments 2 and 3 gave similar results using incentivized binary choice procedures. Overall the evidence suggests that participants are not sensitive to variation in objective probabilities of a sub-sequence occurring; they appear to use heuristics based on several distinct forms of representativeness.

Introduction

Many of the judgments that humans make are based on the abstraction of patterns in events that occur in the world. These patterns can take many forms, such as weather – deciding whether to take a coat or an umbrella based on the temperature and rainfall of previous days – the behavior of other individuals – guessing when a co-author is likely to complete a manuscript draft based on their previous timeliness – or the behavior of wider groups of people – forecasting sales for upcoming months based on figures from recent months.

One of the challenges of any pattern-detection system, whether human or artificial, is to separate signal from noise: to extract, and base predictions on, systematic patterns that appear in the environment, and ignore observations that are—to the system at least—random. If distinguishing between regularity (which has predictive value) and randomness (which does not) is a basic requirement for making successful predictions about the environment, it is surprising that, in higher-level cognition at least, humans are relatively poor at recognizing randomness (for reviews see, Bar-Hillel and Wagenaar, 1991, Falk and Konold, 1997, Nickerson, 2002, Nickerson, 2004; for a similar overview of randomness production, see Rapoport & Budescu, 1997).

Most empirical research examining human (mis-) understanding of randomness has used equiprobable binary outcomes (see Oskarsson, Van Boven, McClelland, & Hastie, 2009, for a review), such as the occurrence of red or black on a roulette wheel (e.g., Ayton & Fischer, 2004), or birth order of boys and girls in a particular family (Kahneman & Tversky, 1972). The most common scenario is the occurrence of heads and tails when repeatedly tossing a fair, unbiased coin (e.g., Caruso et al., 2010, Diener and Thompson, 1985, Kareev, 1992). Across a variety of tasks—including choosing the most random of a set of sequences (e.g. Wagenaar, 1970), classifying individual sequences as random or non-random (e.g., Lopes & Oden, 1987), and prediction of future outcomes of a sequence of coin tosses or roulette wheel spins (e.g., Ayton & Fischer, 2004)—participants appear to mischaracterize the outputs of a random generating mechanism.

The mischaracterizations that people make are similar across different types of task. They include (using Hahn & Warren’s, 2009, characterization): (a) a preference for negative recency between trials rather than independence, meaning that in binary outcomes there is an expectation of an alternation rate between outcomes of greater than 0.5.; (b) a belief that in short sequences, equiprobable outcomes should occur equally often; and (c) a belief that an unstructured or unordered appearance indicates that a sequence of outcomes is more random and hence more likely to occur from a random process (see, e.g., Falk and Konold, 1997, Wagenaar, 1970). These biases lead to participants showing a gambler’s fallacy for random events (e.g., Ayton & Fischer, 2004), or a hot-hand bias for events under human control (Gilovich, Vallone, & Tversky, 1985; but see Miller and Sanjurjo, 2014, Miller and Sanjurjo, 2016): Following a run of the same outcome from a random process, such as five heads in a row in a coin tossing procedure, participants rate the probability of the same outcome occurring again as lower than following other sequences for sequences believed to be generated randomly (gambler’s fallacy, showing negative recency), and rate the probability as higher for sequences that could be under human control (hot-hand, positive recency). Similar effects are seen using continuous outcome measures in forecasting: participants make forecasts that reflect an assumption of serial dependence in a time series, when outcomes are in fact random (Reimers & Harvey, 2011).

In one of the most influential studies in randomness perception, Kahneman and Tversky (1972) conducted two experiments in which participants estimated the relative frequency of two birth orders of boys (B) and girls (G) across families with six children in a city: GBGBBG or BGBBBB. Participants judged that there would be far fewer families with BGBBBB than GBGBBG, suggesting a more representative 1:1 ratio of boys and girls was more likely. However participants also rated BBBGGG as less likely to occur than GBGBBG, suggesting the structure of the sequence, as well as the ratio of outcomes, was important. Their account, based on local representativeness (which we discuss below), has been the dominant explanation for human judgments of random sequences.

In this paper, we examine some of the ways in which people mischaracterize randomness. Specifically, explanations for deviations from normativity in randomness tasks have traditionally taken a heuristics and biases approach (Kahneman & Tversky, 1972). More recent theoretical approaches have emphasized the potential for apparent biases to reflect rational judgments in situations with limited experience. We discuss these two approaches now.

The set of arguments that comes from the heuristics and biases literature suggests performance can be characterized as the application of a representativeness heuristic to short sequences of outcomes. We would expect a random binary sequence of infinite length to a have a number of properties: It should contain the same proportion of each outcome; it should have an alternation rate of around 0.5; it should not contain any internal structure that allows it to be compressed (these properties are discussed further below). The heuristic account argues that people assume that these properties of infinite-length random sequences will also tend to be expected to be seen in short, exact strings of random outcomes. If they are not, a string is judged to be less random or less likely to be generated by a random process. But in reality they are not: for example, in a series of four coin tosses, the chance of tossing four heads in a row (HHHH), and HTTH is equal, at one-sixteenth. By misapplying a representativeness heuristic to short, exact strings of outcomes, participants would rate unrepresentative-looking outcomes (such as HHHH) as being less likely to occur through a random process than are more representative-looking outcomes (e.g., HTTH).

The notion of representativeness has, however, been criticized as nebulous and untestable. Gigerenzer (1996) argued that many heuristics like representativeness lack theoretical specification, and therefore offer enough flexibility to risk being unfalsifiable, and can between them be used to make a post hoc account of almost any experimental finding. Ayton and Fischer (2004) noted that representativeness was used to account for both the gambler’s fallacy, and its opposite, the hot-hand fallacy. Falk and Konold (1997) also noted that there was no a priori way of predicting how representativeness might affect performance on a task, making falsifiable predictions difficult.

Kahneman and Tversky (1972) did make some attempt to define representativeness in binary randomness tasks. As noted above, they suggested that the relative proportions of the two outcomes might be important. In addition, strings containing more alternations (e.g., HHTHTH, which contains four alternations) typically appear more representative of a random generation process than strings containing fewer alternations (e.g., HHHHTT, which contains one alternation). Strings with relatively few alternations tend to contain long runs of a single outcome type, which are heuristically unrepresentative of a random generation process. (Of course, these attributes are not independent: High alternation rates tend to have shorter runs, and vice versa. See Scholl & Greifeneder, 2011, for an attempt to disambiguate the role of run length and alternation rate in longer sequences of outcomes.)

Finally, a random generation process should produce sequences that are uncompressible; that is, that contain no internal structure that allows them to be expressed any more concisely than by giving the entire sequence. For example, HHHHHHHHHHHH could be compressed as (H × 12), or HHTHHTHHTHHT could be compressed as (HHT × 4). In contrast, HTHHTTTHTHHT is not so easily compressed. On this basis, Kahneman and Tversky noted that strings of outcomes that can be given descriptive short-cuts (e.g., HTHTHT being “HT three times”) appear less random. This was more formally codified in Falk and Konold’s (1997) Difficulty Predictor (DP). Although DP primarily attempted to capture the subjective difficulty of encoding a sequence of outcomes, it is closely related to Kolmogorov complexity (Griffiths & Tenenbaum, 2003; see also Gauvrit, Singmann, Soler-Toscano, and Zenil (2016), for a method of calculating Kolmogorov-Chaitin complexity for short binary strings. For longer sequences, formal and subjective compressibility may diverge due to cognitive limitations.). As complexity is one way of defining the randomness of a sequence, use of DP in judgments could be seen as reflecting the misapplication of a norm in which participants make their judgments based on the entropy of a sequence, rather than its probability of occurrence.

The idea that several different properties may contribute to the representativeness of a string introduces further degrees of freedom to the heuristic-based account, and renders it correspondingly difficult to test. In particular, the relative influence, under an account of local representativeness, of proportions, alternations, and compressibility, is something that remains untested. Examining the extent to which one kind of representativeness is more important than others in guiding randomness performance could help with understanding the representations and processes involved, and constrain local representativeness predictions for other situations. This is one of the aims of the current experiments.

An alternative set of arguments treats apparent biases in randomness judgments as adaptive responses to environmental experience. Several authors have noted that events in the world may exhibit negative recency, that is, immediately following an outcome, the same outcome is less likely to occur again. For example, after several days of rain, the nature of weather patterns may make it less likely that rain will continue the following day (see, e.g., Ayton and Fischer, 2004, Pinker, 1997).

More abstractly, participants may confuse sampling with replacement and sampling without replacement (see Fiorina, 1971, Morrison and Ordeshook, 1975, for early discussion of this possibility, and Rabin, 2002 for an attempt to model the idea). If I draw beads from an urn containing 10 red and 10 green without replacement, after drawing 4 reds in a row, the probability of the next bead being green is greatly increased. Many real-world samples involve drawing without replacement, which may encourage more general assumptions of negative recency in randomness judgments, either through overgeneralization or through misconstruing the experimental environment (Ayton and Fischer, 2004, Hahn and Warren, 2009).

There is also a set of models that build on counterintuitive properties of random sequences, suggesting that erroneous or biased judgments might reflect the (mis-) application of alternative norms; that is, accurately representing one’s experience of random sequences, but misapplying that experience when asked to make judgments or choices. For example, Kareev (1992) demonstrated that participants who were instructed to generate random sequences, tended to produce typical sequences with respect to the number of heads and tails they contained. This was accounted for by noting that across all 1024 possible sequences of 10 coin flips, 252 contained exactly 5 heads, whereas, for example, only 10 contained 9 heads. Thus, the most frequent number of heads is 5, and sequences containing exactly 5 heads are most typical of 10-item random sequences. Kareev used this observation to account for overalternation biases seen in randomness production: If participants generate typical sequences containing 5 heads and 5 tails, then these sequences will on average have an alternation rate higher than 50%.

As another example, Miller and Sanjurjo (2016) recently showed that in a short random binary sequence of outcomes, the expected proportion of three occurrences of an outcome that were then followed by the same outcome again was less than 0.5 (and of course conversely, the proportion of three outcomes followed by the opposite outcome was greater than 0.5). Thus, evidence traditionally seen as supportive of the Hot Hand Fallacy (Gilovich et al., 1985) actually suggests that it may not be a fallacy.

Most significantly, and of most relevance to this paper, Hahn and Warren (2009) have developed a theory employing the fact that in a short random binary sequence, some strings of specific outcomes are actually less likely to be observed than others (see Reimers, 2017, for a discussion of similarities between this theory and the work of Miller & Sanjurjo, 2016; see also Konold, 1995, Nickerson, 2007, and Kareev, 1992, for earlier psychologically-motivated work relating to this phenomenon, and Feller, 1968 for mathematical background). Hahn and Warren’s argument involved considering strings as component parts of longer sequences of events. While it is true that the two strings HHTHTT and HHHHTH are equally likely to occur given exactly six tosses of a coin, it is not the case that these strings are equally likely to occur at least once in any global sequence of finite length n > 6. The argument is presented in detail by Hahn and Warren (see also Sun et al., 2010, Sun and Wang, 2010a, Sun and Wang, 2010b), and summarized here. For this purpose, we use the term string to refer to a relatively short sequence of heads and tails that participants might be asked to make a judgment on, and global sequence to refer to a longer sequence of heads and tails, generated by tossing a coin, in which that string may appear. For example, the string THT (with length k = 3) appears three times in the global sequence HTHTTTTHTHT, which has length n = 11. Note that two of the occurrences overlap.

If the global sequence is infinitely long, then any two strings of the same length k will occur the same number of times. However, the distribution of these occurrences will not be the same for all strings: The string HHHH will tend to cluster. Suppose that HHHH appears at position t in the global sequence (where by ‘appears’ we mean ‘is completed’; i.e. the elements at positions t – 3, t – 2, t – 1, and t are all H). Consequently, there is a 50% chance that it will appear again at position t + 1 (i.e., if the coin toss on trial t + 1 yields H, then positions t – 2, t – 1, t, and t + 1 are all H). In contrast, the string HHHT cannot cluster in the same way, because there is no way for two occurrences of HHHT to overlap – all different occurrences of this string must be entirely separate. To illustrate this, Fig. 1 shows a raster plot of a simulated global sequence of 1000 coin flips, with bars marking the points at which each of the strings HHHH and HHHT occurred. Since the global sequence (n = 1000) is very long relative to the length of each string (k = 4), the total number of occurrences of HHHH and HHHT in the global sequence is approximately equal. However, the distribution is very different. Specifically, occurrences of HHHT are relatively regular (a ‘steady drip’), whereas occurrences of HHHH tend to occur in irregular clusters, with large gaps in between. This results in significant areas of white space in the HHHH sequence, where HHHH did not occur for many flips.

The upshot is that there are many more windows of a given sequence length (n > 4) that do not contain the string HHHH than do not contain the string HHHT: In a sequence of length, say, 20, the probability that the string HHHH does not occur (which we label PGN, standing for probability in the Global sequence of Non-occurrence, following Sun et al.’s terminology) is greater (at around 0.5) than the equivalent probability for HHHT (at around 0.25). Equivalently, the probability that HHHH occurs at least once as part of this global sequence of 20 tosses (labelled PGO, standing for the probability in the Global sequence of Occurrence) is less than for HHHT.

Hahn and Warren noted that people’s experiences of random sequences such as coin tosses are necessarily finite, and likely to be of moderate length, say 20 or 30 elements at most. Consequently, in the sequences that people have observed, there is a greater probability of not observing HHHH than not observing HHHT. When asked by a cognitive psychologist to pick which of HHHH or HHHT is more likely, if people assume this question refers to occurrence in a similar finite sample, then their preference for the latter should not be classed as an error; instead it is a sensible inference based on their experience and the statistical properties of the task at hand. More generally, Hahn and Warren stated that “There is not only a sense in which laypeople are correct, given a realistic but minimal model of their experience, that different exact orders are not equiprobable, it seems that the same experience might be able to provide a useful explanation of why some sequences are perceived to be special” (p. 457).

Although Hahn and Warren use PGO as the basis for their theory, they include simplifying assumptions, that aside from strings of streaks of a single outcome, HHHHH, or perfect alternations, HTHTH, people treat all strings with the same proportion of heads and tails identically – so do not differentiate between, say, HHHTT and HTHHT. As PGO does not vary much across strings with the same proportion of heads and tails, this simplifies the predictions made by the theory.

The central argument made by Hahn and Warren (2009) is that judgments may stem from participants over-extending their previous experience of genuine differences in probabilities-of-occurrence to artificial situations contrived by experimenters – the application of alternative norms. This is intriguing, and offers a more experiential explanation of participant behavior to the notion of a representativeness bias in which participants accurately recall limited frequency information. Of course the non-normativeness of a judgment may be relatively inconsequential in many laboratory studies. However, representativeness-based biases occur in both memory for random sequences (Olivola & Oppenheimer, 2008), and higher-stakes choices with real financial (e.g., Chen, Moskowitz, & Shue, 2016) or health (e.g., Kwan, Wojcik, Miron-shatz, Votruba, & Olivola, 2012) outcomes. This suggests that whatever the cause, the bias is not merely the consequence of low-stakes or hypothetical tasks.

The accounts above represent two sides of a broader debate on optimization. Some approaches (e.g., Kahneman & Tversky, 1972) assume that randomness judgments are one more example of ways in which we deviate from optimality, adding to the canon of situations in which, perhaps because of processing or motivational limitations (Simon, 1957), we show suboptimal, but functionally adequate judgment and decision making. The dominant alternative account suggests that our judgment and decision making reflects our limited experience with the environment (Hahn, 2014, Hahn and Warren, 2009; see also Miller & Sanjurjo, 2016, and Hertwig, Pachur, & Kurzenhäuser, 2005). In accounts of this nature, experimenters inadvertently encourage participants to give mathematically or logically incorrect answers by structuring the experimental stimuli in a way that does not reflect experience with the environment to which they are adapted. For more general reviews of these positions, see Oaksford and Chater, 2007, Bowers and Davis, 2012, and Gigerenzer (2007).

However, the use of heuristics and environmental optimization are not mutually exclusive. Hahn and Warren argue that their alternative norms might not be best seen as necessarily an alternative to heuristics. Instead, the reason we have adapted to use heuristics may be as a result of their capturing regularities in the environment reasonably well.

There appear to be four ways in which the relationship among alternative norms, heuristics, and behavior could be related. One relationship is that in randomness judgment tasks, although people appear to use heuristics, in fact they do not. Instead they use the alternative norm of probability of string occurrence, which generates behavior that happens to look like use of heuristics because, for example, both accounts predict that people should find the sequence HHHHH as particularly improbable relative to other 5-item sequences. A second possible relationship is that alternative norms combine additively with heuristic use to improve judgment. A third relationship would be that alternative norms explain the reason for the existence and application of the heuristics we use in randomness tasks: The reason we apply a representativeness heuristic is that it does a better job of capturing the alternative norms in the environment than assuming equiprobability, even if it is not successful for all sequences, and of course fails in the less ecological tasks devised by psychologists. Finally, it is possible that although these alternative norms may be a statistical reality, they have no influence on behavior, and similarities between predictions of alternative norms and behavior are coincidental.

Kahneman and Tversky’s original experiments used just two examples of six-item birth-order strings, and consequently lack the sensitivity to assess the relative influence of different aspects of representativeness (e.g., relative proportion of outcomes, alternation rate, compressibility) or use of alternative norms of the kind suggested by Hahn and Warren (2009). We know of no more systematic attempt to examine the factors affecting people’s judgments of likelihood of occurrence of different strings of binary outcomes, using an approach similar to Kahneman & Tversky’s. The closest example is perhaps that of Scholl and Greifeneder (2011), which attempted to disentangle alternation rate and longest run as predictors of perceived randomness in 20- or 21-item binary sequences. Of course with sequences of this length PGO would be near-zero in any plausibly experienced sequence of outcomes, making it essentially untestable. The aim of this paper is therefore to provide empirical evidence to determine (a) the extent to which people’s judgments in evaluating random sequences show sensitivity to alternative norms; and (b) what kinds of representativeness are important in determining perceptions of randomness in the kind of task that had led to development of local representativeness accounts. We note that this work focuses on the perception of random sequences presented as a single entity, as done by Kahenman and Tversky, rather than a general account of randomness perception and production.

Section snippets

Experiment 1

In Experiment 1, we made PGO (the probability of a string occurring at least once in a sequence) normative, by asking participants explicitly to estimate the probability of a string occurring at least once within a longer, finite global sequence. This was done in order to maximize the chance of detecting an influence of PGO on judgments. Our rationale for this was that if the alternative norm represented by PGO does not influence judgments when it is actually normatively appropriate, then it

Experiment 2

Although Experiment 1 showed clear effects, it has limitations. The task of predicting the probability that a string occurs in a sequence at least once is both a difficult task about which to reason, and is low in ecological validity. We also observed that several participants noted that they thought that all strings had equal probability of occurrence, so gave similar ratings for all options. In Experiment 2, participants made a series of binary choices between pairs of strings, indicating

Experiment 3

The results of Experiment 2 are strongly congruent with, and readily predicted by, those of Experiment 1. In a final experiment, we used a task that was procedurally even easier to understand than Experiments 1 and 2, namely choosing which of two strings is likely to occur first in a sequence of coin tosses. This makes the task particularly straightforward for participants, and is not dependent on participants’ attention to the specifics of a string appearing “at least once” in a sequence of

General discussion

A pervasive view in the cognitive psychology literature is that people’s perception of randomness is fundamentally biased: that judgments regarding the relative likelihood of different strings of events reflect non-normative heuristics relating to the local representativeness of those strings. Much of the basis for this reasoning comes from a very small number of exemplars used by Kahneman and Tversky (1972). Consequently, these studies lack the sensitivity to distinguish between judgments

Acknowledgments

We are grateful to Peter Ayton, Nick Chater, and Yaakov Kareev for helpful discussions regarding this work. Mike Le Pelley was supported by an Australian Research Council Future Fellowship (FT100100260).

References (52)

  • D.L. Chen et al.

    Decision making under the Gambler’s Fallacy: Evidence from asylum judges, loan officers, and Baseball umpires

    The Quarterly Journal of Economics

    (2016)
  • D. Diener et al.

    Recognizing randomness

    The American Journal of Psychology

    (1985)
  • R. Falk et al.

    Making sense of randomness: Implicit encoding as a basis for judgment

    Psychological Review

    (1997)
  • G.D. Farmer et al.

    Who “believes” in the Gambler’s Fallacy and why?

    Journal of Experimental Psychology: General

    (2017)
  • W. Feller

    An introduction to probability theory and its applications

  • M.P. Fiorina

    A note on probability matching and rational choice

    Behavioral Science

    (1971)
  • M. Gardner

    On the paradoxical situations that arise from nontransitive relations

    Scientific American

    (1974)
  • N. Gauvrit et al.

    Algorithmic complexity for psychology: A user-friendly implementation of the coding theorem method

    Behavior Research Methods

    (2016)
  • G. Gigerenzer

    On narrow norms and vague heuristics: A reply to Kahneman and Tversky

    Psychological Review

    (1996)
  • G. Gigerenzer

    Gut feelings: The intelligence of the unconscious

    (2007)
  • T.L. Griffiths et al.

    Probability, algorithmic complexity, and subjective randomness

  • T.L. Griffiths et al.

    From algorithmic to subjective randomness

    Advances in Neural Information Processing Systems

    (2004)
  • U. Hahn

    Experiential limitation in judgment and decision

    Topics in Cognitive Science

    (2014)
  • U. Hahn et al.

    Perceptions of randomness: Why three heads are better than four

    Psychological Review

    (2009)
  • R. Hertwig et al.

    Judgments of risk frequencies: Tests of possible cognitive mechanisms

    Journal of Experimental Psychology. Learning, Memory, and Cognition

    (2005)
  • Y. Kareev

    Not that bad after all: Generation of random sequences

    Journal of Experimental Psychology: Human Perception and Performance

    (1992)
  • Cited by (8)

    View all citing articles on Scopus
    View full text