Skip to main content
Log in

Rational Foundations of Fast and Frugal Heuristics: The Ecological Rationality of Strategy Selection via Improper Linear Models

  • Published:
Minds and Machines Aims and scope Submit manuscript

Abstract

Research on “improper” linear models has shown that predetermined weighting schemes for the linear model, such as equally weighting all predictors, can be surprisingly accurate on cross-validation. We review recent advances that can characterize the optimal choice of an improper linear model. We extend this research to the understanding of fast and frugal heuristics, particularly to the ecologically rational goal of understanding in which task environments given heuristics are optimal. We demonstrate how to test this model using the Recognition Heuristic and Take the Best heuristic, show how the model reconciles with the ecological rationality program, and discuss how our prescriptive, computational approach could be approximated by simpler mental rules that might be more descriptive. Echoing the arguments of van Rooij et al. (Synthese 187:471–487, 2012), we stress the virtue of having a computationally tractable model of strategy selection, even if one proposes that cognizers use a simpler heuristic process to approximate it.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. The random error term, \(\epsilon\), is assumed to be distributed as a random variable with mean zero and constant error variance, \(\sigma\). Note that we do not assume any particular distributional form for \(\epsilon\), e.g., normality.

  2. This formulation was designed for binary cues. It is more difficult to construct to represent a lexicographic weighting with continuous cues. For example, with truly continuous cues, the second cue value could be arbitrarily larger than the first, having an almost infinite ratio, so no weight placed on the first cue would “drown out” the importance of the second cue. Some researchers investigating TTB with continuous cues turn them into binary cues, for example by performing a midpoint split and recoding to 1 all values above the midpoint and 0 all values below the midpoint.

References

  • Brandstätter, E., Gigerenzer, G., & Hertwig, R. (2006). The priority heuristic: Making choices without trade-offs. Psychological Review, 113, 409–432.

    Article  Google Scholar 

  • Czerlinski, J., Gigerenzer, G., & Goldstein, D. G. (1999). How good are simple heuristics? In G. Gigerenzer, P. M. Todd, & The ABC Research Group (Eds.), Simple heuristics that make us smart (pp. 97-118). New York: Oxford University Press.

  • Dana, J. (2008). What makes improper linear models tick? In J. Krueger (Ed.), Rationality and social responsibility: Essays in honor of Robyn Mason Dawes. Mahwah, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  • Dana, J., & Dawes, R. M. (2004). The superiority of simple alternatives to regression for social science predictions. Journal of Educational and Behavioral Statistics, 3, 317–331.

    Article  Google Scholar 

  • Davis-Stober, C. P. (2011). A geometric analysis of when fixed weighting schemes will outperform ordinary least squares. Psychometrika, 76, 650–669.

    Article  MathSciNet  MATH  Google Scholar 

  • Davis-Stober, C. P., Dana, J., & Budescu, D. (2010a). A constrained linear estimator for multiple regression. Psychometrika, 75, 521–541.

    Article  MathSciNet  MATH  Google Scholar 

  • Davis-Stober, C. P., Dana, J., & Budescu, D. (2010b). Why recognition is rational: Optimality results on single-variable decision rules. Judgment and Decision Making, 5, 216–229.

    Google Scholar 

  • Dawes, R. M. (1979). The robust beauty of improper linear models. The American Psychologist, 34, 571–582.

    Article  Google Scholar 

  • Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychological Bulletin, 81, 95–106.

    Article  Google Scholar 

  • Einhorn, H. J., & Hogarth, R. M. (1975). Unit weighting schemes for decision making. Organizational Behavior and Human Performance, 13, 171–192.

    Article  Google Scholar 

  • Fasolo, B., McClelland, G. H., & Todd, P. M. (2007). Escaping the tyranny of choice: When fewer attributes make choice easier. Marketing Theory, 7, 13–26.

    Article  Google Scholar 

  • Flury, B., & Riedwyl, H. (1985). T2 tests, the linear two-group discriminant function, and their computation by linear regression. The American Statistician, 39, 20–25.

    MATH  Google Scholar 

  • Gigerenzer, G. (1991). From tools to theories: A heuristic of discovery in cognitive psychology. Psychological Review, 98, 254–267.

    Article  Google Scholar 

  • Gigerenzer, G. (2008). Why heuristics work. Perspectives on Psychological Science, 3, 20–29.

    Article  Google Scholar 

  • Gigerenzer, G., & Brighton, H. (2009). Homo heuristics: Why biased minds make better inferences. Topics in Cognitive Science, 1, 107–143.

    Article  Google Scholar 

  • Gigerenzer, G., & Goldstein, D. G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103, 650–669.

    Article  Google Scholar 

  • Gigerenzer, G., Todd, P. M., & the ABC Research Group. (1999). Simple heuristics that make us smart. New York: Oxford University Press.

  • Goldstein, D. G. (1997). Models of bounded rationality for inference. Doctoral thesis, The University of Chicago. Dissertation Abstracts International, 58(01), 435B. (University Microfilms No. AAT 9720040).

  • Goldstein, D. G., & Gigerenzer, G. (2002). Models of ecological rationality: The recognition heuristic. Psychological Review, 109, 75–90.

    Article  Google Scholar 

  • Goldstein, D. G., & Gigerenzer, G. (2009). Fast and frugal forecasting. International Journal of Forecasting, 25, 760–772.

    Article  Google Scholar 

  • Hertwig, R., Davis, J. N., & Sulloway, F. J. (2002). Parental investment: How an equity motive can produce inequality. Psychological Bulletin, 128, 728–745.

    Article  Google Scholar 

  • Hogarth, R. M., & Karelaia, N. (2005). Ignoring information in binary choice with continuous variables: When is less more? Journal of Mathematical Psychology, 49, 115–124.

    Article  MathSciNet  MATH  Google Scholar 

  • Hogarth, R. M., & Karelaia, N. (2006). “Take-The-Best” and other simple strategies: Why and when they work “well” with binary cues. Theory and Decision, 61, 205–249.

    Article  MATH  Google Scholar 

  • Kahneman, D., Slovic, P., & Tversky, A. (Eds.). (1982). Judgment under uncertainty: Heuristics and biases. Cambridge: Cambridge University Press.

    Google Scholar 

  • Katsikopoulos, K. V. (2011). Psychological heuristics for making inferences: Definition, performance, and the emerging theory and practice. Decision Analysis, 8, 10–29.

    Article  MathSciNet  Google Scholar 

  • Katsikopoulos, K. V., Schooler, L. J., & Hertwig, R. (2010). The robust beauty of ordinary information. Psychological Review, 117, 1259–1266.

    Article  Google Scholar 

  • Lehmann, E. L., & Casella, G. (1998). Theory of point estimation (2nd ed.). New York: Springer.

    MATH  Google Scholar 

  • Marden, J. I. (2013). Multivariate statistics old school. Department of Statistics, The University of Illinois at Urbana-Champagin.

  • Martignon, L., & Hoffrage, U. (2002). Fast, frugal, and fit: Simple heuristics for paired comparison. Theory and Decision, 52, 29–71.

    Article  MATH  Google Scholar 

  • Schmidt, F. L. (1971). The relative efficiency of regression and simple unit predictor weights in applied differential psychology. Educational and Psychological Measurement, 31, 699–714.

    Article  Google Scholar 

  • Shanteau, J., & Thomas, R. P. (2000). Fast and frugal heuristics: What about unfriendly environments? Behavioral and Brain Sciences, 23, 762–763.

    Article  Google Scholar 

  • van Rooij, I., Wright, C. D., & Wareham, T. (2012). Intractability and the use of heuristics in psychological explanations. Synthese, 187, 471–487.

    Article  MathSciNet  MATH  Google Scholar 

  • von Winterfeldt, D., & Edwards, W. (1973). Costs and payoffs in perceptual research. Technical Report, No. 011313-1-T, Engineering Psychology Laboratory, University of Michigan.

  • Wainer, H. (1976). Estimating coefficients in linear models: It dont make no nevermind. Psychological Bulletin, 83, 213–217.

    Article  Google Scholar 

  • Wilks, S. S. (1938). Weighting systems for linear functions of correlated variables when there is no dependent variable. Psychometrika, 3, 23–40.

    Article  MATH  Google Scholar 

Download references

Acknowledgments

We thank Mirjam Jenny and Jean Whitmore for helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jason Dana.

Appendices

Appendix 1

Prior work has shown that simple heuristics, such as TTB, can outperform standard statistical methods, such as regression, in many task environments (Katsikopoulos et al. 2010). There is a parallel literature on comparing improper linear models to standard regression techniques (Dana and Dawes 2004; Davis-Stober 2011; Dawes 1979; Wainer 1976), demonstrating that improper linear models often perform quite favorably for task environments where information is limited, i.e., sample and effect sizes are small to modest in size. There is converging evidence that simple decision heuristics and improper linear models work well for similar reasons. For example, the loss function we have used, mean squared error, can be decomposed into the sum of an estimator’s squared bias and its variance. Said simply, one can improve overall accuracy, mean squared error, by systematically accepting a small amount of bias while reducing the variability of the estimates. As described by Dana (2008) and Davis-Stober et al. (2010a), this is one interpretation for the strong performance of improper linear models when compared to standard estimation methods. The improper models are biased as the weighting policies are pre-determined, yet variance is greatly reduced for the same reason; on the other hand, standard methods, such as regression, are unbiased in their estimates but are more variable from sample to sample.

In this appendix, we summarize the main results from Davis-Stober et al. (2010a) that underlie our approach. We refer readers to the original papers for detailed proofs. Throughout, we consider the standard regression model where \(\textit{Y} = \varvec{X\beta } + \epsilon\), where \(\varvec{X}\) is a known \(n \times p\) design matrix, \(\epsilon \sim (0, \; \sigma ^{2}\varvec{I}_{n \times n})\), \(\varvec{\beta }\) is a set of unknown population weights, and the matrix \(\varvec{X'X}\) is full rank and positive-definite. We now formerly define our estimator of \(\varvec{\beta }\) that conforms to the weighting policy of a suitably chosen improper linear model. We term this estimator the constrained estimator.

Definition

(Davis-Stober et al. 2010a). Define \(\varvec{a}\) to be an exogenously chosen \(p \times 1\) weighting vector, with \(\Vert \varvec{a}\Vert ^{2} > 0\). Assume \(\textit{Y} \sim (\varvec{X\beta }, \; \sigma ^{2}I_{n \times n})\) with i.i.d. sampling. The constrained estimator \({\hat{\varvec{\beta }}}_{\varvec{a}}\) is an estimator of \(\varvec{\beta }\) and is defined as follows:

$$\begin{aligned} {\hat{\varvec{\beta }}}_{\varvec{a}} = \varvec{a}k, \end{aligned}$$
(5)

where \(k=(\varvec{a'X'Xa})^{-1}\varvec{a'X'y}\).

This formulation allows us to compare the mean squared error (a.k.a. loss) of the constrained estimator \({\hat{\varvec{\beta}}}_{\varvec{a}}\) to more traditional estimators such as the standard regression estimator, Ordinary Least Squares (OLS). We can also calculate the loss of \({\hat{\varvec{\beta}}}_{\varvec{a}}\) under various choices of exogenously chosen weights, \(\varvec{a}\). In other words, given different predictor cue structures, \(\varvec{X}\), we can determine which choices of \(\varvec{a}\) result in a constrained estimator that incurs less loss than others.

A potential stumbling block is that the true value of the population weights, \(\varvec{\beta }\), is unknown. Because the parameter k is a scalar value and cannot change the pre-determined relationships among the estimates in \({\hat{\varvec{\beta}}}_{\varvec{a}}\), which are determined by \(\varvec{a}\), the constrained estimator is both biased and inconsistent. Further, we don’t know, a priori, how biased the constrained estimator will be for any particular choice of \(\varvec{a}\). In other words, if our exogenously chosen weights are, perhaps by luck or ecological rationality principles, very similar to the population weights then the resulting constrained estimator would not be very biased, resulting in a relatively small value of mean squared error. On the other hand, a choice of exogenously chosen weights could be quite different than \(\varvec{\beta }\), resulting in a larger bias, hence larger mean squared error, all things equal.

We solve this problem by solving for the maximum possible mean squared error. The following theorem provides a tight upper bound for the mean squared error of \({\hat{\varvec{\beta}}}_{\varvec{a}}\), and this value can be expressed as a function of the exogenously chosen weights, \(\varvec{a}\), and the matrix \(\varvec{X}\).

Theorem 1

(Davis-Stober et al. 2010a). Assume \(\varvec{a}\) is chosen exogenously and without loss of generality let \(\Vert \varvec{a}\Vert = 1, \Vert \varvec{\beta }\Vert ^{2} < \infty , and\,\textit{Y} \sim (\varvec{X\beta }, \sigma ^{2}\varvec{I}_{n \times n})\) with i.i.d. sampling. The mean squared error (MSE) of \({\hat{\varvec{\beta}}}_{\varvec{a}}\) is bounded above by the following,

$$\begin{aligned} MSE_{\hat{\varvec{\beta }}_{\varvec{a}}} \le \Vert \varvec{\beta }\Vert ^{2} \left( \frac{\varvec{a}'(\varvec{X}'\varvec{X})^{2} \varvec{a}}{(\varvec{a}'\varvec{X}'\varvec{Xa})^{2}} \right) + \frac{\sigma ^{2}}{\varvec{a'X'Xa}}. \end{aligned}$$
(6)

This maximal MSE is attained when the population weights, \(\varvec{\beta }\), is a scale multiple of the vector

$$\begin{aligned} \varvec{\beta }^{\varvec{*}} = \varvec{X'Xa} - \varvec{a(a'X'Xa)}. \end{aligned}$$
(7)

Theorem 1 provides two important pieces of information. First, for any choice of weights \(\varvec{a}\) and matrix \(\varvec{X'X}\), it explicitly describes the maximum loss that the constrained estimator can incur. Second, it provides the population weights, \(\varvec{\beta }^{\varvec{*}}\), under which this maximal loss occurs. Not surprisingly, this worst case set of population weights is geometrically orthogonal to \(\varvec{a}\) (Davis-Stober et al. 2010a). Under this “worst case” condition, the improper linear model being deployed, \(\varvec{a}\), is maximally unfit given the true weighting relationship among the dependent variable and predictors, \(\varvec{\beta }^{\varvec{*}}\).

Theorem 1 also gives an explicit description of maximal loss in terms of the well-known bias-variance trade-off. As is well-known, mean squared error can be written as the sum of an estimator’s squared bias and its variance. More formally, for any estimator, \({\hat{\varvec{\beta}}}\),

$$\begin{aligned} MSE_{\hat{\varvec{\beta}}} = \Vert E({\hat{\varvec{\beta}}})-\varvec{\beta }\Vert ^{2} + tr(\varvec{\Psi }_{\hat{\varvec{\beta }}}), \end{aligned}$$
(8)

where \(\varvec{\Psi }_{\hat{\varvec{\beta }}}\) is the covariance matrix of \(\hat{\varvec{\beta }}\), and “tr” denotes the trace operator. Returning to Theorem 1, the maximal value of mean squared error for the constrained estimator is written as a sum of its squared bias, \(\Vert \varvec{\beta }\Vert ^{2} \left( \frac{\varvec{a}'(\varvec{X}'\varvec{X})^{2} \varvec{a}}{(\varvec{a}'\varvec{X}'\varvec{Xa})^{2}} \right)\), and the sum of its variance, \(\frac{\sigma ^{2}}{\varvec{a'X'Xa}}\), see Davis-Stober et al. (2010a) for the complete derivation and proof. The value, \(\Vert \varvec{\beta }\Vert ^{2}\), in the bias term can be equated with overall effect size and does not depend on the direction of the population weights, only the sum of their squared values. The remaining values in the bias and variance terms are only functions of \(\varvec{a}\) and \(\varvec{X'X}\). From this interpretation, the relationship between a choice of \(\varvec{a}\) and the predictor variables is clear. Minimizing the maximal loss that one could incur depends upon optimizing the relationship between \(\varvec{a}\) and \(\varvec{X'X}\), as the product \(\varvec{X'Xa}\) features prominently in both the maximal bias and variance terms in Eq. (8). This is quite intuitive, as for mean-centered predictor variables, \(\varvec{X'X} = (n-1)\varvec{C}\), where \(\varvec{C}\) is the covariance matrix among the predictor cues. In other words, to minimize maximum loss, the mini–max criterion, we need only find the best choice of \(\varvec{a}\) for a particular predictor cue covariance structure. The following theorem gives precisely this result.

Theorem 2

(Davis-Stober et al. 2010a). Assume \(\varvec{X}\) is given, and without loss of generality let \(\Vert \varvec{a}\Vert ^{2} = 1\). Define \(\lambda _{\max }\) as the largest eigenvalue of the matrix \(\varvec{X'X}\). Assume \(\Vert \varvec{\beta }\Vert ^{2} < \infty\). The weight vector \(\varvec{a}\) that is mini–max with respect to all exogenously chosen \(\varvec{a} \in \mathbb {R}^{p}\), is the eigenvector corresponding to the largest eigenvalue of \(\varvec{X'X}\). The maximal value of \(MSE_{\hat{\varvec{\beta }}_{\varvec{a}}}\) is bounded by the following inequality:

$$\begin{aligned} MSE_{\hat{\varvec{\beta }}_{\varvec{a}}} \le \Vert \varvec{\beta }\Vert ^{2} + \frac{\sigma ^{2}}{\lambda _{\max }}. \end{aligned}$$
(9)

Putting everything together, Theorems 1 and 2 provide necessary and sufficient conditions for finding an optimal choice of improper weighting policy, according to the mini–max criterion, given a predictor cue covariance structure \(\varvec{C}\). By Theorem 2, this optimal weighting policy will be mini–max with respect to all other exogenously chosen weights that could be considered.

Appendix 2

Proof of Result 1

The proof follows by direct calculation.

$$\begin{aligned} \varvec{Cov}_{\varvec{cascade}}\varvec{a}_{1} = \left( \begin{array}{cccccc} 1+c2^{2p-2} &{} c2^{2p-3} &{} c2^{2p-4} &{} \cdots &{} \cdots &{} c2^{p-1}\\ c2^{2p-3} &{} 1+c2^{2p-4} &{} c2^{2p-5} &{} \cdots &{} \cdots &{} c2^{p-2}\\ c2^{2p-4} &{} c2^{2p-5} &{} 1+c2^{2p-6} &{} \cdots &{} \cdots &{} c2^{p-3}\\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \cdots &{} \vdots \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ c2^{p-1} &{} c2^{p-2} &{} c2^{p-3} &{} \cdots &{} c2^{1} &{} 1+c\\ \end{array}\right) \left( \begin{array}{c} \frac{1}{2^{0}}\\ \frac{1}{2^{1}}\\ \frac{1}{2^{2}}\\ \vdots \\ \vdots \\ \frac{1}{2^{p-1}}\\ \end{array}\right) , \end{aligned}$$

which is equal to

$$\begin{aligned} \left( \begin{array}{c} \frac{1+c2^{2p-2}}{2^{0}} + \frac{c2^{2p-3}}{2^{1}}+ \frac{c2^{2p-4}}{2^{2}} + \cdots + \frac{c2^{p-1}}{2^{p-1}}\\ \frac{c2^{2p-3}}{2^{0}} + \frac{1+c2^{2p-4}}{2^{1}}+ \frac{c2^{2p-5}}{2^{2}} + \cdots + \frac{c2^{p-2}}{2^{p-1}}\\ \frac{c2^{2p-4}}{2^{0}} + \frac{c2^{2p-5}}{2^{1}}+ \frac{1+c2^{2p-6}}{2^{2}} + \cdots + \frac{c2^{p-3}}{2^{p-1}}\\ \vdots \\ \vdots \\ \frac{c2^{p-1}}{2^{0}} + \frac{c2^{p-2}}{2^{1}}+ \frac{c2^{p-3}}{2^{2}} + \cdots + \frac{1+c}{2^{p-1}}\\ \end{array}\right) , \end{aligned}$$

which simplifies to

$$\begin{aligned} \left( \begin{array}{c} \frac{1+c2^{2p-2}}{2^{0}} + \frac{c2^{2p-4}}{2^{0}}+ \frac{c2^{2p-6}}{2^{0}} + \cdots + \frac{c2^{0}}{2^{0}}\\ \frac{c2^{2p-2}}{2^{1}} + \frac{1+c2^{2p-4}}{2^{1}}+ \frac{c2^{2p-6}}{2^{1}} + \cdots + \frac{c2^{0}}{2^{1}}\\ \frac{c2^{2p-2}}{2^{2}} + \frac{c2^{2p-4}}{2^{2}}+ \frac{1+c2^{2p-6}}{2^{2}} + \cdots + \frac{c2^{0}}{2^{2}}\\ \vdots \\ \vdots \\ \frac{c2^{2p-2}}{2^{p-1}} + \frac{c2^{2p-4}}{2^{p-1}}+ \frac{c2^{2p-6}}{2^{p-1}} + \cdots + \frac{1+c2^{0}}{2^{p-1}}\\ \end{array}\right) = \left( \begin{array}{c} \frac{1 + c\sum _{i=0}^{p-1}2^{2i}}{2^{0}}\\ \frac{1 + c\sum _{i=0}^{p-1}2^{2i}}{2^{1}}\\ \frac{1 + c\sum _{i=0}^{p-1}2^{2i}}{2^{2}}\\ \vdots \\ \vdots \\ \frac{1 + c\sum _{i=0}^{p-1}2^{2i}}{2^{p-1}}\\ \end{array}\right) = \left(1 + c\sum _{i=0}^{p-1}2^{2i}\right)\left( \begin{array}{c} \frac{1}{2^{0}}\\ \frac{1}{2^{1}}\\ \frac{1}{2^{2}}\\ \vdots \\ \vdots \\ \frac{1}{2^{p-1}}\\ \end{array}\right) . \end{aligned}$$

Thus, \(\varvec{a}_{1}\) is an eigenvector of \(\varvec{Cov}_{\varvec{cascade}}\) with an eigenvalue equal to \(\lambda _{max}=1 + c\sum _{i=0}^{p-1}2^{2i}\). It is routine to show that all other eigenvectors of \(\varvec{Cov}_{\varvec{cascade}}\) have an eigenvalue equal to 1, thus \(\lambda _{max}\) is maximal for any positive value of c. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dana, J., Davis-Stober, C.P. Rational Foundations of Fast and Frugal Heuristics: The Ecological Rationality of Strategy Selection via Improper Linear Models. Minds & Machines 26, 61–86 (2016). https://doi.org/10.1007/s11023-015-9372-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11023-015-9372-z

Keywords

Navigation