Breiman describes two contrasting approaches to statistical modeling: data modeling and algorithmic modeling. Data modeling chooses a (simple linear) model based on intuition about the data-generating mechanism. Emphasis is on explanation and model interpretability (validation and model fit is of secondary importance and, if done, it is performed through goodness-of-fit test that is often not robust). Algorithmic modeling chooses the model with the highest predictive validation accuracy with no consideration for model explanatory power.
Arguing that data models can’t solve novel real-world problems arising from massive data sets, Breiman proposes that we need to move away from data modeling, which can lead to “misleading conclusions,” and embrace algorithmic modeling which is better suited to analyze complex data. Accrdoing to Breiman, a highly-complex, accurate model that can’t be fully explained is more valuable than a simple, linear model with no predictive accuracy that we completely understand.
Shmueli builds on Brieman’s article by differentiating between statistical modeling for causal explanation and statistical modeling for prediction. He explicitly uses the term modeling to “highlight the entire process involved, from goal definition, study design, and data collection to scientific use” (p. 290). Statistical modeling for causal explanation is referred to as explanatory modeling. Explanatory modeling is used to test causal theories/hypotheses and as such is often used in social sciences. Regression models are the most common and they are almost always associational while it is often theory itself that provides for causality. Statistical modeling for prediction is called predictive modeling. Predictive modeling applies a statistical model or data mining algorithm to data for the purpose of predicting new or future observations. [^1]
Shmueli also mentions a third type of modeling, descriptive modeling, so as not to mix the previous two types up with it. Statistical modeling is aimed at summarizing or representing the data structure in a compact manner and it is mainly used by statisticians. Unlike explanatory modeling, descriptive modeling doesn’t use causal theory and as such focuses on the measurable level rather than on the construct level. Unlike predictive modeling, descriptive modeling is not aimed at prediction. Fitting a regression model can be descriptive if it is used for capturing the association between the dependent and independent variables rather than for causal inference or for prediction.
Shmueli writes that although explanatory modeling is commonly used for theory building and testing, predictive modeling is nearly absent in many scientific fields as a tool for developing theory. He asserts that prediction should be established as a necessary scientific endeavor beyond utility, for the purpose of developing and testing theories. For example, large and rich datasets can be too complex to theorize about patterns and relationships and predictive modeling can help uncover potential new causal mechanisms and lead to the generation of new hypotheses.