Skip to main content

*“When we run a data analysis we can fool ourselves, *

*and then with this conviction fool others.”*

David Donoho

# 1. Introducing the Book

# 2. A Brief Summary of Content

# 3. Special Features

## 3.1. Stability and Robustness

## 3.2. The Central Role of the Projects

## 3.3. Critical Thinking

## 3.4. Multivariate Methods as Exploratory Analyses

## 3.5. Computational Details

## 3.6. All-Levels Textbook

# 4. Prospective Readers

# 5. Wishful Additions

# 6. Contemplations

# Disclosure Statement

# References

A Review of *Veridical Data Science* by Bin Yu and Rebecca L. Barter

Published onJun 26, 2024

A Review of Veridical Data Science by Bin Yu and Rebecca L. Barter

*Editor-in-Chief’s Note**:* *In this inaugural book review for* Harvard Data Science Review, *Yuval Benjamini and Yoav Benjamini provide a succinct summary and insightful reflection on* Veridical Data Science *by Bin Yu and Rebecca Barter (2024). The core premise of* Veridical Data Science *(VDS) is that data science results and findings must be demonstrably trustworthy to offer viable solutions to real-world problems. The book is founded on the PCS principles—predictability, computability, and stability—articulated by Bin Yu and her team in recent years. While predictability and computability are frequently emphasized in data science practice and theory, the book uniquely stresses the importance of stability as an integral and routine aspect of data science practice. The Benjamini duo discuss the potential uses and prospective readers of the book, concluding that its pedagogical excellence, diverse examples, and projects make* Veridical Data Science *a suitable textbook for students of all levels, in addition to being a valuable resource for data scientists in general. They also suggest content for a possible second volume, such as general design principles for stability that go beyond traditional robust designs.*

David Donoho

Most readers will agree with the above quotation, but the question of how to avoid such pitfalls in practice remains misty. In a new textbook, *Veridical Data Science*, Yu and Barter provide a potential roadmap for addressing this question: they argue that for the analyses of data to offer *veridical* (truthful) answers for real-world problems, they need to be *demonstratively trustworthy.* The book then sets out to teach both the philosophy and the working routines for demonstrating trustworthiness. At the same time, it teaches the practice of modern data analysis following Yu’s long experience in teaching the influential PhD course “Applied Statistics and Machine Learning” at the University of California, Berkeley.

The core idea underlying veridical data science is that the demonstration of trustworthiness is achieved by a series of computational experiments. Central to the demonstration is the code (*computability*), which should both recount the analysis, and facilitate different manipulations to assure that the results are veridical. The two properties to be proven using the code are *predictability* (akin to replicability or even generalizability) requiring the models to well-describe new or held out data; and *stability*, meaning that the results do not change qualitatively when changes are made to the analysis cycle. Throughout the book, Yu and Barter demonstrate how these three main principles, predictability, computability, and stability (PCS), introduced by Yu and Kumbier (2020), are materialized in various stages of the data analysis life cycle. Students and their teacher will enjoy using this lucid exposition of the topics and gain from the ability to experience the more abstract original ideas through data projects. Working data scientists will gain from being exposed to the veridical approach to data science in a way that the papers on the subject could not have achieved.

We briefly review the content of the book and follow by discussing its special features and their relevance to the prospective readers. We finish by contemplating about the book and its importance.

The first part of the *Veridical Data Science* (*VDS*) book introduces the concept of veridical data science, discusses the data science project life cycle, and ties them together with demonstrations. The second part deals with exploratory data analysis (EDA), including data description, data visualization, and linear relationships. The nuts and bolts of preparing data for analysis—should we remove or impute our missing values—are discussed in the context of specific cases. Vivid examples are given of the need to study the data collection process and its relevance to the knowledge domain and the analysis goals. Further techniques covered in this part are principal component analysis and clustering.

The last part is about prediction, covering the least squares method for continuous variables and its extensions by variable transformations, nonlinear relationships, regularization, and cross-validation. Logistic regression is offered for binary responses and decision trees and random forests are discussed for both types of response variables (excluded are neural networks and deep learning.) This part ends with a discussion of how to choose a single fit from the many explored or, alternatively, construct an ensemble or prediction intervals accounting for uncertainty in preprocessing and modeling choices.

In the traditional statistical idea of robustness, the result of an analysis should not vary by much if the assumed model for data generation is perturbed. In the book, this idea is expanded to an all-encompassing *stability*, where the entire process of data analysis is subject to perturbations: not only the data generation model but also the decisions of the analyst, at the preprocessing stage, the analysis, and the presentation of the results. Facing such perturbations, the result of the analysis or the prediction is required not to change (qualitatively) by much.

The book features many data projects, case studies and data-based exercises, ranging from environmental, medical, and social sciences to shopping online, with several of them repeatedly revisited through the stages of data analysis. This allows the reader to develop familiarity with the particular scientific questions and data collection issues of each data set. In that sense, the holistic approach of the PCS process during the project’s life cycle is being exposed while a particular technique is taught.

The book is rich with data analysis, coding, and thought exercises and projects of various difficulties and lengths. We found these well versed and potentially useful for teaching the practicalities and philosophy of data analysis. As an example, we enjoyed the juxtaposition of the True/False question: “A pre-processed dataset can contain missing values'' with the open version “What do you think might be a possible cause of the missing values in the organ donation dataset discussed in the chapter.” As another example, displaying a histogram of the years since having discovered diabetes in a group of current patients, the question is about two strange phenomena in the histogram and what may be the reason for that (a single 100, and peaks at multiples of 5).

Principal component analysis and clustering analysis are introduced in the EDA section of the book, in contrast to popular textbooks such as *An Introduction to Statistical Learning* (Gareth et al., 2013) and many machine learning courses that regulate them as stand-alone ‘unsupervised learning’ methods. We agree that exploration of modern data sets cannot rely on one- or few-dimensional summaries alone. In high-dimensional data, visualizing and exploring the data requires identifying meaningful projections in variable space, sample space, or both. Variable projections help find outliers or data errors; subgroups may inform of hidden variables that should be highlighted in prediction or explicitly included in the design of a training validation and test sets. Thus, they enrich what lies under the wide umbrella of EDA. In addition, the multivariate methods are subjected to the PCS framework, with explicit methods to check their stability to data perturbations and algorithmic choices. We recommend trying this flow even for standard statistical machine learning courses.

Even though computational code plays an important role in the veridical approach to data science, the book is not code loaded. The R programming language that is used for the computations in the examples and exercises is presented sparingly in the text, usually by merely mentioning the relevant command. The complete underlying R code is presented in a supplementary GitHub repository, where each example in the text is directly linked to its relevant file, and code output is further discussed in the text. The authors seem to have accomplished their goal of writing a book that is as programming language–free as possible, but to get the full benefit from the text through the exercises, programming is needed. To enable that for a wider readership, the authors promise to offer a Python version to escort the book in the near future (a Python package for veridical data science is available on GitHub, but it does not follow the book’s content).

Although the book is grounded in statistical machine learning, most of the results are introduced with only the required mathematical notations to describe concepts and methodologies, not to prove their properties. Basic college calculus as remembered after years of minimal use should suffice. The book is well written pedagogically, and the language is easy to absorb. This makes the book accessible to a wide and diverse readership.

*Veridical Data Science* can be assigned as a textbook for upper-level undergraduate or early graduate courses combining empirical work and an introduction to data science and statistical unsupervised and predictive modeling. We find it particularly compelling for the former, as few textbooks exist that teach the practice of running applied data science projects. Such a course, often named ‘Applied Data Science’ or ‘Data Science Laboratory,’ can complete a natural sequence that is found in most statistics graduate programs alongside applied and theoretical statistics and a theoretical machine learning course.

This is also an excellent book for a single course on data science, to support the education of students in other areas of science, whether physical, biological, or any other area where data analysis is likely to be used in after-university life. In a report of the data science steering committee for Israeli research universities, avilable online, it was found that among the researchers claiming to be involved in ‘data science,’ only a quarter are ‘data scientists’ in the sense that they devote more than 85% of their time to core data science activities. Therefore, undergraduate programs in engineering and management and students majoring in the sciences can all benefit from a course based on this book.

Finally, practitioners who devote most of their time to solving problems with data can also profit from a fresh look at the principles being offered and demonstrated by the book. The first part of the book, and the examples thereafter, can serve as an excellent thinking template for setting up veridical practices within a research group. As mathematics is used mainly as a language to describe concepts and methodologies, and not for proving their properties, the required level should be no barrier to practitioners who wet their hands working with data.

The formal thinking of uncertainty under different design choices is one of the great challenges confronting statisticians and data scientists generally. Modern approaches to study this uncertainty include conformal inference and quantifications of the effect of dependence or multi-environment sampling. The study of causality and interpretation rely heavily on such inferential tools. The book does take first steps toward inference under the PCS framework, but full statistical inference incorporating all sources of uncertainty is still needed. This was noted by the authors, who add that such “*PCS inference”* ideas are being developed in Yu’s lab.

With such a framework for inference in place, we wonder about the design principles for stable methods. Classical robust methods, somewhat underrepresented in the book, were designed to address this goal at the level of deviations from the assumed data-generating models. The natural next step is to design methods that are demonstrably *stable* for general perturbations of the analysis pipeline. We therefore look forward to a second volume about veridical data science in the not too far future.

In our years of teaching statistics/data science project–based applied courses, we never had a textbook that could escort a major part of such a course. We find that *VDS* successfully fills this gap. Reading it, Tukey’s *Exploratory Data Analysis* (1977), comes to mind, and not only because the second part of *VDS* carries this title. In the era of hand-held computation and manual plotting, Tukey’s red book taught the techniques of exploring data—low and high. From the selection of graph paper and pen to box-and-whiskers and residual plots, it showed how they can bring to light hidden patterns in real data. (Indeed, Yoav used this book in the first years of his teaching career). *Veridical Data Science* also deals with techniques—low and high—but in the modern era. From principles of writing data analysis code (“never edit the original data file”) and organizing the computing framework to enable teamwork—to principal components analyses.

But beyond the similarity in teaching techniques for data analysis, *Veridical Data* *Science* delves into the meaning and implications of the analysis for the problem being faced. In the authors’ words: “conducting data analysis while making human judgement-calls and using domain knowledge … in order to solve a real-world domain problem.”

In summary, *Veridical Data Science* teaches practical methods to study data and practical methods to empirically investigate the validity of their results for the problem being faced. It can be used successfully in class, but because it is more than a textbook, anyone analyzing data will benefit from exposure to the strategy for veridical predictive data science developed over the years in Professor Bin Yu’s lab.

Yuval served as a graduate student instructor for early versions of the Applied Statistics and Machine Learning course, as a student in UC Berkeley.

Gareth, J. M., Witten, D., Hastie, T., & Tibshirani, R. (2013). *An introduction to statistical learning*. Springer.

Tukey, J. W. (1977). *Exploratory data analysis*. Addison-Wesley.

Yu, B., & Kumbier, K. (2020). Veridical data science. *PNAS*, *117*(8), 3920–3929. https://doi.org/10.1073/pnas.1901326117

Yu, B., & Barter, R. L. (2024). *Veridical data science: The practice of responsible data analysis and decision making*. The MIT Press.

©2024 Yuval Benjamini and Yoav Benjamini. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.