Skip to main content
SearchLoginLogin or Signup

Fostering Better Coding Practices for Data Scientists

Published onJul 27, 2023
Fostering Better Coding Practices for Data Scientists
·

Abstract

Many data science students and practitioners do not see the value in making time to learn and adopt good coding practices as long as the code ‘works.’ However, code standards are an important part of modern data science practice, and they play an essential role in the development of data acumen. Good coding practices lead to more reliable code and save more time than they cost, making them important even for beginners. We believe that principled coding is vital for quality data science practice. To effectively instill these practices within academic programs, instructors and programs need to begin establishing these practices early, reinforce them often, and hold themselves to a higher standard while guiding students. We describe key aspects of good coding practices for data science, illustrating with examples in R and in Python, though similar standards are applicable to other software environments. Practical coding guidelines are organized into a top 10 list.

Keywords: data acumen, data science, data science practice, data science education, code quality, code style


Media Summary

Many data science students and practitioners are reluctant to adopt good coding practices as long as the code ‘works.’ Yet meticulous attention to detail is an important characteristic of a data scientist.

Code standards are an important part of modern data science, and they play an essential role in ensuring the quality of data science in research and in the workforce. Responsible coding practices lead to more reliable code and save more time than they cost, making them important even for beginners. We believe that principled coding is vital for quality data science practice.

To effectively instill these habits of mind within academic programs, instructors and programs need to begin establishing these practices early, to reinforce them often, and to hold themselves to a higher standard while guiding students. We describe key aspects of good coding for data science, illustrating them with examples and motivation. Practical coding guidelines are organized into a top 10 list.


1. Introduction

Coding is an increasingly important part of statistical analyses (Hardin et al., 2021; Nolan & Temple Lang, 2010). The goal of code is not just to solve an immediate analysis problem but also to establish reusable workflows and to communicate. As projects and analyses become more sophisticated, it is important that structures and expectations be set to facilitate data science as a team sport in a sustainable and reproducible way (Horton et al., 2022).

Consensus reports (National Academies of Science, Engineering, and Medicine [NASEM], 2018) have highlighted the importance of workflow and reproducibility as a component of data science practice:

Modern data science has at its core the creation of workflows—pipelines of processes that combine simpler tools to solve larger tasks. Documenting, incrementally improving, sharing, and generalizing such workflows are an important part of data science practice owing to the team nature of data science and broader significance of scientific reproducibility and replicability. Documenting and sharing workflows enable others to understand how data have been used and refined and what steps were taken in an analysis process. This can increase the confidence in results and improve trust in the process as well as enable reuse of analyses or results in a meaningful way. (pp. 2–12)

Adhering to established coding standards is an important step in fostering effective analyses and an important component of data acumen. Unfortunately, attention to these issues has not been central to many curricula in statistics and data science.

1.1. Why Bother?

Many data science students, and some practitioners, do not make time to learn and adopt good coding practices. As long as the code ‘works,’ they are satisfied and ready to move on. But how do they know that ‘it works’? For data analysts, these issues are an important part of data science practice, because it affects the bottom line:

Good coding practices lead to more reliable and maintainable code.

It is easier to notice and fix errors in well-written code. And it is less likely that the errors occur in the first place if the authors are using good practices. Trisovic et al. (2022) carried out a large-scale study on code quality and found that 74% of R files did not run successfully. After incorporating some automated code cleaning targeting “some of the most common execution errors,” 56% still failed.

Even when the code is correct, following good coding practices makes the code easier to read and understand, saving time and promoting good communication among team members.

In a blog post, Lyman (2021) wrote:

The value of high-quality code can be difficult to communicate. Some managers see it as a boondoggle, an expensive hobby for overly fastidious programmers, since investing in code quality can slow development over the short term and doesn’t appear to alter the user experience. But nothing could be further from the truth.

Learning and consistently using good coding practices takes some effort and some attention. But in the end, we agree that they are likely to save far more time than they consume (Ball et al., 2022). Well-written code is more likely to be correct, saving the time that would be required to fix it later, and easier to maintain, saving effort when it is necessary to modify or adapt the code in the future. Collaborations with other team members are likely to be more positive if the quality of individual contributions is higher.

We are convinced that

Good coding practice is important, even for beginners.

It is easier and more efficient to learn good coding practices as one learns to program than to unlearn bad habits later. This makes it especially important that the code beginners see meets the highest standards for coding practice. We cannot expect beginners to mimic these practices perfectly from the start, and we recommend focusing student attention (and feedback) on just a few key coding practices early on. But if they do not have a good model to emulate, we are impeding their progress unnecessarily. As an additional benefit, modeling good coding practices will make it easier for students (and others) to learn not only good coding technique but also the concepts and applications that the code is illustrating.

In this article, we will motivate the importance of principled coding, illustrate key aspects of good coding practices, and suggest ways that these practices can be included in the data science and statistics curriculum.

1.2. Prior Work

We acknowledge that much of what we discuss is not novel, but it is nonetheless important (and, we argue, underappreciated and underemphasized).

Many calls for better coding practices and enumerations of such practices exist. Computer science curricula have long emphasized these practices beginning in introductory programming courses and continuing throughout the curriculum (Borstler et al., 2017; Keuning et al., 2017), especially in courses like software engineering or in capstone projects courses (e.g., Berkeley’s CS169). Stegeman et al. (2014, 2016) have described rubrics and assessment for code quality in programming courses.

The importance of good coding practices is also recognized in industry (Ghani, 2022; “Google Style Guide” 2019) and across the sciences (Aruliah et al., 2012; Filazzola & Lortie, 2022; Wilson et al., 2017) and social sciences (Gentzkow & Shapiro 2022). Dogucu and Çetinkaya-Rundel (2022) motivates the importance of code quality, style guides, file organization, and related topics. Related work by Carey and Papin (2018) that describes rules for new programmers has relevance for teaching data analysis. Nolan and Stoudt (2021) offer a ‘Dirty Dozen’ set of helpful code recommendations, and Abouzekry (2012) provides 10 tips for better coding.

Code quality has been an area where some previous research has been undertaken. Schulte (2008) introduced a block model to help study comprehension of program components (atoms, blocks, relations, and macrostructure). Keuning et al. (2017, 2019) have explored other aspects of teaching code quality.

While the particular coding practices enumerated vary some by author, programming language, and application area, the overall message is clear: Good coding practices are important across a wide range of contexts to ensure that people, especially those working in teams, are productive and that their work is reliable, maintainable, and reproducible. Furthermore, there is broad agreement about the basic contours of what constitutes good coding practice. Unfortunately, the abundance of such calls indicates that practice continues to fall short of principle.

1.3. A Motivating Example

An April 2021 twitter post (Meyer, 2021) commented:

It is really painful when taking a graduate level data science course and the instructor’s code is considerably below any acceptable standard in the real world. Here is some real life code from a demo offered for the current homework…

We concur that this code, while short, is exceptionally hard to read. There are many ways that this could be improved, some of which are demonstrated below.

# R
n_test <- 8
data <- readr::read_csv(fname)   # code comment 1
train <- head(data, - n_test)
test <- tail(data, n_test)
train_ts <- ts(
  dplyr::select(train, 2:5),     # code comment 2
  start = c(2014, 1),
  freq = 52)

(1) Alternatively we could load the entire readr package with library(readr) and avoid the ::. For demonstration code, the explicit package reference can help the reader know which functions come from which packages. When using packages that are already familiar to the reader or when using many functions from the same package, loading with library() is more appropriate. We will demonstrate both styles in the various examples presented here.

(2) Ideally we would use column names rather than numerical indices here. dplyr::select() would be more useful in that case, especially in conjunction with functions like matches(), contains(), begins_with(), and so on.

The suggested revisions add white space to improve readability and clarify the type of subsetting that is happening by taking advantage of the head() and tail() functions. In addition to improved formatting, we specify that the select() function is coming from the dplyr package.1 We note that there are still aspects of the code that are brittle (e.g., assumptions regarding the ordering of the five variables in the data set). The code snippet in the tweet does not provide enough context to appropriately address these. Using the native pipe in R, we could rearrange the nested tasks, making the order of operations more transparent. We will see examples of this shortly, and of a similar approach called method chaining (see Augspurger, 2016), in Python.

The revised code is easier to read and understand (and maintain) than the original.

We believe that examples of this kind are all too common. This particular example is compelling because it shows that students notice the quality of the code instructors present and motivates why academics need to live up to industry standards.

2. Establishing Good Coding Practices

2.1. The Four Cs

As mentioned in the introduction, many lists of good coding practices have been published. These lists can provide useful guidance as one progresses—or, in the case of an instructor, leads students in their progression—toward better coding practices.

In addition to providing students with specific coding guidelines like our top 10 list below, we think it is also important that students understand the higher level goals that the specific guidelines are intended to support. These higher level goals take into account that computer code simultaneously communicates both to humans and to the computer and provide a framework for establishing a set of specific coding practices. We describe these as the four Cs for good code:

  • Correctness: It is important that the code be correct so that the computer does what is intended. This, in itself, is not profound. But we emphasize two things about correctness: First, correctness is a necessary but insufficient metric for good code. Second, the other goals support and promote correctness.

  • Clarity: It is also important that the code be clear, so that humans reading and writing the code can tell what it is intended to do, and easily make modifications as necessary. (This advice applies both to other humans and to the same human at some later date.)

  • Containment: It is helpful if the code is appropriately contained, to keep separate things that should be separate and together things that should be together. Vartanian (2022) refers to this idea as “low coupling, high cohesion.” Other authors refer to this as modularization. Proper containment includes things like preferring a data frame over several individual vectors, using functions to contain reusable code, and keeping code used across files or projects in a module or package.

  • Consistency: Finally, it is useful if code exhibits internal consistency of style, naming conventions, and other coding practices.

The specific guidelines outlined below serve as concrete advice for developing practices that promote creating code that satisfies the four Cs.

2.2. A Progressive Approach

Before revealing our top 10 list, we want to emphasize the importance of taking a progressive approach, both for oneself and for students. Developing good coding practices takes time and attention. Any list of coding guidelines can exceed the available cognitive resources to take them on. As is true for any behavioral intervention, change takes time and effort.

This advice about developing a growth mindset that proposes slow and steady change (Dana Center, 2021) applied to the authors as well. We are often dismayed and chagrined at the quality of the code we provided students 5 or 10 years ago, and we are constantly updating our own coding examples to improve them.

2.3. Top 10 List

While the four Cs establish important goals for code, they do little to specify how the goals might be achieved. The list of 10 guidelines below provide some additional specificity and are ordered roughly in the progression we encourage our students to develop them. By the completion of an undergraduate data science program, we would expect all students to be comfortable practicing all of the items on this list, which we are confident will also serve data science practitioners well throughout their careers.

2.3.1. Choose Good Names

Wilson et al. (2017), Lyman (2021), and Bryan (2015) note the importance of giving variables and functions meaningful names as a way to clarify the code and make it easier to read. But naming things can be difficult (Fowler, 2009), especially early in a project (when the scope may not yet be clear) and for novice coders. Having a set of general purpose guidelines to narrow the choices and using these across multiple projects can assist greatly in the selection of names.

  1. The length of names should be proportional to their scope.

    The more distance (measured in terms of number of people, human time, and lines of code) between definition and use, the more important it is that a name communicates clearly. The use of a single-character variable name may be acceptable as an index variable of limited scope or as a placeholder argument for a simple function. But even in these cases, a name that reflects what the indexing or placeholder represents is often preferred. Abbreviations are a two-edged sword. Used consistently by a community or team that is familiar with their meanings, they can help reduce the length of names. They can also become inscrutable to those less familiar with the project. Keeping a digest of abbreviations used can be very helpful.

  2. Use capitalization consistently.

    There are no absolute standards here, but adopting a strong local convention can help avoid errors. It is also wise to avoid using two names that differ only in their case. Such names are easily swapped for one another and are difficult to read aloud. Students should be introduced to naming schemes such as camelCase, PascalCase, and snake_case, but encouraged to stick with one of these as much as possible.2 We recommend avoiding the dot (.) as a delimiter in R since it serves another purpose in R’s S3 generic system (and in other programming languages as well).

  3. Avoid nondescript names.

    The ubiquitous d or df as the name for a data frame is a common example of a nondescript name. In a data analysis situation, the data and their provenance are important and the name of the variable containing the data being analyzed should communicate something about the data. Even in ‘generic’ examples, descriptive names can be helpful.

    # original R
    d <- read.csv("study-2023.csv")
    d2 <- d[ d$age >= 18, ]
    # improved R 
    AllSubjects2023 <- readr::read_csv("study-2023.csv") # code comment 1
    Adults2023 <- AllSubjects2023 |> filter(age >= 18)   # code comment 2

    (1) For most files of modest size, the use of readr::read_csv() rather than read.csv() will make very little difference although the format of the object returned may be slightly different. But read_csv() is faster, more consistent across operating systems, and a bit more predictable.

    (2) The pipe (|>) passes its left hand side as the first argument of the function on the right

    hand side.

    In situations where a variable is intended for frequent reassignment (in a loop, for example) or for names of formal arguments in functions, the considerations may be a little different. In these cases the name may say more about the role or hint at the intended use or data type.

External resources, like files, benefit from consistent naming patterns as well.

  1. Use delimiters to make parsing easier for humans and computers.

    When one adopts mixed case or uses underscores or some other delimiter, consistent use of delimiters makes it easier for people to remember the name and opens the possibility of algorithmic processing based on names. For file names, the use of underscores and hyphens to indicate two levels of chunking can be very useful: my-file_2023-12-25.txt, some-other-file_2023-01-02.txt makes it easy to separate the dates from the slugs (the unique identifier for the file), even when the slugs have different numbers of components. Requiring students to follow file-naming conventions (which likely include some identifier for the student) for files that they submit can be a useful way to encourage their use of a file-naming system.

  2. Choose file-naming conventions that sort naturally.

    For dates in file names, we recommend adopting the ISO 8601 standard (ISO, 2019) (YYYY-MM-DD, for example). This format sorts naturally without any special treatment of dates. Padding numbers with 0s so that all numbers use the same number of digits also serves this purpose, and numbers can be prepended to file names to force a particular sorting. (Leaving gaps for future insertions can be helpful as well.)

2.3.2. Follow a Style Guide Consistently

Good coding style includes choosing good names for files and variables, but also includes things like consistent use of white space and indentation; effective use of comments (Spertus, 2021); the choice of data types used for various purposes; and the particular ‘coding dialect’ used (Abouzekry, 2012; Wickham, 2022). Consistent use of white space and indentation can be automated in many modern editors and IDEs (integrated development environments), and we recommend teaching students to use such tools early in their statistics or data science programs. Alas, most of the remaining elements of style require continual human attention.

We recommend adopting a commonly used style guide (like the tidyverse style guide (Wickham, 2022) or Google’s slightly modified version (“Google Style Guide,” 2019) for R, or the PEP 8 style guide for Python (Rossum et al., 2023), perhaps with some local amendments. If a program can adopt a consistent style guide across its courses, that provides additional advantages and makes things simpler for students. In any case, adopting and following a style guide is good both for the improved readability of the resulting code and for practice in following a style guide.

In R, packages like styler (Müller & Walthert, 2022) and formatR (Xie, 2022) can assist with style consistency. More generally, modern editors and IDEs include support for various linters (see, for example, VanTol, 2023, for a discussion of linters for Python) that can detect violations of code style, inconsistencies, and other coding issues, sometimes suggesting improvements as code is being written.

2.3.3. Create Documents Using Tools That Support Reproducible Workflows

Students familiar with copying and pasting output from one software application (e.g., Excel) into another (e.g., Word or PowerPoint) should be encouraged early in their careers to take advantage of other workflows like R Markdown (Baumer et al., 2014), Quarto (Allaire et al., 2022), and Jupyter Notebook (Granger & Pérez 2021). These tools allow students to generate multiple document formats (including PDF, Word, and HTML) from a single source document that contains code in multiple languages (R, Python, Julia, Observable JavaScript, etc.), the results of executing that code, and formatted text discussing the process and results.

These tools provide a convenient way for students to prepare assignments (and for instructors to prepare learning materials), but more importantly, they train students to adopt reproducible workflows. Sandve et al. (2013) offers a set of rules for reproducible computations research. The first of these is that for each result, it is important to keep track of where it is produced. A workflow where output or graphics are copied from one place to another or where manual data wrangling steps are undertaken outside the documented workflow (Sandve et al.’s rule 5) obfuscates this important provenance.

Fostering reproducible analyses and workflow early on is valuable for students, and many instructors who use R emphasize the use of R Markdown (or more recently, Quarto) starting very early in their introductory courses (Baumer et al., 2014; Horton et al., 2022). But some practitioners unfortunately still rely on R scripts that produce auxiliary files that are then included elsewhere for reporting, perhaps via unautomated copy-and-paste steps. The use of modern document formats reinforces proper encapsulation and reproducibility since the documents must be self-contained and include all information needed to perform the analysis presented. Assignments that ask students to repeat an analysis with an augmented data set or slightly modified task (both common occurrences in practice), or to undertake workflows that require separate steps (complicated or time-consuming wrangling followed by later analysis of time-stamped data sets) can reinforce the power of these tools.

More advanced students can be taught about parameter-driven documents (Mahoney, 2022) and automated report generation (Beck, 2020) and can learn to create a wider variety of document types, including web pages, presentations, and dashboards.

2.3.4. Select a Coherent, Minimal, Yet Powerful Toolkit

In most languages, there are many ways to perform some common tasks. Even when several are equally good on their own merits, selecting a toolkit consisting of functions that work well together improves readability and reduces errors that arise from failing to switch from one standard to another (Çetinkaya-Rundel et al., 2022; Pruim & Horton, 2020). The tidyverse (Wickham et al., 2019) provides one example of an “an opinionated collection of R packages designed for data science," (Tidyverse, n.d.) which share an underlying philosophy and grammar. As such, it provides a good model for the kind of attention that is needed to produce such a tool kit.

It is especially important that instructors make wise choices about the tool kit that they present to students. An ideal tool kit should be

  1. Coherent

    Elements of the tool kit that perform similar tasks should have similar structure. This makes the resulting code easier to read and makes it easier for students to recall (or even anticipate) code structures. Generally speaking, new code written by users should aim to mimic the style of and interoperate well with the main elements of the tool kit.

  2. Minimal

    In most cases, less is more. Learning a few things well, and learning how to combine them creatively to perform a wide range of tasks is easier and more useful than having a large tool kit of speciality functions that each do a specific, narrowly defined task. “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away” (de Saint-Exupéry, 1984, p. 39).

  3. Powerful

    Fighting against the desire to have the tool kit be small and coherent is the need to accomplish the tasks at hand (and future tasks as they arise). We want our tool kit to be as simple as possible, but no simpler.

Each team (or instructor) will need to balance these competing interests. We advocate against two extremes that are implicit (or sometimes even explicit) in many coding examples we see: avoiding the use of standard, well-supported packages (in favor of ‘base’ language constructs), and using code from myriad packages, often with competing styles and ‘mental models.’ A well-chosen, consistent tool kit both demonstrates good coding practice and makes it easier for students to learn the material in a given course. Building a toolkit around a focal package or collection of packages (like tidyverse or tidymodels in R or numpy, Harris et al., 2020, pandas, McKinney et al., 2010, and scikit-learn, Pedregosa et al., 2011, in Python) can provide a useful filter for selecting components of a toolkit.

A fortunate recent development is the converging of ideas in R and Python. This makes it possible to choose a bilingual tool kit with some level of coherence. Consider the following examples of data wrangling and plotting in R and in Python, motivated by Hilary Parker’s blog post (Parker, 2013) investigating trends in her name’s popularity over time.

#R 
library(dplyr) 
library(ggplot2) 
library(babynames) # loads babynames data
babynames |>
   filter(name %in% c("Hilary", "Hillary", "Hilarie", "Hillarie")) |>
   group_by(year, sex) |>
   summarise(prop = sum(prop)) |>
   ggplot() +
   geom_line(aes(x = year, y = prop, color = sex)) +
   labs(title = "Prevalence of Hilary et al.")
# Python 
import pandas as pd 
import altair as alt 
from pyreadr import read_r, download_file 
url = "https://github.com/hadley/babynames/raw/master/data/babynames.rda" 
babynames = read_r(download_file(url, "./babynames.rda"))["babynames"]   # code comment 1
(
  babynames
  .query("name in [Hilary, Hillary, Hilarie, Hillarie]")
  .groupby(by = ["year", "sex"], as_index = False)   # code comment 2
  .aggregate({"prop": "sum"})    
  .pipe(alt.Chart, title = "Prevalence of Hilary et al.")    
  .mark_line()   
  .encode(       
    x = alt.X("year").axis(format = "4d"),     
    y = "prop:Q",   
    color = "name:N") 
)

(1) Alternatively, the reticulate package provides a way to communicate data back and forth between R and Python. Using reticulate, we could access the data (converted from an R data frame to a pandas data frame) with babynames = r.babynames.

(2) Rossum et al. (2023) suggests (without much explanation) not putting a space around the assignment = in argument lists for functions. We choose to include the spaces for better legibility, especially for projected demonstration code, but we make our students aware of the more common styling in Python.

While some language differences are unavoidable (e.g., method chaining in Python in place of the pipe (>) in R, strings in Python where nonstandard evaluation can be used to avoid them in R), and the differences in naming (query() vs. filter(), aggregate() vs. summarise(), etc.) are unfortunate, the basic approach supported by dplyr and pandas remains quite similar (and based on ideas from SQL). Similarly, both plotting systems are based on a grammar of graphics approach.3

2.3.5. Don’t Repeat Yourself (DRY)

Overuse of copy-paste-modify can affect code writing as well as document creation. This is usually an indication of a bad workflow or poor encapsulation. Frequent copying and pasting of code may indicate the need for a function that encapsulates the repeated code into one location, identifies (via its arguments) what changes, and simplifies code maintenance, since changes only need to be made in one location (McConnell, 2004). The principle of ‘Don’t Repeat Yourself’ has been a mainstay in computer programming at least since Hunt and Thomas (1999) and serves as a good guideline for statisticians and data scientists as well.

This is not meant to suggest that scaffolded assignments, where students are presented with incomplete code to complete, are not useful. We recognize that many students (and practitioners) often seek to solve problems by finding working code that does a similar task and modifying it to obtain the desired result. But students who rely on this as their sole mechanism for producing code are likely failing to learn important concepts and structures at a deep level and developing coding habits that will not serve them (or their colleagues) well. For beginners, this may require the instructor to modify the tool kit being used or to provide students with some (but not too many) additional functions (Pruim & Horton, 2020). Before long, however, students should be taught to write their own simple functions. The authors often do this beginning from existing code that does one task and asking students to generalize the task and to create a reusable function to execute it. We illustrate this using the babynames example from Section 2.3.4.

This code can be generalized in several ways. Here we create a function that allows us to select several names and a range of years.4

# R
babynames_plot <-
  function(names, years = c(1880, 2017), data = babynames) {
    data |>
      filter(year >= years[1], year <= years[2], name %in% names) |>
      group_by(year, sex, name) |>
      summarise(prop = sum(prop)) |>
      ggplot() +
      geom_line(aes(x = year, y = prop, color = name)) +
      facet_grid(sex ~ ., scales = "free") +
      labs(
        title =
          paste0("Prevalence of ", paste(names, collapse = ", ")))
  }
babynames_plot(c("John", "Jon", "Jonathan"), years = c(1950, 2000))
babynames_plot(c("John", "James", "Mary"), years = c(1920, 2020))

A similar approach can be taken in Python:

# Python
import pandas as pd
import altair as alt
def babynames_plot(names, years = [1880, 2017], data = babynames):
    return (
      data
      .query("name in @names")
      .query("year >= @years[0] and year <= @years[1]")
      .pipe(alt.Chart, title = "Prevalence of " + ", ".join(names))
      .mark_line()
      .encode(
         x = alt.X("year").axis(format="4d"),
         y = "prop:Q",
         color = "name:N")
      .properties(width = 800, height = 250)
      .facet('sex', columns = 1)
      .resolve_scale(y = 'independent')
    )
  
babynames_plot(["John", "Jon", "Jonathan"], years = [1920, 2020])

The ability to write functions also opens up the possibility to use many general purpose functions that take functions as arguments (see Wickham, 2019, chap. 9, and Section 2.3.6).

Functions (and data sets) that are useful in multiple projects can be saved in an R or Python package. Users of R and Python will already be familiar with packages, since most of the R code that they use comes from a package. But it is also important to learn how to create packages, not only to share widely via PyPI, CRAN, bioconductor, or GitHub, but also for use within ‘local’ projects.

Wickham (2015) noted that “R packages are the fundamental units of reproducible R code.” Packages provide standard mechanisms for documentation and testing. We have found that code encapsulated and made accessible in a package can easily and robustly be included in files within and across projects. While there are some details to be learned, the RStudio IDE and the devtools package (Wickham et al., 2022) take care of many of these details for R users, greatly simplifying the process of creating a package. As an example, these tools can be automated to provide feedback methods, for example, usethis::use_github_action() to automatically run devtools::check().

Students who learn to create packages also gain a firmer understanding of how packages work, which makes them better users of packages created by others.

By the end of their programs, data science majors should be facile and comfortable writing functions and ideally have had experience creating and maintaining one or more packages.

2.3.6. Take Advantage of a Functional Programming Style

Even when they are not technically functional programming languages, many languages—including R and Python (David-Williams, 2023)—allow a functional programming style, and this programming style has become central to many data science workflows.

What is a functional style and why is it important for data science? Hadley Wickham (2019) offers this description:

It’s hard to describe exactly what a functional style is, but generally I think it means decomposing a big problem into smaller pieces, then solving each piece with a function or combination of functions. When using a functional style, you strive to decompose components of the problem into isolated functions that operate independently. Each function taken by itself is simple and straightforward to understand; complexity is handled by composing functions in various ways.

And this enumeration of benefits:

Recently, functional techniques have experienced a surge in interest because they can produce efficient and elegant solutions to many modern problems. A functional style tends to create functions that can easily be analysed in isolation (i.e., using only local information), and hence is often much easier to automatically optimise or parallelise.

In R, adopting a functional style also makes it easier to write more efficient code by taking advantage of many functions written in this style that bring about a significant boost in performance for many common tasks. For loops, for example, should mostly be avoided in R and replaced with equivalent (but more efficient) functional programming structures. For this reason Burns (2011) describes the use of for loops as “speaking R with a C accent – a strong accent” (p. 17).

The key concept of iterating over an object (a list, a vector, etc.) is very important, but it is not synonymous with writing for loops. Both R and Python provide other ways to do this that are more efficient and more elegant than using for loops. Consider the simple task of summing the first 100 integers. A student with previous programming experience was likely taught to approach this task like this:

# R
s <- 0 
for (i in 1:100)  {
   s <- s + i
} 
s

[1] 5050

A much more efficient and R-like way to do this is simply:

# R 
sum(1:100)

[1] 5050

R also includes many functions that are ‘vectorized’ so that the following two lines produce equivalent vectors y containing the logarithms of each value contained in x (but the second is more efficient, clearer, and easier to embed in data transformation workflows).

# R 
# poor use of for loop 
y <- numeric(length(x)); for (i in 1:length(x)) { y[i] <- log(x[i]) } 
# better 
y <- log(x)

The Vectorize() function makes it easy for users to create their own vectorized functions that work in the same way as log() and many other functions. Additionally, R includes many functionals (see, for example, Wickham, 2019, chap. 9) in the ‘apply’ family (both in base R and in packages like purrr (Henry & Wickham, 2020). These functionals provide fine control over how a function is applied to each item in a list-like structure (or parallel lists) and how the results are returned. Learning to use these functions makes code both more efficient and more readable.

Functional programming style, although often not a major component of an introductory programming course, is useful in many other languages as well, including Python. (See Kuchling, 2023, for an introduction to functional programming in Python.) The functional programming toolkit contains a similar collection of tools and concepts, regardless of the language, so functional programming skills transfer between languages.

The compositional aspect of functional programming combines well with the use of the pipe operator (>) in R, which can make sequences of operations easier to read and write. A similar approach that leans more on an object-oriented implementation in Python is method chaining. Each method performs a simple task and returns an object (often of the same type as the original) for which an additional method can be selected and executed, as was illustrated in Section 2.3.5.

2.3.7. Employ Consistency Checks

Things do not always work. And code written today may be used in the future in ways that were not anticipated. Avoiding brittle code (like referring to a column in a data frame by number rather than by name), building in checks of assumptions, emitting informative messages, and running automated code tests can reduce the frequency of downstream errors that may otherwise go undetected. Incorporating such checks can ensure that code fails safely when there are issues or inconsistencies.

When students are introduced to creating packages, they should learn to create unit tests. But much earlier, students should be taught to check their work and their data for obvious problems by answering a few simple questions like

  • Does my data frame have the anticipated shape?

  • Do numerical and/or graphical summaries give plausible results?

  • Are quantitative features in the expected range (e.g., ages nonnegative)?

  • Have I tried some examples where I know what the correct answer is?

  • What happens if I …?

Learning to anticipate (and check for) potential problems is a key step in the progression of a data scientist. Full-blown unit testing of the sort supported by the testthat (Wickham, 2011) or assertr (Fischetti, 2021) packages in R or unittest (“unittest — Unit testing framework,” 2023) or pytest (Krekel et al., 2004) in Python may not be needed from the start. But it is good to emphasize the importance of checking for correctness early on. Data audits are a useful part of any workflow. Visualizations or tables can help convince us that a data transformation was performed correctly. Testing code on small examples or examples where the correct result is known, can help reassure us that the code will work on other examples. For instructors, the examples that students construct can also reveal how students are thinking about the task and how their solution might fail.

From these early efforts at testing, we can build to more robust checks of correctness. Defensive coding (McNamara & Horton, 2018) is an approach where some investment in runtime checks can help avoid undetected errors. As a simple example, consider the function we created in Section 2.3.5. Our function is assuming that (at least) two years are specified and that the first is smaller than the second. It is safer to test for the expected input type and format and to emit a helpful error message when something unexpected is provided. Here we test that one or two numeric values are provided for year.

# R 
# simple defensive coding enforces that years will have two values in 
# non-decreasing order when we get to rest of function. 
baby_plot <- function(names, years = c(1880, 2017)) {
   if (!is.numeric(years) || length(years) < 1 || length(years) > 2) {
      stop("years should be a numeric vector of length 1 or 2.")   
   }
   years <- range(years) 
   # rest of function 
}

2.3.8. Learn How to Debug and to Ask for Help

Sometimes we know the code we have written is not working, but what then? Developing some rudimentary debugging skills, including the use of a debugger, can make finding and fixing errors much less frustrating and time-consuming. Unfortunately, many instructors do not teach systematic ways to debug code. The aptly named What They Forgot to Teach You About R (Bryan & Hester, 2019, chap. 11) includes a chapter on this important topic specific to McConnell (2004) and Thomas and Hunt (2019) include sections on general principles of debugging (along with many other useful tips for improving coding practices).

Another vital skill is learning to create a (minimal) reproducible example (reprex) (Stack Overflow, n.d.). The creation of an example that clearly and reproducibly demonstrates a problem, but does not include any extraneous elements, is often the first step toward identifying and fixing the problem. The reprex package (Bryan et al., 2022) in R is useful for making sure that an example is isolated from things outside the example and makes it easy to share the example with others.

Reproducible examples can also be useful teaching devices. Consider the following example, which demonstrates some of the differences between as.numeric() and readr::parse_number(), each of which attempts to convert string data to numeric data.

# R
dplyr::tibble(
  text = c("5", "5.3", "$1.23", "1,234"),
  `as.numeric(text)` = as.numeric(text),
  `parse_number(text)` = readr::parse_number(text)
)

2.3.9. Get (Version) Control of the Situation

Version control systems (e.g., GitHub) are a key part of a workflow that fosters many good code practices, including code review (Beckman et al., 2021; Bryan & Hester 2020; Fiksel et al., 2019) and distribution of tasks in a team. Version control tools can also help make software and analyses more robust (Sandve et al., 2013; Taschuk & Wilson, 2017). Graduates of a data science program should be fluent in the use of at least one version control system, and programs should give some thought to where and how this fluency will be developed.

In our experience, students can begin to use basic git commands (commit/push/pull) fairly early in their programs, perhaps primarily through an interface included in an IDE like RStudio (RStudio Team, 2015) or Visual Studio Code (Microsoft, 2023) or by using GitHub desktop (“GitHub Desktop,” 2023), to avoid the command line interface. But students need to have explicit instruction later in their programs to understand what a commit actually is and to learn how to handle a wider range of situations (effective use of branches, reverting to old versions, creating and reviewing pull requests, handling merge conflicts, rebasing, etc.).

The Learn Git Branching tutorial (Cottle, 2023) provides an excellent introduction to git with graphical representations of how git commands affect a repository.

Legacy et al. (2023) describe an approach to teaching undergraduate students to utilize sophisticated version control as part of an agile data analysis framework.

2.3.10. Be Multilingual

While proficiency in a language that supports data analysis well is important for any statistician or data scientist, a willingness to use other languages and the ability to select an appropriate language for a given task are also important. The use of document formats like R Markdown, Quarto, or Jupyter Notebook that support multiple languages provides an easy way for users more familiar with one language to incorporate multiple languages in their workflow. The reticulate (Ushey et al., 2022) package in R even provides a simple mechanism for passing data back and forth between R and Python.

For example, consider the following (somewhat contrived) example, where data wrangling is done in R, a machine learning model is fit in Python using Scikit-learn (Pedregosa et al., 2011), and the results are plotted in R. (The example is contrived because each of these tasks could be done in either language.) The reticulate package provides an object r for retrieving data from an R session for use in Python and an object py for retrieving Python data for use in R.

# R
library(reticulate)
library(dplyr)
library(tidyr)
library(palmerpenguins)
penguins <-
   penguins |> 
   filter(island == "Dream") |>
   mutate(
     species = case_when(               # code comment 1 
       species == "Adelie" ~ 0,
       species == "Chinstrap" ~ 1
     )
   ) |>
   drop_na()

X_penguins <- model.matrix(             # code comment 2
  species ~ -1 + ., 
  data = penguins
)   
y_penguins <- penguins[["species"]]

(1) This recodes species as 0/1 for use in Python ML algorithm shortly.
(2) model.matrix() recodes categorical data using indicator variables.

# Python
import numpy as np
from sklearn.svm import SVC 1                        # code comment 1
from sklearn.preprocessing import StandardScaler 2   # code comment 2
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

X_penguins = r.X_penguins
y_penguins = np.array(r.y_penguins)
indices = np.arange(X_penguins.shape[0])
penguins_split = train_test_split(
   X_penguins, y_penguins, indices,
   test_size = 0.3, random_state = 0
   )
X_train, X_test, y_train, y_test, indices_train, indices_test = penguins_split
pipe = Pipeline([("scaler", StandardScaler()), ("svc", SVC())])
pipe.fit(X_train, y_train)
pipe.score(X_train, y_train)
pipe.score(X_test, y_test)
predictions = pipe.predict(X_penguins)

(1) SVC = Support Vector Classifier
(2) StandardScalar transforms to have mean = 0; sd = 1

Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())])
1.0
0.9459459459459459

As a final note, we mention the usefulness of learning the basics of shell languages such as BASH and common command line utilities, something often entirely unfamiliar to our students. Familiarity with basic shell commands to create directories and manage files can eventually be leveraged for scripting and ‘batch’ processing, taking advantage of additional utility functions.

2.3.11. Summary

We conclude this section with a succinct enumeration of our top 10 list.

  1. Choose good names.

  2. Follow a style guide consistently.

  3. Create documents using tools that support reproducible workflows.

  4. Select a coherent, minimal, yet powerful tool kit.

  5. Don’t Repeat Yourself (DRY).

  6. Take advantage of a functional programming style.

  7. Employ consistency checks.

  8. Learn how to debug and to ask for help.

  9. Get (version) control of the situation.

  10. Be multilingual.

This list is intended to be illustrative rather than exhaustive. Each is actionable and teachable, and together they help achieve the four Cs from Section 2.1 (correctness, clarity, containment, and consistency) that typify high-quality code.

3. Discussion

Improved coding practices are vital to good statistics and data science. Our past practices may have been ‘good enough,’ but the increasing complexity of analyses and the need to address increasingly sophisticated questions requires us to up our game. A 2019 National Academies report called for educational institutions, professional societies, researchers, and funders to work to improve computational reproducibility (NASEM, 2019). Adopting good coding practices is one part of addressing this need. Defining and explaining the concept of code quality is therefore a challenge faced by educators (Borstler et al., 2017) and industry alike. Unfortunately, typical reward structures do not pay sufficient attention to code quality or other aspects of responsible analysis.

There are open questions regarding when and how these practices should be included in the statistics and data science curriculum. The NASEM (2018) report indicated that data acumen requires multiple opportunities to engage in the entire data analysis cycle. No matter how much we teach, students must be given the time to learn and to consolidate their knowledge into their habits and workflow.

Instructors play an important role (Keuning et al., 2019; Theobold et al., 2021). In academic programs it is important to begin establishing these practices early, to reinforce them often, and to expect students to adopt more and more habits of good programmers as they progress through their programs of study.

Based on our experience, we offer the following advice to instructors seeking to improve the quality of the code their students produce.

  1. Hold yourself (as instructor) to a higher standard while gently guiding students to better coding practices.

    In reference to internet protocols, Postel (1980) coined the phrase “be conservative in what you send out and liberal in what you accept”: this philosophy seems equally appropriate when teaching or mentoring coding practices. A good music teacher will always demonstrate good technique and musicianship, even if the student is not capable of performing at the same level, and perhaps not even able to appreciate some aspects of the teacher’s playing. The same should be true of the code that instructors present to students. Dogucu and Çetinkaya-Rundel (2022) discuss the importance of such role modeling in the context of teaching reproducibility.

    Instructors should seek opportunities to improve their own coding practices as well as the practices of their students. Collaborating with colleagues to review code examples and exchange ideas (and increase consistency throughout the program) is a good place to start. Seeking out authors who exemplify good code, and using or imitating their code is also helpful. Making code publicly available and participating in community-supported open source projects are additional ways to improve one’s coding practices.

  2. Start small and adopt a progressive approach that provides ample opportunities to practice and gradually becomes more strict.

    In addition to modeling good technique and musicianship, a good music teacher also focuses teaching attention on those areas where the student can benefit most. This assessment takes into account both the immediate reward for the student (through improvement that the student can readily appreciate) and long-term goals (e.g., avoiding or breaking a bad habit early that will be harder to break later, even if the student does not yet understand the importance of the particular habit). Once this assessment is complete, the music teacher must select an appropriate set of exercises, études, and pieces that both isolate particular skills and help incorporate them into the way they routinely practice and perform—and encourage the student to practice, practice, practice.

    The task of a data science instructor is similar. It is important to distinguish which details matter a lot and which are more minor. Over the course of a multiyear program, a progressive approach that expects higher quality code from students in each subsequent course should help students to endeavor to write good code as a matter of course and appreciate the importance of doing so.

  3. Use live coding demonstrations to model appropriate practice.

    Talk through your own coding, debugging, and analysis process as you code so students can hear your thought process. It is valuable to get the students involved, having them make code suggestions, or suggest improvements to your code. Wickham (2018) provides an excellent example of an expert analyst demonstrating their workflow and process with a growth mindset (which includes making multiple errors and corrections throughout the process).

  4. Regularly comment on student code practices—good and bad—and include these in assessment rubrics.

    Students will often search the internet for help and start using code that is nothing like what has been demonstrated in class. Beginners are typically unable to bring the two into alignment, if that is even possible. If we value good coding practices, then we must assess them and provide formative feedback so that students understand what is valued and know when they are making (or in need of making) progress. Stegeman et al. (2016) provide some useful guidance about assessment of code quality in the context of introductory programming courses, much of which applies to data science courses as well.

  5. Ask students to address code issues before providing help on debugging.

    Code issues like styling, formatting, and naming can be improved before the main debugging issue is addressed. Not only does this make it easier for the instructor to read and understand the student’s code, but the process of improving the code may lead the student to discover for themselves what is not working.

  6. Provide opportunities for students to collaborate on code and to refactor their own code.

    Giving students the opportunity to read other students’ code and to see how their classmates respond to their own code can reinforce the importance of using good coding practices. Code revision (e.g., Bryan, 2018) and improvement is just as important for developing good coding practices as multiple drafts are in a composition course.

Many practical challenges remain. Consider approaches to naming. The highly heterogeneous practices used with the R and Python communities does not do us any favors in this realm. Consider the following set of base R functions as one example: row.names(), rownames() (but row.names() is preferred), colnames() (but no col.names()), colSums(). There is no apparent method to the maddening naming choices here.

This diversity of style conventions makes it hard for users and models poor coding practice. In this context, it is even more important for analysts and instructors to make consistent, clear naming choices and to establish and follow a style guideline.

To close, we quote Peter Norvig (personal communication, March 19, 2019), who noted the critical importance of a “meticulous attention to detail” in data scientists. Many job descriptions also feature this as an attribute of successful job candidates. We believe that such attention to detail will be easier to inculcate within a structure that rewards building better code.


Acknowledgments

Thanks to NSF IISE award 1923388 (DSC-WAV) for partial support of this project.

Disclosure Statement

Randall Pruim, Maria-Cristiana Gîrjău, and Nicholas J. Horton have no financial or non-financial disclosures to share for this article.


References

Abouzekry, A. (2012, May 23). 10 tips for better coding. SitePoint. https://www.sitepoint.com/10-tips-for-better-coding

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022, January 10). Quarto. Zenodo. https://doi.org/10.5281/zenodo.5960048

Aruliah, D. A., Brown, C. T., Chue Hong, N. P., Davis, M., Guy, R. T., Haddock, S. H. D., Huff, K., Mitchell, I. A., Plumbley, M.D., Waugh, B., White, E. P., & Wilson, P. (2012). Best practices for scientific computing. PLOS Biology, 12(1), Article e1001745. https://doi.org/10.1371/journal.pbio.1001745

Augspurger, T. (2016, April 4). Modern Pandas (Part 2): Method chaining. Tom’s Blog. https://tomaugspurger.github.io/posts/method-chaining/#method-chaining

Ball, R., Medeiros, N., Bussberg, N. W., & Piekut, A. (2022). An invitation to teaching reproducible research: Lessons from a symposium. Journal of Statistics and Data Science Education, 30(3), 209–218. https://doi.org/10.1080/26939169.2022.2099489

Baumer, B. S., Çetinkaya-Rundel, M., Bray, A., Loi, L., & Horton, N. J. (2014). R Markdown: Integrating a reproducible analysis tool into introductory statistics. Technology Innovations in Statistics Education, 8(1). https://escholarship.org/uc/item/90b2f5xh

Beck, M. (2020, November 16). Automated reporting in Tampa Bay with open science. Openscapes. https://www.openscapes.org/blog/2020/11/16/tampa-bay-reporting/

Beckman, M. D., Çetinkaya-Rundel, M., Horton, N. J., Rundel, C. W., Sullivan, A. J., & Tackett, M. (2021). Implementing version control with Git and GitHub as a learning objective in statistics and data science courses. Journal of Statistics and Data Science Education, 29(1), 1–35. https://doi.org/10.1080/10691898.2020.1848485

Borstler, J., Störrle, H., Toll, D., van Assema, J., Duran, R., Hooshangi, S., Jeuring, J., Keuning, H., Kleiner, C., & MacKeller, B. (2017). I know it when I see it: Perceptions of code quality ITiCSE ’17 Working Group Report. In J. Sheard & A. Korhonen (Eds.), Proceedings of the 2017 ITiCSE Conference on Working Group Reports (pp. 70–85). ACM. https://dl.acm.org/doi/abs/10.1145/3174781.3174785

Bryan, J. (2015). Naming things. Reproducible Science Workshop. http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf

Bryan, J. (2018, July 21). Code smells and feels [Video]. YouTube. https://www.youtube.com/watch?v=7oyiPBjLAWY

Bryan, J., and Hester, J. (2019). What they forgot to teach you about R. https://rstats.wtf/

Bryan, J., & Hester, J. (2020). Happy Git and GitHub for the useR. Happy Git. https://happygitwithr.com

Bryan, J., Hester, J., Robinson, D., Wickham, H., & Dervieux, C. (2022). reprex: Prepare reproducible example code via the clipboard. CRAN. https://CRAN.R-project.org/package=reprex

Burns, P. (2011). The R Inferno. Burns Statistics. https://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Carey, M. A., & Papin, J. A. (2018). Ten simple rules for biologists learning to program. PLOS Computational Biology, 14(1), Article e1005871. https://doi.org/10.1371/journal.pcbi.1005871

Çetinkaya-Rundel, M., Hardin, J., Baumer, B., McNamara, A., Horton, N. J., & Rundel, C. (2022). An educator’s perspective of the tidyverse. Technology Innovations in Statistics Education, 14(1). https://doi.org/10.5070/T514154352

Cottle, P. (2023). Learn Git Branching. Website. https://learngitbranching.js.org

Dana Center. (2021). Data science course framework. https://www.utdanacenter.org/sites/default/files/2021-05/data/_science/_course/_framework/_2021/_final.pdf

David-Williams, S. (2023). Functional programming in data engineering with Python — Part 1. Medium, June. https://medium.com/data-engineer-things/functional-programming-in-data-engineering-with-python-part-1-c2c4f677f749

de Saint-Exupéry, A. (1984). Airman’s odyssey. Harcourt Brace Jovanovich. https://books.google.com/books?id=nIOZdLHReUMC

Dogucu, M., & Çetinkaya-Rundel, M. (2022). Tools and recommendations for reproducible teaching. Journal of Statistics and Data Science Education, 30(3), 251–260. https://doi.org/10.1080/26939169.2022.2138645

Fiksel, J., Jager, L. R., Hardin, J. S., & Taub, M. A. (2019). Using GitHub Classroom to teach statistics. Journal of Statistics Education, 27(2), 110–119. https://doi.org/10.1080/10691898.2019.1617089

Filazzola, A., & Lortie, C. J. (2022). A call for clean code to effectively communicate science. Methods in Ecology and Evolution, 13, 2119–2128. https://besjournals.onlinelibrary.wiley.com/doi/pdfdirect/10.1111/2041-210X.13961

Fischetti, T. (2021). assertr: Assertive Programming for r Analysis Pipelines. CRAN. https://CRAN.R-project.org/package=assertr

Fowler, M. (2009). TwoHardThings. https://www.martinfowler.com/bliki/TwoHardThings.html

Gentzkow, M., & Shapiro, J. M. (2022). Code and data for the social sciences: A practitioner’s guide. https://web.stanford.edu/~gentzkow/research/CodeAndData.xhtml

Ghani, U. (2022, January 13). 4 tips to improve code quality. Atlassian. https://www.atlassian.com/blog/add-ons/4-tips-to-improve-code-quality

“GitHub Desktop.” 2023. GitHub Desktop. Website. https://desktop.github.com

“Google Style Guide.” (2019). Google. https://google.github.io/styleguide/Rguide.html

Granger, B. E., & Pérez, F. (2021). Jupyter: Thinking and storytelling with code and data. Computing in Science & Engineering, 23(2), 7–14. https://doi.org/10.1109/MCSE.2021.3059263

Hardin, J., Horton, N. J., Nolan, D., & Temple Lang, D. (2021). Computing in the statistics curricula: A 10-year retrospective. Journal of Statistics and Data Science Education, 29(Supp. 1), S4–6. https://doi.org/10.1080/10691898.2020.1862609

Harris, C. R., Millman, K. J. van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Rio, J. F., Wiebe, M., & Peterson, P. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2

Henry, L., & Wickham, H. (2020). purrr: Functional programming tools. CRAN. https://CRAN.R-project.org/package=purrr

Horton, N. J., Alexander, R., Parker, M., Piekut, A., & Rundel, C. (2022). The growing importance of reproducibility and responsible workflow in the data science and statistics curriculum. Journal of Statistics and Data Science Education, 30(3), 207–208. https://doi.org/10.1080/26939169.2022.2141001

Hunt, A., & Thomas, D. (1999). The pragmatic programmer. Addison Wesley.

ISO. (2019). ISO 8601 — Date and time format. https://www.iso.org/iso-8601-date-and-time-format.html

Kaplan, D., & Pruim, R. (n.d.). Ggformula: Formula interface to the grammar of graphics. GitHub. https://github.com/ProjectMOSAIC/ggformula

Keuning, H., Heeren, B., & Jeuring, J. (2017). “Code Quality Issues in Student Programs.” https://dl.acm.org/doi/10.1145/3059009.3059061

Keuning, H., Heeren, B., & Jeuring, J. (2019). How teachers would help students to improve their code. In B. Scharlau & R. McDermott (Eds.), Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education (pp. 119–125). ACM. https://dl.acm.org/doi/10.1145/3304221.3319780

Kibirige, H., Lamp, G., Katins, J., gdowding, austin, Finkernagel, F., matthias-k, Funnell, T., Arnfred, J., Blanchard, D., Kishimoto, P. N., Astanin, S., Chiang, E., Sheehan, E., stonebig, Willers, B., smutch, Halchenko, Y., GK, P., … RK, M. (2023, July 21). Has2k1/Plotnine: V0.12.1. Zenodo. https://doi.org/10.5281/zenodo.8171350

Krekel, H., Oliveira, B., Pfannschmidt, R., Bruynooghe, F., Laugher, B., & Bruhin, F. (2004). Pytest 7.1. GitHub. https://github.com/pytest-dev/pytest

Kuchling, A. M. (2023). Functional programming HOWTO. Python Documentation. https://docs.python.org/3/howto/functional.html

Legacy, C., Zieffler, A., Baumer, B. S., Barr, V., & Horton, N. J. (2023). Facilitating Team-Based Data Science: Lessons Learned from the DSC-WAV Project. Foundations of Data Science, 5(2), 244–265. https://doi.org/10.3934/fods.2022003

Lyman, I. (2021, October 18). Code quality: A concern for businesses, bottom lines, and empathetic programmers. Stack Overflow. https://stackoverflow.blog/2021/10/18/code-quality-a-concern-for-businesses-bottom-lines-and-empathetic-programmers

Lyttle, I., Jeppson, H., & Altair Developers. (2023). altair: Interface to ‘Altair.’ CRAN. https://CRAN.R-project.org/package=altair

Mahoney, M. (2022, August 5). How to use Quarto for parameterized reporting. https://www.mm218.dev/posts/2022-08-04-how-to-use-quarto-for-parameterized-reporting

McConnell, S. (2004). Code complete (2nd ed.). Cisco Press.

McKinney, W. (2010). Data structures for statistical computing in Python. In S. van der Walt & J. Millman (Eds.), Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

McNamara, A., & Horton, N. J. (2018). Wrangling categorical data in R. The American Statistician, 72(1), 97–104. https://doi.org/10.1080/00031305.2017.1356375

Meyer, A. [@austingmeyer ]. (2021, April 10). It is really painful when taking a graduate level data science course and the instructor's code is considerably below any. [Image attached] [Tweet]. Twitter. https://twitter.com/austingmeyer/status/1380942918593183744.

Microsoft. (2023). Visual Studio code - Code editing. Redefined. https://code.visualstudio.com

Müller, K., & Walthert, L. (2022). styler: Non-invasive pretty printing of r code. CRAN. https://CRAN.R-project.org/package=styler

National Academies of Science, Engineering, and Medicine. (2018). Data science for undergraduates: Opportunities and options. The National Academies Press. https://nas.edu/envisioningds

National Academies of Science, Engineering, and Medicine. (2019). Reproducibility and replicability in science. The National Academies Press. https://nap.edu/25303

Nolan, D., & Stoudt, S. (2021). Communicating with data: The art of writing for data science. Oxford University Press.

Nolan, D., & Temple Lang, D. (2010). Computing in the statistics curriculum. The American Statistician, 64(2), 97–107. https://doi.org/10.1198/tast.2010.09132

Parker, H. (2013). Hilary: The most poisoned baby name in US history. https://hilaryparker.com/2013/01/30/hilary-the-most-poisoned-baby-name-in-us-history

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-Learn: Machine learning in Python. Journal of Machine Learning Research, 12(October), 2825–2830. https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf

Plotly Technologies. (2015). Collaborative data science. https://plot.ly

Postel, J. A(1980). DoD Standard Internet Protocol. ACM SIGCOMM Computer Communication Review, 10(4), 12–51. https://datatracker.ietf.org/doc/html/rfc760

Pruim, R., & Horton, N. (2020, May 14). Less volume, more creativity – Getting started with the mosaic package. Project MOSAIC. http://www.mosaic-web.org/mosaic/articles/LessVolume-MoreCreativity.html

Riederer, E. (2020, September 6). Column names as contracts. https://emilyriederer.netlify.app/post/column-name-contracts

Rossum, G. van, Warsaw, B., & Coghlan, N. (2023). PEP 8 style guide for Python code. Python. https://peps.python.org/pep-0008

RStudio Team. (2015). RStudio: Integrated development environment for r. RStudio. http://www.rstudio.com/

Sandve, G. K., Nektrutenko, A., Taylor, J., & Hovig, E. (2013). Ten simple rules for reproducible computational research. PLoS Computational Biology, 9(10), Article e1003285. https://doi.org/10.1371/journal.pcbi.1003285

Schulte, C. (2008). Block model: An educational model of program comprehension as a tool for a scholarly approach to teaching. In M. E. Caspersen (Ed.), Proceedings of the Fourth International Workshop on Computing Education Research (pp. 149–160). ACM. https://doi.org/10.1145/1404520.1404535

Sievert, C. (2020). Interactive web-based data visualization with r, Plotly, and Shiny. Chapman Hall/CRC. https://plotly-r.com

Spertus, E. (2021). Best practices for writing code comments. Stack Overflow. https://stackoverflow.blog/2021/07/05/best-practices-for-writing-code-comments

Stack Overflow. (n.d.). How to create a minimal, reproducible example - Help Center. Stack Overflow. https://stackoverflow.com/help/minimal-reproducible-example

Stegeman, M., Barendsen, E., & Smetsers, S. (2014). Towards an empirically validated model for assessment of code quality. In Simon & P. Kinnunen (Eds.), Kolli Calling ’14: Proceedings of the 14th Kolli Calling International Conference on Computing Education Research (pp. 99–108). ACM. https://dl.acm.org/doi/10.1145/2674683.2674702

Stegeman, M., Barendsen, E., & Smetsers, S. (2016). Designing a rubric for feedback on code quality in programming courses. In J. Sheard & C. S. Montero (Eds.), Kolli Calling ’16: Proceedings of the 16th Kolli Calling International Conference on Computing Education Research (pp. 160–164). ACM. https://dl.acm.org/doi/10.1145/2999541.2999555

Taschuk, M., & Wilson, G. (2017). Ten simple rules for making research software more robust. PLOS Computational Biology, 13(4), Article e1005412. https://doi.org/10.1371/journal.pcbi.1005412

Theobold, A. S., Hancock, S. A., & Mannheimer, S. (2021). Designing data science workshops for data-intensive environmental science research. Journal of Statistics and Data Science Education, 29(Supp. 1), S83–94. https://doi.org/10.1080/10691898.2020.1854636

Thomas, D., & Hunt, A. (2019). The pragmatic programmer (2nd ed.). Addison Wesley.

Tidyverse. (n.d.). https://www.tidyverse.org/

Trisovic, A., Lau, M. K., Pasquier, T., & Crosas, M. (2022). A large-scale study on research code quality and execution. Science Data, 9, Article 60. https://doi.org/10.1038/s41597-022-01143-6

“unittest — Unit testing framework.” 2023. Python documentation. Python. https://docs.python.org/3/library/unittest.html

Ushey, K., Allaire, J. J., & Tang, Y. (2022). reticulate: Interface to ’Python’. CRAN. https://CRAN.R-project.org/package=reticulate

VanderPlas, J., Granger, B., Heer, J., Moritz, D., Wongsuphasawat, K., Satyanarayan, A., Lees, E., Timofeev, I., Welsh, B., & Sievert, S. (2018). Altair: Interactive statistical visualizations for Python. Journal of Open Source Software, 3(32), Article 1057. https://doi.org/10.21105/joss.01057

VanTol, A. (2023). Python code quality: Tools & best practices. Real Python. https://realpython.com/python-code-quality

Vartanian, E. (2022). 6 coding best practices for beginner programmers. https://www.educative.io/blog/coding-best-practices#lowhigh

Wickham, H. (2011). Testthat: Get started with testing. The R Journal, 3(1), 5–10. https://journal.r-project.org/archive/2011/RJ-2011-002/index.html

Wickham, H. (2015). R Packages. O’Reilly Media. https://r-pkgs.org/

Wickham, H. (2018, January 8). Whole game [Video]. YouTube. https://www.youtube.com/watch?v=go5Au01Jrvs&t=3s

Wickham, H. (2019). Advanced R (2nd ed.). Chapman Hall/CRC.

Wickham, H. (2022). Tidyverse style guide. tidyverse. https://style.tidyverse.org

Wickham, H., Averick, M., Bryan, J., Chang, W., D’Agostino McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen., T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., . . . Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), Article 1686. https://doi.org/10.21105/joss.01686

Wickham, H., Hester, J., Chang, W., & Bryan, J. (2022). devtools: Tools to make developing R packages easier. CRAN. https://CRAN.R-project.org/package=devtools

Wilkinson, L., Wills, D. Rope, D., Norton, A., & Dubbs, R. (2006). The grammar of graphics (2nd ed.). Springer New York. https://books.google.com/books?id=NRyGnjeNKJIC

Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), 1–20. https://doi.org/10.1371/journal.pcbi.1005510

Xie, Y. (2022). formatR: Format R code automatically. CRAN. https://CRAN.R-project.org/package=formatR


©2023 Randall Pruim, Maria-Cristiana Gîrjău, and Nicholas J. Horton. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
1
?
Amelia McNamara:

We just wrapped up the eCOTS R Pedagogy reading group discussion of this paper, and one thing we discussed is that “The length of names should be proportional to their scope” only works when you’re using a modern IDE that has things like tab completion. I think the reason so many people default to using df as the name of their dataset is before tab completion, you wanted names you would be repeatedly typing to be as short as possible. If you’re using (for example) RStudio and have tab completion, you can have a long dataset name with no issues.

Nicholas Jon Horton:

The way that modern tools like a good IDE can allow for longer names without adding typing burden is a very helpful observation. But even if not using an IDE it’s less ambiguous to use “Houston_Housing_tall” rather than “df” as a name for a dataset. As Randy can attest, I’m a “reformed” lazy typist on this front.

More generally, part of our hope here is that instructors and students can spend more time thinking about naming, since it is a very hard problem.

I liked the following query on StackOverview, which stated: “I always try to make my variables as short as they can be while maintaining the proper level of meaning”.

I think that is excellent advice.

https://stackoverflow.com/questions/7044163/how-long-is-too-long-for-a-variable-name