Column Editor’s Note: Data science questions can be found—and answered—in a wide range of contexts. Not all are big data, and not all deal with very complicated questions. In this issue's Minding the Future column, Anthony Shen demonstrates some aspects of the data science pipeline (data wrangling, bringing subject matter knowledge in to the picture, visualizing and interpreting results) to analyze club participation at his school during the COVID-19 pandemic. Students: Do you have examples of your own where you have used data science principles to analyze a problem of interest to you? If so, consider submitting a proposal to be featured in the Minding the Future column of an upcoming issue.
Keywords: data science process, high school research, clubs, COVID-19
Data science has quickly become one of the prominent buzzwords in the academic field, yet the perception of it can often be misconstrued. For most people, hearing the term data science can make them think of massive computers with thousands of CPU processing power, making complex models with millions of data points that only a professor or a banker in a corporate suit understands. Yes. This picture is one, and perhaps the most common, depiction of data science. However, this illustration of data science makes it unfriendly to the general public, especially young enthusiasts of the subject who believe that they are unable to use data science in their own day-to-day lives. In fact, data science is actually super user-friendly, and something that any high schooler can begin. This article will discuss the process of a high schooler’s experience with data science and how aspiring data scientists can collect, analyze, and draw conclusions from anything around them, even their school’s yearbook.
Like an author would with any good novel, article, or research paper, a data scientist must begin their journey from data to discovery by identifying a topic they would like to conduct research on. The best part is that the topic of the research that you eventually conduct can be about anything and come from anywhere. For me, I was looking for a speech topic for my school’s annual TEDxYouth conference when I came across a Stanford University article that discussed the impacts of the pandemic on children both socially and academically. The article discussed how the pandemic stress had “physically aged teens’ brains,”1 which made their brains seem older than teens in past years. This discovery served as the basis for my speech, which discussed the impacts of the pandemic on children socially and academically, setting them back compared to children before the pandemic. After I gave the speech, I did not realize how much this situation resonated with my audience. Parents and teachers came up to me to discuss their thoughts on my speech, which prompted further conversations with other teachers and students about whether they had noticed similar things. I asked my math teacher to look at the grade distributions in precalculus and calculus classes over the years and even brought it up to fellow senate members.
Being on the Hong Kong International School (HKIS) Senate, one of the first few things we noticed as we entered the 2022–2023 school year was a slight change in school culture. Many senate members, including myself, noticed that the school was no longer becoming an after-school hub for students to hang out and relax and that the plaza, one of the astounding features of our school, was often empty. Instead, after the bell rang, flocks of students would head directly toward the buses. This seemingly daunting social shift also made its way to the attendance of after-school activities and clubs. HKIS has a vibrant club system that boasts 80-plus different clubs and student organizations each year. Multiple club leaders, including myself, felt that there was a drop in membership, yet could not explain the reasoning behind such a drop. While other senate members felt that such a change was impossible to find data on and understand, this sparked my curiosity and challenged me to look more into this phenomenon, throwing me into the data science world.
However, finding a topic to research and delve into was only the tip of the iceberg. How would it be possible to find consistent and clear data that could tell the qualitative social behaviors of my classmates? To verify whether my opinion or observations were true, I needed to source data from my school environment. There were many hurdles presented when trying to find data on the HKIS club system. The first hurdle is that HKIS does not have a consistent member count–tracking platform for the clubs. One of the unique parts of the HKIS club culture is that most clubs run independently with little tracking from the student body, so there is difficulty in sourcing accurate and reliable data on member participation in these organizations. Another difficulty would be the need to source data from before, during, and after the pandemic, meaning I would need to source preexisting data. The answer to obtaining this data, which might not be apparent at first, was located within the school yearbook. Having been at HKIS since the third grade, I had amassed a stack of yearbooks, which sat on the bottom shelf of my bookshelf, gathering dust. While the yearbook may just seem like a source for students to complain about their ugly yearbook photos, it also provides a catalog of all the HKIS clubs each year. This method not only provided a standardized method to measure the number of members in each club every year but also provided other valuable information, such as the total number of students in the high school each year, the number of clubs in the high school, and the number of new clubs each year. Just like that, the data you need could be obtained very easily. Data does not always need to be collected in massive experiments done in million-dollar facilities. It could be collected in your home or even sourced from existing materials. Throughout this process, I even reached out to my school teachers to see if they had any academic data for me to use and if it was possible to do so.
The next step was to go through nine yearbooks and headcount the number of people in each club photo, counting twice on separate days to ensure that the number I had was correct. Having sat on the HKIS Senate for many years, I immediately realized that the raw data needed to be processed first via certain procedures: for example, certain clubs may have undergone a rebranding process, a name change, or a merger, meaning there would be duplicate clubs in our counting. Such matters must be considered during the preprocessing stage. Without understanding how HKIS clubs changed and rebranded over time, it would be difficult for one to find such clubs, and remove them from the data set. However, I knew that clubs needed to go through a renewal process and that many clubs had leaders who continued to run the club over multiple years. This knowledge, combined with the knowledge of the club faculty advisors, allowed me to identify the clubs that needed to be “cleaned.” Preprocessing and cleaning the data is a step that can be often overlooked by young data scientists, including myself, but it made my data much more consistent and filled in the gaps. For example, the math club at my school had undergone five name changes during its history, and by figuring that out, I was able to piece together a 7-year image of the club’s history instead of doing five different clubs with different numbers.
Data analysis is naturally the next step in the process. I was able to use both linear regression and categorical regression models to help me notice irregularities or trends. However, data analysis does not have to be super complicated. In fact, the simpler the better; as long as it helps you achieve your purpose and leads you to draw a conclusion about your hypothesis, it will do. The most surprising part of my entire research process was the drastic difference between my hypothesis and my findings. Contrary to my anecdotal opinion, club participation actually increased at the beginning of the pandemic, with clubs that did not need to meet in person experiencing a larger increase in participation than clubs that did meet in person (see Figure 1). This seems confusing, but at a second glance, when we consider clubs as an auxiliary measure for social participation, the data indicates that there was a larger shift away from clubs to other social measures such as sports or fine arts.
Although the findings were not consistent with my hypothesis, which is something many statisticians face, the findings themselves help inform the greater importance of clubs during a pandemic. While all other opportunities for social interaction were limited, many students found a sense of community through clubs within HKIS. This can have further implications for social and emotional well-being as well. Especially during the tough times of the pandemic, many people sought to join a club to interact with their fellow students. This study is very informative to schools as well. During difficult times, schools should work to leverage their clubs and activities culture as it can help create communities and get students through.
Apart from the findings, it is also important to notice and write down the limitations of the study as well. Using the yearbook as a baseline measure for club participation is a unique and effective way to measure club participation. However, it may not always be the most accurate method as the yearbook does not always represent the number of committed members of a club. It is virtually impossible to measure the number of committed members within a club because it would require attending almost every club meeting for close to 100 different organizations within the school. The yearbook photo shooting session is a one-time measure, while club participation is a dynamic, instead of static, measurement. Additionally, more information would need to be collected to see if the findings made from this case study could be generalized to many schools during COVID-19, all of which can help inform me of my next steps after conducting the initial study.
Looking at these results through the lens of my own school, they informed numerous decisions I made after I took on the role of the presiding officer at my school. Recognizing that school clubs were one of the most important methods for social connections, I took measures to change club culture and increase participation. Creating more club check-ins regularly throughout the year allowed the Senate to learn how to support these clubs more and check in on their participation rates. I spoke about my findings at the school mental health student panel and discussed why they exist and how clubs and COVID-19 impacted our school community. The power of data science comes not from the experiments and the analysis, but rather from how the findings can help each and every one of us push for change within our own community.
Throughout this process, I was able to learn about being a data scientist and how easy it can be to collect and analyze data. More often than not, there are a plethora of resources available for an up-and-coming data scientist to utilize and unorthodox methods of obtaining data might even be better in some cases. Data, no matter how big or confusing, can always be broken down, similar to how a complex and difficult task can be broken down into many smaller ones. As one of the fields that has and will continue to take the world by storm, data science is at the forefront for creating social change, informing us about how we can tackle the problems that plague the communities around us. No matter where I am, I hope that I will be able to continue questioning, collecting, and eventually concluding about the phenomena that I notice around me, and hopefully after reading this, you can too.
Anthony Shen has no financial or non-financial disclosures to share for this article.
©2024 Anthony Shen. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.