Skip to main content
SearchLoginLogin or Signup

Data Quality: Comments on “Statistics and Data Science for Cybersecurity”

Published onMar 20, 2023
Data Quality: Comments on “Statistics and Data Science for Cybersecurity”
key-enterThis Pub is a Commentary on

Congratulations to Professors Hero, Kar, Moura, Neil, Poor, Turcotte, and Xi for this thoughtful, revolutionary, cutting-edge, and highly impactful article about improving statistical methodology for scientific studies of cybersecurity. As mentioned in the article, interconnected data services continue to grow in complexity, exposing new vulnerabilities to cyber attacks and outages. However, cybersecurity software has not evolved quickly enough to cope with the increasing sophistication of attackers. This disparity clearly signals the importance of this article.

This article discusses some very timely issues like “data science in cybersecurity,” “cybersecurity enterprise systems,” and “physical layer cybersecurity and resilient decision-making algorithms.” Some challenges in each issue are proposed—these challenges will become the next big research topics in the coming years. I have only a few comments to offer, as given below.

  1. Data. The size of cyber security data could be huge; collecting relevant data on a network could quickly reach hundreds of TBs of data. Additionally, there are different sources of data to consider, bringing their own unique challenges. Thus, storing the data and performing computations on data are important parts of analyzing cybersecurity data. Furthermore, large data does not ensure the quality of the data. In fact, cybersecurity data typically contains many noises (some purposefully made). This is rather challenging to be dealt with.

  2. Complexity. The complexity of cybersecurity data introduces other complications. Attacks could appear as anomalies, or some new data could have very different properties than what had been observed so far. Thus, models and machine learning methods need to handle new and unpredicted data in a responsible way.

  3. Confidentiality. As the article points out, it is important to consider both the security of communication, achieved using encryption, as well as the protection of privacy, achieved through some privacy mechanism. The authors opt for an information-theoretic notion of privacy, which they state is framed in terms of conditional entropy. One potential concern about using conditional entropy to measure privacy loss is that it assumes a model for the private data, and (presumably) only provides guarantees with high probability, perhaps failing to protect outliers. The problem with this approach is that real people’s data are not random in this sense, and it is particularly those who have unusual (i.e., outlier) data that need the most protection. The authors may want to reconsider differential privacy for their privacy framework, as this method measures privacy over the worst case of databases without assuming any model.


I would like to thank my Purdue colleagues for their insightful comments on my review; specifically, Dr. Jordan Awan and Dr. Timothy Keaton.

Disclosure Statement

Dennis K. J. Lin has no financial or non-financial disclosures to share for this article.

©2023 Dennis K. J. Lin. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

No comments here
Why not start the discussion?