Skip to main content
SearchLoginLogin or Signup

Rejoinder: The Emerging Role of Data Science in Cybersecurity

Published onApr 20, 2023
Rejoinder: The Emerging Role of Data Science in Cybersecurity

We thank the discussants for their perspectives on our presentation (Hero et al., 2023) of emerging challenges in cybersecurity and the role of data science in addressing these challenges. We enjoyed reading the discussants’ comments, as they amplify and reinforce our principal point: that statistical methods of data science will have an increasingly important impact on cybersecurity solutions. Each of the discussants’ narratives stands on its own and we are in agreement with most of their points.

The comments in the contribution of discussants Sanna Passino, Adams, Cohen, Evangelou, and Heard (2023) are highly relevant, both from a foundations standpoint and from an applications perspective. We thank them for their remarks on challenges, data structures, and future directions in statistical cybersecurity. The proposal by Sanna Passino et al. (2023) to segment multimodal data into a few well-defined data structures and design fusion approaches for anomaly analysis over multiclass data is timely and could potentially enable a systematic approach to cybersecurity analysis of complex systems. We heartily agree that data structures, data fusion, and streaming methods will be important components to developing domain-specific data-driven cybersecurity solutions.

Concerning data structures, Sanna Passino et al. (2023) give three common examples: graphs, point processes, and textual data. We note that data can either come directly in these forms or, more often, they are derived from more complex data types, such as images and video; for example, respectively as spatial correspondence graphs, event-marking point processes, or text captions. In the latter case, it may be beneficial to consider these data types as unstructured and in context, rather than solely considering the extracted data structures. Below we add to the discussant’s comments on the importance of each of the three data structures to cybersecurity.

Textual Data

Among applications in textual data, the authors rightly point out challenges with latent Dirichlet allocation (LDA) models and interpretability. The issue of interpretability cannot be emphasized enough. Security analysts, the typical consumers of alerts generated by these methods, are very skeptical, due to the relatively high rate of false positives they are exposed to, and due to the costs of responding to a false positive. Therefore, they will simply ignore alerts that are not readily understandable. While one can bring additional context to help prove the viability of an alert, interpretable models that, directly from the parameters, can indicate why the evidence is malicious are a critical component to adoption by security analysts.

Among applications of models for textual data, Sanna Passino et al. (2023) mention analysis of textual data for detection and parsing of computer log data. These are both excellent examples, but a new application has also emerged recently, with the use of large language models (LLMs) as an investigative and hunting tool for cyberthreats. Currently, analysts use a query interface, for example, SQL, to follow an investigative workflow. This workflow is interactive, a process of ‘following one’s nose.’ The user makes a query and, based upon the results, may make an additional query. This iterative process means that agile interaction with the data system is a key efficiency principle. LLMs, and their chat implementations, may become the de facto way of proceeding with these interactive investigations. Rather than an SQL query, which is very rigid and precise, an analyst could interact with natural language, providing less precision but greatly improved flexibility and ease of use. We expect the rapid adoption of LLM-based chat interfaces for cyberthreat investigation in the coming few years.

Point Processes, Graphs, and Data Fusion

Sanna Passino et al. (2023) rightly highlight point processes and graphs as key data structures when analyzing enterprise data for security purposes. Moreover, combining these concepts gives us the right structure to model an enterprise’s behavior: that of a dynamic graph of interconnected nodes. While graphs have been studied extensively, as have point processes, the combination of these two is a fruitful area of research for security, and is still nascent. The dynamics of the point processes are critical, since the normal operation of an enterprise is non-stationary (the null), as are attack behaviors (the alternative). The connectivity of the graph is also critical, since attacks typically involve more than one entity (machine, user, file, application, etc.).

Sanna Passino et al. (2023) mention the critical need to combine different types of data together. We could not agree more, but there are challenges specific to cybersecurity that need to be addressed. Data rates are exceedingly high in security analysis, with streams of events such as network or endpoint data arriving at times greater than one million events per second. Yet attacks only happen rarely, perhaps on the order of a rate of one attack per year. The vast difference between the data rates and attack rates means that any single question we might ask of the data, if that question does not have perfect precision, will result in enormous numbers of false positives. Therefore, to achieve acceptable precision in cybersecurity, data fusion can be especially powerful. In fact, human analysts, when confronted with a new alert, almost invariably seek corroborating or counterfactual evidence to confirm or deny the alert. Many false positives have to do with human activity that looks malicious but is benign in context. Network administrators pinging a machine for uptime may look like reconnaissance by an adversary. Security researchers conducting tests of software may look like they are running malware. By integrating counterfactual evidence into analysis of data streams, fusion can result in eliminating vast swaths of false positives. Data fusion also remains of the highest priority for an investigator after false positives are eliminated. The task is to identify all areas of the enterprise that have been touched by the attack, in order to stop the breach and remediate the assets involved. Automatic combination of all attack signals greatly accelerates this phase of breach response, and yet is still nascent in today’s security tools.

We thank Michailidis (2023) for pointing out the broader context of anomaly detection and discussing the related change-detection problem. We agree that the online setting will become increasingly important, especially as increasingly complex critical infrastructure systems will require shorter and shorter threat response times. Learning to update a time-evolving baseline against which to detect anomalies will indeed be crucial for such applications and we agree that methods adapted from statistical process control could be worthwhile. Michailidis makes another important point, that anomaly detection methods should account for the effect of the anomaly on system viability and user experience. This reinforces our point that developing methods that account for the importance of anomalies will be valuable, as discussed in the Hou et al. (2018) reference in our article. Finally, we cannot agree more with Michailidis's (2023) entreaty to the community to pursue the release of new, large-scale, well-documented, and curated data sets, which are sorely needed to develop and test next-generation anomaly detection methods.

We thank Liang (2023) for correctly noting the importance of looking beyond individual links in a network when considering issues of security and privacy. The cited work, Liang et al. (2011), illustrates how viewing security as a high-layer networking issue can be synergistic with the emerging physical-layer notions of security, and this can improve our understanding of the vulnerabilities of a network to attacks. Liang (2023) provides a very useful commentary on how to employ and adapt information-theoretic security approaches to achieve privacy and confidentiality in multiuser systems such as Internet of Things and distributed edge networks. Indeed, new challenges such as dynamism, limited-node computing resources, and latency-sensitive applications open up new research questions in both classical information-theoretic and modern cryptographic approaches to end-user security. We also believe that modern security research and protocol design should be tuned to the application context, for example, beyond securing point-to-point communication. It will be fruitful to investigate new measures of privacy and confidentiality that are task-centric, for example, differential privacy in distributed computing scenarios, or blockchain-like designs for decentralized smart contracts, which can be integrated with secure communication protocols that cater to various use cases.

We thank Chorti (2023) for expanding substantially on our discussion of physical layer security and its role in sixth-generation (6G) wireless networks. In particular, her notion of quality of security is an important new way of thinking about security in networks, as a counterpart to quality of service, which has been a key performance indicator used in evaluating and designing the many generations of wireless networks. The introduction of the new performance indicator brings with it the possibility of trading off between quality of security and quality of service, which are opposing criteria in many circumstances. Bringing security to the forefront in this way constitutes a major step in moving from the notion of security as an overlay to a more fundamental idea of security as an integral part of network design.

Finally, we thank Lin (2023) for reemphasizing our points on the considerable hurdles we face in developing a data-driven framework for cybersecurity. He mentions, in particular, the massive size of cybersecurity data, the high complexity of such data, and the maintenance of confidentiality of sensitive information. Indeed, cybersecurity data is often complex, large, and noisy. More importantly, cybersecurity data that is publicly available to researchers is often not standardized, may be missing some labels, and is lacking in contextual meta data. Developing statistical models for such data raises many interesting challenges that could lead to new research topics in statistics and data science. For example, little is known about reliable anomaly-detection methods when only a subset of the anomalies is labeled or when the nominal baseline is evolving over time, an important scenario highlighted by discussant Michailidis (2023). Regarding Lin's (2023) comments on confidentiality, we agree that implementing a privacy mechanism inside cybersecure systems will be valuable. We understand Lin’s concern about the model-dependent nature of information theoretic approaches to privacy, as contrasted with the model-free, differential privacy approach. However, we emphasize that our article does not advocate one approach over the other. Indeed, in the article we discuss differential privacy and its connection to the information-theoretic notion of privacy. These are not orthogonal ideas and there will be value in considering them in combination for securing privacy in next-generation cybersystems.

Disclosure Statement

Alfred Hero, Soummya Kar, Jose Moura, Joshua Neil, H. Vincent Poor, Melissa Turcotte, and Bowei Xi have no financial or non-financial disclosures to share for this article.

Reference List

Chorti, A. (2023). Trust and physical layer security for 6G cyber-physical systems. Harvard Data Science Review, 5(1).

Hero, A., Kar, S., Moura, J., Neil, J., Poor, H. V., Turcotte, M., & Xi, B. (2023). Statistics and data science for cybersecurity. Harvard Data Science Review, 5(1).

Hou, E., Sricharan, K., & Hero, A. O. (2018). Latent Laplacian maximum entropy discrimination for detection of high-utility anomalies. IEEE Transactions on Information Forensics and Security, 13(6), 1446–1459.

Liang, Y. (2023). Securing distributed wireless edge networks via information-theoretic security approaches. Harvard Data Science Review, 5(1).

Liang, Y., Poor, H. V., & Ying, L. (2011). Secrecy throughput of MANETs under passive and active attacks. IEEE Transactions on Information Theory, 57(10), 6692–6702.

Lin, D. K. J. (2023). Data quality: Comments on “Statistics and Data Science for Cybersecurity.” Harvard Data Science Review, 5(1).

Michailidis, G. (2023). Challenges for Anomaly Detection in Large-Scale Cyber-Physical Systems. Harvard Data Science Review, 5(1).

Sanna Passino, F., Adams, N. M., Cohen, E. A. K., Evangelou, M., & Heard, N. A. (2023). Statistical cybersecurity: A brief discussion of challenges, data structures, and future directions. Harvard Data Science Review, 5(1).

©2023 Alfred Hero, Soummya Kar, Jose Moura, Joshua Neil, H. Vincent Poor, Melissa Turcotte, and Bowei Xi. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

No comments here
Why not start the discussion?