The sharing of health-related data is governed by myriad federal and state (and sometimes international) requirements. Moreover, these health-related data sharing rules are evolving, which poses a substantial challenge to creating long-term data management and sharing plans for National Institutes of Health (NIH)-funded or conducted research. To comply with inconsistent and evolving legal standards, the author recommends use of HIPAA’s Expert Determination Method to de-identify data. In addition, in light of concerns about whether de-identification of data is sufficient protection of privacy, and in the absence of a federal law prohibiting the re-identification of individuals in de-identified data sets, the author recommends utilizing contractual controls on the use of de-identified data and restrictions on downstream disclosures of that data.
Keywords: data sharing, de-identification, HIPAA, Common Rule, NIH Certificates of Confidentiality, legal issues
Kristen Rosati is a practicing lawyer and past president of the American Health Law Association. She has deep experience in health information and genomic privacy, data sharing for research, and research compliance. The author counsels universities, academic medical centers, and researchers on compliance with health data privacy requirements and structuring research activities across jurisdictions.
The sharing of health-related data is governed by myriad federal and state (and sometimes international) requirements. Moreover, these health-related data sharing rules are evolving, which poses a substantial challenge to creating long-term data management and sharing plans for NIH-funded or conducted research. To comply with inconsistent and evolving legal standards, the author recommends use of HIPAA’s Expert Determination Method to de-identify data. In addition, in light of concerns about whether de-identification of data is sufficient protection of privacy, and in the absence of a federal law prohibiting the re-identification of individuals in de-identified data sets, the author recommends utilizing contractual controls on the use of de-identified data and restrictions on downstream disclosures of that data.
In October 2020, the National Institutes of Health (NIH) released its revised Data Management and Sharing Policy (the DMS Policy) (Office of the Director, National Institutes of Health, 2020). The revised DMS Policy requires a data management and sharing plan for all NIH-funded or conducted research starting January 25, 2023.
NIH-supported or conducted research that involves sharing health-related data triggers potential legal barriers due to myriad federal, state, and international legal requirements that are quickly evolving. The article outlines the primary legal authorities that apply to NIH data sharing plans that involve health-related data and their differing regulatory requirements on data de-identification, including the federal Health Insurance Portability and Accountability Act of 1996 and its implementing regulations called the HIPAA Privacy Rule (2022), the Department of Health and Human Services (HHS) Policy for Protection of Human Research Subjects (also known as the Common Rule) (2022), and the NIH policy related to Certificates of Confidentiality (the NIH Certificates of Confidentiality Policy) (Office of Extramural Research, National Institutes of Health, 2017). To comply with these inconsistent and evolving legal standards, the author recommends use of HIPAA’s Expert Determination Method to de-identify data.
In addition, in light of the concerns about whether de-identification of data is sufficient protection of privacy, and in the absence of a federal law prohibiting the re-identification of individuals in de-identified data sets, the author recommends utilizing contractual controls on the use of de-identified data and restrictions on downstream disclosures of that data.
This article does not address the substantial literature on the science of data de-identification, nor the costs and benefits of various methods of data de-identification, all of which is covered in substantial detail elsewhere by data scientists. Instead, the article is intended to provide guidance to the data science and biomedical research communities in crafting data management and sharing plans that involve health-related data, as well as to vice presidents for research and institutional legal counsel that will need to approve those plans.
There are many federal, state, and international laws that apply to the use of health data for research. The Health Insurance Portability and Accountability Act of 1996 (HIPAA, 1996) is a major consideration for some (but not all) researchers. In addition, the federal Common Rule applies to “human subjects research” that is conducted or supported by a federal department or agency that has adopted the Common Rule, including HHS (of which the NIH is, of course, a part). NIH itself issued a policy on data sharing that implements the 21st Century Cures Act and modified how Certificates of Confidentiality are implemented. Federal law also provides heightened privacy protection for substance use disorder (SUD) information that originates from SUD treatment providers—called Part 2 programs (Confidentiality of Substance Use Disorder Patient Records, 2022). Moreover, many state laws impose requirements related to data sharing and de-identification of data, such as the California Consumer Privacy Act (2021). And where data sharing plans involve data collected about individuals who live in the European Economic Area, the European Union (EU) General Data Protection Regulation (GDPR) applies (Regulation (EU) 2016/679). This section discusses the federal requirements that primarily regulate NIH-funded research: HIPAA, the Common Rule, and the NIH Certificates of Confidentiality Policy.
While the HIPAA Privacy Rule has an outsized effect on data sharing, it does not apply to all actors in the data science community. Rather, it applies to HIPAA “covered entities,” which include most health care providers, health plans, and health care clearinghouses. HIPAA does not apply to universities (other than university-affiliated health systems and provider organizations), many research organizations, and most commercial entities in the research space such as pharmaceutical, medical device, and biotech companies. That HIPAA does not apply equally to all actors in the data sciences is often a cause of confusion (and calls for a fairer playing field across different types of organizations).
The HIPAA rules relating to the use of health-related data can be complicated to apply. Covered entities that are subject to HIPAA may internally use identifiable health information (called protected health information, or PHI, under HIPAA) for research or may disclose PHI externally to third parties for research, only if the requirements of at least one of nine different HIPAA research rules are met (Rosati, 2021b).
One of the most significant HIPAA issues in large-scale data sharing is de-identification of data. HIPAA provides two ways in which PHI may be de-identified so that it is no longer PHI protected by the Privacy Rule: The Safe Harbor Method and the Expert Determination Method. In November 2012, the HHS Office for Civil Rights (OCR) issued detailed guidance on the use of both methods (OCR, 2012).
First, a covered entity may follow the Safe Harbor Method of de-identification and remove or code all the HIPAA ‘identifiers’ in the information. These identifiers include 18 different data elements about individuals and their family members, household members, or employers. Not all HIPAA identifiers are intuitively ‘identifiable,’ including dates related to patients (such as dates of service, dates of drug administration, dates of admission and discharge, dates of birth and dates of death) and any geographic designations smaller than a state (such as city and county).
Because it is difficult to use the Safe Harbor Method of de-identification and still have a valuable data set for research, the expert determination method of de-identification is often employed for large-scale data sharing. Under this method, a ‘qualified’ statistical expert determines the risk is very small that the information could be used alone, or in combination with other available information, to identify the patient. A qualified expert under HIPAA is a “person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable.” As explained in the OCR guidance on de-identification, there “is no specific professional degree or certification program for designating who is an expert at rendering health information de-identified. Relevant expertise may be gained through various routes of education and experience. Experts may be found in the statistical, mathematical, or other scientific domains. From an enforcement perspective, OCR would review the relevant professional experience and academic or other training of the expert used by the covered entity, as well as actual experience of the expert using health information de-identification methodologies” (OCR, 2012).
The other alternative often employed in large-scale data sharing under HIPAA is to utilize a ‘limited data set.’ A limited data set is partially de-identified health information that excludes all direct HIPAA identifiers; however, a limited data set may include dates related to patients, geographic designations above the street level or post office box, and any other unique identifying number, characteristic, or code that is not expressly listed as a HIPAA identifier. The key to using or disclosing a limited data set is that a data use agreement is required both for internal research personnel and outside researchers before access is provided.
One of the challenges under HIPAA is whether genetic information may be treated as de-identified if it is not accompanied by any of the HIPAA ‘identifiers’ (Rosati, 2021a). OCR guidance has confirmed that genetic information is only PHI if it is ‘individually identifiable’, indicating that OCR does not treat all genetic information as PHI (OCR, 2002). This author thus takes the position that genetic information is not PHI if the data otherwise meet the Safe Harbor Method for de-identification (i.e., all HIPAA identifiers are excluded) or meets the expert determination method of de-identification (Rosati, 2021a). Moreover, genetic information without direct identifiers may be treated as a limited data set, which then may be shared pursuant to a HIPAA data use agreement.
The federal Common Rule applies to human subjects research that is conducted or supported by a federal department or agency that has adopted the Common Rule (Protection of Human Research Subjects, 2022). The de-identification standards under the Common Rule—which was recently revised and applies to all NIH-funded human subjects research—may be evolving. This may create confusion regarding how the Common Rule interacts with HIPAA de-identified data.
Under the Common Rule, an investigator does not conduct ‘human subjects’ research as long as both of the following conditions are met: (1) the data or biospecimens are not collected for currently proposed research; and (2) the research does not involve “identifiable private information” (Protection of Human Research Subjects, 2022; see also Office for Human Research Protections, 2008). The Common Rule defines “identifiable private information” as “private information for which the identity of the subject is or may readily be ascertained by the investigator or associated with the information” (Protection of Human Research Subjects, 2022). Today, data that is ‘de-identified’ or a ‘limited data set’ under HIPAA’s standards generally meets the Common Rule’s standard for data that is not identifiable private information.
However, the Common Rule standard may be changing. The revised Common Rule requires the federal departments and agencies that implement it to reexamine the meaning of identifiable private information and biospecimens, with the help of appropriate experts, within one year of publication of the final rule (which deadline passed on January 19, 2018), and then every four years thereafter (Protection of Human Research Subjects, 2022). As of this writing, the federal agencies have not issued this guidance. When finally issued, the federal agencies are required to determine if there are technologies or techniques, when applied to information or biospecimens previously considered nonidentifiable, that could enable investigators to identify the subjects (Protection of Human Research Subjects, 2022).
If technologies or techniques are listed by federal agencies as enabling identification, they will be placed on a list of technologies or techniques that require research participant consent (or the use of the data otherwise permitted under the revised Common Rule without consent, such as an institutional review board [IRB] waiver of consent) (Protection of Human Research Subjects, 2022). According to the preamble of the revised Common Rule, the expectation is that whole genome sequencing will be one of the first technologies evaluated for placement on this list.
It is unclear at this point how the revised Common Rule’s new framework for assessing ‘identifiability’ will interact with HIPAA’s de-identification standard. Data de-identified under HIPAA may not automatically be considered nonidentifiable under the Common Rule if that data appears in the future list of technologies or techniques that produce identifiable information. For example, if the agencies put whole genome sequencing on that list, that sequencing data might be treated as identifiable, even if the genetic data currently is considered to be de-identified under the HIPAA standards. Moreover, it is unclear whether a new identifiability framework will apply to previously shared data.
For NIH-funded research, there is an additional complexity related to data de-identification: in 2017, NIH announced updates to its policy for issuing Certificates of Confidentiality to implement the 21st Century Cures Act (Office of the Director, NIH, 2017). The revised NIH Certificates of Confidentiality Policy broadens the applicability of certificates and increases privacy protections for research participants. It is now included in the NIH Grants Policy Statement as a standard term and condition of award for new and noncompeting awards issued on or after October 1, 2017 (Office of Extramural Research, NIH, 2017, 2019).
The NIH’s definition of “identifiable, sensitive information” (also called “Covered Information” by the policy) is information about an individual “that is gathered or used during the course of biomedical, behavioral, clinical, or other research” where an individual is identified or for which there “is at least a very small risk, that some combination of the information, a request for the information, and other available data sources could be used to deduce the identity of an individual” (Office of the Director, NIH, 2017). The NIH gives the following examples of covered information: “name, address, social security or other identifying number; and fingerprints, voiceprints, photographs, genetic information, tissue samples, or data fields that when used in combination with other information may lead to identification of an individual” [emphasis added] (NIH, 2020). The covered information definition thus is broader than HIPAA’s definition of “individually identifiable health information” and broader than the Common Rule’s current “identifiable private information” standard—especially as applied to genetic information. Any NIH-funded research that generates genetic information, even if de-identified under HIPAA and nonidentifiable under the Common Rule, is subject to the NIH Certificate of Confidentiality Policy.
These inconsistent and evolving legal standards for data de-identification pose a real challenge for the data science community and biomedical researchers, particularly as applied to the de-identification of genetic information. One way to minimize the potential future disconnect between HIPAA, the Common Rule, and the NIH Certificates of Confidentiality Policy will be to use HIPAA’s expert determination method for de-identification. As noted above, under this method a qualified statistical expert determines that the risk is very small that the information could be used alone, or in combination with other available information, to identify the individual (Protection of Human Research Subjects, 2022). The expert will consider whether the intended recipient of the data has access to data that would lead to re-identification and may consider whether contractual controls are in place to prohibit the recipient from combining the data with other data sets (OCR, 2012).
As applied to genetic sequencing data, for example, sequencing data cannot itself identify a person; rather, a population register or other data set that includes sequencing data paired with identifiers (such as a publicly available genealogy database) would be necessary to identify an individual in the original sequencing data. The expert determination method must consider the recipient of the data (OCR, 2012). Thus, the expert determination method could conclude that a data set including genetic data is de-identified, if the data set will not be provided to recipients that maintain a population register. In addition, the expert determination method may consider contractual controls over the data to prevent re-identification (OCR, 2012). Thus, contractual controls on the recipient’s use of data are relevant to the de-identification determination, such as an express prohibition against re-identification of individuals in the data, restrictions on combining the data with other data sets that could increase the likelihood of re-identification, and limits on downstream disclosures of the data where not subject to the same contractual controls. In contrast, the Safe Harbor Method relies solely on the removal of HIPAA identifiers and does not depend on an analysis of the protections available outside the characteristics of the data set itself.
If HHS releases guidance treating certain genomic sequencing as potentially identifiable under the Common Rule, HHS hopefully will continue to treat data de-identified under HIPAA’s expert determination method as nonidentifiable under the Common Rule. Under the expert determination method, a statistical determination of a ‘very small risk’ of re-identification may be acceptable to the agencies to conclude that the particular set of genetic sequencing information cannot enable particular investigators to identify the subjects. While that Common Rule guidance on identifiability has not yet been issued, the research community is hopeful that HHS and the other Common Rule agencies will anticipate the need for flexibility to support the continued use of de-identified genetic information.
Not surprisingly, there is substantial discussion (and disagreement) about whether de-identification of data alone sufficiently advances the fundamental research ethics goals of respect for persons (autonomy), beneficence (no harm), and justice identified in the Belmont Report (Office of the Secretary, Department of Health, Education and Welfare, 1979). “Although privacy is not explicitly reflected in the original Belmont Report, it has since become a pressing concern and a source of potential harm that researchers should seek to mitigate” (Anabo, 2018, p. 144).
There are legitimate concerns that de-identification may not itself be sufficient to protect privacy. Rothstein, for example, argues that the use of de-identified health information, particularly when paired with biological specimens, creates privacy risks to both individuals and groups, and that the current regulatory system needs to be revised to extend protection to de-identified information in a manner that does not unduly burden research (Rothstein, 2015). And, there is substantial debate within the data science community about whether health-related data can ever truly be de-identified (Ohm, 2010).
Individual consent traditionally has been the way in which the research community has protected individual autonomy, so that individuals are aware of privacy risks involved in research. However, individual informed consent is not an easy fit for large-scale data sharing, because of the substantial expense involved, the potential intrusion on privacy to seek consent, and the potential introduction of consent bias in research due to differing consent rates based on demographic characteristics (El Eman, 2013; El Emam et al., 2011; Froomkin, 2019). And as noted by Price and Cohen (2019), there are harms that can result from privacy ‘overprotection,’ including limits on innovation.
Moreover, there is no federal law that currently prohibits the re-identification of individuals whose data is used in de-identified data projects. Such a law—with appropriate exceptions such as permitting re-identification in research protocols approved by an IRB—would substantially advance the public’s comfort with the use of large-scale data sets and maintain the crucial role that large-scale sharing of de-identified health information plays in the ‘learning healthcare system’ (Olsen et al., 2007; see also Barth-Jones, 2016).
In light of the concerns about whether de-identification of data is sufficient protection of privacy, and in the absence of a legal prohibition against re-identification of individuals, the research community should exercise good stewardship to protect individuals whose data is included in large de-identified data sets. To be good stewards of health-related data, the author recommends that researchers (and their institutions) utilize contractual controls to impose restrictions on a recipient’s use of de-identified data and on downstream disclosures of that data. While a HIPAA data use agreement is only required for the use and disclosure of HIPAA limited data sets, and contracts are not legally required for disclosure of de-identified data (other than certain state law requirements for the sale of de-identified data), adding a layer of contractual control on use and disclosure of de-identified data sets likely will increase the protection of individuals represented in the data.
It is increasingly common to include the following content in data sharing agreements:
Express prohibitions against the re-identification of individuals in de-identified data sets: As noted, no federal law currently prohibits recipients of a de-identified data set from using that data to re-identify individuals in the data. While researchers likely have no incentive to re-identify individuals, downstream recipients of those data sets might financially benefit from re-identifying individuals for marketing and other commercial services. In the absence of a regulatory prohibition against re-identification, many institutions have begun imposing contractual prohibitions on re-identification.
Restrictions on combining de-identified data with other data sets: Combining de-identified data sets with those that include individual identifiers obviously raises the risk of re-identification, even if the intent is not to identify individuals in the data. Moreover, where data has been de-identified under the expert determination method, prohibitions against linking data without obtaining a revised certification that the data remains de-identified are warranted. As explained earlier, the expert will take into account what a recipient’s ability is to re-identify data; linking a de-identified data set to new data (even if also de-identified) may change the calculation of the risk of re-identification.
Imposition of security requirements to protect against inadvertent disclosure of de-identified data: Increasingly, data sharing agreements impose express security requirements to protect against unauthorized access, acquisition, destruction, use, modification and/or disclosure of the data. While these security requirements will vary greatly, by way of example these requirements might include: (a) a requirement for the recipient to have a comprehensive risk-based security, privacy and compliance framework that complies with applicable laws; (b) restriction of the location of the data to secure facilities with access only to authorized individuals; (c) technical access controls, including two-factor authentication and password complexity; (d) data encryption; (e) firewalls; (f) prohibition of copying or downloading data by non-administrators; (h) security monitoring to track potential vulnerabilities and intrusions; and (i) record retention/destruction specifications.
Prohibition against disclosure of de-identified data to third parties, unless subject to the same contractual controls: Because of concerns that downstream disclosure will result in the loss of data control or will result in non-compliance with the terms and conditions of an expert determination method de-identification, many organizations now prohibit the release of de-identified data to third parties that have not contracted with the data source. Where downstream disclosure is permitted, such as from data repositories that release de-identified data with open licenses, contracts might impose the requirement for the original recipient to pass down the same contractual controls or data license terms to other third parties.
The sharing of health-related data is governed by myriad federal and state (and sometimes international) requirements that are difficult to understand and quickly evolving. This poses a substantial challenge to creating long-term data management and sharing plans for NIH-funded or -conducted research. Moreover, there are substantial concerns about whether de-identification of data is sufficient protection of privacy, and there is no current federal law prohibiting the re-identification of individuals in de-identified data sets. To keep in front of those evolving laws, the author recommends using HIPAA’s expert determination method to de-identify data. The author also recommends employing contractual controls on the use of de-identified data as a key to good data stewardship.
The author would like to thank the following people for their helpful suggestions on this article: Dr. Christine L. Borgman, Distinguished Research Professor, Information Studies, UCLA; Dr. Maryann Martone, Professor, Neurosciences, UCSD; and Dr. Richard Nakamura (retired), Former Director, Center for Scientific Review, National Institutes of Health.
Kristen B. Rosati has no financial or non-financial disclosures to share for this article.
Anabo, I., Elexpuru-Albizuri, I., & Villardon-Gallego, L. (2018). Revisiting the Belmont Report’s ethical principles in internet-mediated research: Perspectives from disciplinary associations in the social sciences. Ethics and Information Technology, 21(2), 137–149. https://doi.org/10.1007/s10676-018-9495-z
Barth-Jones, D. (2016). NCVHD Hearing: De-identification and HIPAA - Improving HIPAA de-identification public policy. https://www.ncvhs.hhs.gov/wp-content/uploads/2016/04/BARTH-JONES.pdf
California Consumer Privacy Act, Cal. Civ. Code 1798.100-1798.199.100. (2021). https://leginfo.legislature.ca.gov/faces/codes_displayText.xhtml?lawCode=CIV&division=3.&title=1.81.5.&part=4.&chapter=&article=
Confidentiality of Substance Use Disorder Patient Records, 42 C.F.R. Part 2 (2022). https://www.ecfr.gov/current/title-42/chapter-I/subchapter-A/part-2
El Emam, K. (2013). Guide to the de-identification of personal health information. CRC Press.
El Emam, K., Jonker, E., & Fineberg, A. (2011). The case for de-identifying personal health information. SSRN. http://doi.org/10.2139/ssrn.1744038
Froomkin, M. (2019). Big data: Destroyer of informed consent. Yale Journal of Health Policy, Law, and Ethics. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3405482
Health Insurance Portability and Accountability Act of 1996, Pub. L. No. 104-191, 100 Stat. 2548 (1996). https://www.govinfo.gov/app/details/PLAW-104publ191#:~:text=An%20act%20to%20amend%20the,access%20to%20long%2Dterm%20care
HIPAA Privacy Rule, 45 C.F.R. Parts 160 and Part 164 (2022). https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-160 and https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164?toc=1
National Institutes of Health. (2020). Frequently asked questions (FAQs): Certificates of Confidentiality. https://grants.nih.gov/faqs#/certificates-of-confidentiality.htm?anchor=header11002
Office for Civil Rights. (2002). Frequently asked questions: Does the HIPAA Privacy Rule protect genetic information? https://www.hhs.gov/hipaa/for-professionals/faq/354/does-hipaa-protect-genetic-information/index.html
Office for Civil Rights. (2012). Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#uniquenumber
Office for Human Research Protections. (2008). Guidance on research involving coded private information or biological specimens. https://www.hhs.gov/ohrp/regulations-and-policy/guidance/research-involving-coded-private-information/index.html
Office of Extramural Research, National Institutes of Health. (2017). Notice of changes to NIH policy for issuing Certificates of Confidentiality (NOT-OD-17-109). National Institutes of Health. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-17-109.html
Office of Extramural Research, National Institutes of Health. (2019). NIH Grants Policy Statement, § 220.127.116.11. https://grants.nih.gov/grants/policy/nihgps/HTML5/section_4/4.1_public_policy_requirements_and_objectives.htm#Confiden
Office of the Director, National Institutes of Health. (2020). Final NIH Policy for Data Management and Sharing (NOT-OD-21-013). https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html
Office of the Secretary, Department of Health, Education and Welfare (1979). The Belmont report. https://www.hhs.gov/ohrp/sites/default/files/the-belmont-report-508c_FINAL.pdf
Ohm, P. (2010). Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Review, 57, 1701. SSRN. https://ssrn.com/abstract=1450006
Olsen, L., Aisner, D., & McGinnis, J. M. (2007). The Learning Healthcare System: Workshop Summary. The National Academies Press. https://doi.org/10.17226/11903
Price, W. N., & Cohen, I. G. (2019). Privacy in the age of medical big data. Nature Medecine, 25, 37–43. https://doi.org/10.1038/s41591-018-0272-7
Protection of Human Research Subjects, 45 C.F.R. Part 46 (2022). https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-A/part-46?toc=1
Rosati, K. (2021a). Coping with evolving standards of data de-identification. In A. G. Gosfield (Ed.), Health Law Handbook (pp. 503–532). Thomson Reuters.
Rosati, K. (2021b, April 28–29). The regulation of research involving human participants [Conference session]. Changing the Culture of Data Management and Sharing: A Workshop, Virtual. The National Academies of Sciences, Engineering and Medicine. https://www.nationalacademies.org/event/04-29-2021/changing-the-culture-of-data-management-and-sharing-a-workshop
Rothstein, M. (2015). Ethical issues in big data health research. Journal of Law, Medicine & Ethics, 43(2), 425–429. https://doi.org/10.1111/jlme.12258
©2022 by Kristen B. Rosati. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.