Healthcare Data Access in the Age of Data Privacy with Terry Myerson, CEO at Truveta
The Challenge of Accessing Healthcare Data while Maintaining Privacy
Healthcare data is incredibly private, and it can be difficult to balance the need for accessibility with the need to maintain confidentiality. To address this challenge, there are several laws and regulations that govern the handling of healthcare data in the United States.
One such law is HIPAA, also known as the Health Insurance Portability and Accountability Act. HIPAA sets standards for the protection of individually identifiable health information (PHI) and requires covered entities to implement security measures to safeguard PHI. However, HIPAA does not provide a comprehensive framework for de-identifying or de-duplicating PHI.
To address this limitation, there are two paths under HIPAA: Safe Harbor and Safe Harbor with modifications. The Safe Harbor provisions provide a framework for de-identifying PHI by stripping out all personally identifiable information (PII) such as social security numbers, names, addresses, and dates of birth. However, the Safe Harbor provisions have limitations when it comes to analyzing data from specific populations or conditions.
For example, if an individual is born in a specific location on a certain date, it becomes easier to identify them through subsequent data points. Therefore, there are two paths under HIPAA: Safe Harbor and Safe Harbor with modifications. The Safe Harbor provisions provide a framework for de-identifying PHI by stripping out all personally identifiable information (PII) such as social security numbers, names, addresses, and dates of birth.
However, the Safe Harbor provisions have limitations when it comes to analyzing data from specific populations or conditions. For instance, if an individual has a rare disease, it may be more difficult to de-identify their PHI without losing valuable insights into that condition. To address this challenge, researchers are using statistical methods to remove identifiable information while preserving meaningful data.
In addition to Safe Harbor, there is another path under HIPAA: Safe Harbor with modifications. This provision provides a framework for analyzing data from specific populations or conditions by reducing the granularity of the data. For example, instead of analyzing data at the individual level, researchers may use aggregated data to preserve anonymity while still capturing valuable insights.
The Safe Harbor with modifications provisions also provide guidance on how to determine the appropriate level of granularity to use when de-identifying PHI. According to the guidelines, if a study requires more precise data than what is available through de-identification, the researcher should take steps to remove or modify identifiable information without compromising the integrity of the data.
For instance, if a researcher needs to analyze data from a specific geographic region or medical condition, they may need to use aggregated data that preserves anonymity while still capturing valuable insights. In such cases, researchers can use statistical methods to estimate demographic characteristics and disease prevalence without compromising confidentiality.
In summary, HIPAA provides two paths for de-identifying PHI: Safe Harbor and Safe Harbor with modifications. These provisions provide a framework for analyzing data from specific populations or conditions by reducing the granularity of the data while preserving meaningful insights into healthcare outcomes.
To balance utility and anonymity, researchers are using statistical methods to remove identifiable information while preserving valuable insights. They also need to consider the limitations of de-identification when working with rare diseases or specific populations.
"WEBVTTKind: captionsLanguage: enyou talked about um Healthcare data being incredibly private uh so how do you get that level of accessibility while keeping um those sort of important personal details about People's Health private so hippo um which is our Healthcare Privacy Law in the United States defines the process of deidentification and deidentification um has two paths under hippo one is called Safe Harbor with Safe Harbor you you strip out all of the pii all the personal all your social security number your your birth date your name address all that gets stripped out but also all time and all location uh because time and location are two uh critical elements of reidentification if you know that this per that a baby was born at a certain location at a certain time you have a higher ability of reidentifying that person so you have this one path where you strip out all time on location the problem is that's really tough for any sort of Public Health analysis that's really tough if you're trying to look at you know outcomes that occur or interventions that occur at one point in time and outcome so the other path to find is a that there's a statistical threshold of reidentification now this gets much more complicated and it's really an AI problem to say okay we're obviously going to strip out all the pii done but then we're going to start redacting elements of um uniqueness you know we're going to you know we're going to reduce the granularity of time from like down to the second maybe down to the year maybe down to the quarter maybe down to the month we're going to redact vocation from zip five to zip four to zip 3 to you know to county or state but then you also need to consider if it's a rare disease situation then it is a um there's fewer people similar to this you need less granular you may need to redapt other details you know you may need to take it instead of saying you know a specific pancreatic cancer maybe you say just pancreatic cancer maybe you say just cancer and so you're going through the statistical thing you're creating this uh you want to provide data that has the right that that balances utility and anonymity but gets to this right statistical level of anonymity and so you got to take into consideration all kinds of factors you know if if a if you're doing a study which has no need for geography just take out all the geography and that you then can include more granularity in other places if you need a study that's going to have down to the date granularity I need to know if it's this absolute date of July one July two July three well then you need to have less granularity in location or maybe some of the other quasi identif fires in the data and so we've designed the system that takes this takes these medical records and the things which are strict identifiers like your name the Quasi identifiers um you know your gender your um marital status things like this and then you have things which are really not identifying about you and statistically tries that that crafts a data set that maximizes utility at the threshold of of reidentification risk that hippo requires okay that's absolutely fascinating and yeah just the the ways people can reconstruct who a person is just from Individual BS of information and then maybe matching with some other data so I'm curious um does the amount of detail that you keep in data set do you need to know what the use case is for that data before you craft this partially normalized um data set well we do restrict the use the the the data is used for healthcare research it's used to approve Health outcomes this is not you know we actually strictly forbid the data to be used for any sort of uh at targeting advertising Physicians or patients so it's a the use case is very clearly about improving Health outcomes and the and that's contractually something someone agrees to prior to getting to getting any access to the data but then we still absolutely you know we have studies being done that um link moms and their children and it's very important for studying maternal health and for studying vaccinations and you know what happens now if you want a linkage between moms and their children then you're going to take a lot less other details in geography or time or diagnosis because that's a you know knowing this mother has two children instead of one children again that's a reidentification vector and so knowing you know is it a public health study where you need GE Geographic granularity knowing it's a maternal health study that needs moms and their children linked knowing it's a um you know you're studying a specific procedure where you need down to the second granularity of what's happening in the operating room these these are things which you know we've got this incredible deidentification team and risk analytics team focused on creating the highest utility data sets to meet the needs of the study to be doneyou talked about um Healthcare data being incredibly private uh so how do you get that level of accessibility while keeping um those sort of important personal details about People's Health private so hippo um which is our Healthcare Privacy Law in the United States defines the process of deidentification and deidentification um has two paths under hippo one is called Safe Harbor with Safe Harbor you you strip out all of the pii all the personal all your social security number your your birth date your name address all that gets stripped out but also all time and all location uh because time and location are two uh critical elements of reidentification if you know that this per that a baby was born at a certain location at a certain time you have a higher ability of reidentifying that person so you have this one path where you strip out all time on location the problem is that's really tough for any sort of Public Health analysis that's really tough if you're trying to look at you know outcomes that occur or interventions that occur at one point in time and outcome so the other path to find is a that there's a statistical threshold of reidentification now this gets much more complicated and it's really an AI problem to say okay we're obviously going to strip out all the pii done but then we're going to start redacting elements of um uniqueness you know we're going to you know we're going to reduce the granularity of time from like down to the second maybe down to the year maybe down to the quarter maybe down to the month we're going to redact vocation from zip five to zip four to zip 3 to you know to county or state but then you also need to consider if it's a rare disease situation then it is a um there's fewer people similar to this you need less granular you may need to redapt other details you know you may need to take it instead of saying you know a specific pancreatic cancer maybe you say just pancreatic cancer maybe you say just cancer and so you're going through the statistical thing you're creating this uh you want to provide data that has the right that that balances utility and anonymity but gets to this right statistical level of anonymity and so you got to take into consideration all kinds of factors you know if if a if you're doing a study which has no need for geography just take out all the geography and that you then can include more granularity in other places if you need a study that's going to have down to the date granularity I need to know if it's this absolute date of July one July two July three well then you need to have less granularity in location or maybe some of the other quasi identif fires in the data and so we've designed the system that takes this takes these medical records and the things which are strict identifiers like your name the Quasi identifiers um you know your gender your um marital status things like this and then you have things which are really not identifying about you and statistically tries that that crafts a data set that maximizes utility at the threshold of of reidentification risk that hippo requires okay that's absolutely fascinating and yeah just the the ways people can reconstruct who a person is just from Individual BS of information and then maybe matching with some other data so I'm curious um does the amount of detail that you keep in data set do you need to know what the use case is for that data before you craft this partially normalized um data set well we do restrict the use the the the data is used for healthcare research it's used to approve Health outcomes this is not you know we actually strictly forbid the data to be used for any sort of uh at targeting advertising Physicians or patients so it's a the use case is very clearly about improving Health outcomes and the and that's contractually something someone agrees to prior to getting to getting any access to the data but then we still absolutely you know we have studies being done that um link moms and their children and it's very important for studying maternal health and for studying vaccinations and you know what happens now if you want a linkage between moms and their children then you're going to take a lot less other details in geography or time or diagnosis because that's a you know knowing this mother has two children instead of one children again that's a reidentification vector and so knowing you know is it a public health study where you need GE Geographic granularity knowing it's a maternal health study that needs moms and their children linked knowing it's a um you know you're studying a specific procedure where you need down to the second granularity of what's happening in the operating room these these are things which you know we've got this incredible deidentification team and risk analytics team focused on creating the highest utility data sets to meet the needs of the study to be done\n"