Is it OK to Share Anonymized Data?

The manager of HM Hospitales has recently announced that he has made available to the scientific community 2,157 anonymized medical records of COVID-19 patients treated at these hospitals.

Healthcare

A fine initiative that nonetheless begs a couple of questions:

Does personal or confidential data anonymization really provide a privacy guarantee?
Is publishing anonymized databases currently the best way of helping the scientific community to draw up precise machine learning models to make headway in research, in this case biomedical research?

An anonymized database is prone to what are called re-identification attacks, meaning an attempt to trace ostensibly anonymous records in the records of another related database or data source to extract confidential information from it. For example, two Texas University researchers managed to de-anonymize the movie ratings of Netflix users in a dataset published by the company for a competition designed to improve its recommendation system. The technique used was based on a simple idea: in the movie dataset, with a huge number of fields, there are not many users who give the same rating to the same film. Given that a user’s ratings are unique, or almost unique, it should not be too difficult to identify this user with only a little auxiliary information obtained from another source.

The article explains that a high-dimensional dataset like Netflix’s greatly raises the chances of being able to de-anonymize a register, while at the same time slashing the amount of auxiliary information required to do so. It also enables the de-anonymization algorithms to be robust in the face of data perturbation or incorrect auxiliary information. They showed this by cross-checking the Netflix ratings against the Internet Movie Database (IMDb), where many Netflix subscribers had also introduced ratings of the movies they had seen; they thus managed to trace IMDb user profiles, often with users' real names, to their Netflix ratings (which are theoretically private). This turned out to be possible even if the subscriber had posted very few IMDb ratings and their ratings bore only a rough resemblance to the same subscriber’s Netflix ratings.

MPC-Learning is a project financed jointly by GMV's R&D+i area and the Spanish Ministry of Economic Affairs and Digital Transformation, and it focuses on mathematical techniques capable of providing a numerical calculation without having to share data.

A well-known case in the medical sphere was the disclosure of the medical records and data of the Governor of the State of Massachusetts, when it occurred to an MIT student, Latanya Sweeney, to collate an anonymized medical database with the voter registration rolls of Cambridge, MA. The voter records contained, among other things, the name, address, zip code, date of birth, and gender of Cambridge’s total of 54,000 voters at that time, who were distributed across seven zip codes. Combining this information with the records of the anonymized database, the student was then able to find the governor’s medical records quite easily: only six people in Cambridge shared his date of birth; three of these were men and only one of them, the governor himself, lived in his zip code. The article “The 'Re-Identification' of Governor William Weld's Medical Information” describes this case, albeit also pointing out that the re-identification was possible because the governor was a public figure who experienced a highly publicized hospitalization (he passed out at a public event and footage was broadcast on all TV networks). Nonetheless, it is highly likely that the selfsame procedure would also work for finding the information of a known person, or someone who shared too much information on the Internet.

Does this mean we should forgo the use of anonymized data for scientific research?

Probably not, or not yet. As it stands today, it does not yet seem that re-identification can be performed on a massive scale on all the records of any anonymized database. Although there is now a host of studies showing cases of re-identification in certain circumstances, no one would call this an excessive price to pay for the huge scientific advances allowed by the exchange of anonymized medical datasets. It does serve as food for thought, however, showing that if we want to share our datasets in the interests of medical research, we should give careful thought to the anonymization technique. It also shows that privacy may not yet be guaranteed or simply that the database concerned is not apt for anonymized publication. It is more than likely that new techniques will appear in the future that disclose all or part of the information we wanted to hide.

Just maybe, for that very reason, the time has now come to consider data-sharing alternatives. Quite apart from anonymization, this idea gains traction if we also consider the following question: wouldn’t it be better to switch to a cooperation scenario in which each hospital, group, organization, etc. worked together in a federated learning network instead of each publishing its own anonymized database? Federated learning is a privacy- and confidentiality-preserving distributed computation model. It involves taking the machine learning models to where the data is rather than working with a single, centralized dataset. A collaboration of this type not only gets over the obstacle of database anonymization shortfalls and solves the legal constraints on sharing medical records but also brings much more data into the trawl (i.e., not only the 2,157 medical records shared by HM Hospitales), thereby allowing the organizations to obtain more accurate models.

Due to cases like these and GMV’s own experience with its clients, the company has always considered data privacy to be a crucial factor. So much so that GMV is now taking part in the project MPC-Learning: Secure and Protected Machine Learning by Secret Sharing. Co-funded by the R&D department of GMV’s Secure e-Solutions sector and the Spanish Ministry of Economic Affairs and Digital Transformation, , the project focuses on the development of mathematical and computational techniques capable of numerical calculation without the need for sharing data.

Click here to find out more about MPC-Learning, GMV’s alternative solution.

Authors: Luis Porras Díaz and Juan Miguel Auñón

Is it OK to Share Anonymized Data?

Does this mean we should forgo the use of anonymized data for scientific research?

Add new comment