The manager of HM Hospitales has recently announced that he has made available to the science community 2157 anonymized medical records of Covid-19 patients treated in their hospitals. A fine initiative that nonetheless begs a couple of questions:
- Whether personal- or confidential-data anonymization is really a privacy guarantee.
- Whether publishing anonymized databases is nowadays the best way of helping the science community to draw up precise machine learning models to make headway in research, in this case biomedical research.
An anonymized database is prone to what are called re-identification attacks, meaning an attempt to trace ostensibly anonymous records in the records of another related database or data source to extract confidential information from it. For example, two Texas University researchers managed to de-anonymize Netflix viewers’ movie ratings in a dataset published by the company for a competition designed to improve its recommendation system. The technique used was based on a simple idea: in the movie dataset, with a huge number of fields, there are not many users who give the same rating to the same film. Given that a user’s ratings are unique, or almost unique, it should not be too difficult to identify this user with only a little auxiliary information obtained from another source.
The article explains that a high-dimensional dataset like Netflix’s greatly raises the chances of being able to de-anonymize a record, while at the same time slashing the amount of auxiliary information required to do so. It also enables the de-anonymization algorithms to be robust to perturbation in the data or wrong auxiliary information. They showed this by cross-checking the Netflix ratings against the Internet Movie Database (IMDb), where many Netflix subscribers had also introduced ratings of the movies they had seen; they thus managed to trace IMDb user profiles, often with their real names, to their Netflix ratings (theoretically private). This turned out to be possible even if the subscriber had published very few IMDb ratings and these bore only a rough resemblance to the same subscriber’s Netflix ratings.
A well-known case in the medical sphere was the disclosure of the medical record and data of the Governor of the State of Massachusetts, when it occurred to an MIT student, Latanya Sweeney, to collate an anonymized medical database with the population register of Cambridge in this same state. The voter census contained, among other things, the name, address, postal code, D.o.B and gender of Cambridge’s total of 54,000 voters at that time, who were distributed in seven postal districts. Combining this information with the records of the anonymized database, the student was then able to find the governor’s medical record with the greatest of ease: only six people in Cambridge shared his date of birth; three of these were men and only one of them, the governor himself, lived in his postal district. The article, published under the title “The ‘Re-Identification’ of Governor William Weld’s Medical Information” reviews this case, albeit also pointing out that the re-identification was possible because the governor was a public figure who experienced a highly publicized hospitalization (he passed out in a public act and footage was broadcast on all TV chains). Nonetheless, it is highly likely that the selfsame procedure would also work for finding the information of a known person, or someone who shared too much information on internet.
Does this mean we should renounce the use of anonymized data for scientific research?
Probably not, or not yet. As we stand today it does not yet seem that re-identification can be performed on a massive scale on all the records of any anonymized database. Although there is now a host of studies showing cases of re-identification in certain circumstances, no one would call this an excessive price to pay for the huge scientific advances that allow for the exchange of anonymized medical datasets. It does serve as food for thought, however, showing that if we want to share our datasets in the interests of medical research, we should give careful thought to the anonymization technique. It likewise shows that privacy may not yet be guaranteed or simply that the database concerned is not apt for anonymized publication. It is more than likely that new techniques will appear in the future that disclose all or part of the information we wanted to hide.
Just maybe, for that very reason, the time has now come to consider data-sharing alternatives. Quite apart from anonymization, this idea gains force if we also consider the following question: wouldn’t it be better to switch to a cooperation scenario in which each hospital, group, organization, etc would work together in a federated learning network instead of each publishing its own anonymized database? Federated learning is a privacy- and confidentiality-preserving distributed computation model. It involves taking the machine learning models to where the data is rather than working with a single, centralized dataset. A collaboration of this type not only gets over the obstacle of database anonymization shortfalls and solves the legal clinical-data-sharing constraints but also brings much more data into the trawl (i.e., not only the 2157 medical records shared by HM Hospitales) thereby obtaining more accurate models.
Due to cases like these and GMV’s own experience with its clients, the company has always considered data privacy to be a crucial factor. So much so that GMV is now taking part in the project “MPC-Learning: Aprendizaje automático seguro y protegido mediante compartición de secretos” (MPC-Learning. Secure and protected machine learning by secret sharing). Co-funded by the R&D department of GMV’s Secure e-Solutions sector and the Spanish Ministry of Economic Affairs and Digital Transformation (Ministerio de Asuntos Económicos y Transformación Digital),  the project focuses on the development of mathematical and computational techniques capable of numerical calculation without the need for sharing data.
Click here to find out more about MPC-Learning, GMV’s alternative solution
Authors: Luis Porras Díaz and Juan Miguel Auñón
Las opiniones vertidas por el autor son enteramente suyas y no siempre representan la opinión de GMV
The author’s views are entirely his own and may not reflect the views of GMV