Data vs. Privacy in the time of COVID-19
How to protect fundamental rights amidst fear and uncertainty
Humanity is currently facing a global crisis caused by COVID-19. The silver lining of this tragic situation is that everybody is trying their best to contribute, according to their means and skills, and of course, by staying home.
We, the authors of this post, know next to nothing of epidemiology, virology or public health interventions. But the little we know is that one of the best ways to slow and eventually stop the chain of contagion is to identify those who are infected and retrieve the contacts and places they have been to so that people having been in contact—either direct or via environment—can be alerted, tested, and in the case of infection, quarantined.
Given the tragic circumstances, more and more voices are questioning the use of surveillance-type data to fight the pandemic. The underlying rationale is that Telcos and Internet service providers such as Google or Facebook already have the location history of most of the people. If they were to release the data, or provide unrestricted access to governments, it would be possible to track back the contacts of those infected, and consequently, to have more fine-grained testing, interventions and quarantines.
While this reasoning is undeniably correct, it would seriously compromise the privacy of the whole population. Rights are gained little by little but can be lost in a single blow, what we accept today out of fear and solidarity might have serious ramifications on our future society.
Knowing the privacy implications, some advocate to anonymize or aggregate the aforementioned location history data. While this might preserve (to a certain degree) privacy, it will affect its utility. Aggregates on neighborhoods are better than nothing, and they can be very useful for macroscopic models such as predicting the evolution of the epidemic curve, but they do not offer the required resolution for fine-grained interventions. The trade-off is between how much privacy you preserve vs. the final utility of the data.
Location history data is extremely powerful, and there has been a surge of cell-phone apps that track the users. A very successful one is Singapore’s state-sponsored app: TraceTogether, that tracks the user’s contacts (via Bluetooth proximity). To minimize the impact on privacy, the data is only made available to the authorities if the person is infected.
We believe that this approach is sound. For it provides fine-grained data without having to partner or opt-into the realm of state or corporate surveillance. Since we have plenty of experience and know-how about privacy-preserving data collection, we felt compelled to contribute in this area. Here goes our proposal.
A cell-phone app should track location (via GPS) and close contacts (via Bluetooth) and store this information locally, on the device. Both mechanisms are needed since Bluetooth only records simultaneous contacts, and infection can occur asynchronously via the environment (process of shedding, droplet on surface, self-inoculation). GPS alone is not good enough for very close contacts, due to its lack of precision, and it does not provide reliable information about indoor activity.
If the person turns out to be COVID-19 positive or is clearly symptomatic, she should flag herself as infected. Then and only then, parts of the data of her activity will be sent to a collector service.
What parts of the data should be sent? It should not be all of it (as is the case in TraceTogether), otherwise, we are compromising the privacy of that person. Instead, she should only send locations in which she has spent more than one minute, along with a timestamp. Locations and timestamps should be bucketed to minimize potential for record-linkage by the collector. Bluetooth contacts should not be sent at all. The Bluetooth
id should also be sent.
To prevent the collector from tracking users, this data should not be sent in a single message, but instead should be split as individual, atomic messages, then sent at random time intervals over an extended period of time (hours). Needless to say, messages should not contain any user-identifier and they should be sent through Tor to avoid network level identifiers such as the IP. This way, we obfuscate to a certain degree the activity of the person.
It is to be excepted that in densely populated areas, many people will coincide in the same “box”, and because only stationary data should be sent, it should not be possible to know that the person that was on location
<x_1, y_1> at time
t_1, is the same person on location
<x_2, y_2> at time
t_2. In the extreme case, if there is only one infected person in a city the collector could trivially infer that all received data is from the same person. In practical terms, however, given that the number of infected people is on the order of thousands, it is safe to assume that collisions will occur.
Last but not least, it is very important to minimize trajectories, so only non-transient locations should be sent, we want to report the home location and the supermarket location, not the path between home and supermarket because that would link the two locations to the same person, which is something to avoid.
Building the Risk Heatmap Based on Collected Data
The data provided by infected or symptomatic people is very valuable and, if private enough, could be open for any institution that can provide value. One such category of institutions should be public health policy specialists, who could decide which hot-spots require cleaning interventions, or which areas are becoming a threat, in order to contain them.
Another category would be epidemiologists who could calculate the risk of contagion of a given area—at the level of square meters. The data provided could, for example, inform specialists that in a given 20x20-meters area, there had been 7 infected people in the last 48 hours. From this they could derive a function of the contagion rate depending on the area and number of infected people over time. Such models, and there are many, could help close the circle, as they could be executed on the non-infected population to issue a risk of contagion metric.
The app users could download the latest risk heat-map (sharded by region) as well as the latest models and proceed to a self-evaluation of the risk. Not as aggregates, but with the fine-grained location information stored locally on their devices. In layman’s terms, instead of sending the data to “someone” to run a model, we could run the model directly on the data stored locally. No one needs to know that you left home to go to the supermarket and crossed 5 people on the way. But thanks to the models developed by epidemiologists (not by us), the cell-phone could evaluate the risk of contagion of its owner based on his contact and location history.
This ability would allow people to make well informed decisions for themselves. If the risk of contagion given their latest activity is high, they might decide to self-quarantine for a few days. Or better yet, they might qualify to receive a real COVID-19 test from the authorities. Sadly, tests are scarce and must be prioritized. Detecting asymptomatic cases is a key to break the contagion chain, since healthy carriers are the most likely to spread the disease.
The end result would be almost the same as compared to what could be achieved having access to the full location history of the population (via Google or Telco data) in conjunction with the identity of the infected people. In other words, it has the same utility and value, but with much less privacy side-effects.
Of course, this approach is more cumbersome to develop: a) it requires high adoption, almost everyone should use the app, b) development and deployments are technically more challenging, c) it is susceptible to people misreporting. While the system can tolerate some noise, it can also be attacked with false information, and d) it cannot be applied retrospectively, the application would need several days of data before being useful.
Note that a similar methodology could be applied to already existing location datasets, which could help mitigate the issues of adoption and the initial lack of data. In a nutshell, the institutions holding fine-grain location data, should be able to run the sharing component and risk assessment on behalf of their users, and act as the cell-phone app we described. Unfortunately, that would entail that such institutions would learn about the identity of the infected people, which is also extremely sensitive. Still, it is an approach that would help bootstrap the system with limited impact on privacy.
Do we—the authors of this post—have such an app and system? No. But we have the knowledge, experience and resources to build it, quickly. If any [serious] institution thinks that this would be beneficial, then we can provide it. We would only demand the system to be open for any city or country to use.
A system like this is not a silver bullet. Neither tech nor data can be the only solution, it can only be a tool for the real problem solvers: Doctors, nurses, public health officials, logistics. Thank you for your hard work, keep fighting. We will do our part, stay at home.
As a matter of fact, Israel considered choosing this path as the cabinet unanimously approved the use of technology used for counter-terrorism to be deployed on civilians to monitor the spread of COVID-19. Read more: “Better Health Through Mass Surveillance?”, “Israel to track mobile phones of suspected coronavirus cases”, and “Israel Govt’s New ‘Shield’ App Tracks Your Coronavirus Exposure”. ↩︎