U.S. epidemiologists say data secrecy on COVID-19 cases cripples intervention strategies | Science
Science‘s COVID-19 reporting is supported by the Pulitzer Center and the Heising-Simons Foundation.
California was a COVID-19 success story—until suddenly it wasn’t. Early in the pandemic, the state seemed to have the new coronavirus under control, but the pandemic has begun to ride a wave there, with records set in daily cases several times this month, and deaths on the rise.
California officials whose COVID-19 responses were once hailed as enlightened are now receiving criticism—and some of the sharpest is coming from scientists seeking to help guide the state’s fight against the virus. Since April, epidemiologists from Stanford University and several University of California campuses have sought detailed COVID-19 case and contact-tracing data from state and county health authorities for research they hope will point to more effective approaches to slowing the pandemic. “It’s a basic mantra of epidemiology and public health: Follow the data” to learn where and how the disease spreads, says Rajiv Bhatia, a physician and epidemiologist who teaches at Stanford and is among those seeking the California data.
But the agencies have refused requests filed from April through late June, Science has learned. They cited multiple reasons including workload constraints and privacy concerns—even though records can be deidentified, and federal health privacy rules have been relaxed for research during the pandemic. As a result, Bhatia says, “In 4 months of the epidemic, collecting millions of records, no one in California or at the CDC [U.S. Centers for Disease Control and Prevention] has done the basic epidemiology.” Other states also fail to share highly specific information for their COVID-19 cases, which some scientists warn is hampering efforts to identify targeted measures that could stem the spread of SARS-CoV-2 without full-scale lockdowns.
Bhatia and other epidemiologists, in California and across the country, are especially aggrieved after recent news reports revealed states are feeding the same data they desire to a federal contractor, Palantir Technologies, that has drawn criticism for data work supporting Immigration and Customs Enforcement deportations. For a data platform dubbed HHS Protect, Palantir is aggregating information on the spread of the new coronavirus on behalf of the U.S. Department of Health and Human Services (HHS), drawing on more than 225 data sets, including demographic statistics, community-based tests, and a wide range of state-provided data.(This week, sparking widespread concern among public health experts, epidemiologists, and others, HHS also directed hospitals to provide data on COVID-19 cases and patient information to the Palantir system—largely via a second contractor, TeleTracking Technologies—rather than to CDC as they have for decades (see sidebar).
Aggregated COVID-19 case and death data by county, and often by age and race, is publicly available in much of the country. But few locales link those cases and deaths to other information typically collected on the individuals, such as their ZIP codes, occupations, living conditions, and known contacts with others ill with COVID-19. And according to the COVID Tracking Project, a volunteer organization launched by The Atlantic, no U.S. state or territory publicly provides a complete set of even such basic COVID-19 measures as total and pending tests; deaths and recovered patients; and current and cumulative hospitalizations, patients in intensive care units, and those using ventilators.
Bhatia and colleagues say detailed COVID-19 case data could be mined to find the combinations of factors most responsible for the “biggest bundles of hospitalizations and deaths.” He hypothesizes the data would, for example, confirm that even as commerce opens up, hospitalizations and deaths still primarily emerge from widely cited flashpoints, including elderly care facilities and large households that include infected essential workers who are asymptomatic or have mild symptoms and pass the disease to relatives who have risk factors making them more vulnerable to severe illness.
“We think you can be more strategic on your interventions if you know where exposures actually occur,” says Jeffrey Klausner, a physician and epidemiologist at the University of California, Los Angeles, who is also seeking the California data. For example, case data might confirm patchy evidence suggesting indoor dining is risky, but parks and beaches are generally safe. If so, reopening outdoor settings with reasonable precautions might boost the economy and allay fears that severe risk of infection is ubiquitous.
As the pandemic evolves, regular reassessment of granular data on cases is vital, says Natalie Dean, a University of Florida (UF) biostatistician. “We have this whole new world now, where we are opening things back up. We have this shifting set of environments—indoor dining, bars, open retail buildings, offices, gyms. When we think of what are pressure points, there’s a lot we just don’t know yet. … We have to have ‘a learning architecture’ in place where there’s always some level of reflection.”
In the absence of clear, localized data from public authorities, some clinics in California have relied on their own research. After conducting thousands of COVID-19 tests in Oakland, “We have been able to pinpoint where some of the outbreaks are, both geographically and in terms of setting,” leading to highly targeted health education and testing outreach, says Noha Aboelata, a physician who heads the city’s Roots Community Health Center, which primarily serves people of color in underserved communities. Without neighborhood-level intelligence for public health outreach, you get “a one-size-fits-all solution that might exacerbate the problem,” she says. “Withholding the information is going to lead to deaths.”
In a written response to questions from Science, the California Department of Public Health said even deidentified data “can be used alone or in combination with publicly available information to identify an individual.”
Caitlin Rivers, an epidemiologist at Johns Hopkins University’s Center for Health Security, calls reidentification a valid concern, but argues it would happen so rarely that the risk shouldn’t justify blanket denials of data requests during the pandemic. “There’s a lot of space in the middle that we haven’t really explored,” she adds. For example, to obviate some privacy concerns, Bhatia’s group requested case reports giving 10-year age ranges rather than specific ages, the week of COVID-19 onset rather than a specific date, and an occupational group rather than specific occupation.
As a case study in the value of richer data, Bhatia turned to Florida, which compared with California offers fairly detailed information on each of the roughly 316,000 COVID-19 cases recorded there so far. The data set enabled him to graph, week by week, infections by age and whether the source of transmission was known. He found that early in the pandemic, the source was known for 80% of children, and 50% to 60% of adults.
As Florida relaxed restrictions on businesses and other aspects of life, known sources of transmission remained at similar levels, even though casual contact with strangers was apparently increasing. Because some of unknown sources of transmission were certainly asymptomatic or mildly symptomatic family or friends, such a finding suggests crowded beaches are playing a smaller role in Florida’s surge in infections than, say, increased numbers of large family gatherings at home or repopulated offices. “If people know that 50% or 60% of infections are resulting from people they know, including family, friends, and co-workers, they may better interpret risk,” Bhatia says.
Even Florida’s data exclude key data points that some researchers view as essential to map and respond to the pandemic most effectively—including ZIP codes; more complete racial designations; and specifics on cases in long-term care facilities, jails, and prisons. That hampers targeted responses within the state, says Thomas Hladish, an infectious disease researcher at UF who consulted extensively with state officials about COVID-19 data from March until this month. “A lot of the inconsistencies that you see are reasonably explained by well-intentioned people who are scrambling to reinvent [data fields and formats] on the fly without the appropriate technical background.” The Miami Herald also recently reported that municipal officials have not been able to get the state to provide case details they need to attack local outbreaks. The Florida Department of Health did not respond to Science’s requests for comment.
Epidemiologists praise more forthcoming agencies, especially those that use their own data to guide response. The New York City Department of Health and Mental Hygiene posts unusually complete, continually updated data sets on COVID-19 on its website—showing detailed information on tests, cases, and deaths for 177 discrete neighborhoods—and uses them to map hot spots of the disease. It offers probable and confirmed deaths by age, race or ethnicity, underlying conditions, and other factors. One clear finding: Lower income areas, with a higher concentration of large households, suffered from COVID-19 at many times the rate of most wealthy areas.
The city’s health commissioner, Oxiris Barbot, says the system was crucial in decreasing cases by about 94% and deaths by about 98% since the April peak. “The transparency in data helped to paint a picture of how acute a situation we were in and the degree to which we needed New Yorkers to comply with what we were asking them to do,” she says. “It helped highlight as early as possible the ways in which the virus was ravaging Black and brown communities.” And the granular data allowed a calibrated response—including offers of hotel rooms to help people living in crowded conditions isolate when diagnosed with COVID-19. “Had it not been for that data analysis we would have been much slower in the response, and … many more lives would have been lost,” Barbot says.
“These are the right type of efforts using the right type of data,” Bhatia says. Figuring out how to stop the pandemic is “the biggest and most impactful policy decision we’ve seen in our lifetimes,” he adds. But in California and elsewhere, “We’re trying to predict the future without analyzing the data that’s in front of us. That’s a failure.”
This story was supported by the Science Fund for Investigative Reporting.