If you ever felt a chill and wondered whether someone, somewhere, could see your search history…now you know. Yes. They are. But they’re using their powers for good. Microsoft scientists have come out with a demo showing that by analyzing a large volume of anonymized queries from their Bing search engine, scientists may be able to identify internet users who are suffering from pancreatic cancer, even before the querent has been diagnosed with the disease.
“We asked ourselves, ‘If we heard the whispers of people online, would it provide strong evidence or a clue that something’s going on?'” said Dr. Eric Horvitz, coauthor. Horvitz, Dr. Ryen White, also of Microsoft, and Columbia grad student John Paparrizos teamed up to work with searches conducted using Bing, Microsoft’s search engine, that indicated someone had been diagnosed with pancreatic cancer. Starting from when queries appeared suggesting the diagnosis, they worked backward in time, hunting for search terms further back in the sample histories that could have shown that the Bing user was experiencing symptoms.
The researchers believe that patterns in those early searches can be red flags that warn of major health problems down the road. The researchers reported in the Journal of Oncology Practice that they could identify between 5 and 15 percent of pancreatic cancer cases, but they did so with false positive rates of as low as one in 100,000. This is like how rapid strep cultures work. They don’t catch strep every time, but when they do report positive results, they’re quite sure it’s strep and not something else.
Coming from a background of both medicine and computer science, Dr. Horvitz said he began looking into this area after a phone conversation with a friend who had described symptoms. Based on their conversation, Dr. Horvitz advised his friend to seek medical attention. He was, in fact, diagnosed with pancreatic cancer, and died several months later.
While the anonymized data means that the researchers can’t reach out to the individuals whose data it was, it’s clear that the next steps are practical, logistical. Scientists must learn how to use big data without mistaking quantity for quality of information. Refining the way we handle such biostatistics could enable a whole new class of inexpensive, data-powered health services. “Might there be a Cortana for health some day?” mused Dr. Horvitz.
It makes sense. How many times have you searched for symptoms online rather than go to the expense and trouble of seeing a doctor? This kind of data could be a diagnostic gold mine if we could isolate reliable search patterns; Google has already started surfing this wavefront, but their foray into predictive medicine mostly served as an example of how not to handle big data. But Google Flu often missed high (PDF). Could that fact just represent how easy it is to Google symptoms, compared to getting medical care? We don’t necessarily know that there’s a 1:1 relationship between people who search for flu symptoms, and people who have the flu. It seems like more eyes on the problem, yet again, is the answer.
On the other hand, weren’t we just asking who guards the data? It seems like there are obvious HIPAA implications here. Any such database would be a tantalizing target for black hats and commercial interests. Is Minority-Report-esque precognition of your search history something that you can consent to with a clickthrough TOS?
“I think the mainstream medical literature has been resistant to these kinds of studies and this kind of data,” Dr. Horvitz said. “We’re hoping that this stimulates quite a bit of interesting conversation.” Next they’ll be telling us we should make our browser histories freely available — for science.
Now read: Machine learning offers hope in fight against antibiotic resistance