Do Computer Scientists Hold the Key to Treating Cancer?

Error: Unable to create directory uploads/2025/06. Is its parent directory writable by the server?

An ACM sponsored blog was published with this title in the Huffington Post today. Here is the link. What’s published accidentally isn’t the latest draft, which is below.

This ancient assassin, first identified by a pharaoh’s physician, has been killing people for more than 4600 years. As scientists found therapies for other lethal diseases—such as measles, influenza, and heart disease—cancer moved up this deadly list and will soon be #1; 40% of Americans will face cancer during their lifetimes, with half dying from it. Most of us ignore cancer until someone close is diagnosed, but instead society could zero in on this killer by recording massive data to discover better treatments before a loved one is in its cross hairs.

We now know that cancer has many subtypes, but they are all unlimited cell growth caused by problems in DNA. Some people are born with precarious DNA, and others acquire it later. When a cell divides, sometimes it miscopies a small amount of its DNA, and these errors can overwhelm a cell’s defenses to cause cancers. Thus, you can get it without exposure to carcinogens. Cigarettes, radiation, asbestos, and so on simply increase the copy error rate. Speaking figuratively, every time a cell reproduces, we roll the dice on cancers and risky behavior increases cancers’ chances.

Most cancer studies today use partial genomic information and have under 1000 patients. One wonders whether their conclusions wouldn’t improve if they used complete genomes and increased the number of patients by factors of 10-100.

Given cancer’s gravity and nature, shouldn’t scientists be able to decode full genomes inexpensively to fight this dreaded disease better informed? Now they can! The plot below shows the dropping cost of sequencing a genome since 2001.

Moore’s Law, which drives the information technology revolution, improved a 100 fold in 15 years, yet the cost to identify a genome has dropped 100,000 fold to $1000 per genome, which is considered the tipping point of affordability by many.

Dropping cost of sequencing

This graph should be a call to arms for computer scientists, as the war on cancer will require Big Data. If the 1.7 million Americans who will get cancer in 2016 were to have their healthy cells and tumor cells sequenced, it would yield 1 exabyte (1 billion times 1 billion bytes) of raw data. The UC Berkeley AMPLab —collaborating with Microsoft Research and UC Santa Cruz—joined the battle in 2011, which we announced in a New York Times essay. We have been championing cloud computing and open-source software ever since.

The good news is that the same technology that can decode cancer tumors can identify unknown pathogens, allowing software our collaboration developed to help save a life. A teenager went to medical specialists repeatedly and was eventually hospitalized for five weeks without a successful diagnosis. He was placed in a medically induced coma after developing brain seizures. In desperation, the doctors sent a spinal fluid sample to UCSF for genetic sequencing and analysis. Our program first filtered out the human portion of the DNA data, which was 99.98% of the original three million pieces of data, and then sequenced the remaining pathogen. In just two days total, UCSF identified a rare infectious bacterium. After treating the boy with antibiotics, he awoke and was discharged. Although our software is only part of this process, previously doctors had to guess the causative agent before testing for a contagious disease. Other hospitals and the Center for Disease Control now use this procedure.

The bad news that we must change is that genetic repositories are still a factor of 10-100 short of having enough cancer patients to draw statistically significant results. The reason we need so many patients is that there are many cancer subtypes and that the tumors of subtype are notoriously diversified; most are unique, so it takes numerous samples to make real progress. Here are obstacles to collecting that valuable data, despite the storage itself already being affordable and getting cheaper:

Who would pay? Like the chicken versus the egg debate, we don’t yet have conclusive data that show how genetic information leads to effective therapies for most cancers. Thus, despite lower costs, insurance companies won’t pay for sequencing. Although many believe it would yield bountiful insights, we can’t prove it.
If funding was found, would the hospitals share data? Some hospitals don’t share data to attract more patients and researchers, and many researchers consider the data private property—at least until they publish and often after—despite funding from our tax dollars. For example, even the editor-in-chief of the New England Journal of Medicine recently referred to interested outsiders as

“research parasites” who should not “use the data to try to disprove what the original investigators had posited.”
Even if hospitals and researchers were willing, would they be allowed to share data? While a cancer database will likely lead to breakthroughs, and cancer patients often are eager to donate their data to help others, medical ethicists worry more about patient privacy. Consequently, cancer studies regularly restrict data access to the official investigators of the research grant.

As Francis Collins, Director of the National Institute of Health, said at the Davos meeting about accelerating progress on cancer:

“We need that Big Data to be accessible. It’s not enough to say that we are in a Big Data era for cancer. We also need to be in a Big Data access era.”

To make genomic data more accessible, the Global Alliance for Genomics and Health was founded in 2013 “to enable the responsible, voluntary, and secure sharing of genomic and clinical data.” While 375 organizations from 37 countries have joined, and its working groups are active, progress has been slow in actually getting organizations to share. Perhaps the main impact thus far is that the community now largely believes that such data will eventually be shared.

To make it so, not only does society need to find the funding and cut through the red tape to populate a million cancer genome repository, but we need to draft experts to design and build open-source software that leverages advances in cloud computing, machine learning, and privacy protection to make it useful. Recruitment should be easy, as there’s no more inspiring endeavor than helping save lives, including some you may know.

And the quicker it happens, the better we can fortify ourselves against this ancient assassin.