Systematic bias in the design of several underlying studies raises doubt over whether a serum proteomics test based on those studies can accurately identify ovarian cancer, two independent biostatisticians have argued.
The researchers, both of the University of Texas M.D. Anderson Cancer Research Center, Houston, have been unable to reproduce the high sensitivity and specificity rates reported in a 2003 study of the technique (J. Natl. Cancer Inst. 2005;97:307–9).
The problem, said Keith A. Baggerly, Ph.D., and Kevin R. Coombes, Ph.D., lies not in the fundamental concept—that cancer-shed proteins in serum may be able to identify patients who have even very early-stage cancer—but in the way the data sets were processed in both the 2003 study and the original 2002 National Cancer Institute (NCI) study upon which it was based.
“We're not saying proteomics doesn't work,” Dr. Baggerly said in an interview. “It may very well work. But these data sets can't be used to say this approach works.”
The method involves using mass spectroscopy to display proteins in serum as a series of peaks and valleys of varying strength. A computer-driven mathematical algorithm finds unique patterns expressed in the serum of patients with the disease. Several researchers are investigating proteomics' application in ovarian cancer, using different algorithms and spectrometers. All of the decoding work is being performed on three publicly available sets of spectral data, which were processed as part of the original proof-of-concept study by NCI researchers led by Emmanuel F. Petricoin III, M.D. (Lancet 2002;359:572–7).
Dr. Baggerly and Dr. Coombes reanalyzed the data used in a 2003 paper by Wei Zhu, Ph.D., and associates, of the State University of New York at Stony Brook. By using the same NCI data sets—samples from women with ovarian cancer, women with benign ovarian cysts, and healthy controls—but a new protein-recognition pattern, Dr. Zhu achieved perfect discrimination (100% sensitivity, 100% specificity) of patients with ovarian cancer, including early-stage disease, from normal controls (PNAS 2003;100:14666–71). Dr. Zhu's results were even better than those originally reported by Dr. Petricoin and colleagues in their 2002 study.
When Dr. Baggerly reanalyzed the Zhu data, he was unable to arrive at the same results. The Zhu study identified a pattern involving 18 protein peaks that separated controls from cancers. For Dr. Baggerly, the pattern resulted in significant accuracy in the first data set, which contained serum from all three groups, but not in the second data set, which contained only serum from cancer patients and healthy controls.
In the second data set, 13 of the 18 peak differences changed signs—that is, peaks associated with cancer in the first group were associated with controls in the second group, and peaks first associated with controls switched to cancers. “This reversal isn't consistent with a persistent difference between cancer samples and control samples,” Dr. Baggerly said.
The researchers then chose 18 random protein peaks from the same regions of spectral data as Dr. Zhu's peaks. The random peaks separated cancer samples from controls up to 56% of the time, depending on the strength of the signals used. Because the pattern of protein expression was inconsistent between the data sets, they concluded, the values did not represent biologically important changes in the serum of cancer patients.
The problem, Dr. Baggerly asserts, is that Dr. Zhu processed the serum samples in a nonrandomized way that the spectra were acquired in the initial study by Dr. Petricoin and his collegues.
“They ran all the controls on one day and all the cancers on the next day,” Dr. Baggerly said. “This is the worst kind of design when you are using a machine that can be subject to external factors,” such as changes in calibration or mechanical breakdown.
In fact, he said, a June 2004 study in which Dr. Petricoin participated also suffered from just such a problem (Endoc. Relat. Cancer 2004;11:163–78). This study used a different mass spectrometer, which began to break down on day 3 of running the samples. In a letter to the editor, Dr. Petricoin admitted the problem, but said, “We cannot detect whether the cancer data acquired on the previous day were convincingly negatively affected by the spectrometer failure.”
Dr. Baggerly contends that a better design involving randomizing sample processing would allow separation of differences due to biology from those due to external factors.
His failure to find reproducibility does not surprise Dr. Petricoin and his colleague, Lance A. Liotta, M.D., who participated in the 2002 and 2004 studies. Their commentary appears in the same journal. Each of the data sets, all of which are available without restriction online, was generated with different machines and methods to test those machines and methods.