Created by physicians for physicians, patients and their families.™

What is a Prediction?
Harry B. Burke, M.D., Ph.D.

INTRODUCTION
What is a prediction? A prediction is a person's chance of something happening to them in the future based on a model (like a map) that was created using people with the same disease whose outcome is known. For example, we can follow a group of women with breast cancer over time and observe their outcomes. Some women experience a recurrence, some die of their disease, and most live long lives. We can use a statistical method called an artificial neural network to combine the experiences of all these women. Now when a woman is diagnosed with breast cancer we can go to the model that was based on the experiences of women in the past, put the newly diagnosed woman's prognostic factors (for example, tumor size) into the model, and the model will give us the woman's chance of recurrence and survival for different treatments. The newly diagnosed women can then pick the treatment that is best for her individual situation. A more detailed discussion follows.

 

 

 

 

DETAILED DISCUSSION
Predictive factors are required for predicting the natural history of the patient's disease, predicting the therapy that is optimal for the patient, and predicting the effectiveness of the treatment. (Burke, 1998a) Because predictive factors are predictive to the degree that they participate in the disease process, anything that participates in the disease process is a potential predictive factor. (Burke and Henson, 1999) Although the factor itself does not change its functional type, i.e., risk factor, diagnostic factor, or prognostic factor, its type depends on whether it is being evaluated and used to determine the patient's risk of disease, the existence of disease, or the patient's prognosis and treatment.

The analysis and use of predictive factors is complicated by the movement down explanatory levels of analysis, from the demographic level, to the anatomic-cellular level, to the molecular genetic level because the number and complexity of the factors increase with the movement to lower levels of analysis. (Burke and Henson, 1999) The movement occurs because the factors at higher levels are compound factors, and therefore are inherently less powerful than lower level factors. A compound factor is the realization of several unmeasured lower level factors. The movement to lower levels of analysis increases predictive power but also results in the proliferation of factors and the need for their integration in a predictive statistical model. (Burke and Henson, 1999) In addition, there are methodologic and technical issues unique to the identification, replication, and validation of molecular genetic factors. (Burke and Henson, 1999 )

What are predictive factors?

A predictive factor predicts an outcome (risk of disease, existence of disease, or prognosis) by virtue of its relationship with the disease process that causes the outcome. (Burke, 1998a) Terms such as marker, biomarker, predictor, prognosticator, indicator, surrogate factor, intermediate biomarker have been used to identify variables that are connected to medical outcomes. (Burke, 1998a) The meanings of these terms overlap and their undifferentiated use can cause confusion. We suggest that all predictive factors are markers of disease; they are in some way associated with the disease process. But that not all markers of disease have sufficient predictive power to be called predictive factors; many a only indirectly related to the disease process. (Burke, 1998a) We will use the term factor to identify markers of disease that either are, or have the potential to be, predictive for a given outcome in a specified statistical model.

There are three types of predictive factors; risk, diagnostic, and prognostic. (Burke, 1998a) They differ in their outcomes and the degree to which they are associated with their outcome. "Risk" is an ambiguous term, it can mean the risk of occurrence of disease or the chance of any event occurring. We will use the term "risk" to refer to the risk of disease occurrence. "Risk" when used in the context of "risk of recurrence" or "risk of death" will be called "probability", as in "probability of recurrence" and "probability of death". A risk factor's primary outcome is incidence of disease. The factor, either alone, or in combination with other factors, is almost always much less than 100% predictive of the disease occurring by a specified time in the future. (Burke, 1998a) The reason for the poor predictive accuracy of risk factors is because no matter how carefully the population at risk is selected, few people will clinically express the disease. Therefore, there is usually a high error rate in predicting who will exhibit the disease. The easiest prediction is when most of the people in the population will have the disease by the end of a specified time interval. Risk can be viewed as a propensity for the disease. A high grade squamous intra-epithelial lesion (HSIL), for example, is a cytologic risk factor for subsequent cervical cancer. It indicates a greater propensity for cervical cancer than a normal Papanicolau smear.

A diagnostic factor's outcome is also incidence of disease. (Burke, 1998a) The factor, either alone, or in combination with other factors, must be close to 100% predictive of disease. A biopsy that shows invasive cancer is 100% predictive of invasive cancer. A prognostic factor's  primary outcome in lethal diseases is death. A prognostic factor is rarely a strong predictor in isolation from other prognostic factors. (Burke, 1998a, 1998c) Although prognostic factors are almost always stronger than risk factors simply because everyone in the population has the disease, when the disease process is complex it is rarely the case that one factor can accurately reflect the disease. This is especially true when the factor is assessed using cases that represent patients at different stages of the disease process. Tumor locations(s) and lymph node involvement are prognostic factors for several, but not most, of the solid tumors.

Within a type (risk and prognostic) of predictive factor there are three subtypes: 1) natural history, 2) therapy specific, and 3) post therapy. (Burke, 1998a, 1998c) The sub-types are most useful for risk and prognostic factors because of their importance in directing interventions such as prevention and therapy. Natural history predictive factors predict the future occurrence (risk), current existence (diagnosis), or course (prognostic) of a disease when the patient never receives any prevention or therapeutic intervention. (Burke, 1998a, 1998c) Natural history should the baseline against which all interventions are tested. (Burke, 1998a) An example of a natural history prognostic factor is any anatomic "extent-of-disease" factor such as tumor size. A therapy specific predictive factor assumes that there is an effective therapy and it predicts whether the patient will respond to a particular intervention (e.g., chemoprevention or chemotherapy). (Burke, 1998a, 1998c) A therapy specific factor is, as its name implies, specific to a particular treatment and must be assessed in a population that only received that treatment. (Burke, 1998c) An example of a therapy specific prognostic factor is estrogen receptor status in breast cancer which predicts response to adjuvant hormonal treatment. A natural history predictive factor may also be a post therapy predictive factor if it changes its value after a treatment has been successful. Post therapy  predictive factors require that patients respond to the intervention; they predict the success or the failure of the intervention. Disease recurrence requires that an effective treatment has been given to the patient and is a  post therapy prognostic factor. 

Determining whether a marker of disease is a predictive factor requires that: 1) the marker (now termed a variable because it is being quantified and modeled) be measured in a defined population, 2) the population be followed until enough outcomes have occurred (e.g., deaths), and 3) the relationship between the variable and the outcome be determined. (Burke, 1998a) If the variable predicts the outcome with "sufficient" accuracy (where sufficient varies with the question being addressed) in a specified model it is called a predictive factor. If the outcome that is predicted to occur always occurs, we say that the predictive factor and the outcome are 100% linked, i.e., that the factor has a 100% predictive accuracy.

The predictive power of a factor depends on both its intrinsic and extrinsic power. (Burke, 1998a) The intrinsic predictive power of a factor is related to its "connectedness" to the disease process. "Connected" means associated with the disease process (where "process" subsumes concepts such as cause, trigger, etc.). The less connected the factor is, the less predictive it is. (Burke, 1998a) A direct connection means that the factor is an integral (necessary, causal) part of the disease process itself. (Burke, 1998a) An indirect connection means that it is not an integral part of the disease process, but is related to the disease process such as being a byproduct of the disease process. The extrinsic predictive power of the factor depends on the question being asked, i.e., the specific factor-outcome relationship being examined. For some questions the factor-outcome relationship will not be strong, for example, a factor initiating the disease process and the eventual outcome of the patient whereas for others it will be strong, the initiating factor and the detection of the disease. 

For a specific disease process and outcome the predictive accuracy of a factor depends on: 1) how closely connected the factor is to the disease process (individual factor power) and its orthoginality of all the known factors (degree of predictive overlap), 2) how easy it is to collect and measure, 3) the degree to which the selected statistical method is able to capture the factor's predictive information and to integrate that information with that of other relevant factors. (Burke, 1998a)

Gasparini (1993) introduced the following distinction. "A prognostic indicator may be defined as any factor able, at the time of diagnosis (or surgery), to give information on clinical outcome." (p. 1208) "A predictive factor may be defined as any factor able to give information useful in selection of patients likely to respond to a specific, presently available form or combination of systematic adjuvant therapy." (p. 1209) There are several problems with this distinction. Prognosis is a prediction, thus prognosis is a sub-type of prediction. Since, prognosis is a sub-type of predictive factor, two factors can not be equal and at the same time one factor be a sub-type of the other. Further, risk is a prediction and therefore a sub-type of predictive factor. But if a predictive factor must always be a factor that gives information regarding treatment in patients with disease, a risk factor can not be a sub-type of predictive factor. Gasparini was probably trying to distinguish between natural history prognostic factors and therapy specific prognostic factors.

What are surrogate outcomes?

A surrogate outcome is the use of a predictive factor (risk, diagnostic, or prognostic) as an outcome in place of the true outcome. (Burke, 1994) All risk and prognostic factors can be used as surrogate outcomes. A surrogate outcome can be used to shorten the duration of a prospective risk or therapy study, or to clinically intervene prior to a patient reaching a true outcome. The term surrogate endpoint biomarker has been used to denote the use of predictive factors as an endpoint in a clinical study. The term biomarker can be applied to anything, it does not distinguish between anything, and is not scientifically useful. A better terminology is to discuss a surrogate outcome, and then the type of factor (risk or prognostic) the factor itself. 

One purpose of a screening program, the detection of a risk factor and then targeting the risk factor for an intervention that will reduce or eliminate it, can use a risk factor as a surrogate outcome, for example prostatic intraepithelial neoplasia can be used as a surrogate for prostate cancer. All surrogate outcomes in individuals not diagnosed with the disease are risk factors.(Burke, 1994) Risk factors are usually used as surrogate outcomes in order to more rapidly detect an intervention effect.

At least three components are necessary to use a predictive factor as a surrogate outcome: (1) the proper definition of the risk factor and a description of how to detect it, (2) the proper definition of the true outcome and a description of how to assess it, and (3) knowledge of the strength and direction of the relationship between the surrogate outcome and the true outcome over a specified time interval.(Burke, 1994) For a predictive factor to be a useful surrogate outcome it must be strongly connected to the true outcome (Burke, 1994; Bucher, 1999) and the shape and direction of the relationship must be known.(Burke, 1994) For example, disease recurrence can be used as a surrogate outcome because it is known to be strongly and positively associated with future death. It is usually the case that the closer the surrogate is to the true outcome the stronger it is as a predictor of the true outcome. (Temple, 1999) Surrogate outcomes, including those used as endpoints in clinical trials (surrogate endpoints), can never shorten the first investigation because the relationship between the risk factor and the true outcome must be known prior to the risk factor's use as a surrogate outcome. The only way to shorten the initial investigation of the relationship between a risk factor and the true outcome is through the use of specimen banks. (Burke and Henson, 1998b)

For screening that is used in conjunction with a prevention intervention, incidence of disease is a surrogate outcome for death from the disease. Incidence is an excellent surrogate outcome in this setting because it is causally linked to death from disease; if an individual never experiences clinical disease then that individual can never die from the disease. For screening used as a trigger for diagnostic testing the primary outcome is death from disease.

What are the criteria for a predictive factor?

Predictive factors should be: 1) accurate, 2) independent, and 3) useful. (Burke and Henson, 1993, 1999) Accurate means that the factor is, at its minimum accuracy, a powerful predictor for a subset of a clinical population or a modest predictor for a large segment of the population. Independent means that the factor retains predictive value when placed in a multivariate model that contains other relevant predictive factors. Useful means that the predictive factor is personally relevant to the patient or clinically relevant to a treatment. Personally relevant means that the factor can be used to inform the patient regarding their disease. Clinically relevant means that the factor can affect patient management and therefore outcome. Most powerful factors possess both aspects. When there is no effective therapy a factor's utility is in its informing the patients of their outcome so that they can prepare for it. The importance of the personal utility of a factor, its ability to provide information to patients regarding their outcome even when the outcome cannot be changed, should not be underestimated.

Although it has been suggested that biologic plausibility be a criteria for a prognostic factor, (McGuire, 1991) it is not necessary to understand the function of the factor in order to use it as a predictive factor. (Burke and Henson, 1999) The biologic plausibility requirement does not distinguish between the biologic function of a factor and its predictive utility. It is certainly the case that a factor's predictive value rests on its function in the disease process. But it is not necessary to know its function in order to use the factor predictively.

How should predictive factors be tested?

A putative predictive factor must go through three stages of testing before it is ready for clinical use. (Burke and Henson, 1999) The first stage is identification. (Burke and Henson, 1999) This is the discovery and initial characterization of the factor. The factor must be unambiguously described and its method of determination explained sufficiently for replication. The factor's predictive connection to a clinical outcome must be assessed (usually using an appropriate univariate statistical method). The clinical population used for the outcome assessment must also be described including the inclusion/exclusion criteria and what method was used to obtain the patients.

The second stage is replication. (Burke and Henson, 1999) Once a factor has been identified it must be replicated by independent researchers using the original assay method. Other assay methods that are commonly used to detect this type of factor should be employed by both the original researcher and by independent researchers. The original finding should be reproducible across assay methods and researchers using the same type of outcome and patient population. Failure to replicate previous  results will affect the interpretation and use of the prognostic factor. In addition, the accuracy method that is used to compare the factor across assay methods and researchers must be suitable for the comparison of two statistical models.

The third stage is validation. (Burke and Henson, 1999) Validation assesses the predictive power of the factor in other populations. The factor should be evaluated on a well defined independently collected patient population (not the same population that was used for identification and replication). The question being addressed is whether the factor retains its predictive power.

Guiding this three stage process is the understanding that for a factor to be clinically useful it must be assessed by a method that can be performed in many different types and levels of laboratories and it must be powerful enough to overcome intra-observer, inter-observer, inter-institutional variance. (Burke and Henson, 1999)

There are two major validation problems related to prognostic factors. (Burke and Henson, 1998b) The first is the time from diagnosis to the analysis of outcomes (e.g., mortality). The longer this interval the longer the prediction time interval. To provide, for example, ten year survival predictions a patient population must be followed for ten years. The ten-year information is used to assess prognostic factor predictive accuracy and to provide ten-year outcome predictions to future patients. The second is the accrual of a sufficient number of outcomes so that the assessment of the factor is statistically reliable. Reliable means that a similar result would be observed if the analysis were repeated. One solution to these problems is the implementation of a specimen bank. (Burke and Henson, 1998b)

How can I combine factors to increase my predictive accuracy?

It is rarely the case that one factor is sufficiently predictive, i.e., that it is able to predict the outcome of interest with 100% accuracy, until the patient is very near the outcome. (Burke and Henson, 1999) The usual strategy when dealing with predictive factors is to combine several in a predictive model. The most useful grouping of factors is one in which all the factors are powerful and predictively orthogonal to each other, i.e.; they represent independent aspects of the disease process. If they represent aspects of the disease that are not independent then their information will overlap and one will not add predictive power when combined with the other factors. The statistical method employed to combine the factors must be able to capture the complexity of the disease process that is represented by the factors being combined, e.g., nonlinearity and interactions.

Diagnosis is not an exception to the need to aggregate predictive factors to increase predictive power. When a pathologist makes a diagnosis based on a tissue slide the pathologist is using a set of diagnostic factors, for example, morphology and nuclear features. The task is relatively unambiguous because by the time the disease is clinically expressed it is usually well advanced, there is evidence of invasion. The predictive task is more difficult with early, pre-malignant lesions. The ability of pathologists to diagnose current or future clinical disease declines as we move earlier in the disease process. This problem will continue with the movement toward molecular genetic diagnosis. (Burke, 1996).

A predictive model is one or more predictive factors systematically related to each other and to an outcome. There are many ways to systematically organize factors. One common approach is to use a statistical method to relate one or more predictive factors to an outcome. For example, the mathematical formula generated by the logistic regression statistical method relates the predictive factors (input variables), in terms of their ß-coefficients, to a binary disease outcome, e.g., recurrence, death, etc.

It should be noted that the predictive power of a factor in a statistical model is always dependent on the statistical method that was used to capture its power and with the other factors included in the model. (Burke, 1996). Because a particular model may not be efficient at capturing the power of the factors, and because all the relevant factors may not be included in the model, any statement of a factor's accuracy must include an explanation of why a specific statistical model was chosen and why the other factors were included in the model. 

The primary descriptive methods for evaluating factors in cancer are: bins, stages, and indexes (either as discrete endpoint models or as Kaplan-Meier product-limit models). (Burke, 1993) The main inferential methods for combining factors are: decision trees; and regression methods including logistic, proportional hazards, and artificial neural networks.(Clark, 1994; Burke, 1995b)

Bins are the result of the mutually exclusive and exhaustive partitioning of discrete variables. Each combination of variable values is a bin and every patient is placed in the bin corresponding to their variable value combination. An example is the TNM classification of ovarian cancer. Tumor location (T1a, T1b, T1c, T2a, T2b, T2c, T3a, T3b, T3c), regional lymph node involvement (N0, N1), and existence of metastases (M0, M1) produce thirty-six bins.

For discrete variables, if there are enough patients in each bin, it can be shown that the frequency of the outcome in the population within each bin is the best predictor of the true outcome. (Burke, 1993) In other words, no prediction model can be more accurate than the bin model if the variables are discrete and the population very large. Problems with bin models include: 1) Continuous variables must be parsed into discrete variables, almost always resulting in a loss of predictive information and therefore a loss of accuracy. 2) As the number of discrete variables increase the number of bins increase exponentially. For example, if we wish to add 3 grades to the TNM of ovarian cancer, then the number of bins will increase to 108. In order to maintain accuracy there most be a corresponding exponential increase in the size of the patient population to fill each bin. 3) The proliferation of bins reduces the ability to understand the phenomena. Since the main reason of creating a bin model is usually for ease of understanding and ease of use, bin models are rarely used in situations where there are more than two or three predictive factors.

A partial solution to some of the problems of a bin model is a stage model. A stage model is the combining of bins into super-bins. The justification for the grouping is the assumption that the factors selected are indexes of the "stages" of the disease process and that the combined bins represent a real stage in the disease process. For example, in breast cancer, the TNM staging system combines forty TNM classification bins into six super-bins (Stages I, IIA, IIB, IIIA IIIB, IV) based on decreasing survival, and these super-bins are termed the TNM staging system. 

A small set of stages have the potential to maintain explanatory simplicity and ease of use. Problems with stage models include: 1) The combining of bins into super-bins/stages reduces predictive accuracy. 2) Stage systems do not overcome the exponential increase in bins and in patients associated with adding a variable to the staging system, they just delay the problem at the cost of predictive accuracy. If the stages are held constant as variables (and their associated bins) are added the staging system, the potential improvement in accuracy associated with the additional bins will be small to nonexistent. But, if the stages are expanded to accommodate additional bins, the system looses its ease of understanding and usefulness. Thus, attempts to improve predictive accuracy by adding variables to a bin/stage model are rarely successful. 3) The problems of parsing continuous variables, with the resulting loss in predictive accuracy, remains.

Indexes associate numerical scores (usually based on a bounded, linear scale) with bins or groups of bins. The scores are parsed into discrete ranges, and each range is associated with a disease stage (usually a severity of illness system). Indexes offer some flexibility in the grouping of bins, but at the cost of further degradation in predictive accuracy. The simplest example of an index is the Apgar score. 

Any bin, group of bins, stages, or scores can be compared, in terms of outcome, with other bins, group of bins, stages or scores at the end of a single time interval, across objective time intervals, or across a series of event time intervals. These approaches usually deal with censoring by dropping censored cases at the time interval in which they are censored. The most common descriptive approach for comparing predictive factors across a series of event time intervals is the Kaplan-Meier product-limit method (inferential methods that can accommodate continuous variables and that usually require a proportional hazards assumption will be discussed later with regression methods). A Kaplan-Meier plot should always include confidence intervals around each line. A significant difference is a Kaplan-Meier comparison is usually assessed by a log-rank test (which assumes proportional hazards). It is important to note that there is currently no widely accepted method for comparing the accuracy of two Kaplan-Meier comparisons based on different stratifications of the same variables. The use of the log-rank p-value to select one stratification over another is incorrect because the log-rank test determines whether a factor stratification is likely to have occurred by chance. Extreme stratifications may result in a smaller p-value but they may also reduce predictive accuracy over the entire population.

Decision trees split predictive factors to maximize predictive power using a loss function such as the log-likelihood and a greedy search algorithm. The most well known decision tree approach is the Classification and Regression Trees (CART) recursive partitioning method (Breiman, 1984). Empirically, we have never found decision trees to be the most accurate statistical method, when compared to other regression methods. Its problems include the selection of the correct loss function, difficulty dealing with continuous variables, and overfitting when searching for the best predictors especially when there are more than  two or three splits.

Univariate regression methods are usually not appropriate for deciding whether a variable is or is not predictive factor. These methods should not be used to assert that a factor is predictive because a factor should be assessed in the context of the other known factors. Further, some variables are only predictive when they are interacting with other factors (for example, most molecular genetic factors).

Logistic regression is the cumulative probability of a binary event occurring by a specific time. It uses a maximum likelihood loss function and a greedy search technique. It is a very efficient method for problems that have a binary outcome (e.g., recurrence, survival). Its limitation is that it must span a single time interval and does not distinguish when in the interval an event occurred.

"Proportional hazards" methods include the Weibull, exponential, and Cox. The Cox proportional hazards regression method (Cox, 1972) is the most commonly used. All three methods assume that the hazard of each patient is proportional to the hazards of all the other patients and that the degree of each patient's hazard is related to their relative risk. The Cox model cannot create empirical survival curves. For survival curves a baseline hazard must be introduced, e.g., Cox-Breslow estimates. (Breslow, 1974) In cancer, the proportional hazards assumption is often violated. Therefore, anyone using a Cox model must demonstrate that proportional hazards holds for the factors and outcome.

Molecular genetic factors, for example, p53, c-erbB-2 (Her-2/neu), pRB, exhibit the properties of complex systems, they are nonlinear and they are interactional, i.e., they act nonmonotonically and in concert with other molecular genetic factors (Steele, 1993; Loomis, 1995; Buratowski, 1995; Sauer, 1995) Thus, capturing the factors as part of a complex system is critical to accurate prediction of the behavior of the system. Artificial neural networks are capable of capturing complex systems. (Burke, 1996)

The idea that learning can be viewed as the modification of information by repetitively passing it through processing nodes originated in the late 1940's as a way to model the physiology of neuronal processes. (Hebb, 1949) The operationalization of this idea was called an artificial neural network. Gradually it became apparent that this information theoretic approach to learning was very powerful and very general; it was useful in, and applicable to, many learning situations. Since statistics can be viewed as learning from the data, it is not unexpected that this approach would be mathematically proved and operationalized within the domain of statistics.

Artificial neural networks are universal approximators. It has been shown that any real, continuous function can be approximated to any degree of precision by a three-layer network with x in the input layer (patient variables), a hidden layer with sigmodal transfer functions, and one layer of output units, as long as the hidden layer can be arbitrarily large. (Hornak, 1990, 1994) 

Artificial neural networks, as a class of nonlinear regression and discrimination statistical methods, are of proven value in many areas of medicine. (Baxt, 1995; Dybowski, 1995; Westenskow, 1992; Tourassi, 1993; Leong, 1992; Gabor, 1992; von Osdol, 1994; Burke, 1997) They do not require a priori information regarding the phenomenon, they make no distributional assumptions, and with the appropriate method to avoid overfitting (i.e., loss of generalization by fitting the patterns to the test data too precisely), artificial neural networks are usually at least as accurate as classical statistical models and, depending on the complexity of the phenomena, can be more accurate. Artificial neural networks have, for example, been shown to be more accurate than logistic regression, CART (pruned or shrunk), and principal components analysis at predicting five-year breast cancer specific survival. (Burke, 1995b)

In medical research, the most commonly used artificial neural networks (ANN) are multilayer perceptrons that use backpropagation training. Backpropagation consists of fitting the parameters (weights) of the model by a criterion function, usually squared error or maximum likelihood, using a gradient optimization method. In backpropagation artificial neural networks, the error (the difference between the predicted outcome and the true outcome) is propagated back from the output to the connection weights in order to adjust the weights in the direction of minimum error. The usual artificial neural network employed in medical research is composed of three interconnected layers of nodes: an input layer with each input node corresponding to a patient variable, a hidden layer, and an output layer. All nodes after the input layer sum the inputs to them and use a transfer function (also known as an activation function) to send the information to the adjacent layer nodes. The transfer function is usually a sigmoid function such as the logistic. The connections between the nodes have adjustable weights that specify the extent to which the output of one node will be reflected in the activity of the adjacent layer nodes. These weights, along with the connections among the nodes determine the output of the network. The output of the network is a probability of the event for each patient.

How can I make sure that the method I have used to combine factors is a good one?

In order to assess and compare statistical models, it is necessary to distinguish between significance, accuracy, and importance.  Significance suggests that it is unlikely that either a trained statistical method (i.e., a statistical model) or a predictive factor' predictions are due to chance (e.g., the chi-square test) where chance is set at a certain level. Significance is not necessarily accuracy. (Burke, 1998a) Accuracy is the association between the model's individual patient outcome predictions (the predicted outcome) and the individual outcomes of the test population (the true outcome). The importance of a factor or a model is based on whether the model or the factor possesses sufficient accuracy to be useful in answering a particular clinical question. Finally, the assessment of model's or factor's significance, accuracy, and importance must be based on test data set results, not on training data set results.

There are several approaches to assessing the accuracy of a multivariate model and for comparing multivariate models (e.g., Goodman and Kruskall's Gamma, Kendall's Tau). The best method currently in use is the area under the receiver operating characteristic curve. The area under the receiver operating characteristic curve (Az) is the best currently available measure of predictive accuracy. (Swets, 1996) It can be used to assess and compare the adequacy of statistical models. Az can be directly calculated by Somer's D (Somer, 1962) or it can be approximated by its trapezoidal area. (Bamber, 1975) The area under the curve is a nonparametric measure of discrimination. The receiver operating characteristic area is independent of both the prior probability of each outcome and the threshold cutoff for categorization. Its computation requires only that the prediction method produce an ordinal-scaled relative predictive score. In terms of mortality, the receiver operating characteristic area estimates the probability that the prediction method will assign a higher mortality score to the patient who died than to the patient who lived. The receiver operating characteristic area varies from zero to one. When the predictions are unrelated to survival, the score is 0.5, indicating chance accuracy. The farther the score is from 0.5 the better, on average, the prediction method is at predicting which of two patients with different outcomes will be alive. Significant differences in the receiver operating characteristic areas between two models can be tested following Hanley and McNeil (Hanley, 1982), by calculating their asymptotic variances, or calculating the empirical variance using the bootstrap method. (Efron, 1993)

What should I look for when reading about predictive factors?

There is a great deal of variation in the reporting of predictive factor results. This variability makes it difficult to understand empirical results and to replicate and validate predictive factor research. A report regarding the discovery of a new prognostic factor or the validation of an existing factor should contain the following: (Burke, 1998a) 1. The name of the disease and the necessary and sufficient criteria used to diagnose the disease. 2. Where in the disease process the collected patient population is (i.e., early detected disease) and the patient inclusion and exclusion criteria. 3. The name and complete description of the prognostic factor. 4. The type of prognostic factor (i.e., natural history, therapy specific, or post therapy) it is. 5. The outcome that was selected and a specific time interval, e.g., five-year breast cancer-specific survival, except in special situations it should not be a "lifetime" time interval. 6. When the prognostic factor was collected in the patient population (e.g., at disease discovery, prior to therapy, after therapy). 7. The specific laboratory method used to assess the factor (e.g., immunohistochemistry) and why that method rather than another method was selected. 8. If the prognostic factor is stratified, how was the cut-point selected and was it replicated in another popualtion. If the variable value is based on a rater's judgment then Cohen's kappa should be reported. 9. Relevant characteristics of the data set should be described including the data set size, population characteristics, and the number of events (outcomes). 10. If any patients received a therapy, the type of therapy should be reported and the patients should be stratified by therapy for all analyses.  11. The numerical estimate (usually a parameter estimate) and its confidence interval of the finding should be provided. 12. The level of significance should be set. If the investigators have not looked at the data then a p-value of < 0.05 is usually acceptable. For multiple tests or data exploration an adjustment may be required. 13. The type of multivariate statistical method (e.g., logistic regression, Cox) used, why it was used, and any assumption tests (for example, proportional hazards) should be provided. 14. If all the other relevant prognostic factors were not included in the multivariate model, what were left out and why. 15. The predictive accuracy of the multivariate model should be assessed, , e.g., area under the receiver operating characteristic, R2, and the reason why a particular method was used should be explained.. 16. The accuracy estimates of the multivariate model including either standard errors or confidence intervals for the estimates (e.g., Az = .75, CI = .50 - 1.0) must be provided. References

Altman DG, Lyman GH. Methodologic challenges in the evaluation of prognostic factors in breast cancer. Breast Cancer Res Treat 1998a;52:289-303.

Altman DG. Suboptimal analysis using 'optimal' cutpoints. Br J Cancer 1998;78:556-7. Bamber D. The area above the ordinal dominance graph and the area below the receiver operating graph. J Math Psych 1975;12:387-415.

Baxt WG. Application of artificial neural networks to clinical medicine. Lancet 1995;346:1135-38.

Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Pacifistic Grove, CA; Wadsworth and Brooks, 1984.

Breslow NE. Covariance analysis of censored survival data. Biometrics; 1974:80-99.

Buratowski S. Mechanisms of gene activation. Science 1995;270:1773-4.

Bucher HC, Guyatt GH, Cook DJ, Holbrook A, McAlister FA. Users guide to the medical literature XIX. Applying clinical trial results. A. How to use an article measuring the effect of an intervention on surrogate end points. JAMA 1999;282:771-778.

Burke HB, Henson DE. Criteria for prognostic factors and for an enhanced prognostic system. Cancer 1993;72:3131-35.

Burke HB. Increasing the power of surrogate endpoint biomarkers: the aggregation of predictive factors. J Cell Biochem 1994;19S:278-82.

Burke HB, Hutter RVP, Henson DE. Breast Carcinoma. In: P. Hermanek, M.K. Gospadoriwicz, D.E. Henson, RVP Hutter, L.H. Sobin (eds.), UICC Prognostic Factors in Cancer. Berlin: Springer-Verlag, 1995a, 165-76.

Burke HB, Rosen DB, Goodman PH. Comparing the prediction accuracy of artificial neural networks and other statistical models for breast cancer survival. In: G. Tesauro, D.S. Touretzky, T.K. Leen, eds. Advances in Neural Information Processing Systems 7. Cambridge: MA, MIT Press, 1995b, 1063-67.


Webmaster@CancerHome.com © Copyright 2000, Cancer Home Inc. All Rights Reserved.