

What
is a Prediction?
Harry
B. Burke, M.D., Ph.D.
INTRODUCTION
What
is a prediction? A prediction is a person's chance of something happening
to them in the future based on a model (like a map) that was created using
people with the same disease whose outcome is known. For example, we can
follow a group of women with breast cancer over time and observe their outcomes.
Some women experience a recurrence, some die of their disease, and most
live long lives. We can use a statistical method called an artificial neural
network to combine the experiences of all these women. Now when a woman
is diagnosed with breast cancer we can go to the model that was based on
the experiences of women in the past, put the newly diagnosed woman's prognostic
factors (for example, tumor size) into the model, and the model will give
us the woman's chance of recurrence and survival for different treatments.
The newly diagnosed women can then pick the treatment that is best for her
individual situation. A more detailed discussion follows.
DETAILED DISCUSSION
Predictive factors are required for predicting
the natural history of the patient's disease, predicting the therapy that
is optimal for the patient, and predicting the effectiveness of the treatment.
(Burke, 1998a) Because predictive factors are predictive to the degree that
they participate in the disease process, anything that participates in the
disease process is a potential predictive factor. (Burke and Henson, 1999)
Although the factor itself does not change its functional type, i.e., risk
factor, diagnostic factor, or prognostic factor, its type depends on whether
it is being evaluated and used to determine the patient's risk of disease,
the existence of disease, or the patient's prognosis and treatment.
The analysis and use of predictive factors is complicated by the movement
down explanatory levels of analysis, from the demographic level, to the anatomic-cellular
level, to the molecular genetic level because the number and complexity of
the factors increase with the movement to lower levels of analysis. (Burke
and Henson, 1999) The movement occurs because the factors at higher levels
are compound factors, and therefore are inherently less powerful than lower
level factors. A compound factor is the realization of several unmeasured
lower level factors. The movement to lower levels of analysis increases predictive
power but also results in the proliferation of factors and the need for their
integration in a predictive statistical model. (Burke and Henson, 1999) In
addition, there are methodologic and technical issues unique to the identification,
replication, and validation of molecular genetic factors. (Burke and Henson,
1999 )
What are predictive factors?
A predictive factor predicts an outcome (risk of disease, existence of disease,
or prognosis) by virtue of its relationship with the disease process that
causes the outcome. (Burke, 1998a) Terms such as marker, biomarker, predictor,
prognosticator, indicator, surrogate factor, intermediate biomarker have been
used to identify variables that are connected to medical outcomes. (Burke,
1998a) The meanings of these terms overlap and their undifferentiated use
can cause confusion. We suggest that all predictive factors are markers of
disease; they are in some way associated with the disease process. But that
not all markers of disease have sufficient predictive power to be called predictive
factors; many a only indirectly related to the disease process. (Burke, 1998a)
We will use the term factor to identify markers of disease that either are,
or have the potential to be, predictive for a given outcome in a specified
statistical model.
There are three types of predictive factors; risk, diagnostic, and prognostic.
(Burke, 1998a) They differ in their outcomes and the degree to which they
are associated with their outcome. "Risk" is an ambiguous term,
it can mean the risk of occurrence of disease or the chance of any event occurring.
We will use the term "risk" to refer to the risk of disease occurrence.
"Risk" when used in the context of "risk of recurrence"
or "risk of death" will be called "probability", as in
"probability of recurrence" and "probability of death".
A risk factor's primary outcome is incidence of disease. The factor, either
alone, or in combination with other factors, is almost always much less than
100% predictive of the disease occurring by a specified time in the future.
(Burke, 1998a) The reason for the poor predictive accuracy of risk factors
is because no matter how carefully the population at risk is selected, few
people will clinically express the disease. Therefore, there is usually a
high error rate in predicting who will exhibit the disease. The easiest prediction
is when most of the people in the population will have the disease by the
end of a specified time interval. Risk can be viewed as a propensity for the
disease. A high grade squamous intra-epithelial lesion (HSIL), for example,
is a cytologic risk factor for subsequent cervical cancer. It indicates a
greater propensity for cervical cancer than a normal Papanicolau smear.
A diagnostic factor's outcome is also incidence of disease. (Burke, 1998a)
The factor, either alone, or in combination with other factors, must be close
to 100% predictive of disease. A biopsy that shows invasive cancer is 100%
predictive of invasive cancer. A prognostic factor's primary outcome
in lethal diseases is death. A prognostic factor is rarely a strong predictor
in isolation from other prognostic factors. (Burke, 1998a, 1998c) Although
prognostic factors are almost always stronger than risk factors simply because
everyone in the population has the disease, when the disease process is complex
it is rarely the case that one factor can accurately reflect the disease.
This is especially true when the factor is assessed using cases that represent
patients at different stages of the disease process. Tumor locations(s) and
lymph node involvement are prognostic factors for several, but not most, of
the solid tumors.
Within a type (risk and prognostic) of predictive factor there are three subtypes:
1) natural history, 2) therapy specific, and 3) post therapy. (Burke, 1998a,
1998c) The sub-types are most useful for risk and prognostic factors because
of their importance in directing interventions such as prevention and therapy.
Natural history predictive factors predict the future occurrence (risk), current
existence (diagnosis), or course (prognostic) of a disease when the patient
never receives any prevention or therapeutic intervention. (Burke, 1998a,
1998c) Natural history should the baseline against which all interventions
are tested. (Burke, 1998a) An example of a natural history prognostic factor
is any anatomic "extent-of-disease" factor such as tumor size. A
therapy specific predictive factor assumes that there is an effective therapy
and it predicts whether the patient will respond to a particular intervention
(e.g., chemoprevention or chemotherapy). (Burke, 1998a, 1998c) A therapy specific
factor is, as its name implies, specific to a particular treatment and must
be assessed in a population that only received that treatment. (Burke, 1998c)
An example of a therapy specific prognostic factor is estrogen receptor status
in breast cancer which predicts response to adjuvant hormonal treatment. A
natural history predictive factor may also be a post therapy predictive factor
if it changes its value after a treatment has been successful. Post therapy
predictive factors require that patients respond to the intervention; they
predict the success or the failure of the intervention. Disease recurrence
requires that an effective treatment has been given to the patient and is
a post therapy prognostic factor.
Determining whether a marker of disease is a predictive factor requires that:
1) the marker (now termed a variable because it is being quantified and modeled)
be measured in a defined population, 2) the population be followed until enough
outcomes have occurred (e.g., deaths), and 3) the relationship between the
variable and the outcome be determined. (Burke, 1998a) If the variable predicts
the outcome with "sufficient" accuracy (where sufficient varies
with the question being addressed) in a specified model it is called a predictive
factor. If the outcome that is predicted to occur always occurs, we say that
the predictive factor and the outcome are 100% linked, i.e., that the factor
has a 100% predictive accuracy.
The predictive power of a factor depends on both its intrinsic and extrinsic
power. (Burke, 1998a) The intrinsic predictive power of a factor is related
to its "connectedness" to the disease process. "Connected"
means associated with the disease process (where "process" subsumes
concepts such as cause, trigger, etc.). The less connected the factor is,
the less predictive it is. (Burke, 1998a) A direct connection means that the
factor is an integral (necessary, causal) part of the disease process itself.
(Burke, 1998a) An indirect connection means that it is not an integral part
of the disease process, but is related to the disease process such as being
a byproduct of the disease process. The extrinsic predictive power of the
factor depends on the question being asked, i.e., the specific factor-outcome
relationship being examined. For some questions the factor-outcome relationship
will not be strong, for example, a factor initiating the disease process and
the eventual outcome of the patient whereas for others it will be strong,
the initiating factor and the detection of the disease.
For a specific disease process and outcome the predictive accuracy of a factor
depends on: 1) how closely connected the factor is to the disease process
(individual factor power) and its orthoginality of all the known factors (degree
of predictive overlap), 2) how easy it is to collect and measure, 3) the degree
to which the selected statistical method is able to capture the factor's predictive
information and to integrate that information with that of other relevant
factors. (Burke, 1998a)
Gasparini (1993) introduced the following distinction. "A prognostic
indicator may be defined as any factor able, at the time of diagnosis (or
surgery), to give information on clinical outcome." (p. 1208) "A
predictive factor may be defined as any factor able to give information useful
in selection of patients likely to respond to a specific, presently available
form or combination of systematic adjuvant therapy." (p. 1209) There
are several problems with this distinction. Prognosis is a prediction, thus
prognosis is a sub-type of prediction. Since, prognosis is a sub-type of predictive
factor, two factors can not be equal and at the same time one factor be a
sub-type of the other. Further, risk is a prediction and therefore a sub-type
of predictive factor. But if a predictive factor must always be a factor that
gives information regarding treatment in patients with disease, a risk factor
can not be a sub-type of predictive factor. Gasparini was probably trying
to distinguish between natural history prognostic factors and therapy specific
prognostic factors.
What are surrogate outcomes?
A surrogate outcome is the use of a predictive factor (risk, diagnostic, or
prognostic) as an outcome in place of the true outcome. (Burke, 1994) All
risk and prognostic factors can be used as surrogate outcomes. A surrogate
outcome can be used to shorten the duration of a prospective risk or therapy
study, or to clinically intervene prior to a patient reaching a true outcome.
The term surrogate endpoint biomarker has been used to denote the use of predictive
factors as an endpoint in a clinical study. The term biomarker can be applied
to anything, it does not distinguish between anything, and is not scientifically
useful. A better terminology is to discuss a surrogate outcome, and then the
type of factor (risk or prognostic) the factor itself.
One purpose of a screening program, the detection of a risk factor and then
targeting the risk factor for an intervention that will reduce or eliminate
it, can use a risk factor as a surrogate outcome, for example prostatic intraepithelial
neoplasia can be used as a surrogate for prostate cancer. All surrogate outcomes
in individuals not diagnosed with the disease are risk factors.(Burke, 1994)
Risk factors are usually used as surrogate outcomes in order to more rapidly
detect an intervention effect.
At least three components are necessary to use a predictive factor as a surrogate
outcome: (1) the proper definition of the risk factor and a description of
how to detect it, (2) the proper definition of the true outcome and a description
of how to assess it, and (3) knowledge of the strength and direction of the
relationship between the surrogate outcome and the true outcome over a specified
time interval.(Burke, 1994) For a predictive factor to be a useful surrogate
outcome it must be strongly connected to the true outcome (Burke, 1994; Bucher,
1999) and the shape and direction of the relationship must be known.(Burke,
1994) For example, disease recurrence can be used as a surrogate outcome because
it is known to be strongly and positively associated with future death. It
is usually the case that the closer the surrogate is to the true outcome the
stronger it is as a predictor of the true outcome. (Temple, 1999) Surrogate
outcomes, including those used as endpoints in clinical trials (surrogate
endpoints), can never shorten the first investigation because the relationship
between the risk factor and the true outcome must be known prior to the risk
factor's use as a surrogate outcome. The only way to shorten the initial investigation
of the relationship between a risk factor and the true outcome is through
the use of specimen banks. (Burke and Henson, 1998b)
For screening that is used in conjunction with a prevention intervention,
incidence of disease is a surrogate outcome for death from the disease. Incidence
is an excellent surrogate outcome in this setting because it is causally linked
to death from disease; if an individual never experiences clinical disease
then that individual can never die from the disease. For screening used as
a trigger for diagnostic testing the primary outcome is death from disease.
What are the criteria for a predictive factor?
Predictive factors should be: 1) accurate, 2) independent, and 3) useful.
(Burke and Henson, 1993, 1999) Accurate means that the factor is, at its minimum
accuracy, a powerful predictor for a subset of a clinical population or a
modest predictor for a large segment of the population. Independent means
that the factor retains predictive value when placed in a multivariate model
that contains other relevant predictive factors. Useful means that the predictive
factor is personally relevant to the patient or clinically relevant to a treatment.
Personally relevant means that the factor can be used to inform the patient
regarding their disease. Clinically relevant means that the factor can affect
patient management and therefore outcome. Most powerful factors possess both
aspects. When there is no effective therapy a factor's utility is in its informing
the patients of their outcome so that they can prepare for it. The importance
of the personal utility of a factor, its ability to provide information to
patients regarding their outcome even when the outcome cannot be changed,
should not be underestimated.
Although it has been suggested that biologic plausibility be a criteria for
a prognostic factor, (McGuire, 1991) it is not necessary to understand the
function of the factor in order to use it as a predictive factor. (Burke and
Henson, 1999) The biologic plausibility requirement does not distinguish between
the biologic function of a factor and its predictive utility. It is certainly
the case that a factor's predictive value rests on its function in the disease
process. But it is not necessary to know its function in order to use the
factor predictively.
How should predictive factors be tested?
A putative predictive factor must go through three stages of testing before
it is ready for clinical use. (Burke and Henson, 1999) The first stage is
identification. (Burke and Henson, 1999) This is the discovery and initial
characterization of the factor. The factor must be unambiguously described
and its method of determination explained sufficiently for replication. The
factor's predictive connection to a clinical outcome must be assessed (usually
using an appropriate univariate statistical method). The clinical population
used for the outcome assessment must also be described including the inclusion/exclusion
criteria and what method was used to obtain the patients.
The second stage is replication. (Burke and Henson, 1999) Once a factor has
been identified it must be replicated by independent researchers using the
original assay method. Other assay methods that are commonly used to detect
this type of factor should be employed by both the original researcher and
by independent researchers. The original finding should be reproducible across
assay methods and researchers using the same type of outcome and patient population.
Failure to replicate previous results will affect the interpretation
and use of the prognostic factor. In addition, the accuracy method that is
used to compare the factor across assay methods and researchers must be suitable
for the comparison of two statistical models.
The third stage is validation. (Burke and Henson, 1999) Validation assesses
the predictive power of the factor in other populations. The factor should
be evaluated on a well defined independently collected patient population
(not the same population that was used for identification and replication).
The question being addressed is whether the factor retains its predictive
power.
Guiding this three stage process is the understanding that for a factor to
be clinically useful it must be assessed by a method that can be performed
in many different types and levels of laboratories and it must be powerful
enough to overcome intra-observer, inter-observer, inter-institutional variance.
(Burke and Henson, 1999)
There are two major validation problems related to prognostic factors. (Burke
and Henson, 1998b) The first is the time from diagnosis to the analysis of
outcomes (e.g., mortality). The longer this interval the longer the prediction
time interval. To provide, for example, ten year survival predictions a patient
population must be followed for ten years. The ten-year information is used
to assess prognostic factor predictive accuracy and to provide ten-year outcome
predictions to future patients. The second is the accrual of a sufficient
number of outcomes so that the assessment of the factor is statistically reliable.
Reliable means that a similar result would be observed if the analysis were
repeated. One solution to these problems is the implementation of a specimen
bank. (Burke and Henson, 1998b)
How can I combine factors to increase my predictive accuracy?
It is rarely the case that one factor is sufficiently predictive, i.e., that
it is able to predict the outcome of interest with 100% accuracy, until the
patient is very near the outcome. (Burke and Henson, 1999) The usual strategy
when dealing with predictive factors is to combine several in a predictive
model. The most useful grouping of factors is one in which all the factors
are powerful and predictively orthogonal to each other, i.e.; they represent
independent aspects of the disease process. If they represent aspects of the
disease that are not independent then their information will overlap and one
will not add predictive power when combined with the other factors. The statistical
method employed to combine the factors must be able to capture the complexity
of the disease process that is represented by the factors being combined,
e.g., nonlinearity and interactions.
Diagnosis is not an exception to the need to aggregate predictive factors
to increase predictive power. When a pathologist makes a diagnosis based on
a tissue slide the pathologist is using a set of diagnostic factors, for example,
morphology and nuclear features. The task is relatively unambiguous because
by the time the disease is clinically expressed it is usually well advanced,
there is evidence of invasion. The predictive task is more difficult with
early, pre-malignant lesions. The ability of pathologists to diagnose current
or future clinical disease declines as we move earlier in the disease process.
This problem will continue with the movement toward molecular genetic diagnosis.
(Burke, 1996).
A predictive model is one or more predictive factors systematically related
to each other and to an outcome. There are many ways to systematically organize
factors. One common approach is to use a statistical method to relate one
or more predictive factors to an outcome. For example, the mathematical formula
generated by the logistic regression statistical method relates the predictive
factors (input variables), in terms of their ß-coefficients, to a binary
disease outcome, e.g., recurrence, death, etc.
It should be noted that the predictive power of a factor in a statistical
model is always dependent on the statistical method that was used to capture
its power and with the other factors included in the model. (Burke, 1996).
Because a particular model may not be efficient at capturing the power of
the factors, and because all the relevant factors may not be included in the
model, any statement of a factor's accuracy must include an explanation of
why a specific statistical model was chosen and why the other factors were
included in the model.
The primary descriptive methods for evaluating factors in cancer are: bins,
stages, and indexes (either as discrete endpoint models or as Kaplan-Meier
product-limit models). (Burke, 1993) The main inferential methods for combining
factors are: decision trees; and regression methods including logistic, proportional
hazards, and artificial neural networks.(Clark, 1994; Burke, 1995b)
Bins are the result of the mutually exclusive and exhaustive partitioning
of discrete variables. Each combination of variable values is a bin and every
patient is placed in the bin corresponding to their variable value combination.
An example is the TNM classification of ovarian cancer. Tumor location (T1a,
T1b, T1c, T2a, T2b, T2c, T3a, T3b, T3c), regional lymph node involvement (N0,
N1), and existence of metastases (M0, M1) produce thirty-six bins.
For discrete variables, if there are enough patients in each bin, it can be
shown that the frequency of the outcome in the population within each bin
is the best predictor of the true outcome. (Burke, 1993) In other words, no
prediction model can be more accurate than the bin model if the variables
are discrete and the population very large. Problems with bin models include:
1) Continuous variables must be parsed into discrete variables, almost always
resulting in a loss of predictive information and therefore a loss of accuracy.
2) As the number of discrete variables increase the number of bins increase
exponentially. For example, if we wish to add 3 grades to the TNM of ovarian
cancer, then the number of bins will increase to 108. In order to maintain
accuracy there most be a corresponding exponential increase in the size of
the patient population to fill each bin. 3) The proliferation of bins reduces
the ability to understand the phenomena. Since the main reason of creating
a bin model is usually for ease of understanding and ease of use, bin models
are rarely used in situations where there are more than two or three predictive
factors.
A partial solution to some of the problems of a bin model is a stage model.
A stage model is the combining of bins into super-bins. The justification
for the grouping is the assumption that the factors selected are indexes of
the "stages" of the disease process and that the combined bins represent
a real stage in the disease process. For example, in breast cancer, the TNM
staging system combines forty TNM classification bins into six super-bins
(Stages I, IIA, IIB, IIIA IIIB, IV) based on decreasing survival, and these
super-bins are termed the TNM staging system.
A small set of stages have the potential to maintain explanatory simplicity
and ease of use. Problems with stage models include: 1) The combining of bins
into super-bins/stages reduces predictive accuracy. 2) Stage systems do not
overcome the exponential increase in bins and in patients associated with
adding a variable to the staging system, they just delay the problem at the
cost of predictive accuracy. If the stages are held constant as variables
(and their associated bins) are added the staging system, the potential improvement
in accuracy associated with the additional bins will be small to nonexistent.
But, if the stages are expanded to accommodate additional bins, the system
looses its ease of understanding and usefulness. Thus, attempts to improve
predictive accuracy by adding variables to a bin/stage model are rarely successful.
3) The problems of parsing continuous variables, with the resulting loss in
predictive accuracy, remains.
Indexes associate numerical scores (usually based on a bounded, linear scale)
with bins or groups of bins. The scores are parsed into discrete ranges, and
each range is associated with a disease stage (usually a severity of illness
system). Indexes offer some flexibility in the grouping of bins, but at the
cost of further degradation in predictive accuracy. The simplest example of
an index is the Apgar score.
Any bin, group of bins, stages, or scores can be compared, in terms of outcome,
with other bins, group of bins, stages or scores at the end of a single time
interval, across objective time intervals, or across a series of event time
intervals. These approaches usually deal with censoring by dropping censored
cases at the time interval in which they are censored. The most common descriptive
approach for comparing predictive factors across a series of event time intervals
is the Kaplan-Meier product-limit method (inferential methods that can accommodate
continuous variables and that usually require a proportional hazards assumption
will be discussed later with regression methods). A Kaplan-Meier plot should
always include confidence intervals around each line. A significant difference
is a Kaplan-Meier comparison is usually assessed by a log-rank test (which
assumes proportional hazards). It is important to note that there is currently
no widely accepted method for comparing the accuracy of two Kaplan-Meier comparisons
based on different stratifications of the same variables. The use of the log-rank
p-value to select one stratification over another is incorrect because the
log-rank test determines whether a factor stratification is likely to have
occurred by chance. Extreme stratifications may result in a smaller p-value
but they may also reduce predictive accuracy over the entire population.
Decision trees split predictive factors to maximize predictive power using
a loss function such as the log-likelihood and a greedy search algorithm.
The most well known decision tree approach is the Classification and Regression
Trees (CART) recursive partitioning method (Breiman, 1984). Empirically, we
have never found decision trees to be the most accurate statistical method,
when compared to other regression methods. Its problems include the selection
of the correct loss function, difficulty dealing with continuous variables,
and overfitting when searching for the best predictors especially when there
are more than two or three splits.
Univariate regression methods are usually not appropriate for deciding whether
a variable is or is not predictive factor. These methods should not be used
to assert that a factor is predictive because a factor should be assessed
in the context of the other known factors. Further, some variables are only
predictive when they are interacting with other factors (for example, most
molecular genetic factors).
Logistic regression is the cumulative probability of a binary event occurring
by a specific time. It uses a maximum likelihood loss function and a greedy
search technique. It is a very efficient method for problems that have a binary
outcome (e.g., recurrence, survival). Its limitation is that it must span
a single time interval and does not distinguish when in the interval an event
occurred.
"Proportional hazards" methods include the Weibull, exponential,
and Cox. The Cox proportional hazards regression method (Cox, 1972) is the
most commonly used. All three methods assume that the hazard of each patient
is proportional to the hazards of all the other patients and that the degree
of each patient's hazard is related to their relative risk. The Cox model
cannot create empirical survival curves. For survival curves a baseline hazard
must be introduced, e.g., Cox-Breslow estimates. (Breslow, 1974) In cancer,
the proportional hazards assumption is often violated. Therefore, anyone using
a Cox model must demonstrate that proportional hazards holds for the factors
and outcome.
Molecular genetic factors, for example, p53, c-erbB-2 (Her-2/neu), pRB, exhibit
the properties of complex systems, they are nonlinear and they are interactional,
i.e., they act nonmonotonically and in concert with other molecular genetic
factors (Steele, 1993; Loomis, 1995; Buratowski, 1995; Sauer, 1995) Thus,
capturing the factors as part of a complex system is critical to accurate
prediction of the behavior of the system. Artificial neural networks are capable
of capturing complex systems. (Burke, 1996)
The idea that learning can be viewed as the modification of information by
repetitively passing it through processing nodes originated in the late 1940's
as a way to model the physiology of neuronal processes. (Hebb, 1949) The operationalization
of this idea was called an artificial neural network. Gradually it became
apparent that this information theoretic approach to learning was very powerful
and very general; it was useful in, and applicable to, many learning situations.
Since statistics can be viewed as learning from the data, it is not unexpected
that this approach would be mathematically proved and operationalized within
the domain of statistics.
Artificial neural networks are universal approximators. It has been shown
that any real, continuous function can be approximated to any degree of precision
by a three-layer network with x in the input layer (patient variables), a
hidden layer with sigmodal transfer functions, and one layer of output units,
as long as the hidden layer can be arbitrarily large. (Hornak, 1990, 1994)
Artificial neural networks, as a class of nonlinear regression and discrimination
statistical methods, are of proven value in many areas of medicine. (Baxt,
1995; Dybowski, 1995; Westenskow, 1992; Tourassi, 1993; Leong, 1992; Gabor,
1992; von Osdol, 1994; Burke, 1997) They do not require a priori information
regarding the phenomenon, they make no distributional assumptions, and with
the appropriate method to avoid overfitting (i.e., loss of generalization
by fitting the patterns to the test data too precisely), artificial neural
networks are usually at least as accurate as classical statistical models
and, depending on the complexity of the phenomena, can be more accurate. Artificial
neural networks have, for example, been shown to be more accurate than logistic
regression, CART (pruned or shrunk), and principal components analysis at
predicting five-year breast cancer specific survival. (Burke, 1995b)
In medical research, the most commonly used artificial
neural networks (ANN) are multilayer perceptrons that use backpropagation
training. Backpropagation consists of fitting the parameters (weights) of
the model by a criterion function, usually squared error or maximum likelihood,
using a gradient optimization method. In backpropagation artificial neural
networks, the error (the difference between the predicted outcome and the
true outcome) is propagated back from the output to the connection weights
in order to adjust the weights in the direction of minimum error. The usual
artificial neural network employed in medical research is composed of three
interconnected layers of nodes: an input layer with each input node corresponding
to a patient variable, a hidden layer, and an output layer. All nodes after
the input layer sum the inputs to them and use a transfer function (also known
as an activation function) to send the information to the adjacent layer nodes.
The transfer function is usually a sigmoid function such as the logistic.
The connections between the nodes have adjustable weights that specify the
extent to which the output of one node will be reflected in the activity of
the adjacent layer nodes. These weights, along with the connections among
the nodes determine the output of the network. The output of the network is
a probability of the event for each patient.
How can I make sure that the method I have used to combine factors is a
good one?
In order to assess and compare statistical models, it is necessary to
distinguish between significance, accuracy, and importance. Significance
suggests that it is unlikely that either a trained statistical method (i.e.,
a statistical model) or a predictive factor' predictions are due to chance
(e.g., the chi-square test) where chance is set at a certain level. Significance
is not necessarily accuracy. (Burke, 1998a) Accuracy is the association between
the model's individual patient outcome predictions (the predicted outcome)
and the individual outcomes of the test population (the true outcome). The
importance of a factor or a model is based on whether the model or the factor
possesses sufficient accuracy to be useful in answering a particular clinical
question. Finally, the assessment of model's or factor's significance, accuracy,
and importance must be based on test data set results, not on training data
set results.
There are several approaches to assessing the accuracy of a multivariate model
and for comparing multivariate models (e.g., Goodman and Kruskall's Gamma,
Kendall's Tau). The best method currently in use is the area under the receiver
operating characteristic curve. The area under the receiver operating characteristic
curve (Az) is the best currently available measure of predictive accuracy.
(Swets, 1996) It can be used to assess and compare the adequacy of statistical
models. Az can be directly calculated by Somer's D (Somer, 1962) or it can
be approximated by its trapezoidal area. (Bamber, 1975) The area under the
curve is a nonparametric measure of discrimination. The receiver operating
characteristic area is independent of both the prior probability of each outcome
and the threshold cutoff for categorization. Its computation requires only
that the prediction method produce an ordinal-scaled relative predictive score.
In terms of mortality, the receiver operating characteristic area estimates
the probability that the prediction method will assign a higher mortality
score to the patient who died than to the patient who lived. The receiver
operating characteristic area varies from zero to one. When the predictions
are unrelated to survival, the score is 0.5, indicating chance accuracy. The
farther the score is from 0.5 the better, on average, the prediction method
is at predicting which of two patients with different outcomes will be alive.
Significant differences in the receiver operating characteristic areas between
two models can be tested following Hanley and McNeil (Hanley, 1982), by calculating
their asymptotic variances, or calculating the empirical variance using the
bootstrap method. (Efron, 1993)
What should I look for when reading about predictive factors?
There is a great deal of variation in the reporting of predictive factor
results. This variability makes it difficult to understand empirical results
and to replicate and validate predictive factor research. A report regarding
the discovery of a new prognostic factor or the validation of an existing
factor should contain the following: (Burke, 1998a) 1. The name of the disease
and the necessary and sufficient criteria used to diagnose the disease. 2.
Where in the disease process the collected patient population is (i.e., early
detected disease) and the patient inclusion and exclusion criteria. 3. The
name and complete description of the prognostic factor. 4. The type of prognostic
factor (i.e., natural history, therapy specific, or post therapy) it is. 5.
The outcome that was selected and a specific time interval, e.g., five-year
breast cancer-specific survival, except in special situations it should not
be a "lifetime" time interval. 6. When the prognostic factor was
collected in the patient population (e.g., at disease discovery, prior to
therapy, after therapy). 7. The specific laboratory method used to assess
the factor (e.g., immunohistochemistry) and why that method rather than another
method was selected. 8. If the prognostic factor is stratified, how was the
cut-point selected and was it replicated in another popualtion. If the variable
value is based on a rater's judgment then Cohen's kappa should be reported.
9. Relevant characteristics of the data set should be described including
the data set size, population characteristics, and the number of events (outcomes).
10. If any patients received a therapy, the type of therapy should be reported
and the patients should be stratified by therapy for all analyses. 11.
The numerical estimate (usually a parameter estimate) and its confidence interval
of the finding should be provided. 12. The level of significance should be
set. If the investigators have not looked at the data then a p-value of <
0.05 is usually acceptable. For multiple tests or data exploration an adjustment
may be required. 13. The type of multivariate statistical method (e.g., logistic
regression, Cox) used, why it was used, and any assumption tests (for example,
proportional hazards) should be provided. 14. If all the other relevant prognostic
factors were not included in the multivariate model, what were left out and
why. 15. The predictive accuracy of the multivariate model should be assessed,
, e.g., area under the receiver operating characteristic, R2, and the reason
why a particular method was used should be explained.. 16. The accuracy estimates
of the multivariate model including either standard errors or confidence intervals
for the estimates (e.g., Az = .75, CI = .50 - 1.0) must be provided. References
Altman DG, Lyman GH. Methodologic challenges in the evaluation of prognostic
factors in breast cancer. Breast Cancer Res Treat 1998a;52:289-303.
Altman DG. Suboptimal analysis using 'optimal' cutpoints. Br J Cancer 1998;78:556-7.
Bamber D. The area above the ordinal dominance graph and the area below the
receiver operating graph. J Math Psych 1975;12:387-415.
Baxt WG. Application of artificial neural networks to clinical medicine. Lancet
1995;346:1135-38.
Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression
Trees. Pacifistic Grove, CA; Wadsworth and Brooks, 1984.
Breslow NE. Covariance analysis of censored survival data. Biometrics; 1974:80-99.
Buratowski S. Mechanisms of gene activation. Science 1995;270:1773-4.
Bucher HC, Guyatt GH, Cook DJ, Holbrook A, McAlister FA. Users guide to the
medical literature XIX. Applying clinical trial results. A. How to use an
article measuring the effect of an intervention on surrogate end points. JAMA
1999;282:771-778.
Burke HB, Henson DE. Criteria for prognostic factors and for an enhanced prognostic
system. Cancer 1993;72:3131-35.
Burke HB. Increasing the power of surrogate endpoint biomarkers: the aggregation
of predictive factors. J Cell Biochem 1994;19S:278-82.
Burke HB, Hutter RVP, Henson DE. Breast Carcinoma. In: P. Hermanek, M.K. Gospadoriwicz,
D.E. Henson, RVP Hutter, L.H. Sobin (eds.), UICC Prognostic Factors in Cancer.
Berlin: Springer-Verlag, 1995a, 165-76.
Burke HB, Rosen DB, Goodman PH. Comparing the prediction accuracy of artificial
neural networks and other statistical models for breast cancer survival. In:
G. Tesauro, D.S. Touretzky, T.K. Leen, eds. Advances in Neural Information
Processing Systems 7. Cambridge: MA, MIT Press, 1995b, 1063-67.
Webmaster@CancerHome.com © Copyright 2000, Cancer Home Inc. All Rights Reserved.