Survival analysis for white non-Hispanic female breast cancer patients.

BACKGROUND
Race and ethnicity are significant factors in predicting survival time of breast cancer patients. In this study, we applied advanced statistical methods to predict the survival of White non-Hispanic female breast cancer patients, who were diagnosed between the years 1973 and 2009 in the United States (U.S.).


MATERIALS AND METHODS
Demographic data from the Surveillance Epidemiology and End RESULTS (SEER) database were used for the purpose of this study. Nine states were randomly selected from 12 U.S. cancer registries. A stratified random sampling method was used to select 2,000 female breast cancer patients from these nine states. We compared four types of advanced statistical probability models to identify the best-fit model for the White non- Hispanic female breast cancer survival data. Three model building criterion were used to measure and compare goodness of fit of the models. These include Akaike Information Criteria (AIC), Bayesian Information Criteria (BIC), and Deviance Information Criteria (DIC). In addition, we used a novel Bayesian method and the Markov Chain Monte Carlo technique to determine the posterior density function of the parameters. After evaluating the model parameters, we selected the model having the lowest DIC value. Using this Bayesian method, we derived the predictive survival density for future survival time and its related inferences.


RESULTS
The analytical sample of White non-Hispanic women included 2,000 breast cancer cases from the SEER database (1973-2009). The majority of cases were married (55.2%), the mean age of diagnosis was 63.61 years (SD = 14.24) and the mean survival time was 84 months (SD = 35.01). After comparing the four statistical models, results suggested that the exponentiated Weibull model (DIC= 19818.220) was a better fit for White non-Hispanic females' breast cancer survival data. This model predicted the survival times (in months) for White non-Hispanic women after implementation of precise estimates of the model parameters.


CONCLUSIONS
By using modern model building criteria, we determined that the data best fit the exponentiated Weibull model. We incorporated precise estimates of the parameter into the predictive model and evaluated the survival inference for the White non-Hispanic female population. This method of analysis will assist researchers in making scientific and clinical conclusions when assessing survival time of breast cancer patients.


Introduction
14.3% of all cancer related deaths worldwide, breast cancer is also the most fatal cancer among females aged between 20-59 (Ferlay et al., 2013;WHO, 2013b).
Globally, the projections for new breast cancer cases in the year 2020 indicate an increase of 18%; and in the United States, a 14.4% increase is projected (Ferlay et al., 2013). It has been suggested that during the year 2000 to 2009, the overall annual percentage change (APC) in breast cancer incidence has increased in the U.S. For women aged 40-49, most racial and ethnic groups showed an overall APC increase of 1.1% (p<0.001) (Hou and Huo, 2013).
Breast cancer incidence is increasing among women in the U.S. In the year 2012, an estimated 226,870 new breast cancer cases and 39,510 deaths occurred due to the breast cancer (ACS, 2013;Siegel et al., 2014). In the year 2010, breast cancer patients in the U.S. spent around $16.5 billion in care and medical services; which is approximately 13% of the total estimated national expenditure on cancer care. It is projected that this figure will reach $23.34 billion in the year 2020, and is going to be the second largest expenditure in medical care for cancer (Mariotto et al., 2011;Montero et al., 2012). About 14% of women in the U.S. are expected to develop breast cancer in their lifetime and the associated costs of care will affect their survival (ACS, 2013). Among these women, breast cancer affects ethnicities at different rates (DeSantis et al., 2011). A higher incidence of 417 cases per 100,000 is reported among White females compared to other racial groups (White et al., 2013).
The survival of breast cancer patients also depends on factors such as genetics, age at diagnosis, stage of the cancer, access to care, weight, physical activity status, alcohol consumption, disease co-morbidities, social, economic, environmental factors, and ethnicity (Graeser et al., 2009;Kwan et al., 2010;Protani et al., 2010;Peairs et al., 2011;Sprague et al., 2011;ACS, 2013). Screening guidelines have also evolved based on the research findings correlating breast cancer-screening and survival times. Presently, it is recommended that women between ages 20 to 39 complete a clinical breast examination (CBE) every 3 years. Those who are asymptomatic but aged 40 years or older are recommended to receive CBE every year (Robertson et al., 2011;Smith et al., 2013). Women who have a history of breast cancer in their families should start screening on a regular basis before age 40. The most recognizable signs and symptoms of breast cancer often appear in the later stages of the disease, making it imperative to detect, diagnose, and treat breast cancer early (Walker et al., 2013).
The Centers for Disease Control and Prevention (CDC) has identified that between the years 1999-2010, White women accounted for the highest incidence of breast cancer and the third highest mortality rates from the disease (CDC, 2012). However, White women have a higher five-year survival rate when compared to other racial groups. Death rates among White women have declined 2% from 1997 to 2007; this decline is not found in other racial or ethnic groups (DeSantis et al., 2011). Research shows that White women, older than age 40 had higher rates of breast cancer compared to black women in the same age group (Clarke et al., 2012).
According to the American Cancer Society, the incidence of age-adjusted breast cancer vary greatly from state to state for example, they may range from 47.79 cases per 100,000 in Connecticut to 20.25 cases in rural Georgia, the death rates range from 27.6 per 100,000 in Alaska to 17.5 per 100,000 in New Hampshire (ACS, 2013). The contrast in incidence and mortality rates among race, ethnicity and other determinants demonstrate a need for statistical modeling to predict the survival times. Patients diagnosed with breast cancer, visit clinics, healthcare units, and hospitals to receive modern treatments to improve their prognosis. Advancements in modern technology are able to help patients determine survival days. In addition, there is a huge demand for new uses of statistical analyses in order to facilitate new discoveries, diagnosis, and treatment planning.
The objectives of this paper are: i) to analyze demographic variables of the selected sample; ii) to demonstrate that the breast cancer survival data follows a specific probability model by using model selection criterions for goodness of fit tests; iii) to perform the Bayesian analysis of the posterior distribution for the parameters; and iv) to derive Bayesian survival inference for future times by using the best fit model.

Materials and Methods
We used breast cancer data (N=657,712) from Surveillance, Epidemiology and End Results website (SEER, 2010). The data contains information for breast cancer patients from 1973 to 2009, and covers 12 cancer registries among the 50 U.S. states. A stratified random sampling method was used to select nine of the 12 available states to provide a representative sample of White non-Hispanic categories ( Figure 1). The total SEER data included 4,269 males and 653,443 females. Among the 608,032 total females, 22,639 were White Hispanic, and 531,562 were White non-Hispanic. Men were not included in this study due to the small chance (0.70% of total patients) of breast cancer occurring within this group.

Selected patients and their demographic characteristics
A random sample of 2,000 White non-Hispanic cancer cases was included in the data analysis representing nine states in the U.S. (Figure 1). Simple random sampling (SRS) methodology was used to select a representative sample of patients and to minimize selection bias.
Health care professionals use various statistical probability modeling techniques to determine the prognosis of cancer patients. Often times the data stored in various cancer registries or databases are utilized. Survival data taken from these databases follow several statistical probability models, for example exponential, Weibull, Exponentiated Exponential (EE), Exponentiated Weibull (EW), Beta Generalized Exponential (BGE), Beta inverse Weibull (BIW) model, among others. For accurate predictions, it is imperative that the data fit the appropriate model. Different racial and ethnic groups may follow different distribution patterns and so it is important to use statistical methodologies to draw clinical inferences. Khan et al. (2014a), discussed in details four types of statistical probability models. These models include EEM, EWM, BGEM, and BIW, which were briefly used in this study. There are two parameters, shape (α>0) and scale (λ>0) for the EEM (Khan et al. 2014b). The Weibull model has three parameters, α>0 and β>0 are the shape parameters, and λ>0 is the scale parameter (Khan et al. 2014c). The beta generalized exponential model has four parameters, where the shape parameter, α>0 and the scale parameter, λ>0, and additional two parameters, a>0 and b>0 are essential for varying tail weight and to present skewness (Barreto-Souza et al., 2010). The beta inverse-Weibull (BIW) model is another type of statistical probability model, where β is the shape parameter, and two extra parameters, a>0 and b>0, are used to introduce skewness and tail weight (Khan et al. 2014c(Khan et al. , 2014d. In exploration of the posterior probability for the parameters from the EEM, BGEM, EWM and BIWM, an innovative Bayesian method may be used to achieve posterior inference. In the healthcare research field, Bayesian statistics have become more popular because of its use of parametric and model-based inference, and its applicability to clinical diagnostics, potentially improving the field of translational research. Data and model parameters are random variables in the Bayesian estimation technique; data is termed as "observed" and parameters are termed as "unobserved" variables. The joint distribution of the posterior parameters is determined by multiplication of the likelihood and prior. The likelihood relies on the model of underlying process given any values for parameters, it is measured as a conditional distribution that specifies the probability of the observed data. Prior and likelihood combine all the available information about the parameters, and manipulates the joint distribution in many ways and makes inference about the parameters given the data. Given a set of observed data, the Bayesian inference develops the posterior distribution for the parameters which allows population predictions when applied to datasets. For more information regarding Bayesian method and its inference, the readers are referred to other works (Khan et al., 2012a;2012b;2013a;2013b).
SPSS software (IBM SPSS, 2011) was used to gather descriptive statistics. Nine out of 12 states were used to extract data and a geographic map was drawn for White non-Hispanic women cases using the Google fusion table (Gonzalez et al., 2010). Mathematica version 8.0 (Wolfram Research, 2012), an advanced computational software produced a graphical representation of the predictive density for a single future survival time for the selected sample. Furthermore, it was used to obtain additional predictive inferences for the survival times. To assess goodness of fit, summary results of the posterior parameters, and to execute related calculations WinBugs software (MRC Biostatistics Unit, 2013) was used.

Results
Frequency of the sample of breast cancer cases ranged from 366 (Connecticut) to 25 (Hawaii), and thus the percentage of White non-Hispanic females with breast cancer cases in the analysis varied from 1.3% to 18.3% for various states (Table 1). The second highest percentage of selected patients was observed from Washington and the next highest from Michigan.
The quartiles for the mean age at diagnosis were 52, 63, and 75 years, respectively ( Table 2). The mean and median age at diagnosis were close and suggested an approximate normal distribution. The median survival time was 87 months and ranged between 38-160 months for White non-Hispanic females. The majority of the females in the analytical sample were married (55%). Khan et al. (2014b) discussed in details about three model selection criteria; Akaike Information Criterion (AIC), Deviance Information Criterion (DIC), and Bayesian Information Criterion (BIC). The DIC, a measure of fit is widely used for the comparison of different models. The Markov Chain Monte Carlo (MCMC) method is used to attain the posterior distribution of parameters for the sample. The DIC values can be either positive or negative, however, the model with smaller values is considered better than those with larger values. As with DIC and AIC, the model containing the lower Bayesian Information Criterion (BIC) values is considered better between any two estimated models. Bayesian Information Criterion (BIC) is an asymptotic result assumed that the distribution of data is an exponential family. We used log-likelihood function of the models in WinBugs (MRC Biostatistics Unit, 2013) and applied them for White non-Hispanic survival data. The AIC, BIC, and DIC values are calculated and the summary results of the measures of goodness of fits are reported in Table 3. Table 3 consists of AIC, BIC, and DIC values for the EE, EW, BGE, and BIW models. Comparing the estimated values of all AIC, BIC, and DIC for the models, the EWM fits better for the survival times because it produces the smallest values of AIC and DIC. Table 4 presents the summary results (Mean, SD, MC Error, Median, and Confidence Intervals) of the parameters in the case of best-fit exponentiated Weibull model for White non-Hispanic female breast cancer cases. Figure 2 displays the graphical representation of the parameters. It is noted that all the parameters produced skewed distributions. Parameter values when plotted showed that alpha and beta are negatively skewed, and lambda is positively skewed. The range (95% CI) of these parameter values are described in Table 4. After 50,000 iterations, the kernel density appears smoothened.

Survival inference
We developed a predictive survival model using the results of the best-fit model to demonstrate the survival time of White non-Hispanic women with breast cancer patients. By using the values of model selection criterion found in Table 3, it is identified that the data follows the exponentiated Weibull model. Applying the Bayesian survival model, we assume the data set x=(x 1 , . . . , x n ) represent n White non-Hispanic female breast cancer cases survival days that follow the exponentiated Weibull model, and let y be a future survival time, then following Khan et al. (2011), the predictive density of y given the observed White non-Hispanic survival data x, is given by p(y | x) = ∫∫∫ p(y |α, β, λ) p(α, β, λ | x) dλ dβ dα, where, p(α, β, λ | x) is the posterior density function, and p(y | α, β, λ) represents the probability density function for a future survival time (y) that is defined from the best fit exponentiated Weibull model.
Graphical representation of the predictive density is shown in Figure 3 based on the survival times of White non-Hispanic cases. Figure 3 presents the predictive survival density function. The predictive density (Figure 3) for the survival   56 31 times appears to be unimodal and it is positively skewed, ranging from 28.629 to 170.4252 (95% CI). Table 5 contains summary statistics for future survival times of the patients. We identified that the survival times are higher for future patients compared with existing diagnosed patients. We obtained the raw and corrected moments for the survival inference for future White non-Hispanic female breast cancer patients. Since Kurtosis is <3, we can assume that the future survival values follow a platykurtic distribution. The data points represent a flat distribution as compared to a normal distribution, which has a wider peak. Since Skewness is >0, majority of the data falls to the left of the mean, with extreme values to the right.

Discussion
For White non-Hispanic women diagnosed with breast cancer between the years 1973 to 2009 in the U.S., several statistical models were used to show the best-fit for the breast cancer survival data. The sample consisted of 2,000 White non-Hispanic women; stratified random sampling was used at the state level and simple random sampling used within the nine states.
The Mean (SD), age (in years) at diagnosis for breast cancer cases was 63.31 (14.24), with age 15 being the minimum age at diagnosis for White non-Hispanic women. The highest mortality rates among women diagnosed with breast cancer are those that are 50 years or older (SEER, 2010). Survival time ranged from 38 to 160 months, with a Mean (SD), 84.17 (35.01) months. The majority of these cases were married.
To speed up performance of the Bayesian posterior parameters and to draw their corresponding dynamic kernel densities, a reparameterization method was used for the exponentiated Weibull model. After running 50,000 Monte Carlo repetitions reported with negligible MC errors we obtained posterior inference for the parameters.
Given the breast cancer survival model, we were able to determine the inference for posterior parameters using the Bayesian method. By using the Markov Chain Monte Carlo method, the inferences for the posterior parameters for the best-fit model are reported for White non-Hispanic females.
Based on the goodness of fit analysis, the breast cancer survival sample for White non-Hispanic women followed the exponentiated Weibull (EW) distribution. The lowest DIC value was 19818.220, indicating the best goodness of fit. In this case under the selection of EW distribution, Mean (SD) values for α, β, and λ are 7.25 (0.1189), 1.092 (0.00643), and 0.0189 (5.34×10 -4 ), respectively. The dynamic kernel density for each of the parameters is reported for White non-Hispanic females so that one can observe the shape of the kernel density. It was noticed that all parameters displayed skewed distributions.
In the case of the survival inference, the best fit statistical survival model and the Bayesian method were used to derive a predictive model for a single future survival time. A summary table for the predictive mean, standard error, and 95% future survival intervals are provided on the basis of the predictive density. According to the results, the shape of the future survival model for White non-Hispanic women is positively skewed. Figure 3 shows the graphical representations of White non-Hispanic female future survival times using the exponentiated Weibull distribution in the Bayesian method. Higher survival times are identified for White non-Hispanic women compared to the existing survival times. For ethnicity, we report the predictive raw and corrected moments, predictive skewness, and kurtosis for future survival time in Table 5. The model is able to predict survival times accurately within 90%-99% confidence intervals while taking into account multiple parameters.
We identified a data-based statistical probability model from the 1973-2009 SEER database to demonstrate the effectiveness of predicting breast cancer survival data for White non-Hispanic women. Statistical probability models are important for posterior model parameters in order to predict survival times among ethnicity and for describing inferences for observations. To best determine a fitted model, methods for measuring the goodness of fit tests are imperative in the selection of the best statistical probability models for survival samples of ethnicity. AIC, BIC, and DIC model selection criterions were used to develop statistical probability model for ethnicity.
These findings will be beneficial to healthcare researchers and practitioners to aid in the prediction of a patient's possible survival time given the patient's current state and medical history. Therefore, the findings may work to improve knowledge, demonstrate scientific discovery, and innovation. This may improve the diagnosis and treatment of breast cancer cases within the United States and the world.