Bayesian Method for Modeling Male Breast Cancer Survival Data

BACKGROUND
With recent progress in health science administration, a huge amount of data has been collected from thousands of subjects. Statistical and computational techniques are very necessary to understand such data and to make valid scientific conclusions. The purpose of this paper was to develop a statistical probability model and to predict future survival times for male breast cancer patients who were diagnosed in the USA during 1973-2009.


MATERIALS AND METHODS
A random sample of 500 male patients was selected from the Surveillance Epidemiology and End Results (SEER) database. The survival times for the male patients were used to derive the statistical probability model. To measure the goodness of fit tests, the model building criterions: Akaike Information Criteria (AIC), Bayesian Information Criteria (BIC), and Deviance Information Criteria (DIC) were employed. A novel Bayesian method was used to derive the posterior density function for the parameters and the predictive inference for future survival times from the exponentiated Weibull model, assuming that the observed breast cancer survival data follow such type of model. The Markov chain Monte Carlo method was used to determine the inference for the parameters.


RESULTS
The summary results of certain demographic and socio-economic variables are reported. It was found that the exponentiated Weibull model fits the male survival data. Statistical inferences of the posterior parameters are presented. Mean predictive survival times, 95% predictive intervals, predictive skewness and kurtosis were obtained.


CONCLUSIONS
The findings will hopefully be useful in treatment planning, healthcare resource allocation, and may motivate future research on breast cancer related survival issues.


Introduction
According to the World Health Organization (WHO), breast cancer accounts for 519,000 deaths worldwide.Surprisingly, 69% of those deaths were in developing countries, refuting the belief that breast cancer is a disease of developed countries (WHO, 2004).The symptoms of breast cancer include presence of a lump, which is commonly pea sized, in or near the breast or in the underarm that changes the appearance of the overlying skin and causes pain (Merck Manual of Diagnosis and Therapy, 2012).The complications of the breast cancer include chest pain, mastitis, metastasis, nipple discharge, and side effects of radiation therapy and chemotherapy (Hurria, 2012).The metastasis of the breast cancer occurs via hematogenous spread or lymphatic spread, generally to lung, pleura, liver, bone, adrenal glands, gastro intestinal track, pancreas and peritoneum (Mülle et al., 2001).
Early diagnosis of breast cancer is very important as its signs and symptoms appear in the later stages of the

Bayesian Method for Modeling Male Breast Cancer Survival Data
Hafiz Mohammad Rafiqullah Khan 1 *, Anshul Saxena 2 , Sagar Rana 3 , Nasar Uddin Ahmed 4   disease.Early diagnosis ensures early management of the breast cancer and decreases the mortality associated with the disease.Male breast cancers occur in patients experiencing gynecomastia due to hormonal problems and overabundance of estrogen precursors.These are usually due to side effects of therapeutic drugs, hormonal imbalance during puberty or side effects of androgenic and anabolic steroid abuse.According to the National Cancer Institute estimates for 2013, about 2,240 males are diagnosed annually and 410 males die of breast cancer in US (ACS, 2013).We are focusing on the survival estimations for male breast cancer patients because of paucity of statistical studies for male survival times.
In recent years, the healthcare industry collected an unprecedented amount of phenomic data (clinical, imaging, biochemical, cellular), and genomic data (DNA sequencing and microarray analysis), from thousands of subjects which are not fully utilized by the researchers and are not even available for public usage.These large databases require a new direction of statistical analysis and the implementation of newer analytical approaches to identify data based statistical probability distributions and to draw a scientific conclusion about future situations of disease patterns and death rates.
The data extracted from health related lab/experiments may follow several statistical probability distributions including exponentiated exponential and exponentiated Weibull.Statistical and computational techniques are immensely necessary to understand such informative data and to make scientific conclusions that are reliable.
The Exponentiated Exponential Model (EEM) is frequently used in modeling the data for engineering and biomedical sciences.The EEM is a generalization of the exponential distribution, which was introduced by Gupta and Kundu (1999) and received tremendous and widespread attention.The EEM has two parameters (scale and shape).A random variable x is said to have the exponentiated exponential distribution if its probability the density function (pdf) is given by where x>0, λ>0 and α>0.We can write x ~ EE (α, λ).The distribution of Eqn.(1) was introduced by Gupta and Kundu (1999).In the particular, if α= 1, is the exponential distribution.
In 1939, Waladdi Weibull, a Swedish physicist, introduced Weibull distribution.In 1993, Mudholkar and Srivastava presented the first exponentiated Weibull model (EWM) that contained distributions with bathtub shaped and unimodal failure rates.Since then, it has been extensively used for analyzing the lifetime data by Nassar andEissa (2003), andChoudhury (2005).
The probability density function (pdf) for the exponentiated Weibull model is given by where α and β are the shape parameters, and λ is the scale parameter respectively.
Statistical inference to predict future outcome using past observations is known as predictive inference.Most of the studies on EEM and EWM have addressed the estimation of parameters and reliability, and hazard functions but, there are not enough studies which are related to the posterior distribution of the parameters and predictive inference for future outcome.The derivation of predictive inferences is necessary for future advances in biomedical science.
The purpose of Bayesian inference is to develop the posterior distribution of the parameters given a set of observed data.According to Gelman (2004), the Bayesian method can be divided into three components.
The main goals of this paper are to: i) study some demographic and socio-economic variables; ii) review right skewed models EEM and EWM; iii) display different scenarios of the EEM and EWM by changing the parameter; iv) a justification to prove that the given sample data follows the EEM and EWM by using model selection criteria for goodness of fit tests; v) draw a Bayesian analysis of the posterior distribution of the parameters; vi) derive Bayesian predictive model for future survival time from the EEM and EWM; vii) utilize the predictive models in the breast cancer survival data sets to obtain the predictive inference for future response and the likelihood of males getting breast cancer and; viii) to obtain the predictive intervals for the future survival times.
This paper is organized as follows: A real breast cancer survival data example is discussed in details in section 2. Measures of goodness of fit tests and loglikelihood functions for both models are presented in Section 3. Section 4 addresses the Bayesian predictive model including the likelihood function, posterior density function, predictive density for response given a sample of observations from the EWM.In section 5 the results and discussion are included.The last section, which is 6, gives the concluding remarks.

A real data example
We formally requested and acquired the breast cancer patients' data (N=657,712) from the Surveillance, Epidemiology and End Results (SEER;1973-2009) website.SEER data contain breast cancer patients' information mostly on 12 states in the USA.A representative probability random sampling scheme was employed to draw a sample from the randomly selected nine states to represent race categories.This data was split according to gender (where males=4,269) and then a simple random sampling (SRS) method was used to select a sample of size 500 male breast cancer patients.

Measures of goodness of fit
The main methods currently used by several researchers to compare various models goodness of fit are: Akaike Information Criterion (AIC), Deviance Information Criterion (DIC), and Bayesian Information Criterion (BIC).The widely used method, DIC, is a Bayesian measure of fit which is used for overall comparison of different models, for example, public data by Congdon (2005;2007).It shows how good the model predictions fit the given data, while it represents the complexity fitness given each model of the data.Akaike (1973) generalized his work over factor analysis by introducing information criterion which later became popular as Akaike's information criterion or AIC.Bayesian information criterion (BIC) is an asymptotic result assumed that the data distribution is an exponential family and can only be used to compare estimated models when numerical values of the dependent variable are identical for all estimates being compared.The model with lower value of BIC is preferred over others.AIC, BIC, and DIC values are reported in Table 4 for the EEM and EWM based on the male survival times.
Assuming the data x=(x 1 , x 2 , . . ., x n ) represents n male breast cancer patients survival times, then one may obtain the log-likelihood function from the EWM specified by (2) which is given by  2013) considered a reparameterization some skewed models.One may utilize a reparameterization method by considering the log-likelihood functions from the EWM that is described as follows: Assume r 1 =log(a); r 2 =log(β); and r 3 =log(λ).Furthermore, we assume that r 1 , r 2 and r 3 are independently distributed.To obtain noninformative prior for r 1 , r 2 and r 3 , let a uniform prior distribution for r i be U(-a i, a j ), " i=1,2,3 .Then the joint posterior density is given by

Log L ((a, β, λ)|x)=n log(a)+n log(β)+n log(λ) +(a-
+(e r 1 -1) Similarly, one may obtain the log-likelihood function from the EEM (1) or from the log-likelihood function of the EWM (when β=1) and may use the reparameterization method to derive the joint posterior density for r 1 and r 2 .
By using the reparameterization method, one would obtain better performance of the posterior distributions for the parameters and their results are reported in Table 5.The posterior kernel densities for the parameters are given in Figure 5.   DIC infer better model fit of the data.Here, goodness of fit of survival times of male patients is being tested.In case of males, the data fits the EW distribution better than the EE distribution.The estimated value of BIC is the highest (12235.73)while the DIC value is the lowest (12225.30)for males in the case of EEM.estimated value of BIC is the largest (5540.67)while the DIC value is the least (5530.30)for males in the case of EWM.Comparing the estimated values of all AIC, BIC, and DIC based on the EEM and EWM in the case of males, the EWM fits better for the male survival times because it produces smaller values of AIC, BIC, and DIC.
Figure 4A portrays the different scenarios of the EE model where each graph is composed of three superimposed models with certain parameters.Figure 4A shows scenarios of the EEM with some values of the parameters.The left graph in Figure 4A contains three inverted J-shaped graphs that have a fixed α and increasing λ parameters.The red line graph shows more variability compared to the green and purple line graphs.The right graph in Figure 4A, contains three right skewed graphs when α is increasing and λ is fixed.In the right graph, λ is kept constant while values of α is increasing.It is observed that the green line is showing more variability as compared to the purple and red lines.In Bayesian approach, the knowledge of the distribution of the parameters is updated through the use of the observed data, resulting in what is known as the posterior distribution of the parameters.In the case of breast cancer data, we are interested in estimating the posterior distribution of the parameters assuming that observed random variables form an appropriate theoretical probability distribution.
Table 4 indicates the summary results of the posterior distribution of the parameters from the exponentiated Weibull by using the breast cancer males' patient data.By setting the values of the r 1 , r 2 and r 3 the results of the posterior distribution parameters α, β, and λ are estimated using the MCMC method.Markov chain Monte Carlo is a class of algorithms used in statistics for generating samples from a probability distribution (Gilks et al., 1996).The log-likelihood function is derived from the EWM and then by its parameter values which are assigned to appropriate probability distribution.The WinBugs software is used to obtain the summary results (Mean, SD, MC Error, Median, and Confidence Intervals) of the parameters.The early iterations are ignored in order to remove any biases of estimated values of the parameters resulting from the survival values of x utilized to initialize the chain, a process that is called burn-in.After removing the burnin samples, the remaining samples are treated as if the samples are from the original distribution.The procedure was conducted by 80,000 Monte Carlo repetitions to produce the inference for the posterior parameters in Table 5.
Figure 5 displays the graphical representation of the parameters' behavior in the case of exponentiated Weibull given the male survival data.It is noted that the shape parameter β plays approximately symmetric distribution and other model parameters are following skewed distributions.

The Bayesian predictive survival model
The Bayesian predictive method is growing more popular, finding new practical applications in the fields of health sciences, environmental sciences, and social sciences, among others.The Bayesian predictive approach which is used for the design and analysis of survival research studies in the health sciences is now widely used to reduce healthcare cost and to successfully allocate health care resources.
In this section, a predictive survival model for the breast cancer patients is developed by using a novel Bayesian method.It is found that the male cancer patients' data follow the EWM by using AIC, BIC, and DIC.Assuming the data x=(x 1 , . . ., x n ) represents n male breast cancer patients survival times that follow the EWM, let z be a future response (or future survival days), then the predictive density of z given the observed data x is p(z|x)=c c c p(z| a, β, λ) p(a, β, λ |x) dλ, dβ, da, where p(a, β, λ |x) is the posterior density function, and p(z| a, β, λ) represents the probability density function of a future response (z) that may be defined from model (2).The posterior density is given by where L(a, β, λ |x) is the likelihood function, p(a, β, λ) is the prior density for the parameters.
To derive the likelihood function, let x 1 , . . ., x n be a random sample of size n from model (2).Thus, x=(x 1 , . . ., x n )' forms an observed sample.Then given a set of data x=(x 1 , . . . ,x n ) from (2), the likelihood function is given by Ahmed (1992) discussed an estimation theory under uncertain prior information.Ahsanullah and Ahmed (2001) discussed in details on Bayes and empirical Bayes estimates of survival and hazard functions of a class of distribution.Ahmed and Tomkins (1995) estimated lognormal mean by making use of uncertain prior information.Khan et al. (2004) derived the Bayesian predictive model from the Weibull life model by means of a conjugate prior for the scale parameter and a uniform prior for the shape parameter.It is assumed that the prior density for the scale parameter (λ) is given by p(λ) a λ exp {-λ}, λ>0, (3 Considering Khan et al. (2004), the shape parameters, α and β, have a uniform prior over the interval (0, α) and (0, β), respectively, which is given below: p(a, β) a 1/aβ, a, β>0, ( 4Thus, the joint prior density is p(a, β, λ) a λexp{-λ}/aβ, a, β, λ>0, ( 5Considering the prior density in (5), the posterior density of α, β, and λ is given by A numerical integration command 'NIntegrate' in conjunction with the symbolic computational software Mathematica version 8.0, Wolfram Research (2012), is applied to plot the predictive density graph.The predictive means, standard deviations, predictive intervals, and the measures of skewness and kurtosis are obtained.The Mathematica package is also utilized to carry out all related calculations.
Figure 6 shows the graphical representation of the predictive density based on the male breast cancer patients' survival days.It is noted that the predictive density formed right skewed model.The summary results of male predictive means, standard errors, and predictive intervals for a future survival day are given in Table 6.The predictive shape characteristics, raw moments, corrected moments, and measures of skewness and kurtosis are also presented in the same table.
These findings are very important towards health care researchers and providers to characterize future disease patterns and to make an effective future plan in our health industry.

Results
We used a random sample consisting of 500 breast cancer male patients diagnosed during 1973-2009 in the USA.The mean±SD, of age at diagnosis for breast cancer patients was 66.19±12.45for males.The minimum age at diagnosis for males 30 years.In the sample, oldest males at the time of the diagnosis was 99 years.The mean±SD, of survival time for male was 84.14±75.25 months.For males, the survival time ranged from 1 to 407 months.In this sample, among 500 males, race was distributed as 87% White, 9% Black, and 4% other among male cases.There were nearly 3% people of Hispanic origin.The majority of these patients were married.The basic statistical calculations helped us to conclude that the breast cancer in male is diagnosed at a higher age.The late diagnosis makes survival time look shorter after diagnosis.
In the case of the goodness of fit analysis, the breast cancer data from male sample followed exponentiated Weibull distribution with lower DIC value as 5530.30.For EW distribution, mean±SD values for a, β and λ are 2.73±0.01,0.81±0.01 and 0.05±6.87×10 - , respectively.Rho values for males are r 1 =(1, 2), r 2 =(-1, 1), r=(-4, -3).The Markov Chain errors and confidence intervals for posterior parameters are given in Table 5.The dynamic kernel densities for each of the parameters are reported in Figure 5 so that one can observe the shape of the kernel density.
Figure 6 shows the graphical representation of the single future survival time.The shape of the future distribution of survival times is positively skewed.The raw and corrected moments are described in Table 6 based on males' survival times.

Discussion
A Statistical probability modeling has become essential for the modern practice of medicine, pharmaceutical, and clinical studies, the effective future direction of health care, and the health profession of education.Statistical probability models can describe the nature of data from any scientific field.From such statistical models, one can make statistical inferences about the parameters as well as the future observations because they play a very important role in decision making for future patterns of diseases in human health.
In this study, we selected a random sample of male breast cancer patients from the SEER  database.The methods for measuring the goodness of fit tests are used to select the best statistical probability models for the male based on breast cancer survival data.The sample is composed of male data from nine randomly selected states out of 12 states in the USA.To develop statistical probability models for survival days of males, we used model selection criterions, AIC, BIC, and DIC to measure the best fit to the breast cancer survival data.We found that the EWM best fits the male survival data.A detail analysis of the posterior models for the parameters is described with their summary results.A novel reparameterization method is performed in terms of the log-likelihood functions for the EW models to accelerate better performance of the Bayesian posterior parameters and to draw its corresponding dynamic kernel densities.The summary results of the posterior parameters are reported with very less MC errors which are negligible by using the MCMC method.The results are obtained after running 80,000 Monte Carlo repetitions.The results of the posterior distribution of parameters using the breast cancer patients' data will contribute a new addition to the knowledge of the model parameters.
A summary table for the predictive means, standard deviations, and the predictive intervals are obtained for the predictive density of a single future survival time.The measures of skewness and kurtosis of a future response from the predictive model are also given.Based on the results of skewness and kurtosis one would comment that the shape of the future survival model for the males is positively skewed.These findings will be extremely helpful for the healthcare researchers and providers to predict a male patient's possible future medical outcome given the patient's current and past history of reported conditions.
Descriptive statistics are obtained by using the SPSS software.An advanced computational software package, 'Mathematica version 8.0', is used to show the graphical representations of the EE and the EW models, to display the predictive density and also to achieve additional predictive inferences for male survival times.WinBugs software is used to check the goodness of fit tests, to obtain the summary results of the posterior parameters, to determine the kernel densities of the parameters, and also to carry out all related calculations.

Figure 3 .
Figure 3. Number of Male (n 1 =500) Patients Randomly Selected from Nine States (darker color represents a high density and lighter color is the low density) of Breast Cancer Patients Figure 4B portrays the different scenarios of the EW model where each graph is composed of four superimposed models with certain parameters.The graphical scenario of EW models where in the left graph, α is kept fixed with increased values of β and λ.The left graph displays three approximately normally distributed graphs, and one positively skewed graph.In the right graph, λ is kept fixed with decreased values of α and β.The right graph displays three approximately normally distributed graphs where one is very wide, and one positively skewed graph.In Figure 4B left graph, it is noted that the red line is showing more variability compared to the other three normal graphs and in the right graph, the orange line is more variable as

Figure
Figure 4. Scenarios of the A) EEM and B) EWM with Some Values of the Parameters

Table 1 . Frequency Distribution of Selected Male Breast Cancer Patients
Table 4 consists of AIC, BIC, and DIC values for the EE and EW models.This is a common way to test the goodness of fit models.Lower values of AIC, BIC, and