Modeling Age-specific Cancer Incidences Using Logistic Growth Equations: Implications for Data Collection

Cancers have long been a major cause of human death and diseases-adjusted life year loss (Ma et al., 2006). In 2002, new cancer cases and deaths accounted for 10.86 and 6.73 million respectively worldwide (Parkin et al., 2000). In 2005, over 7.6 million people died to cancer accounting for 13% of total death (Parkin et al., 2005). Predicted cases and deaths will rise to 15 and 10 million by 2020 (Parkin et al., 2001). WHO data showed that malignant tumors worldwide took up 5% of total burden caused by all diseases in 2005 (World Health Organization, 2006). More recent investigations revealed that cancers were the first death cause in cities and higher than cerebrovascular diseases and cardiopathy (China Ministry of Health, 2008). Estimated direct and indirect economic loss due to the


Introduction
Cancers have long been a major cause of human death and diseases-adjusted life year loss (Ma et al., 2006). In 2002, new cancer cases and deaths accounted for 10.86 and 6.73 million respectively worldwide (Parkin et al., 2000). In 2005, over 7.6 million people died to cancer accounting for 13% of total death (Parkin et al., 2005). Predicted cases and deaths will rise to 15 and 10 million by 2020 . WHO data showed that malignant tumors worldwide took up 5% of total burden caused by all diseases in 2005(World Health Organization, 2006. More recent investigations revealed that cancers were the first death cause in cities and higher than cerebrovascular diseases and cardiopathy (China Ministry of Health, 2008). Estimated direct and indirect economic loss due to the 1 RESEARCH ARTICLE
Escalating cancer threats and harms have attracted tremendous efforts exploring the epidemiology of the diseases via large scale secular registry or surveillance systems (Goss et al., 2014;Ullrich et al., 2014). These efforts have been accumulating vast data that allow for establishment of mathematic models simulating cancer incidence and mortality rates for different age groups, time periods or cohorts. Leung and colleagues proposed a model for analyzing cervical cancer incidence using maximum likelihood and Bayesian methods and data from the Hong Kong Cancer Registry (Leung et al., 2006). Tyson and coworkers established a model incorporating the effects of age, year of diagnosis, and year of birth on incidence trends of renal cell carcinoma using data from United States National Cancer Institute's Surveillance, Epidemiology, and End Results public-use registry (Tyson et al., 2013). The most commonly adopted approaches for modeling cancer rates are time series and APC models (Meira et al., 2013;Ocana-Riola et al., 2013;. Usually, time series models assume a Poisson distribution of cancer counts and include autoregressive error terms and/or time trends (Wingo et al., 1998;Knorr-Held et al., 2001); while APC models generally consists of three components, i.e., age (A), period (P), and cohort (C) (Jurgens et al., 2014). Of all the variables studied so far, age seems to have the highest effect on cancer mortality and incidence rates (Dyzmann-Sroka et al., 2014). So the performance of models depends heavily on how the influence of age on cancers is incorporated. With contemporary models, methods used simulating the relationship between age and cancer rates include mainly linear (Lee et al., 2011), polynomial (Wingo et al., 1998), piecewise linear (Kim et al., 2000), spline, log linear (Du et al., 2014) or power curves (Moller, 2004). Typical S-shaped line graphs of age-specific cancer incidence and mortality rates are clearly observable with almost all cancers worldwide (Bouchbika et al., 2013;Al-Hashimi et al., 2014;Wei et al., 2014). If S-line represents true general pattern, most methods (including linear, log linear, power curves) used in previous models may not fitt well at least for some age ranges (e.g., under 35 or over 75 years). Some of the curves (e.g., piecewise linear and spline curves) may adequately approach any S-line. Yet this requires much more detailed data about observed age-specific cancer counts. Besides, most previous work in this regard focuses mainly on predicting or analyzing cancer epidemic with little attention being paid to informing relevant data collection.
This paper models age-specific cancer incidence rates using logistic growth equations. Although there are evidences that such equations describe well cancer cell proliferation under various conditions (Fory's et al., 2003), publications linking them with age-specific cancer rates are limited. In particular, the paper explores the performance of logistic growth models under different scenarios of data completeness. This may reveal clues for reshaping contemporary data collection, e.g., cancer registry, surveillance or evaluation initiatives. Given the huge amount of scarce resources invested annually on these initiatives (Hutchison et al., 1997), they merit continuous scrutinize and refinement.

Data source
All source data used in this study came from China Cancer Registry Report 2012, the latest available annual report of the kind by far (He et al., 2012). It draws from data collected in 2009 by 72 sites throughout China covering 85.47 million urban and rural Chinese residents and provides incidence and mortality rates of all and over 20 specific cancers by age, gender and registry sties. A sample datasets was given in our previous paper (Chen et al., 2014) and detailed charateristics of the data will be described seperately.

Formulae used
Based on empirical observations of patterns with the reported cancer incidence rates along different ages for all and specific cancers, the study adopted a three parameter logistic growth equation (Formula 1). In this formula, t stands for age; and p t , cancer incidence rate for a given age t; p max , the highest cancer incidence rate for all ages; k, growth rate; while b serves as a baseline growth rate that determines the location of "the rapidly growing phase" of a S-curve along the age spectrum. In addition, the study used Formula 2 in identifying the most optimal model from a set of potential models for a given type of cancer and in evaluating the performance of the models selected. In Formula 2, R represents goodness of fit of the model under concern; while p ot and p st stand for observed and simulated (or predicted) cancer incidence rate for a given age t resperctively. (1) (2)

Selection of input data
In terms of how input data were selected, the study performed 3 types of modeling, namely full age-span fitting, multiple 5-year-segment fitting and single-segment fitting. Full age-span fitting utilized the registered cancer incidence data covering all the ages (i.e., from age 0 through to age 85); while the other two types of fitting, only part of the data. The multiple 5-year-segment fitting divided the whole age-span into segments consisting of 5 consecutive ages (e.g., ages 0-4, ages 5-9 etc.) first and then enter the corresponding observed age-specific cancer incidence rates for every other (e.g., ages 0-4, ages 10-14, …, ages 70-74, ages 80-84), every other 2, every other 3, every other 4 and every other 5 segments into modeling respectively. With regard to single-segment fitting, it selected only one segment of ages and entered corresponding incidence rates into modeling. These segments covered 15, 30, 45, 60 and 75 consecutive ages respectively. Considering that location along the age-span covered by a same length of data segment may result in different model parameters, the study set beginning, middle and end as 3 crieteria for selecting single segment data. For example, for the segment consisting of 15 ages, the study established 3 different logistic growth equations based on registered cancer incidence rates for ages 0-14 (beginning segment), ages 38-52 (middle segment) and ages 70-84 (end segment) respectively.

Algorithms for model building
Given that available statistical software does not allow for logistic growth modeling using segmental input data. The study employed a self-developed mini-program to perform all the computation. Written in C# language, the program runs on a webpage built with Microsoft Visual Studio 2008. For each model building, the webpage accepts an intended age set (e.g., {61; 62; 63; 64; 65}) Asian Pacific Journal of Cancer Prevention, Vol 15, 2014 9733 DOI:http://dx.doi.org/10.7314/APJCP.2014.15.22.9731 Modeling Age-specific Cancer Incidences Using Logistic Growth Equations: Implications for Data Collection and a corresponding set of observed cancer incidence rates (e.g., {351; 362;373;384; 395}, in 1/100000) as input and then produces a best-fit parameter set (e.g., {p max =200, b=10.5, k=0.17}) and a goodness of fit (e.g., R=0.98). This computation proceeds in 5 steps.
Step 3 selects one element from each of the 3 serial parameter sets and generates a complete set (5000×200×80 elements in total) of potential parameter combinations, i.e., Step 4 uses Formula 2 and compares the goodness of fit between the registered cancer incidence set entered via the webpage and that predicted by Formula 1 using each of the potential parameter combinations.
Step 5 outputs the parameter combination that has the largest R.

Full age-span fitting models
As shown in Table1 and Figure 1, the majority of curves representing observed age-specific cancer incidence rates fit very well with predictions by logistic growth models with estimated goodness of fit (R) being over 0.96. Yet the R values for some types of cancer, e.g., cervical cancer and breast cancer, were quite low, ranged from 0.25 to 0.92. The 3 parameters defining the logistic growth models showed substantial variations. P max ranged from 8 to 2248; b, from 5.50 to 14.80; and k, from 0.08 to 0.44. Table 2 and Figure 2 displays findings from multiple-5-year-segment modeling. Goodness of fit (R) decreased as the number of data segments being left out increased. Yet, it remained fairly high even when only one fourth of segments of observed data were entered into modeling. And this phenomenon applied to all types of cancers. The differences between the R values of models for different cancer types (excluding cervical and breast cancer) built upon "every other segment" of observed data (  *Note: Source data came from age-specific incidence rates of top ten and all cancers from China cancer registry report 2012; P max , b and k represents the parameters in the logistic equation, y t =P max /(1+e b-kt ), where t stands for age and yt, incidence rate for age t; R stands for goodness of fit between predicted and observed age-specific cancer incidence rates; NA stands for not applicable Figure 2. Predicted vs Registered Age-specific Cancer Incidence Rates Using Difference Segment of Input Data. Blue lines represent actual incidence rates; and red, green, purple, light blue, brown lines represent predicted incidence rates using every other 1, 2, 3, 4, 5 segment of input data respectively, Y-Axis represents cancer incidence rate in 1/100000 and X-Axis, age; Data source came from China cancer registry report 2012 Figure 1. Predicted vs Registered Age-specific Cancer Incidence Rates. Red lines represent predicted incidence rates and blue lines, actual incidence rates; Y-Axis represents cancer incidence rate in 1/100000 and X-Axis, age; Data source came from China cancer registry report 2012 observed data (Table 2, column 17) ranged from only 0 to 0.05. However, starting from the column of "every 5 other segments", the R values reduced dramatically. Similarly, although all the 3 parameters of the simulated logistic growth equations varied as the number of segments of data entered for modeling changed, most of these variations remained to a minimum extent (less than 10%) until the column of "every other 4 segments" and did not show clear decreasing or increasing trend.      *Note: Source data came from age-specific incidence rates of top ten and all cancers from China cancer registry report 2012; P max , b and k represents the parameters in the logistic equation, y t =P max /(1+e b-kt ), where t stands for age and yt, incidence rate for age t; R stands for goodness of fit between predicted and observed age-specific cancer incidence rates

Multiple-5-year-segment fitting models
Single-segment fitting models Table 3 and Figure 3 resulted from single-segment fittings. Goodness of fit (R) increased as the length of the data segment increased and this increase was dependent on the location of the segment of data entered for fitting. For a same length of segment (e.g., 15 ages), the older the age covered by the corresponding data segment, the higher the resulting R. As for the segment covering the oldest part of age-span, all the R values turned out to be very high. The Asian Pacific Journal of Cancer Prevention, Vol 15, 2014 9735 DOI:http://dx.doi.org/10.7314/APJCP.2014.15.22.9731 Modeling Age-specific Cancer Incidences Using Logistic Growth Equations: Implications for Data Collection modeled parameters were also linked to the length and age-range covered by the data segment. For data segment covering beginning ages, all the 3 parameters increased as the length changed from 15 ages to 75 ages; while for data segment covering the end ages, P max increased yet b and k decreased as the length increased.

Discussion
Although typical S-shaped line graphs of age-specific cancer incidence rates are clearly observable with almost all cancer registry and other relevant epidemological reports worldwide, their relations with logistic growth equations have not been fully addressed. The current study demonstrated that logistic growth models perfectly describe the incidence rates along different age groups for most type of cancers. This may be explained by: a) onset of clinically detectable cancers results from the counteraction between cancer cell occurrence and removal (Baker et al., 2013); b) cancer cell occurs after a normal somatic cell has experienced multiple times (say n times) of damages due to exposure to same or different risk factors (Shaukat et al., 2013); c) a certain level of risk exposure defines a corresponding chance (q) for a normal somatic cell to get one time damage and hence the chance (q n ) for an innate cell to mutate into cancer cell in an unit time period; d) given c, as time (t) passes by and somatic cell gets damaged for more and more times, its chance (p) for becoming malignant increases exponentially (p@ q n-qt ); e) level of life spectrum exposure to cancer risk factors starts relatively low at birth, increases during childhood and adolescence (due initiation of unhealthy or unprotected behaviors), remains the highest in adulthood and begins to decrease gradually in late lifetime (due to reduced smoking, drinking etc.) (Katulanda et al., 2014;Chockalingam et al., 2013); f) cancer cell removal or immunity manifests similar lifetime trend as risk exposure (Wu et al., 2012). Therefore, the early low and relatively stable phase of the S-shaped age-specific cancer rates may reflect the combined effect of low cancer cell occurrence vs. high immunity; while the rapidly growing part, exponentially increasing occurrence vs. high and stable immunity; and the late high and relatively stable stage, diminishing occurrence due to reduced risk exposure vs. downward immunity.
Linking logistic growth law with age-specific cancer rates leads to a plausible thinking that description of cancer incidence or mortality rates along the whole age span is to estimate the parameters involved in the equations rather than uncover counts for each of the ages. Such a shift of focus may result in great resource reduction, since logistic equations generally involve only a few parameters (e.g., 3 parameters in our cases) and estimation of these requires much less data than what have usually been collected. This is of particular significance to cancer registry. As suggested by our simulations (Table 2 and Figure 2), the work volume of current China national cancer registry could be reduced by 3 fourths without severely damage its capacity in producing age-specific cancer incidence rates. This should also apply to other registries. Given that over fifty countries have large scale operating cancer registry systems that consume huge amount of scarce resources year by year (Izquierdo et al., 2000;Tangka et al., 2010), a growth model-guided rethinking merits special attention. Even though segmental cancer registry may sound unacceptable to some, the findings suggest priority age groups for monitoring and controlling data quality of registry systems.
Logistic growth analysis may also inform data collection for intervention or hypothesis assessments. As shown in Table 3 and Figure 3, for a same length of data segment, the older the age covered by the data, the higher the goodness of fit of the resulting model. This suggests that, for studies evaluating the effect of an intervention or an influencing factor on cancer rates using limited age groups, backward sampling (i.e., start to choose from the oldest age group backward to younger ones) may work better than forward selection (from age 0-5 to 6-10 and then to 11-15 etc.). For studies that have yielded data showing differences in cancer rates between two groups (say, intervention vs control) of middle ages (say, ages 30-59), simulated logistic growth equations may be used to measure extended difference (say for ages 60-69, or even 60 and over) between the two groups. However, the goodness of fit of models based on data covering middle segment of ages is only moderate.
In addition, logistic growth equations may help assessing data collections biases and/or errors under certain circumstances. If there are sufficient evidences to believe that certain age-specific cancer rates follow logistic growth law, then the goodness of fit estimations (Rs) can also be viewed as a quality indicator of the observed cancer counts. Of the ten cancers included in Table 1, cervical and breast cancers showed clear deviations from logistic equations. By excluding these two cancers, all the cancer-specific pairs of Rs (Table1, column 17 vs 21) showed a consistent trend, i.e., for any given cancer, the R of the model built upon observed data from urban residences was higher than or at least equal to that from rural people. This may indicate better cancer registry in urban than in rural China. The Rs for models of different cancers witnessed much greater variations ranging from 0.92 for nasopharyngeal cancer to 0.98 for esophagus and pancreatic cancers (Table 1, column 5). This suggests a need for tailored data quality control or improvement with special attention being paid to cancers with the lowest Rs. The varied biases and errors in the rates for different cancers in our case may be attributed to a whole range of reasons including number of cases registered (e.g., too few for nasopharyngeal cancer), physical symptoms and sings, easiness to get cancer tissues for pathologic diagnosis, availability of auxiliary examination techniques etc.
Finally, readers are cautioned about a number of issues. First, this study used only most simple logistic growth equations and they do not fit very well with the observed data for some cancers, e.g., cervical and breast cancers. Such problems can be solved by adding more parameters and introducing more sophisticated growth equations. Second, parameters presented in this paper were all average estimates derived from pooled cancer counts reported by 72 CNCR sites in 2009. Age-specific incidence and mortality bands with means and 95% confidence intervals rather than single mean estimates may be produced by building similar set of logistic growth equations using the data from each CNCR sites (72 sets in total) and then performing bootstrap re-sampling and jackknife-correction (Dexter et al., 2013;Yu et al., 2013). Third, this paper focuses primarily on implications for data collection without any attention being paid to identifying trends and components with the cancer rates. Forth, apart from goodness of fit, this paper did not provide other performance indicators (sensitivity, specificity etc.) of the models used due to space limit. Most of these will be addressed separately in a forthcoming paper titled "modeling age-specific cancer incidence using logistic growth equations: jackknife-corrected bootstrap estimates".