Application of Crossover Analysis-logistic Regression in the Assessment of Gene-environmental Interactions for Colorectal Cancer

In the view of modern genetics, the genesis and development of complex multifactorial human diseases are the result of specific environmental factors, genetic factors (mainly genetic susceptibilities), and the interactions between these two types of factors, which usually develops through multiple stages. Complex diseases including colorectal cancer are affected by multiple gene loci and environmental factors (Arafa et al., 2011; Zhao et al., 2012). An important topic for current genetic epidemiology and bioinformatics is the effective processing and analysis of the interactions between critical SNP (single-nucleotide polymorphism) sites involved in common complex multifactorial human diseases(Tomlinson et al., 2007; Reeves et al., 2008; Darbary et al., 2009; Xiong et al., 2009; Gao et al., 2010). SNPs refer to DNA sequence polymorphisms resulting from single nucleotide mutations. They are the third generation genetic markers in humans and play


Introduction
In the view of modern genetics, the genesis and development of complex multifactorial human diseases are the result of specific environmental factors, genetic factors (mainly genetic susceptibilities), and the interactions between these two types of factors, which usually develops through multiple stages.Complex diseases including colorectal cancer are affected by multiple gene loci and environmental factors (Arafa et al., 2011;Zhao et al., 2012).An important topic for current genetic epidemiology and bioinformatics is the effective processing and analysis of the interactions between critical SNP (single-nucleotide polymorphism) sites involved in common complex multifactorial human diseases (Tomlinson et al., 2007;Reeves et al., 2008; an important role in identifying disease-related genes, elucidating phenotypic differences among individuals, and interpreting disease susceptibilities in different populations and individuals.Previous studies have shown that the genesis and development of complicated diseases are not completely caused by genetic factors; rather, they are results of the interactions between genetic variations and environmental factors (Chatterjee et al., 2006;Wong et al., 2010).It is likely that there is only weak relevance, but not a major genetic effect between every individual gene and disease.This weak effect is more susceptible to the effect of environment.If the interactions between genes and the environment (including gene-gene, and gene-environment interactions) are neglected, it may not be possible to truthfully and precisely describe the effect of genetic mutations.Therefore, to prevent disease and establish public health policies, it is important to properly analyze and assess the interactions between genes and environment.
One of the greatest challenges facing human geneticists is the identification and characterization of susceptibility genes for common complex multifactorial human diseases.This challenge is partly due to the limitations of parametric-statistical methods for detection of gene effects that are dependent solely or partially on interactions with other genes and with environmental exposures.How to analyze the interactions between genes and genes (environment) for a complex multifactorial human disease is more and more important.There are two mainly different interaction models between genes locus and environmental factors in biology: the additive interaction model and multiplicative interaction model (Ruth, 1996).In considering the joint effects of risk factors in disease causation, however, epidemiologists have debated intensely about what interaction is, where it comes from, and how to detect it (Rothman, 1986).Therefore, how to select the appropriate methods analysis of each interaction model is very important.In recent years, statistical methods have achieved rapid advances in the study of the interactions among genes and the interactions between genes and environmental factors.These methods mainly include logistic regression, stratified analysis, generalized relative risk model (Moolgavkar et al., 1987), multifactor dimensionality reduction (Hahn et al., 2003), and methods based on composite lineage disequilibrium (Wu et al., 2008).Each method has its own advantages and disadvantages.However, the traditional regression model may bring out the greater errors and increase the typeⅠor typeⅡerror during the analysis of interactions, so that the test power decrease.Based on our previous data (Yang et al., 2009), the risk factors of colorectal cancer were analyzed by using chisquare test; the characteristics of related genes locus and environmental factors associated with the development of colorectal cancer were found.This study further analyzed and explored the interactions between genes and environment using crossover analysis combined with logistic regression method.By using a case-control study method, colorectal cancer patients in Chongqing, China, were selected for a sampling study to explore the risk factors related to the genesis of colorectal cancer and the effect of gene-environment interactions on this disease.

Data Source
The data used in this study are from the case-control study of colorectal cancer in Chongqing, China, by the Department of Health Toxicology at the Third Military Medical University (Yang et al., 2009).Among the 432 colorectal cancer patients who were pathologically diagnosed, 237 were males and 195 were females, with an average age of 52 years (44,60).By using the hospital control method, patients with matching age, gender, and birthplace were selected from the orthopedics department of the same hospitals and screened to eliminate the possibility of carrying colorectal cancer or colorectal cancer-related diseases.A total of 788 of such people were selected as the healthy control group.Among them, there were 438 males and 350 females, with an average age of 55 years (46,65).All controls and provided their written informed consent, Semiquantitative Food Frequency Questionnaire, and blood samples as the CRC patients group.This study protocol was approved by the Third Military Medical University Ethics Committee, and informed consent was obtained from all participants.This study was in compliance with the Helsinki Declaration.
The survey contents included general information (gender and age), polymorphism distribution of genes related to ethanol metabolism (the distribution of homozygotes and heterozygotes of gene loci including rs2075633, rs17033, rs1229984, rs4767939, rs4767944, rs671, rs16941669, rs886205, rs7296651, rs1329149, rs2249695, rs8192772, rs8192775, and rs915908), and lifestyle habits (smoking and alcohol consumption).To avoid any bias, a standard questionnaire was generated in which each survey item had a specific definition.The examination was carried out as a face-to-face query, and some survey items, such as the amount of alcohol and cigarettes consumed, were quantitatively estimated.Using age 60 as the demarcation point, the surveyed patients were divided into two groups: the elderly group and the young and middle-aged group.Alcohol consumption was divided into two categories: healthy drinking (including people who did not drink and people who drank no more than 15 g per day) and non-healthy drinking (including people who drank more than 15 g per day).Based on smoking habits, the subjects were divided into nonsmokers and smokers (including those who had quit smoking).

Statistics methods
From biological view in the literature (Yang et al., 2009), the risk factors of colorectal cancer were analyzed by using chi-square test, the characteristics of related gene locus and environmental factors associated with the development of colorectal cancer were found.However, the interactions between these factors are not fully analyzed.Based on the literature, in this article we will further explore the interactions between genes related to colorectal cancer and environmental factors and its impact on development of this disease, and analyze the existence of interactions among the genes and environmental factors by combining a variety of statistical methods.There are two mainly different interaction models between genes locus and environmental factors in biology: the additive interaction model and multiplicative interaction model.Under this statistical model, the presence or absence of interaction depends upon the scale of measurement (additive or multiplicative).Therefore, how to select the appropriate methods analysis of each interaction model is very important.In general, the logistic regression method can obtain the multiplicative interaction, but not analyze the additive interaction.However, the crossover analysis can analyze the additive interaction and multiplicative interaction.What's more, the existence of multiplicative interaction of gene and environmental factors were analyzed by using the Akaike information content (AIC) of logistic regression and combined with crossover analysis methods.
Logistic regression model: The logistic regression is a common method used to analyze the multiplicative DOI:http://dx.doi.org/10.7314/APJCP.2012APJCP. .13.5.2031Crossover Analysis-logistic Regression for Gene-environmental Interactions in CRC interaction among categorical variables (Hosmer et al., 1990).The logistic regression can use not only alleles as the genetic variable under the assumption of multiplicative genetic model but also genotypes as the genetic variable under the assumption of certain genetic models (such as the dominant model and recessive model) (Kooperberg et al., 2001;Ruczinski et al., 2003;Kooperberg et al., 2005).
For the example of using genotypes as a genetic variable, the modeling procedure is described as follows.
Assuming that D stands for disease, E for environmental factor, G for the genotype of the disease-related locus, and this locus has two alleles, the susceptibility gene M and the normal gene m, then the exposure rate of the environmental factor in the human population is expressed as P (E), and the frequency of the susceptibility gene is expressed as PM.Assuming this locus meets the H-W equilibrium, the frequencies of the genotypes MM, Mm and mm in the human population are P M 2 , 2P M (1-P M ) and (1-P M ) 2 , respectively.Assuming the environmental factors and genetic factors independently exist in the human population, the logistic regression model can be set up as (1) Baseline prevalence of the disease in human population is .The odds ratios for genes, environmental factors, and gene-environment interactions are OR g = exp (β g ), OR e = exp (β e ) and OR ge = exp (β ge ), respectively.OR g ( OR e ) is the odds ratio of genetic factors (environmental factors) when the individual is not exposed to the environmental factors (susceptible genotype).When OR ge = 1, there is no interaction between environmental factors and genetic factors.When OR ge ≠ 1, the interaction of environmental factors and genetic factors exists.When OR ge > 1 (OR ge < 1), the environmental factors can promote (inhibit) the expression of susceptibility genes.In other words, genetic factors can increase (decrease) the susceptibility of the human body to environmental factors.The partial regression coefficient in the model can be used to explain the meaning of OR under different combinations.
In the logistic regression model for gene-gene interactions, the environment variable in equation ( 1) is replaced with one of the genotype variables.Otherwise, the principle is the same.
Crossover analysis: The crossover analysis (Hosmer et al., 1992;Hallqvist et al., 1996;Garcia et al., 2008) is one of the most common methods to analyze the interaction between genes and environment in genetic epidemiological research.Information from case-control studies among populations, case-control studies with subjects' parents, case-control studies with subjects' siblings, and cohort studies can all be analyzed by crossover analysis for interactions between genes and environment.
Table 1 shows the basic research units in a 2×4 crossover analysis of the interaction between genes and environmental factors, indicating the four possible combinations formed by the two binary variables, genes (G) and environmental factors (E).The risk ratio of being exposed to both factors to being unexposed to either factor (odds ratio, OR) is labeled as OR ge (abbreviated as A).The risk ratio of being exposed only to genes or environmental factors are respectively labeled as OR g and OR e (abbreviated as B and C, respectively).Patients who were not exposed to either factor, as well as the control group, are used as the common reference group (OR=1).
Here, the combined effect of genes and environment includes not only the individual effect of genes and environment but also the superposition of the individual effect from these two types of factors (additive effect) and the multiplicative effect from genes and environment.By using different models, we can determine whether there are interactions between the two types of factors and the degree of these interactions.
In crossover analysis, because the existence of interaction is closely associated with the chosen model, the major parameters for interaction calculation based on the additive model proposed by Rothman include the following (Rothman, 2002).
Attributable proportion of interaction (API) is the most broadly used parameter to determine the existence of interactions between genes and environment.It indicates the proportion of total effects that can be attributed to the interaction of the two factors.It is calculated by the following formula: (2) API can reflect the percent of the total effect due to the interaction between genes (G) and environmental factors (E).If API ≠ 0, then an additive interaction exists between genes (G) and environmental factors (E), and the larger |API| is, the stronger the interaction between genes (G) and environmental factors (E).On the other hand, if API = 0, there is no interaction between genes (G) and environmental factors (E).
Because API is an estimation of point values, hypothesis testing is needed to determine whether the interaction is statistically significant.The detailed procedure is as follows.
If the statistics T=S 2 /U < χ 1,0.05 2 , then P > 0.05, and the interaction between genes (G) and environmental factors (E) is not considered statistically significant.
Parameters of the multiplicative model ( ) reflect the ratio of the multiplicative interaction between genes and environment.When the ratio equals one, the   two factors fit the multiplicative model and no interaction exists; a ratio greater than one indicates a positive interaction (synergistic effect of biological significance), whereas a ratio less than one indicates a negative interaction (antagonistic effect of biological significance).

Analysis with addition of interaction terms in multivariate logistic regression model
The crossover analysis table can only analyze the interaction of two binary factors; the effects of the risk factors that are not involved in the crossover have not been taken into account.Therefore, it is necessary to combine crossover analysis with multivariate regression analysis (i.e., logistic regression analysis based on the multiplicative model) to obtain more reliable information.To achieve this combination, statistically significant interaction terms are added to the model obtained from multivariate logistic regression (Garcia et al., 2008), and the Akaike information criterion (AIC) statistics are applied to determine of the goodness of the model fitting.
AIC = -2InL + 2m (7) Where -2InL is -2 fold of the natural logarithm of the likelihood function, and m is the number of covariates of the model in the regression equation.
Upon adding the interaction term to the original main effects model, the change (decrease) in the corresponding AIC compared to that from the original main effects model indicates that a multiplicative interaction may exist with DOI:http://dx.doi.org/10.7314/APJCP.2012.13.5.2031 Crossover Analysis-logistic Regression for Gene-environmental Interactions in CRC this interaction term, and it requires P value is less than 0.05.That is, the best model is the one which minimizes the AIC, and there is no requirement for the models to be nested (Liddle, 2007).

Results
First, genotype distribution was tested for the goodness of fit for the Hardy-Weinberg equilibrium.Except for gene rs915908, whose genotype distribution does not satisfy the Hardy-Weinberg law, the genotype distribution of all 13 genes matched the Hardy-Weinberg law (P>0.05),and analysis results are consistent with those of previous findings.

Results of logistic regression analysis
The results of univariate analysis showed that gene rs671, rs1329149, age, and alcohol drinking correlate with the pathogenesis of colorectal cancer to a certain extent.We have introduced the factors that are statistically significant in the above univariate analysis into multivariate non-conditional logistic stepwise regression analysis.In this analysis, the groups with heterozygote GA homozygote GG at locus rs671 were combined into one group (because these two groups had no statistically significant difference compared to the control group).The groups containing heterozygote TC and homozygote CC at locus rs1329149 were also combined into one group.At the level of , Forward LR analysis was applied to select variables, and the results are shown in Table 2. Gene rs671, rs1329149, age, and alcohol consumption correlate with the morbidity of colorectal cancer.Based on OR, all of these four factors are risk factors for the pathogenesis of colorectal cancer, which is consistent with results in the literature (Yang et al., 2009).Next, the interactions between these factors were analyzed by combining crossover analysis and logistic regression methods.

Results of crossover analysis
Using the crossover analysis, the above four risk factors were analyzed to determine whether additive interactions and multiplicative interactions were present.The results from the crossover analysis are shown in Table 3.
The crossover analysis results shown in Table 3 indicate that although the additive interactions between any two of the four risk factors are not statistically significance (P>0.05) by the χ 2 test, the API (API>0) exist, which could suggest its biological significance.Moreover, the parameter of the multiplicative model, M, indicated that a multiplicative positive interaction may exist between these factors except for loci rs671 and rs1329149, rs671 and age.The negative multiplicative interaction was found between these two loci (M=0.988<1,M=0.727<1).At the same time, the positive multiplicative interactions between rs671 and alcohol drinking (M=3.160) and between rs1329149 and alcohol drinking (M=2.603) may be stronger than others factors.

Results of crossover analysis-logistic regression analysis
The results of the addition of interaction terms to the multivariate logistic regression model are shown in Table 4.The corresponding AIC for the product terms of rs671*alcohol drinking and rs1329149*alcohol drinking decreased compared to that in the main effect model (Δ<0), while the corresponding AIC increased after introducing other interaction terms.This indicates that multiplicative interactions may exist between rs671 and alcohol consumption and between rs1329149 and alcohol consumption (P<0.05, a statistically significance), which is consistent with the results of the multiplicative model obtained from the above crossover analysis.

Discussion
Exploring the interactions among risk factors (geneenvironment) for complex diseases is central to the emerging field of genetic epidemiology, and is also an important topic in the etiological study of genetic epidemiology because the presence of such interactions and different interaction models has different public health significance in epidemiology (Mitchell et al., 2000).Thus, study of gene-environment interactions for complex diseases is important for improving accuracy and precision in the assessment of both genetic and environmental influences.An understanding of geneenvironment interaction also has important implications for public health.It aids in predicting disease rates and provides a basis for well-informed recommendations for disease prevention (Ottman, 1996).
Through a case-control study with a large sample size, this study investigated the risk factors of colorectal cancer using several statistical methods.The gene loci rs671 and rs1329149, age and alcohol consumption were determined to be risk factors that have effects on the pathogenesis of colorectal cancer.The results showed that the population carrying homozygous AA at locus rs671 or homozygous TT at locus rs1329149, the population of old age, and the population who have unhealthy alcohol drinking habits are more susceptible to colorectal cancer.Further crossover analysis showed that the additive interactions among these four risk factors are not statistically significant as demonstrated by hypothesis testing (p>0.05).That is, although the additive interactions value (API) of the any two factors is relatively large, and the maximum API is 0.729, but still did not show statistical significance.The reason may be due to fewer cases and controls with these factors, which result in too wide confidence interval and the instability efficiency combined effects of two factors, therefore we must further increase the sample size to overcome this problem.Logistic regression analysis, Although the logistic regression (Hosmer et al., 1990) is a common method used to analyze the interaction among categorical variables, it can statistically deduce the interaction effect in a multiplicative model of independent variables, it cannot be used to determine the interaction effect in an additive model of independent variables.Fortunately, as a basic analysis method in case-control study in epidemiology, the crossover analysis has some obvious advantages of explicit theoretical significance, abundant information, straightforward and simple calculation, and stable performance compared to the other methods (stratified analysis, chi-square test, logistic regression, the logarithmic linear model, and generalized relative risk model, etc.).Firstly, the crossover analysis table can intuitively and visually presents the vast majority information of the basic unit in epidemiology, which provides us with a more broad judgment and insight.Secondly, by using the crossover analysis to analyze the interaction between two given factors, we obtained not only the major effects of genes and environmental factors but also the interaction effects based on different models (additive models and multiplicative models).Namely, by virtue of different models, the existence and the degree of interactions between two factors can be determined.Thirdly, the biggest advantage of the crossover analysis is that it not only can analyze multiplicative interaction of genes and environmental factors, but also can analyze the additive interaction.Finally, the crossover analysis is widely applied to analyze genes and environment interactions in group case-control study, matched case control study, case-parent control study, case-sibling control study, cohort study.
However, the statistical test method of crossover analysis itself has some limitations and needs to be further improved.If the interaction of more than two factors is to be analyzed, multiple stratifications are required.Under such conditions, the sample size of patients and the controls in stratifications may be very small or even zero, and the calculation becomes very complicated.More importantly, the crossover analysis does not take into account of the effect of factors that are not involved in the interaction terms on this interaction.Given this problem, interaction terms between every two risk factors are considered to be added on the basis of the selected covariate vector in the multivariate logistic regression model, which can balance the effect of other factors on the interaction.In addition, when the interaction between genes and environmental factors is studied, the analysis of this interaction may be distorted if there are confounding factors.In this situation, these confounding factors should be controlled for before crossover analysis so that the final result reflects the real degree of interaction.Therefore, we should combine the crossover analysis and multivariate logistic regression method for the analysis of practical problems in order to obtain more extensive and reasonable information.
Higher order interactions between genes and environment cannot be completely addressed by either logistic regression or crossover analysis.Currently, many researchers are proposing other methods, such as multifactor dimensionality reduction (Hahn et al., 2003), which is a powerful alternative to traditional parametric statistics such as logistic regression and may process the higher order data better (Wu et al., 2011).The neural network method has unique advantages in processing the interaction between genes and environment (Günther et al., 2009).In particular, the genome-wide association study of susceptibility genes for complex diseases is currently a hot research area, and many new breakthroughs were obtained in the area (Elbers et al., 2009;Roukos, 2009).In the future, based on this study, we will further explore these methods from the aspects of their algorithms and theories and apply our study to practical data processing.
In this paper, we obtained a comprehensive set of gene and environment (gene) interactions for colorectal cancer in Chongqing of China by using the method based on crossover analysis-logistic regression.Our work may have value for both clinical medicine and preventive medicine research.In conclusion, the method based on crossover analysis-logistic regression is successful in assessing additive and multiplicative interactions of geneenvironment, and in revealing the synergistic effects of gene loci rs671 and rs1329149 with alcohol consumption in the pathogenesis and development of colorectal cancer.