Finding Genes Discriminating Smokers from Non-smokers by Applying a Growing Self-organizing Clustering Method to Large Airway Epithelium Cell Microarray Data

Lung cancer is one of the most frequent human cancers and the leading cause of cancerrelated death in males and the second leading cause of cancer death among females (Jemal et al., 2011). Smoking, particularly smoking cigarette, is one of the main contributor to lung cancer (Spira et al., 2004; Jemal et al., 2011). Cigarette smoking injures airway epithelium cells exposed to it. A number of studies show that noncancerous large-airway epithelium cells of current and former smokers with and without lung cancer exhibit allelic loss (Wistuba et al.,1997; Powell et al., 1999), P53 mutation (Franklin et al., 1997), changes in DNA methylation in the promoter regions of several genes (Guo et al., 2004) and also increased telomerase activity (Miyazu et al., 2005). Some microarray studies show that cigarette smoking up-regulates the expression


Introduction
Lung cancer is one of the most frequent human cancers and the leading cause of cancer-related death in males and the second leading cause of cancer death among females (Jemal et al., 2011).Smoking, particularly smoking cigarette, is one of the main contributor to lung cancer (Spira et al., 2004;Jemal et al., 2011).Cigarette smoking injures airway epithelium cells exposed to it.A number of studies show that noncancerous large-airway epithelium cells of current and former smokers with and without lung cancer exhibit allelic loss (Wistuba et al.,1997;Powell et al., 1999), P53 mutation (Franklin et al., 1997), changes in DNA methylation in the promoter regions of several genes (Guo et al., 2004) and also increased telomerase activity (Miyazu et al., 2005).Some microarray studies show that cigarette smoking up-regulates the expression

RESEARCH ARTICLE
Finding Genes Discriminating Smokers from Non-smokers by Applying a Growing Self-organizing Clustering Method to Large Airway Epithelium Cell Microarray Data Maryam Shahdoust 1 *, Ebrahim Hajizadeh 2 , Hossein Mozdarani 3 , Ali Chehrei 3 of some lung cancer marker genes such as UCHL1 (Spira et al., 2004;Brendan et al., 2006;Beane et al.,2007;Spira et al., 2007;Cote et al., 2009;Pickett et al., 2009).The identification of effects of smoking on airway gene expression may provide an insight to study the cause of this elevated risk and to diagnosis and prognosis of the lung cancer.Therefore to asses these alterations, finding the genes which have the different expression and distinguish smokers from non-smokers could be useful.In 2003 Hsu et al., have introduced an approach to cancer class discovery and marker genes identification based on GSOMs.The approach has three phases; cancer class discovery, marker gene identification and refinements.The applied approach in this article is part of Hsu approach to compare smokers and non-smokers large airway epithelium cells gene expression in order to find genes which expressed differently in smokers group.

112
Date set is microarray gene expression data of large airway epithelium cells (Brendan et al., 2006).The clustering variable was class discrimination score which was calculated based on differentiation between each mean of gene expression in smokers and non-smokers groups.This paper was aimed to identify the genes which discriminate the smokers from non-smokers in order to assess the effects of cigarette smoking on large airway epithelium cells by applying a neural network clustering method, growing self-organizing maps (GSOM) (Alahokoon et al., 2000;Hsu et al., 2003), to compare the gene expression of large airway epithelium cells in the normal smokers and the non-smokers.By applying the approach, we were able to compare the expression of genes at the same time in order to find differentiations and also to identify the effects of cigarette smoking.

Data set
Data set included large airway epithelium cells microarray information from 9 normal non-smokers and 13 normal smokers of their left lung.Each sample composed of 7129 genes expression levels.The data was a part of the Brendan et al. (2006) study, the upregulation of expression of the ubiqutin carboxyl-terminal hydrolase L1 gene in human airway epithelium of cigarette smokers, prepared in Dr Crystal lab.The data has been deposited in the Gene Expression Omnibus site, which is curated by the national Center for bioinformatics.The dataset is available in www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2489 .
Data standardization before applying algorithm is necessary.Equation (1) could be used for standardizing data (Hsu et al., 2003).

Identifying discriminating genes
To identify genes that were differently expressed between the smokers and the non-smokers in largeepithelium cells, the hexagonal Growing self-organizing maps (GSOMs) clustering method (Hsu et al.,2003) had been trained from "class discrimination scores" in smokers group (Galub et al., 2000;Hsu et al., 2003).This score was calculated by equation (2) for all genes.

P(g,1,2)=[m(g) 1 -m(g) 2 ]/[s(g) 1 +s(g) 2 ]
(2) Where m(g) 1 and m(g) 2 are the mean of expression gene (g) in smokers (group 1) and non-smokers (group 2) and s(g) 1 and s(g) 2 are the standard deviations of the expression level of gene (g) for all samples belonging to smokers and non-smokers respectively.The high value of absolute P(g,1,2) shows that the gene (g) is strongly suitable for discriminating smokers from non-smokers.In the map of nodes provided by the applied GSOMs, the node with highest average of class discrimination scores included the genes which discriminated smokers from non-smokers.We call these genes discriminating genes at the follow.
To evaluate the prediction strength of identified discriminating genes, a weighted vote for each gene was calculated by equation ( 3) for each sample (Galub et al., 2000;Hsu et al., 2003).A weighted vote of gene g for a sample shows the vote of gene g to asset the sample to the smokers group.It was supposed that the means of identified discriminating genes votes in the smoker group would be higher than the means of identified discriminating genes votes in the non-smokers.
Man-Whitney U test was applied to compare the means of genes weighted votes of non-smokers and smokers.Also, discriminate analysis had been applied to see how much these marker genes are able to discriminate smokers from non-smokers.
In respect to the other experiences, marker genes are either in the vicinity of the highest average node on the trained GSOM or are included in identified predictor genes.So in the article, all the nodes were in the vicinity of the introduced node had been studied and also all the identified genes had been matched by clinical studies too.

Results
The nodes with highest average score of class discrimination (first ranked node) included seven genes (Table 1, Figure 1).Other nodes with high averages are in the neighborhood of the first ranked node, as expected (Table 2).
Table 3 shows the discriminating genes weighted votes for each sample.The smokers genes weighted votes were higher than non-smokers genes weighted votes.Also, Man-Whitney U tests results showed that the means of each genes weighted votes in two groups were significantly different (p<0.05).
The discriminate analysis results showed that seven identified genes of first ranked node could classify 100%  of the samples correctly, in order to their actual groups (Table 4).

Discussion
There are lots of evidences proving that smokers with a particular mutation have a dramatically higher risk to develop lung cancer.Therefore the comparison of the smokers' genes expression with the non-smokers' could be helpful in order to discover the effect of smoking on airway gene expression.In this study we used microarray data of large epithelium lung cells to compare normal non-smokers with normal smokers.The aim of the study was finding the genes which could discriminate the smoker samples from the non-smoker samples.Finding these genes could be useful to study the effect of cigarette smoking, on large airway epithelium cells.We applied a neural network clustering approach; GSOM (Hsu et al., NQO1, NAD(P)H: quinine oxidoreductase, is a detoxification enzyme that protects against the regeneration of reactive oxygen species chemically induced by oxidative stress, cytotoxicity, mutagenicity, and carcinogenicity (Joseph et al., 1998;Kiyohara et al., 2005;Saldivar et al., 2005;Kolesar et al., 2011).There are evidences suggest that tobacco smoking demonstrates a strong increase in expression of NQO1 (Cote et al., 2009;Pickett et al., 2009;Boyle et al., 2010;Timofeeva et al., 2010) AlDH3A1 which is from Aldehye dehydrogenases was among the first seven discriminating genes.Aldehye dehydrogenases activity is a functional marker for lung cancer (Ucar et al., 2008;Sullivan et al., 2010;Muzio et al., 2011).ALDH3A1 are up-regulated by smoking (Beane et al., 2007;Petal et al., 2007).In fact Aldehye dehydrogenases, such as AlDH3A1, are involved in the oxidation of toxic aldehydes produced from oxidative stress and exposure to tobacco smoke (Vasiliou et al., 2005;Muzio et al., 2011).
The other identified discriminating gene was AKR1C1.This gene comes from the aldo-keto reductase (AKRs) superfamily.The AKR1 family contains many of the human isoforms, which include AKR1A, AKR1B, 2003).The clustering variable was calculated based on the differentiations between each mean of gene expression in two groups.Clustering of genes performed a viewpoint in gene expression alteration in the smokers and identified the effects of cigarette smoking on the airway genes expression.In the provided GSOM map the node with the highest average of discriminating score included seven genes (discriminating genes): NQO1, ALDH3A1, H19, AKR1C1, ABHD2, GPX2, ADH7.
In our study, AKR1C1, AKR1C3 and AKR1B10 were identified in either the first ranked node or in its vicinities.AKR1C1, AKR1C3 and AKR1B10 are up-regulated by cigarette smoking (Spira et al., 2004;Woenckhaus et al., 2006;Beane et al., 2007).AkR1B10 is a diagnostic marker of non-small cell lung carcinoma in smokers (Penning, 2005;Miller et al., 2012).GPX2, glutathione peroxidase 2, is another discriminating gene.This gene is involved in the xenobiotics metabolism and there are some evidences that confirm its inducement by exposing to cigarette smoke (Spira et al., 2004;Woenckhaus et al., 2006;Brigelius-Flohe et al., 2012).
The other identified discriminating genes was H19, imprinted maternally expressed transcript (Matouk et al., 2007;2010).There are some studies that show up-regulation of H19 in respiratory epithelia exposed to cigarette smoking (Kaplan et al., 2003;Liu et al., 2010).Some studies suggest that overexpression and eventual loss of imprinting of H19 may represent early markers in the progression of airway epithelium toward lung cancer (Kaplan et al., 2003).
The two last identified discriminating gene were ABHD2 and ADH7.ABHD2, Abhydrolase domaincontaining protein 2, encodes a protein containing an alpha/beta hydrolase fold, which is a catalytic domain found in a very wide range of enzymes.The function of this protein has not been determined.Alternative splicing of this gene results in two transcript variants encoding the same protein (Entrez gene, available in: http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=Show DetailView&TermToSearch=11057).ADH7, alcohol dehydrogenose.This gene encodes class IV alcohol dehydrogenase 7 mu or sigma subunit, a member of the alcohol dehydrogenase family.This family metabolize a variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products (Seitz et al., 2007).Up to the time of writing this article, there were not enough studies which confirmed the relationship between these two genes alterations and smoking, so further studies are recommended to find the effect of cigarette smoking on the expression these gene.
Also, in the vicinity of identified discriminating genes there were two genes which their relevancy with lung cancer is strongly confirmed by biological studies; CYP1B3 and UCHL1.
CYP1B1 comes from cytochrome P450 family 1.CYP1B1 have a significant role in the oxidation of a veriety of carcinogens.The gene is expressed in the lungs and is up-regulated in response to cigarette smoke (Spira et al., 2004;Nagaraj et al., 2005;Wenzlaff et al., 2005;Beane et al., 2007,Xu et al., 2012).
UCHL1, ubiquitin carboxyl-terminal hydrolase L1, is used as a marker of the lung cancer (Hibi et al., 1999) and it is up-regulated in the large and the small airway epithelium of cigarette smokers, including normal smokers with early chronic obstructive lung disease (Brendan et al., 2006;Orr et al., 2011;Hurst-Kennedy et al., 2012).
Most of the identified genes agree with other similar recent studies results.For example the study of Boyle et al. (2010) comparing the oral mucosa and airway epithelium transcriptome of smokers versus non-smokers showed the overexpression of CYP1A1, CYP1B1, AKRs,ALDH2A1,NQO1 and UGTs.Pickett et al. (2009) investigated the effects of cigarette smoking condensated on airway epithelium cells.Their findings demonstrated a strong increase in expression of genes that coded for xenobiotic and detoxifying functions such as CYP1A1 and CYP1B1 and antioxidants such as GPX2 and NQO1.The results of Beane s, et al.,study indicated that many of the rapidly reversible genes such as CYP1A1, CYP1B1,AKR1B10, AKR1C1 and ALDH3A1 are up-regulated by smoking and involved in a protective or adaptive response to tobacco exposure and the detoxification of tobacco smoke components.
Most of these articles had applied common multivariate clustering methods such as hierarchical clustering which has several drawbacks such as being time-consuming and lack of robustness when there is strong presence of noise in data.But by applying GSOMs clustering according to the difference of means expression of genes, we were able to compare 7129 genes of smokers with nonsmokers in just a few minutes.Also, the map of GSOM provided a visual viewpoint to find genes discriminating the smokers from the non-smokers and could suggest further studies about co-expression of genes which were placed in the same node.In addition, it was possible to evaluate the strength of identified discriminating genes which were supposed to distinguish two groups in a systematic way by calculating weighted votes.
In our study the majority of genes in the first ranked node and its vicinity have been always interesting in lung cancer studies such as gene NQO1 (Eom et al., 2009;Timofeeva et al., 2010;Guo et al., 2012;Liu et al., 2012) and even some of them such as ALDH3A1, AKR1B10,UCHL1 are known as marker for lung cancer (Penning et al., 2005;Petal et al., 2007;Ucer et al., 2008).The identified genes except ADH7 and ABHD2 had strong relevancy to lung cancer and were supported by existing literatures but we did not do any laboratory study to investigate the correlation between ADH7 and ABHD2 or other genes which were placed in other high average nodes with smoking and also the lung cancer.Therefore a large sample experimental study is needed to study the altered expression of the genes in smokers comparing non-smokers.

Figure 1 .
Figure 1.Part of the GSOM Map Trained from Class Discrimination Scores for Normal Smokers Group Showing Location of Marker Genes.The node number 183 is the node with highest discriminate score