Comparison of the Gene Expression Profiles Between Smokers With and Without Lung Cancer Using RNA-Seq

Lung cancer is the most common cause of cancerrelated death in both men and women throughout the world. Lung cancer can be broadly classified into two main types based on the morphological characteristics: non-small cell lung cancer and small cell lung cancer (Xiao et al., 2011). There are many causes of cancer include carcinogens (such as those in tobacco smoke), ionizing radiation, viral infection, etc. Cigarette smoke contains over 60 known carcinogens (Hech, 2003) and tobacco smoke is the main contributor to lung cancer (Biesalski et al., 1998) . Therefore, the tissues of those smokers with and without lung cancer provide great resources to study their gene expression changes and find out those lung cancer related genes to help corresponding treatment. RNA-Seq technologies are now popularly used in diverse transcriptome studies (such as alternative splicing, gene expression, gene fusions etc.) and exhibit many amazing aspects (Mortazavi et al., 2008; Sultan et al., 2008; Maher et al., 2009; Zhao, 2009; Gan et al., 2010; Guttman et al., 2010; Trapnell et al., 2010; F, 2011; Geng Chen, 2011; Pflueger, 2011). Compared with microarrays, RNA-Seq has many advantages. It needs less RNA samples, products lower background noise, could detect new genes and/or transcripts and so on (Marioni et al., 2008; Wang et al., 2009; Marguerat and Bahler, 2010; Nagalakshmi et al., 2010; Beane, 2011). To better


Introduction
Lung cancer is the most common cause of cancerrelated death in both men and women throughout the world.Lung cancer can be broadly classified into two main types based on the morphological characteristics: non-small cell lung cancer and small cell lung cancer (Xiao et al., 2011).There are many causes of cancer include carcinogens (such as those in tobacco smoke), ionizing radiation, viral infection, etc. Cigarette smoke contains over 60 known carcinogens (Hech, 2003) and tobacco smoke is the main contributor to lung cancer (Biesalski et al., 1998) .Therefore, the tissues of those smokers with and without lung cancer provide great resources to study their gene expression changes and find out those lung cancer related genes to help corresponding treatment.

Comparison of the Gene Expression Profiles Between Smokers
With and Without Lung Cancer Using RNA-Seq Peng Cheng 1& , You Cheng 2& , Yan Li 1 , Zhenguo Zhao 1 , Hui Gao 1 , Dong Li 1 , Hua Li 1 , Tao Zhang 1 * understand the gene expression differences between smokers with and without lung cancer, we analyzed two related datasets from short read archive (Beane, 2011).
We quantified the expression of human genes in these two samples of smokers with and without lung cancer.Then, we compared their expressed genes and studied the gene expression distribution patterns in both two samples.To further investigate the gene expression changes between smokers with and without lung cancer, we carried out differential expression analysis and found out a number of differentially expressed genes.The results show some interesting phenomenon of the gene expression profiles between smokers with and without lung cancer, and highlighting that the RNA-Seq technologies are powerful to study the characteristics of human transcriptome.

Materials and Methods
The RNA-Seq datasets of smokers with and without lung cancer were downloaded from short read archive (SRA) with the accession number: SRX060175 (smokers without lung cancer) and SRX060176 (smokers with lung cancer).The human reference genome hg19 was downloaded from UCSC http://genome.ucsc.edu/.We first extracted the human transcript sequences from hg19.Then the RNA-Seq reads of SRX060175 and SRX060176 were mapped onto those transcripts with two mismatches allowed using SeqMap (Hui Jiang, 2008).The gene expression levels were calculated using rSeq (Hui Jiang, 2008) and 0.1 RPKM (reads per kilobase of exon model per million mapped reads) was chosen as the threshold.Differential expression analysis between these two samples of smokers with and without lung cancer was carried out using the software of DESeq and chose its model of without any replicates.Adjusted P-value <0.1 were used as the threshold of differentially expressed genes.
To study the gene expression profiles between smokers with and without lung cancer, we downloaded two related RNA-Seq datasets from short read archive (SRA) with the accession number: SRX060175 (smokers without lung cancer) and SRX060176 (smokers with lung cancer) (Beane, 2011).The corresponding samples were from the human large airway epithelial cells of smokers with and without lung cancer.They were sequenced using the Illumina Genome Analyzer IIx platform with standard Illumina mRNA-Seq protocol.The reads are single-end and 36 bp in length.There are total ~26.94 million and ~27.78 million reads for SRX060175 and SRX060176, respectively.
We used the methods of rSeq (Jiang and Wong, 2009) to quantify the human gene expressions.First, we extract the human transcript sequences from the human reference genome hg19.Then we mapped (Hui Jiang 2008;Marguerat and Bahler, 2010) the RNA-Seq reads of SRX060175 and SRX060176 to those transcripts with two mismatches allowed.We computed the gene expression levels using rSeq and chose 0.1 RPKM (reads per kilobase of exon model per million mapped reads) as the threshold.For the sample of smokers without lung cancer (SRX060175), 16,248 genes expressed higher than 0.1 RPKM; and 16,321 genes for the sample of smokers with lung cancer (SRX060176).Between these two samples, 15433 expressed genes are in common, 815 genes only expressed in SRX060175 and 888 genes only expressed in SRX060176.The results show that most of human genes expressed in the samples of smokers with and without lung cancer are the same.

Gene expression distribution
We investigated the gene expression level distributions in the two samples of smokers without and with lung cancer.For these two samples of SRX060175 and SRX060176, there are 2,987 and 3,037 genes expressed at the range of 0.1-1 RPKM; 7,255 and 7,675 genes in the range of 1-10 RPKM; 4,706 and 4,445 genes at the range of 10-50 RPKM; 752 and 679 genes at the range of 50-100 RPKM; 548 and 485 genes are equal or greater than 100 RPKM (Table 1).As we can see that the majority of human genes in both smokers with and without lung cancer samples are expressed lower than 50 RPKM (92% for SRX060175 and 92.87% for SRX060176), and remain a small portion of human genes expressed at higher levels (Figure 1).
We also calculated the minimum, lower quartile, median, mean, upper quartile and maximum gene expression values in these two samples (Table 1).The two highest expression level genes in SRX060175 sample are TPT1 (7152.85RPKM) and MALAT1 (6681.13RPKM); and MALAT1 (14326.9RPKM) and TPT1 (7331.23 RPKM) for sample SRX060176.TPT1 is involved in calcium binding and microtubule stabilization and MALAT1 is associated with metastasis, and positively regulates cell motility via the transcriptional and/or post-transcriptional regulation of motility-related genes.We plotted the cumulative frequency of human gene expression levels in samples of smokers with and without lung cancer (Figure 2).We found that the curve of sample SRX060175 is almost above the one of sample SRX060176, suggesting that the most of human genes in smokers with lung cancer are expressed lower than smokers without lung cancer.fold changes against the base means, the dots that colouring in red are represent those genes that significant (adjusted P-value <0.1) at 10% FDR

Differential expression analysis
To further study the gene expression differences between smokers with and without lung cancers, we carry out differential expression analysis to find out those differentially expressed human genes between these two samples.For calculating the differentially expressed genes, we used the software of DESeq (Anders and Huber, 2010) and chose its model of without any replicates.Using the threshold of adjusted P-value < 0.1, we found that 27 genes differentially expressed in smokers with lung cancer versus smokers without lung cancer, with 4 down-regulated and 23 up-regulated (Figure 3 and Table 2).
About those differentially expressed genes, they have diverse functions and involved in different pathways.Several of those differentially expressed genes have important functions with lung, such as HBB and HBA2 genes are involved in oxygen transport from the lung to the various peripheral tissues (Wajcman et al., 1992;Sanna et al., 1994); MALAT1 (metastasis associated lung adenocarcinoma transcript 1) is a large and infrequently spliced non-coding RNA, it is associated with metastasis and positively regulates cell motility while the transcriptional and/or post-transcriptional regulation of motility-related genes (Huang da, 2009;Tseng, 2009;Guo, 2010).Other differentially expressed genes are related with various functions, IL1B are involved in the inflammatory response, being identified as endogenous pyrogens; SPC25 Acts as a component of the essential kinetochore-associated NDC80 complex, which is required for chromosome segregation and spindle checkpoint activity; SDC4 is cell surface proteoglycan that bears heparan sulfate; MUC2 coats the epithelia of the intestines, airways, and other mucus membrane-containing organs and so on.We also carried out functional annotation clustering using DAVID (Huang da et al., 2009;Huang da, 2009;Guo, 2010;Geng Chen, 2011), but only three genes (MUC2, FCGBP, MUC5B) could clustered together and met the criterion that adjusted P-value <0.1.

Discussion
In this study, we investigated the gene expression differences between smokers with and without lung cancer with two transcriptome sequencing datasets downloaded from short read archive.We first estimated the gene expression levels between these two samples and found that the majority expressed genes of them are the same, indicating that the expression profile differences between smokers with and without lung cancer might be not the unique expressed genes but the subtle expression changes of the genes.Analyzed results also show that most of human genes are expressed at a low or moderate level in both two samples of smokers with and without lung cancers, remain a small portion of human genes expressed at extremely high levels.It suggests that lung cancer disease seems does not disturb the whole trends of gene expression distribution.To know more about the expression divergence between these two samples, we then inferred the differentially expressed genes between the smokers with and without lung cancer.Because there are no replicates of these two samples, we used stringent criteria to call the differential expression.Finally, 27 genes were found differentially expressed between smokers with and without lung cancer.Further analyses suggested that some of those differentially expressed genes have crucial functions in lung tissues but other differentially expressed genes are involved in diverse functional pathways.
Genes coding for the secreted intestinal mucins MUC2 has been mapped to chromosomes 11 (p15).Inactivation of Muc2 causes lung tumor formation with spontaneous progression to invasive carcinoma, and this occurs in the absence of the overt inflammatory response.The reduced representation of goblet cells is characteristic of many aberrant crypt foci (ACF) of both humans and rodents, which are considered early preneoplastic lesions (Velcich et al., 2002).MUC2 gene expression data support the hypothesis that the reduction in these cells and, thus, reduction of the mucus they produce, plays a role in tumor formation.
Tumors with increased expression of mucin genes tended to be associated with post-operative relapse, especially when MUC5B genes were overexpressed (p = 0.015).Tumors from smokers tended to have higher MUC5B and MUC5AC expression ratios than those of non-smokers (MUC5B: 1.71 vs. 0.76, p = 0.023 and MUC5AC: 1.46 vs. 0.81, p = 0.040), and were more likely to overexpress much genes (52.9% of tumors from smokers vs. 23.1% of tumors from non-smokers had overexpression of mucin genes p = 0.039) (Yu et al., 1996).It is noteworthy that a high percentage of squamous-cell carcinomas also expressed mucin genes and proteins.This finding seems to validate "Yesner's diagram": lung cancers derived from the same pluripotent cells, and squamous-cell carcinoma may preserve their mucin-secretory potential.
IgG Fc binding protein (FcγBP) that binds the Fc portion of IgG molecules has been reported in mucin secreting cells in colon, small intestine, gall bladder, cystic duct, bronchus, sub mandibullar gland, cervix uteri, and in fluids secreted by these cells in human (O'Donovan et al., 2002).The FcγBP gene investigated in the study has potential as a genetic marker in lung cancer.In each of the malignant lung tumors tested the ratio of FcγBP mRNA expression (relative to normal tissue) was less than one, whereas in all the lung tumor and in three out of four of the hyperplastic nodules the ratio of FcγBP expression was greater than one.Measurement of FcγBP mRNA expression in lung tumors and surrounding normal tissues would thus have enabled us to predict the benign or malignant nature of these lung nodules.
Lung cancer is widely affecting the health of human and can lead to cancer-related death, it is vital to study the molecular mechanisms that causing lung cancer.RNA-Seq is now a more flexible and more accurate technology than gene microarrays to investigate the gene expression changes among those healthy and unhealthy lung tissues.It provides us great abilities to study the properties of human gene expressions and generate an unprecedented view of the human transcriptome.Our study shows that the transcriptome sequencing data from smokers with and without lung cancer provide us great opportunities to compare the gene expression profiles between these two samples.Therefore, the RNA-Seq technologies are very powerful to reveal the characteristics of the gene expressions and enable us to study the gene activities more comprehensively.Our results uncover some interesting phenomenon of the gene expression profiles between smokers with and without lung cancer.We believe that more and more intriguing findings will be reported with the progress in sequencing technologies and bioinformatics algorithms.These advances will definitely bring many benefits to the human cancer treatments.

Figure
Figure 1.Gene Expression and Distribution of Smokers With and Without Lung Cancer

Figure 3 .
Figure 3. Differential Expression Between Smokers with and Without Lung Cancer.Plot is shown in the log2