Mining Proteins Associated with Oral Squamous Cell Carcinoma in Complex Networks

Oral squamous cell carcinoma (OSCC) is a major healthcare problem. It includes approximately 90% of oral malignancies and accounts for more than 300,000 of newly diagnosed cancers every year. Although significant progress has been made in cancer treatment, the death rate associated with OSCC remains unchanged, and the overall 5-year survival rate is estimated at about 50% (Choi and Myers, 2008; Pasini et al., 2012). Identifying high risk factors may facilitate early diagnosis , treatment and lower the incidence of OSCC. Proteins are the final executants of physical functions, and play the key role in the development of cancer. Traditional research methods only focus on individual proteins. However, a better understanding of protein-protein interactions is crucial to investigating their roles in cancer development and identifying potential drug targets for use in clinical applications (Bonetta, 2010). Researchers need a network that can describe a large number of protein interactions clearly and explain the mutual influences on structures and functions. With the help of high-throughput screening technologies and computational models, information can be integrated and PPI networks can be constructed. The PPI networks might help researchers determine the


Introduction
Oral squamous cell carcinoma (OSCC) is a major healthcare problem. It includes approximately 90% of oral malignancies and accounts for more than 300,000 of newly diagnosed cancers every year. Although significant progress has been made in cancer treatment, the death rate associated with OSCC remains unchanged, and the overall 5-year survival rate is estimated at about 50% (Choi and Myers, 2008;Pasini et al., 2012). Identifying high risk factors may facilitate early diagnosis , treatment and lower the incidence of OSCC. Proteins are the final executants of physical functions, and play the key role in the development of cancer. Traditional research methods only focus on individual proteins. However, a better understanding of protein-protein interactions is crucial to investigating their roles in cancer development and identifying potential drug targets for use in clinical applications (Bonetta, 2010). Researchers need a network that can describe a large number of protein interactions clearly and explain the mutual influences on structures and functions. With the help of high-throughput screening technologies and computational models, information can be integrated and PPI networks can be constructed. The PPI networks might help researchers determine the

Collection of OSCC related genes and proteins
The content search for "oral squamous cell carcinoma" was performed in OMIM, produced a list of OSCC-related 42 genes records (Oti et al., 2011). The list was subjected to the search tool in HUGO Gene Nomenclature Committee (HGNC) in order to identify the exact identifiers. HGNC stores all confirmed human genes and each gene receives exactly one unique standard gene identity (Seal et al., 2011). The genes were mapped to their Swiss-Port protein IDs . There were 4 genes that had no corresponding proteins. Although some genes encoded more than one protein, only experimentally verified proteins were used here. A total of 38 OSCC-related proteins were retrieved and denoted as the seed set .

Expansion of OSCC related protein-protein interactions
I. First neighbor of the seed proteins were found using the OPHID (Zhang et al., 2010). This covers both known and predicted mammalian protein-protein interactions. II. Every interaction protein obtained from step I was expanded using the nearest neighbor method. For example, A interacts with B and B interacts with C, as in A->B->C. Only B was included because B is A's nearest neighbor (Chen et al., 2006;Ning et al., 2010). Proteinprotein interaction pairs were selected for at least one protein in the OPHID seed set. The final OPHID seed set was expanded and a new OSCC-interaction-protein set was produced. Only accepted the protein interaction that came from HPRD, BIND, or MINT database, because all the records in these three databases have been verified in humans through real human protein interaction experiments. So, the protein interaction information was more credible.

Visualization of the PPI network
Pajek, a tool designed for the analysis of bioinformatics networks containing embedded graph-drawing capabilities, was used to visualize and analyze the OSCC-related PPI network (Batagelj and Mrvar, 2011). The edges and nodes were used to represent protein interactions and proteins respectively. The protein and protein interactions were tweaked, the OSCC-related PPI network was constructed.

Statistical evaluation of specificity and stability
The index of aggregation was calculated. This means that the ratio of the size of the largest sub-network to that of the whole network. Network size was calculated using the total number proteins of sub-network and entire network. The specificity of this network was tested to prove that all these proteins were interacted really rather than randomly. The same number of protein pairs were selected randomly for 1000 times to calculate the p-value and to generate distribution of the index of aggregation for further calculation (Wagner and Fell, 2001;Maslov and Sneppen, 2002). Finally, we verified if the degree centraliy distribution of all the poteins obey the power law.

Evaluation of the contributions of each protein
The role of a protein in the network can be qualitatively evaluated. The ability to connect with other protein partners with high specificity reflects the contribution of a protein to the network which was calculated using the following formula Si= 2*ln(t(i)*0.9)-ln(t(i)) (Eq.1) In Eq. 1, Si is the contribution score of protein i and t(i) indicates the number of connections of a given protein i . 0.9 is the fixed coefficient, which has been verified for protein interactions through real human protein interaction experiments. These interactions are assigned a high confidence score of 0.9 (Chen, 2006).

OSCC-related genes and proteins
42 gene records were collected from the OMIM database. 38 seed proteins were retrieved from HGNC (Table 1). 1908 protein interaction pairs were acquired and only 750 PPI pairs were accepted. The details are available in the supplementary material.

Visualization of PPI network
From two columns of data covering the 750 pairs, Pajek produced the graph shown in Figure 1. The entire network contained several clusters of different scales, in which the number of involved proteins ranged from 2 to 626. The largest sub-network contained 626 proteins and    the index of aggregation was 93.43%. Repeating random selection of the same number of protein pairs for 1000 times made index of aggregation greater than 93.43% for only 5 times. It indicated that the p-value was 0.005 and the OSCC-related PPI was statistically significant and specific . The distribution of the index of aggregations is shown in Figure 2. Degree centrality is defined as the number of links incident upon a node that is widely used in network analysis. There are two charts show information about degree centrality of the OSCC-releated network ( Figure  3). In Figure 3A, degree centrality are shown by the X-axis, whereas the Y-axis shows the counts of the correspond proteins. In Figure 3B, X-axis and Y-axis are transformed by log function, the curve fitting result prove that the distribution of degree centraliy obeys the power law. Maslov had certified that the protein interaction network was consistent with the power law distribution model (Maslov and Sneppen, 2002). If some proteins had connected randomly, the degree centrality distribution of the network would not have obeyed the power law (Ning et al., 2010). Therefore, it suggested that the proteins connected with each other biologically rather than randomly.

Evaluating the contribution of each protein
Not all OSCC-related protein interaction carried the same level of confidence. The contribution of each node was evaluated based on the role of every protein in the network, as described in Eq. 1. 30 top-ranked proteins were listed (Table 2), the other proteins scores are available in the supplementary material. The four highest-scoring proteins were SMAD4 , CTNNB1, HRAS, NOTCH1 which interaction with proteins more than 40. 22 proteins in the core positions scored over 1.1. They were all included in the seed set , which were retrieved directly from the OMIM. It is indicated that they had already been verified in previous studies. Four proteins (P53, EP300, SMAD3, SRC) were not included in the seed set, so, they were not initially retrieved from the OMIM data by the automated procedure but rather recovered form the interaction data using the nearest neighbor expansion method.

Discussion
The present study screened the proteins which come from the human protein interaction experiments database. This screening method could diminish the influence of the interference factors and uncertain factors and could help to evaluate the contribution of proteins more accurately. The proteins were integrated and analyzed to construct the OSCC-related protein interaction network, which contributes to more comprehensive and systematic research. The nearest neighbor expansion method not only validated existing OSCC protein targets but also mined ones absent in the initial seed set of OSCC protein targets. The specificity and the reliability of the PPI network were tested to be fine. The important candidates for assessing OSCC risk and therapeutic targets were mined. The recommended research method may also help to screen other target molecules for further study of OSCC.
The four highest-scoring proteins (SMAD4, CTNNB1, HRAS, NOTCH1) were proposed as the most important candidates for assessing OSCC risks and therapeutic targets. And they had been confirmed to play an important role in the occurrence and development of OSCC. SMAD4 protein plays the role of common-mediator in the Smad family and is called co-Smad. SMAD4 and the R-SMADs complex can target DNA binding proteins to promote transcriptional responses of TGF-β signaling pathway. In this way , SMAD4 plays a critical role in the suppression of carcinogenesis and maintenance of tissue homeostasis. The loss of expression may promote the development and metastasis of OSCC (Yang and Yang, 2010;Xia et al., 2013). CTNNB1 (β-catenin) belongs to the armadillo family and plays an important role in Wnt signaling. Furthermore, it contributes to adherens junctions through protein-protein binding and regulates E-cadherin-mediated cell-cell adhesion. Abnormal expression of CTNNB1 can impact on oral cancer cell behavior (Duan et al., 2006;Leel et al., 2010). HRAS, a GTPase, has been proven to be a proto-oncogene and overaction drives the cells to uncontrolled division and thus carcinogenesis. The variant 'C' allele of the H-RAS T81C was founded to be associated with higher risk of oral cancer (Murugan et al., 2009;Jayaraman et al., 2012). NOTCH1, a transmembrane protein with repeated extracellular EGF domains and the NOTCH domains, works in multiple processes such as differentiation, proliferation and apoptosis. Overactivated Notch1 signaling facilitates tumor recurrence and drug resistance of cancer stem cell and cancer stem-like cells. However, activated NOTCH1 can increase the expression of p21WAF1/CIP1 and P53 and trigger down-regulate Wnt/β-catenin signaling, which can induce OSCC cells apoptosis and cell cycle arrest (Duan et al., 2006;Ravindran and Devaraj, 2012).
Four proteins, P53, EP300, SMAD3 and SRC, were mined using the nearest neighbor expansion method . However, these proteins were not included in the seed set, they were all found to interact with important seed proteins (Table 3). In this way, it was indirectly proven that they might play an important role in OSCC. These proteins meritare further research. P53 acts as tumor suppressor and the activation of P53 can initiate responses such as DNA repair, differentiation, senescence and the inhibition of angiogenesis (Mroz and Rocco, 2010;Pasini et al., 2012). EP300, a transcriptional coactivator, promotes maturation and differentiation of cells and prevents the growth of cancer. Studies suggest that EP300 mutations contribute to the development of colon cancer, breast cancer and OSCC. It may also help predict cancer prognosis (Gayther et al., 2000). SMAD3, a mediator of TGF-β signaling pathway, can combine with SMAD4 to activate the pathway. SMAD3 may have a bidirectional function in cancer development (Han and Wan, 2011). SRC is a proto-oncogene tyrosine-protein kinase encoded by the SRC gene, this protein phosphorylates specific tyrosine residues of other proteins and the activation promotes angiogenesis, proliferation and invasion of cancer (Cheng et al., 2011).
In summary, this work describes the construction of a protein-protein interaction network of OSCC. The Four highest-scoring proteins SMAD4, CTNNB1, HRAS and NOTCH1 were identified , and four non-seed proteins P53, EP300, SMAD3 and SRC were mined using the nearest neighbor expansion method. These proteins affect the development and metastasis of OSCC through regulation of transcriptional responses, differentiation, angiogenesis, proliferation, and apoptotic programs. The present study may help researchers identify crucial targets for the prevention and treatment of OSCC and guide medical research toward further pertinent study.