Early Detection of Lung Cancer Risk Using Data Mining

Lung cancer is the most common cause of cancer death worldwide. The occurrence of lung cancer has increased rapidly and become the most common cancer in men in most countries. Lung cancer accounts for around 1,095,000 new cancer cases and 951,000 deaths each year in men, and 514,000 cases and 427,000 deaths in women, representing about 12.7% of all new cancer cases each year and 18.2% of cancer deaths (Ferlay et al., 2010; Paul et al., 2011). Uncontrolled cell growth causes diseases that are known as cancer. Lung cancer occurs for out-of-control cell growth and begins in one or both lungs. Lung cancer that spreads to the brain can cause difficulties with vision, weakness on one side of the body. Symptoms of primary lung cancers include cough, coughing up blood, chest pain, and shortness of breath. Cigarette smoking is the most important cause of lung cancer. Cigarette smoke contains more than 4,000 chemicals, many of which have been identified as causing cancer. A person who smokes more than one pack of cigarettes per day has a 20-25 times greater risk of developing lung cancer than someone who has never smoked. About 90% of lung cancers arise due to tobacco use (Smith et al., 2012). However, other factors, such


Introduction
Lung cancer is the most common cause of cancer death worldwide.The occurrence of lung cancer has increased rapidly and become the most common cancer in men in most countries.Lung cancer accounts for around 1,095,000 new cancer cases and 951,000 deaths each year in men, and 514,000 cases and 427,000 deaths in women, representing about 12.7% of all new cancer cases each year and 18.2% of cancer deaths (Ferlay et al., 2010;Paul et al., 2011).Uncontrolled cell growth causes diseases that are known as cancer.Lung cancer occurs for out-of-control cell growth and begins in one or both lungs.Lung cancer that spreads to the brain can cause difficulties with vision, weakness on one side of the body.Symptoms of primary lung cancers include cough, coughing up blood, chest pain, and shortness of breath.
Cigarette smoking is the most important cause of lung cancer.Cigarette smoke contains more than 4,000 chemicals, many of which have been identified as causing cancer.A person who smokes more than one pack of cigarettes per day has a 20-25 times greater risk of developing lung cancer than someone who has never smoked.About 90% of lung cancers arise due to tobacco use (Smith et al., 2012).However, other factors, such
Among the overall population of Bangladesh, lifetime mortality risks (per 100,000 population) of cancer of the lung was 159.1, 23.1 for males and females respectively.The prevalence is increasing at an alarming rate in a developing country like Bangladesh in recent years (Ferlay et al., 2010).Therefore the early diagnosis of Lung cancer is obvious but the diagnosis is costly in the developing countries.Therefore based on different and most common risk factors of lung cancer a risk prediction system of lung cancer is proposed in this study which will be cost effective and easy to use.
A widely recognized formal definition of data mining can be defined as "Data mining is the nontrivial extraction of implicit previously unknown and potentially useful information about data".Data mining has some fields to analysis of data such as classification, clustering, correlations, association rule etc (Jayalakshmi and Santhakumaran, 2010) and has been used intensively and extensively by many organizations.And In-healthcare, data mining is becoming increasingly popular.Data mining provides the methodology and technology to analysis the useful information of data for decision making.
Data pre-processing is a vital task of data mining.It mainly used for making analysis appropriate and also making data appropriate for clustering by avoiding duplicate records and adding missing data according to past recorded data.The main benefits of data preprocessing reduces memory.
Clustering is a process of separating dataset into subgroups according to the unique feature.Clustering separated the dataset into relevant and non-relevant dataset to Lung Cancer.AprioriTid (Lan et al., 2010) and Decision Tree algorithm (Yael and Elad, 2010) are mainly used to find out frequent patterns of dataset.Those algorithms are very easy and effective to find out frequent patterns.Frequent patterns, the sets of data are frequently occurred into data warehouse.Significant frequent pattern, the set of data are mostly responsible to Lung Cancer.Using this significant pattern we implemented a prediction system for Lung Cancer.
The main goal of this research is to develop a system that can be used by a person for testing his/her Lung Cancer risk level.

Materials and Methods
400 patients' data (200 lung cancer patients and 200 non-cancer patients) is obtained from different diagnostic centre.There are 200 male and 200 female patients whose age between 20-80 years old.From the previous studies 20 risk factors were considered for Lung cancer assessment in Bangladeshi population, which includesage, gender, hereditary, previous health examination, use of anti-hypersensitive drugs, smoking, food habit, physical activity, obesity, tobacco, genetic Risk, environment, mental trauma, uptake of red meat, balance diet, hypertension, heart disease, excessive alcohol, radiation therapy and chronic lung diseases.
Data pre-processing is a vital term of data mining.Making an appropriate analysis and suitable for clustering of collected data.This is the main goal of data preprocessing.Sometimes data warehouse is consisted with duplicate data and missing any values of data.Data preprocessing deletes the duplicates data and supplies the missing values according to the past recorded data.It also reduces the memory and normalizes the values used to represent information in database.
The process of partitioning and category of collected data into different subgroups where each groups have a unique feature is called clustering.Clustering is another tedious term of data mining.The clustering problem has been addressed in numerous contents besides being proven beneficial in many applications (Muhammad et al.,2011).The goal of clustering is to classify objects or data into a number of categories or classes where each class contains identical feature.The main benefits of clustering are that the data object is assigned to an unknown class that have unique feature and reduces the memory.
The K-means clustering (Amorim and Mirkin, 2012) is a widely recognized clustering tool that is used for robotics, diseases and artificial intelligence application purposes (Pradhan and Kumar, 2011).Here k is a positive integer representing the number of clusters.The pre-processed data is clustered using the K-means clustering algorithm with the value of k equal to 2. This represents there is two clusters where one cluster contains relevant data to Lung Cancer and another contains remaining data that means non relevant data.This is the most significant and vital topics of data mining.It is considered as the principle data mining problem that intends to find out the frequent items or patterns from the data warehouse.There are different kinds of algorithms, used to mine interesting frequent patterns from databases like association rules, clusters, classifications and correlations etc such as Apriori, AprioriTid, Decision Tree, and FP-Tree.
After clustering, AprioiTid (Lan et al., 2010) and Decision Tree algorithms (Yael and Elad, 2010) is used to mine the frequent patterns.The AprioriTid and Decision Tree algorithms are the efficient algorithms of extracting the frequent patterns from clustered dataset.
Where W i is the weightage of each attribute and F i represents number of frequency for each rule.And significant Frequent Pattern is selected by using the following Equation ( 2) SFP=Sw (n) ≥φ for all values of n (2).Where SFP denotes significant frequent pattern and φ denotes significant weightage.

Results
The experimental results are separated into two sections.One is significant frequent patterns discover and another is represents prediction tools to Lung Cancer.
Using data from data warehouse, the significant patterns are extracted for Lung cancer prediction.The collected data are pre-processed by deleting duplicate records and adding missing values.Then pre-processed data is clustered using K-means cluster algorithm with k equal to 2. And finally significant frequent patterns are mined using AprioriTid shown in Table 1 and Decision Tree algorithm shown in Table 2.
Finally using the significant pattern the prediction tools to Lung Cancer are implemented.Table 3 represents the frequent pattern parameters and their corresponding score and Figure 1 represents the risk level of Lung Cancer which is implemented using Table 3.

Discussion
Large numbers of people in the Bangladesh and the world have Lung cancer.Most of them do not even   know they have it.There is no remedy for cancer after completely affected.Death is inevitable.So the ability to predict Lung cancer plays an important role in the diagnosis process.In this paper we have proposed an effective Lung cancer prediction system based on data mining.We have provided an efficient approach for the extraction of significant pattern from data warehouse for efficient prediction of Lung cancer.The proposed method is implemented using java.The proposed method can efficiently and successfully predict the risk of Lung cancer.

Figure
Figure 2. Lung Cancer Prediction with Low Risk Level