Application of Data Mining Techniques to Explore Predictors of HCC in Egyptian Patients with HCV-related Chronic Liver Disease

Hepatocellular carcinoma (HCC) is the most common primary malignant tumor of the liver and the fifth most frequent malignant tumor in the world (third in terms of mortality) (Parkin et al., 2001). An association was observed between the occurrence of HCC and viral hepatitis (either HBV or HCV) (Saeed et al., 2012 ). Studies had demonstrated that unlike other countries of the Middle East, the attributable fraction of HCC due to hepatitis C virus (HCV) is quite high in Egypt (Ezzat et al., 2005). Egypt has the highest prevalence of hepatitis C virus (HCV) in the world, ranging from 15% to 25% in rural communities (Strickland et al., 2002; Kurosaki M et al., 2010). This high prevalence in a country with a population census of about ninety million results in a massive number of HCV-infected patients (12-22 million). This, in turn, results in massive numbers of HCV related HCC. All HCV related liver cirrhosis patients are recommended for HCC surveillance according to


Introduction
Hepatocellular carcinoma (HCC) is the most common primary malignant tumor of the liver and the fifth most frequent malignant tumor in the world (third in terms of mortality) (Parkin et al., 2001).An association was observed between the occurrence of HCC and viral hepatitis (either HBV or HCV) (Saeed et al., 2012 ).Studies had demonstrated that unlike other countries of the Middle East, the attributable fraction of HCC due to hepatitis C virus (HCV) is quite high in Egypt (Ezzat et al., 2005).Egypt has the highest prevalence of hepatitis C virus (HCV) in the world, ranging from 15% to 25% in rural communities (Strickland et al., 2002;Kurosaki M et al., 2010).This high prevalence in a country with a population census of about ninety million results in a massive number of HCV-infected patients (12-22 million).This, in turn, results in massive numbers of HCV related HCC.All HCV related liver cirrhosis patients are recommended for HCC surveillance according to 1 Endemic Medicine and Hepatology Department, Faculty of Medicine, 2 Computer Science Department, Faculty of Computers and Information, Cairo University, Cairo, Egypt *For correspondence: daliaomran2007@yahoo.com; daliaomran@kasralainy.edu.egEuropean Association for the Study of Liver Diseases (EASL), the American Association for the Study of Liver Diseases (AASLD) and Asian Pacific Association for the Study of Liver Diseases (APASL) (Bruix et al., 2001;Omata et al., 2010;Bruix et al., 2011).

Application of Data Mining
Regular HCC surveillance by ultrasound examination (US) of the liver and/or alpha fetoprotein (AFP) may identify early HCC and reduce mortality (Wong et al., 2008).Given the constrained economy and lack of resources in Egypt, such large-scale screening is almost impossible.Moreover, repeated examinations are costly for the patients and reduce their compliance.In clinical practice, it had been reported that over 60% of HCCs were diagnosed at late stages, suggesting failures in the surveillance.This failure is attributed mainly to poor patients' compliance and to failure to detect small lesions (Altekruse et al., 2005).
These findings emphasize the need for a simple, cheap and noninvasive method that can predict HCC.

Identification of the risk factors associated with HCC
Dalia Abd El Hamid Omran et al development in HCV related chronic liver disease (CLD) is essential for formulating personalized surveillance programs.Decision-tree analysis is a core component of data mining analysis that can be used to build predictive models (Breiman et al., 1980).The major advantage of decision-tree analysis over logistic regression analysis is that the results of analysis are easy to understand.The simple allocation of patients into subgroups by following the flowchart form could define the predicted possibility of outcome (LeBlanc and Crowley, 1995) This method had been used to define prognostic factors in various diseases such as prostate cancer (Garzotto et al., 2005), diabetes (Miyaki et al., 2002), melanoma (Averbook et al., 2002), colorectal carcinoma (Valera et al., 2007), liver failure ( Baquerizo et al., 2003) and for the prediction of virological response in HCV patients (Salim, 2009).The objective of the current study is to develop an economic, reliable mathematical model to predict HCC in HCV related CLD patients using routine workup.In areas with limited resources like Egypt, it is wise to restrict the semiannual surveillance by ultrasound scan to risky patients with HCV-related liver cirrhosis.This restriction will eventually reduce unnecessary costs caused by screening all cirrhotic patients.We believe that the field of data mining can be used to solve real health problems that are currently facing Egypt with great success.

Materials and Methods
This cross sectional retrospective study enrolled 315 HCV related chronic liver disease (CLD) patients of both sexes, recruited from Endemic Medicine Department, Cairo University Hospital.Informed consent was obtained from each patient according to the 1975 Helsinki Declaration and the study was approved by Cairo University ethical committee.
One hundred thirty five (135) patients were diagnosed to have HCC according to the criteria of the European association for the study of the liver (EASL) (Bruix et al., 2001 ).One hundred and sixteen (116) patients were diagnosed to have liver cirrhosis on the basis of clinical, biochemical, and ultrasound findings.Sixty four (64) patients were diagnosed to have chronic hepatitis C. HCV infection was diagnosed by anti-HCV antibodies, HCV-RNA (Cobas Amplicor HCV Monitor v 2.0, Roche Diagnostic systems, CA).

Data Collection, Feature selection and reduction
A subset of 29 features including routine laboratory workup (categorical or numerical) was used for the model building process (Table 1).The dataset was created containing two demographic variables (age, gender), three hematological variables (hemoglobin, white blood cells, and platelets), eight biochemical variables (total bilirubin, albumin, AST, ALT, ALP, INR, creatinine and AFP), viral markers (for HBV and HCV) Child class status (A, B or C), in addition to clinical examination for the presence of splenomegaly and ascites.
A number of data transformation techniques have been used to format and prepare the patient records to be processed by the learning algorithms.

Data mining
Using the data mining analysis, we constructed a decision tree learning algorithm C4.5 (weka J48).The C4.5 which was published by Ross Quinlan in 1993 is an example of commonly used decision trees of high accuracy in medical classification, which can handle both categorical and numerical data.The data set was evaluated to determine which variables can yield the most significant diagnosis of HCC.Internal validation was performed with test mode: 10-fold cross-validation which is a technique for assessing how the results of a statistical analysis will generalize to an independent data set.

Statistical Analysis
Patients were categorized into HCC and non HCC.
Qualitative variables were expressed by number, percent and compared by chi square or fisher's exact test.
Quantitative variables were expressed by mean and standard deviation (SD) and compared by t student.
Optimum cutoff values for serum AFP, serum AST, age were determined by data mining analysis.Sensitivity, specificity, PPV, NPV and accuracy were calculated subsequently.
Decision tree algorithm was able to diagnose HCC with recall (sensitivity) 83.5% and Precision (specificity) 83.3% using only routine data.The correctly classified instances were 295 (82.2%) and the incorrectly classified instances were 56 (17.8%).Out of 29 attributes, serum AFP, with an optimal cutoff value of ≥50.3 ng/ml was selected as the best predictor (most decisive) of HCC according to the decision-tree models.To a lesser extent, male sex, presence of cirrhosis, AST >64U/L, and ascites were variables associated with HCC (Figure 1).
AFP was found to be the accurate single predictor of HCC (Table 3).Moreover, it was found that the presence of more than 2 of the studied five variables (i.e.having score >2) was associated with an increased risk for HCC development by 103.4 times and can successfully predict HCC with a sensitivity of 96% and specificity of 82%.

Discussion
HCC reduces quality of life and causes death within 6 months -1 year from the diagnosis (Bosch et al., 2005).In Egypt, there was a dramatic increase in the number of HCC cases in the last few years.The registry of the pathology department, National Cancer Institute (NCI), Cairo (2003Cairo ( -2004) ) showed that HCC was the 2 nd malignancy in males after carcinoma of the urinary bladder and the 4 th in females (Mokhtar et al., 2007) The stage of cancer dictates the therapeutic choice, making early detection a primary objective.Surveillance of HCC aims at detection of small tumors for curative treatment, which may be translated to improved patient survival.Regular screening of large number of cirrhotic patients have a high cost impact and may add burden to a country like Egypt having the highest prevalence of HCV worldwide (Frank et al., 2000;Arguedas et al., 2003).
Data mining analysis has been integrated into bioinformatics research in order to explore hidden patterns in large datasets and thus can be used to make prediction models or certain hypotheses (Han and Kamber, 2006).On one hand, conventional statistics can examine certain hypothesis while on the other hand; data mining analysis can set an algorithm by using a large amount of data.Decision tree model is considered to be rather superior over the traditional regression models as it can be readily interpreted by medical professionals simply by following the flowchart form without any specific knowledge of statistics (Witten and Frank, 2005).
In the current study, a decision tree model based on routinely available clinical and laboratory parameters was constructed for HCC prediction in patients with HCV related CLD.Serum AFP ≥50.3 ng/ml was the best predictor of HCC and to a lesser extent male sex, presence Dalia Abd El Hamid Omran et al of cirrhosis, AST >64 U/L, and ascites were variables associated with HCC.
Serum AFP, a well-known noninvasive marker for the development of HCC (Daniele et al., 2004), was the first split variable (most decisive) in the predictive model for HCC and was significantly associated with HCC development in the multivariate analysis as well.HCV related CLD patients with AFP serum level of 50.3 ng/ml or more are 252 times more liable to develop HCC.Our study proposed a cutoff of 50.3 ng/ml with a sensitivity of 72%, specificity of 99%, a positive predictive value of 99% and a negative predictive value of 72%.The AUROC was 0.833.
In previously published studies, AFP (with different cutoffs) had a sensitivity of 39%-65%, a specificity of 76%-94%, and a positive predictive value of 9%-50% for the presence of HCC (Franca et al., 2004.).The variation in sensitivity and specificity of AFP in the studies performed may be due to the diversity of patient populations examined, varying study designs and differing cut-off values for normality.There was a debate in defining the AFP cut-off level for the diagnosis of HCC.An AFP value >400 ng/mL had been considered to be diagnostic for HCC in cirrhotic patients.However, such a cut-off value is problematic in absolute diagnostic terms, since such high levels are not common in the presence of small tumors (<5 cm) and only 30% of HCC patients have levels higher than 100 ng/mL, furthermore, up to 20% patients with HCC do not produce AFP (Tao et al., 2010).
Recently, the American Association for the Study of Liver Diseases (AASLD) Practice Guidelines Committee recommended that ultrasound (US) examination alone (without AFP) should be used for HCC surveillance (Bruix et al., 2011).However, the interpretation of ultrasound is operator-dependent and can be difficult in obese persons; moreover, it cannot differentiate malignant from benign nodules in the small cirrhotic liver, and the detection of small HCC in a cirrhotic liver by US is much more difficult than the detection of metastases in a normal liver, owing to disturbed parenchymal architecture (Saar and Kellner-Weldon, 2008).An effectiveness study recently demonstrated that ultrasound only had a sensitivity of 32% for early stage tumors, which was significantly increased to 63% when used in combination with AFP (Singalat el al., 2012).
According to our study, male patients are 2.9 times more liable to develop HCC.Previous studies reported that primary liver cancer is more prevalent among men than women.The gender-specific age-adjusted incident rate ratio ranges from 1.3 to 3.6 worldwide (Ferlay et al., 2001).Male sex had been consistently shown to increase the risk for HCC development in HCV-infected persons (Degos et al., 2000).Androgens and androgen receptors had been suggested to induce and promote HCC (Strickland et al., 2002).
In the current study, liver cirrhosis increases the risk of HCC by 2.2 times.It is well known that HCC development is restricted largely to patients with cirrhosis or advanced fibrosis (Shiratori et al., 1995;El-Serag, 2000).
Presence of ascites in our study was associated with an increased risk of HCC development.It is well known that ascites signifies advanced liver disease.Shiratori et al (1995) the characteristics of 205 HCC cases from Japan and noted that HCV-associated HCC occurred in the presence of more severe liver disease than in hepatitis B virus (HBV)-associated HCC.
According to our study, having AST serum level >64 IU/L increase the risk of HCC by 9.5 times.High AST serum levels reflect severe hepatic fibrosis, portal inflammation and piecemeal necrosis that may eventually progress to HCC (Saar and Kellner-Weldon, 2008).
Identifying patients at risk of progressing to hepatocellular carcinoma may help in justifying health resource allocation by targeting high risk patients.In areas with constrained economy like Egypt, it is wise to restrict the HCC surveillance to patients having >2 risk variables (out of the fiver risk variables proposed by our decision-tree model).This restriction will eventually reduce unnecessary costs caused by screening all cirrhotic patients.This is also useful in the rural areas of Egypt where CT facilities are not readily present.Depending on these risk variables, physicians can easily identify high risk patients and refer them to specialized tertiary care centers.Limitation of our study included the small number of patients and the lack of evaluation of tumor markers other than AFP.
Conclusion: Data mining analysis explores data to discover hidden patterns, trends and enables the development of models to diagnose HCC utilizing simple laboratory data, without imposing extra costs for additional examinations.This study was the first one (up to our knowledge) to highlight a new cutoff value of AFP for diagnosis of HCC (≥50.3 ng/ml).This low cutoff together with other unfavorable (risk) variables when used in combination can help in early diagnosis of HCC.The field of data mining can be used to solve real health problems that Egypt is currently facing with great success.More studies are needed exploring more variables that may be associated with progression of HCV related CLD to HCC.Identification of risk factors associated with HCC development will result in better targeting of patient, thus having the utmost benefit from HCC surveillance programs with the least possible cost by restricting screening to risky patient only.

Figure 1 .
Figure 1.Decision Tree Algorithm to Predict HCC

Table 1 . Summary of Features (Attributes) Included in The Study
WBC, white blood cell count; Hb hemoglobin; PLT, platelet count; ALP , alkaline phosphatase; AST, aspartate aminotransferase;ALT, alanine aminotransferase; AFP, alpha fetoprotein; INR, international normalized ratio; Anti HCV Ab, anti hepatitis C antibodies; HBsAg, hepatitis B surface antigen; HB core Ab, hepatitis B core antibodies