COMPUTER-AIDED DIAGNOSIS OF LUNG MALIGNITY USING MULTIDIMENSIONAL ANALYSIS OF TUMOUR MARKER DATA

The aim of this work is assessing diagnostic performance of lung tumour markers. Three clinical laboratory tests were used for indicating lung malignancy in order to verify or predict the patient’s diagnosis. The data set of 182 patients was examined and two main groups of the patient samples were created – 86 with diagnosed malignancy (confirmed by histology) and 96 with diagnosed benign tumours or tuberculosis. The following tumour markers were analyzed: carcinoembryonic antigen and cytokeratin 19 fragment, which were sampled in the pleural exudates, and the same tumour markers in serum. In addition, the patient’s age and the gender of the corresponding individual were used as further variables in the original data matrix. Three laboratory tests were used for indicating lung malignancy in order to verify or predict the patient’s diagnosis not only by using the results of the chosen individual laboratory test but also applying multivariate statistical approach, which jointly utilizes all performed tests in the form of their optimal linear combination.


Introduction
Diagnosis of any disease can be confirmed or predicted not only using appropriate laboratory tests but also using multivariate statistical analysis, which uses simultaneously all performed tests in the form of their optimal (usually linear) combination.This new way of enhancing the diagnostic effectiveness, advocated by us, is applied here to the results of laboratory analysis of lung tumour markers in serum as well as pleural effusion (exudate).
Pleural effusion is common for several kinds of lung illnesses in clinical practice.Malignancy is one of the main causes of pleural effusion.Greater than 90 % of malignant pleural effusions are due to metastatic disease, mainly from lung or primary breast malignancies.The initial diagnostic approach includes examinations: thoracocentesis, cytology, and biochemical laboratory tests.However, the sensitivity of these non-invasive techniques is considered to be only 40 %-70 %.To improve upon these rates, a number of tumour markers (TM) in the pleural fluid have been intensively evaluated.The most common markers found to be of diagnostic significance were carcinoembryonic antigen (CEA), cancer antigen 15-3 (CA), cancer antigen 19-9, and cytokeratin 19 fragment (CYFRA 21-1).
CEA was identified in 1965 and has been widely used during the follow up of various tumours i.e.: colorectal cancers (MROCZKO et al., 2007;DUFFY et al., 2007;YAMAMOTO et al., 2005), breast cancer (NICOLINI et al., 2008;CHEN et al., 2006;SÖLÉTORMOS et al., 2004).CYFRA 2l-1 assay measures cytokeratin 19 fragment and his concentration is increased with the extent of the malignant disease in non-small cell lung cancer.The serum CYFRA 21-1 distribution differs significantly according to histology, disease stage and performance status.
Lung cancer is the leading cause of cancer deaths in Europe.Levels of tumour markers, especially CEA and CYFRA 21-1, may help to establish the diagnosis of pleural malignancy (MATSUOKA et al., 2007;SHITRIT et al., 2005;OKAMOTO et al., 2005;FUHRMAN et al., 2000).
It should be added that the most effective positive test is histology of the appropriate tissue sample but this way is invasive and takes a long time.Therefore the use of TM may prevent the loss of time necessary for medical treatment in urgent cases.

Description of the studied data
Tumour markers were determined at the Institute for Tuberculosis and Respiratory Diseases (ITRD) in Poprad -Kvetnica, Slovakia.The data set of 182 patients was examined; two main groups of the patient samples were created -86 malignant (with malignancy confirmed by histology) and 96 benign tumours or tuberculosis.The following tumour markers were analyzed in pleural effusion: CEA (coded as EXCEA), CYFRA 21-1 (EXCYF) as well as in serum: (SCEA and SCYF, respectively).In addition, the patient's age (coded as AGE) and the gender of the corresponding individual (coded as SEXN) were used as the variables in the original data matrix.When using classification multivariate statistical techniques, two values of the categorical classification variable DG (diagnosis) were used: 1 -indicating malignant diseases and 2 -for others.This categorization was made on the basis of known histology results.

Multidimensional data analysis
Statistical calculations were performed using the principal components analysis (PCA), cluster analysis (CA), the linear discriminant analysis (LDA), and logistic regression (LR).Several software commercial packages were used: STAGRAPHICS Plus 5.1, SPSS 15 and JMP 6.0.2.

Analytical procedures
Tumour markers were analysed by automatic analysers ELECSYS 1010 and ELECSYS 2010, which use immunoanalysis with electrochemically generated chemiluminiscent detection.For determinations in pleural effusion an original procedure was used developed at the ITRD.

Principal Component Analysis (PCA)
The data set of the patients characterized by four variables, namely SCEA, EXCEA, EXCYF and SCYF was used for a preliminary study.In the way, described in detail in the part 3.1 it was found that the variable SCYF is of the least importance.Considering this as well as economical aspects the variable SCYF was exclude from further studies.Instead, the variable AGE, always accessible, was used in the detailed PCA study.PCA reveals a natural grouping of the studied objects as well as the used variables in a reduced dimensional space (Fig. 1).The first principal component (PC1) performs a linear combination of four original variables, optimized with respect to preserving maximal variance of the data.The variables are demonstrated in the exhibited PCA biplot by the rays (connecting the variable position in the PC2 -PC1 plane with the origin).The numbers 1 and 2 represent the category where the investigated sample belongs (1 -malignant, 2 -non-malignant).The inspection of the biplot depicted in Fig. 1 reveals that the PC1 axis represents malignancy.All tumour markers are positively correlated with the PC1, which is the proof that all patient samples with a high PC1 value are malignant and is in accordance with the observation that the malignant cases are located at high PC1 values.

Cluster Analysis (CA)
Among the clustering techniques, Ward´s method with Squared Euclidean distance metrics was selected for variable clustering.The obtained results are in agreement with clinical expectations: EXCEA and SCEA are clustered with EXCYF so that all tumour markers indicating positive diagnosis result are together.Variable AGE is clustered with SEXN since it simply reflects the fact that the average age of women is higher than that of men.

Discriminant Analysis
With regard to the solved problem the main goals of the applied classification multivariate methods, namely the linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and logistic regression (LR) is: (1) to create diagnostic categories and the training data set using the entries of the individual samples with known diagnosis, (2) to elaborate a classification model using the categorized patient samples in the training set, (3) to perform the categorization of the not yet classified samples (belonging to the test set of data) into the selected classes.Figure 3 represents the LDA graphical output, which shows that the non-malignant patient samples (numbers 1-96) are located in a narrow cluster at very negative values of the first discriminate function (DF1) whilst the malignant samples (97-182) form a wide tailing cluster at higher DF1 values, which is a typical behaviour in many clinical studies.
The classification performance for different software packages providing the classification outputs is collected in Table 1.The exhibited results of the classification performance regard three types of samples: (1) the training set samples (used for calculating the classification model), (2) the samples omitted from the training set in a step-by-step manner according to the leave-one-out procedure (which was applicable only using the SPSS software), (3) the samples creating a special test set, which were not included into the training set.The number of the patient samples is given by denominator in the "true/all" ratio.Decision upon malignity was predicted using four variables SCEA, EXCEA, EXCYF, AGE except LR where also SEXN was used.N/A means that the calculation was not possible when using the cited software.
The predictive ability of the used multivariate methods is expressed by the results referring to the last two types of the samples.It is better for the LDA (over 86 %) than the QDA.However, the best results (90 %) were achieved by logistic regression, where in addition to the variables used in other techniques, the patient's gender (woman/man) was used (in the form of the binary variable SEXN).

Conclusions
Principal component analysis and cluster analysis allow display a natural grouping of the samples belonging to the individuals treated for lung diseases.The obtained results demonstrate very good applicability of the used multivariate statistical methods for graphical representation and the samples classification in a reduced number of dimensions.Patient's diagnosis may be predicted or verified not only using the results of the selected individual laboratory test but also utilizing all performed laboratory tests jointly in the form of their optimal combination ensured by an appropriate multidimensional statistical technique.

Fig. 3 .
Fig. 3. Linear discriminant analysis of the samples denoted by the numbers on the vertical axis.DF1 denotes the only discriminant function.86 samples corresponding to malignancy confirmed by histology (Dg=1) and 96 samples regarding benign tumours or tuberculosis (Dg=2).Software Statgraphics 5.1.DF1 Tab. 1. Classification results for various multivariate methods and software. Note: