Understanding Cancer Patients with Diagnostically Influential Factors using High Dimensional Data Embedding
Syed, A. S., Hajderanj, L., Guo, K. and Chen, D. (2022). Understanding Cancer Patients with Diagnostically Influential Factors using High Dimensional Data Embedding. in: Imoize, A. L., Hemanth, D. J., Do, D.-T. and Sur, S. N. (ed.) Explainable Artificial Intelligence in Medical Decision Support Systems The Institution of Engineering and Technology (IET).
|Authors||Syed, A. S., Hajderanj, L., Guo, K. and Chen, D.|
|Editors||Imoize, A. L., Hemanth, D. J., Do, D.-T. and Sur, S. N.|
Analysing breast cancer data is a long-established research topic from both medical diagnosis and data modelling perspectives. Enormous predictive models have been employed in modelling breast cancer data, e.g., predicting a patient’s survival rate given certain medical circumstances and a patient’s demographics. However, these predictive models tend to take a black-box approach to the modelling and therefore can hardly provide any explainable results to be applied for diagnostic purposes, in particular, if neural network-based models are utilised. On the other hand, identifying diagnostically influential factors with exploratory descriptive models has been proven difficult due to the high dimensionality of breast cancer data under consideration. For instance, the breast cancer data provided by SEER, The Surveillance, Epidemiology, and End Results Program, typically has more than 100 dimensions of numeric and categorical data types and could expend to about some 1000 dimensions for analysis if orthogonal (one-hot) encoding is applied. Hence, effectively interpreting and understanding high dimensional data becomes crucial in modelling cancer data, and it is because of this that dimensionality reduction algorithms and manifold learning algorithms have been studied intensively and many relevant algorithms are available, with each having pros and cons of its own. In this Chapter, a comparative study is presented aiming at providing visualized, explainable insights in breast cancer survival rate analysis and identifying critical influential factors that strongly determine the likelihood of a patient’s survival. Two dimensionality reduction algorithms are considered in this study for comparison’s purpose: One is a typical and popular t-SNE (t-distributed stochastic neighbor embedding) algorithm, and the another is a relevant new SDD (same degree distribution) algorithm. The relevant experiments have demonstrated that, based on the same embedding performance assessment metrics, the SDD algorithm can achieve much better data embedding results which could be impossible or difficult if t-SNE is used. Furthermore, using the reliable embedding results from SDD, meaningful and explainable factors have been identified that reflect crucially the similarities of the patients who have survived and the diversities of the patients who, unfortunately, have died. Clusters of patients who survived are clearly recognizable in a two-dimensional embedding space, whereas the embedded points of patients who died are significantly scattered in the space. The entire package of the codes used for the analysis is available for replication.
|Keywords||Breast cancer survivability; t-SNE; Same Degree Distribution algorithm; Dimensionality reduction; Visualization; Data embedding; Classification|
|Book title||Explainable Artificial Intelligence in Medical Decision Support Systems|
|Publisher||The Institution of Engineering and Technology (IET)|
|Publication process dates|
|Accepted||10 Aug 2022|
|Deposited||31 Aug 2022|
2views this month
0downloads this month