Year of Graduation
Application of Machine Learning Methods for Analysis of Genome Data
Big Data Systems
In this research we focus on pre-evaluation of relationship between cancer breakpoints hotspots and such DNA secondary structures as stem-loops and quadruplexes. A reliable procedure for machine learning models building and evaluation was developed and random forest algorithm was used to check the main hypothesis of the research. From the performed analysis it could be concluded that the relationship between cancer breakpoints hotspots and studied DNA secondary structures exists. It was revealed that, generally, this relationship is weak. But the best models for each cancer type showed lift of recall from 1.25 to 2.86 for stem-loop-based models and from 1.02 to 10 for quadruplex-based models in comparison with random choice model. Additionally, if to select the best model for each cancer type, for pancreatic, prostate, ovary, uterus, brain and liver cancer it will be stemloop-based model and for the rest – blood, bone, skin and breast cancer – it will be quadruplex-based model.