Modern Methods of Data Analysis
- • know the theory of the process and components of predictive modeling, types of predictive models, key steps of model creation, such as data-preprocessing, model construction and assessment of model performance.
- • know various practical applications of predictive modeling using machine-learning algorithms for the databases of molecular biology
- • acquire the skills to use python functions from different python packages to apply different types of models such as linear and nonlinear regression models, linear and nonlinear classification models, regression trees and rule-based models
- acquire the skills to use python functions from different python packages to pre-process the input data, i.e. calculate statistics, estimate skewness, apply appropriate transformation, perform PCA, find between-predictor correlations, generate dummy variables.
- • acquire the skills to use python functions to measure predictor importance and model performance, use filtering methods, measure outcome error.
- apply the knowledge and tools of predictive analytics to bioinformatics applications.
- know the theory of machine-learning algorithms
- acquire the skills to implement machine-learning algorithms in python
- apply the knowledge and tools of predictive analytics to real-life applications
- Big Data in Bioinformatics. Concepts of model building.Introduction to big Data in Bioinformatics. Progress in sequencing technologies. Key parts of predictive models. Concepts of model building. Data “spending”. Data splitting. Predictors. Candidate models. Optimal model. Performance estimation.
- Data Preprocessing.Unsupervised data processing. Techniques of addition, deletion, transformation of training data set. Reduction of data skewness or outliers. Feature engineering. Feature extraction. Surrogate variables as combinations of multiple predictors. Dummy variables. Principle Component Analysis.
- Linear regression models.Measuring performance in regression models. Linear regression. Partial least squares. Regularization. Ridge models. LASSO and Elastic net.
- Multivariate adaptive regression splines.Piece-wise linear approximation models (MARS). Multivariate adaptive regression splines. Feature importance in MARS models.
- Neural networks.One perceptron. Multilayered perceptrons. Back propagation. Activation functions. Error estimation. Tensorflow playground.
- Support vector machines. K-nearest neighbors.Support Vector machine algorithm. Kernels. K-nearest neighbors. Tuning paramters. Cross-validation.
- Measuring performance in classification models.Sensitivity and specificity. Receiver operating characteristic curves.
- Linear classification modelsLogistic regression. Linear discriminant analysis. Partial least squares discriminant analysis. Penalized models. Nearest shrunken centroids.
- Nonlinear classification modelsNonlinear discriminant analysis. Neural networks. Flexible discriminant analysis. Support vector machines. K-nearest neighbors. Na ̈ıve Bayes.
- Decision TreesBasic regression trees and regression model trees. Basic classification trees. Bagged trees. Random forests. Boosting. Cubist. Case studies.
- Machine-learning in bioinformaticsExamples of application of machine-learning algorithms to bioinformatics tasks such as classification of RNA-seq expression data, or prediction of functional genomic elements based on sequence features.
- Домашнее задание 1
- Домашнее задание 2
- Домашнее задание 3
- Домашнее задание 4
- Письменный экзамен
- Interim assessment (2 module)0.15 * Домашнее задание 1 + 0.15 * Домашнее задание 2 + 0.15 * Домашнее задание 3 + 0.15 * Домашнее задание 4 + 0.4 * Письменный экзамен
- Machine learning : a probabilistic perspective, Murphy K. P., 2012
- Witten, I. H. et al. Data Mining: Practical machine learning tools and techniques. – Morgan Kaufmann, 2017. – 654 pp.
- Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2017). Data Mining : Practical Machine Learning Tools and Techniques (Vol. Fourth edition). Cambridge, MA: Morgan Kaufmann. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=1214611