• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
  • HSE University
  • Student Theses
  • Single-cell RNA-seq: Optimal Feature Selection Methods Using Scmap to Build Accurate Projections of Gene Expression Data Sets

Single-cell RNA-seq: Optimal Feature Selection Methods Using Scmap to Build Accurate Projections of Gene Expression Data Sets

Student: Maslov Ivan

Faculty: International Laboratory for Applied Network Research

Educational Programme: Applied Statistics with Network Analysis (Master)

Year of Graduation: 2020

Single-cell RNA-seq (scRNA-seq) is widely used to delve into the composition of complex tissues since the technology allows researchers to define cell-types using unsupervised clustering of the transcriptome. Nonetheless, due to differences in experimental methods and computational analyses, it is frequently challenging to directly match the cells identified in two diverse experiments. Scmap is a method for projecting cells from a scRNA-seq experiment on to the cell-types or individual cells identified in a different experiment. In this study, we will investigate a few questions posed by Professor Blaz Zupan (personal communication): 1) what is the actual contribution of feature selection approach to the nearest neighbor method; would changing the ad hoc approach to feature selection proposed in the scmap with a more standard feature selection method improve or worsen the results? 2) is the proposed kNN good because it is good in identifying cases that should not be classified (a reject option), or because this is simply a great classifier? 3) would the same classification approach work on some other data sets, say, from the field of gene expression (microarray) studies? The results of the study are: 1. Feature selection methods allow classifiers to be more precise, they decrease the noise of the data. Changing proposed drop-out method with Boruta has worsen the results in range from 0-200 features, but showed promissing results when the increasing amount of features. Recursive Feature Elemination reduced the overall results. Genetic Algorithm has similar results as proposed drop-out method, therefore it can be used as a alternative feature selection method. 2. Knn showed to be much more superior classifier then combination of random forest and recursive feature elimination, which confirmes to us, that this is a great classifier on its own and drop-out feature selection method just enchances its performance. 3. Unfortunately, scmap cannot work with the data from microarray gene expression studies. Because it will not work very well as the gene selection step (which is critical for good results) is based on the dropout characteristic of scRNA-seq data. As this property is not a major feature of bulk (microarray) data it did not compile any comprehensive results. 

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses