• A
• A
• A
• ABC
• ABC
• ABC
• А
• А
• А
• А
• А
Regular version of the site
Campus inMoscow
Student
Title
Supervisor
Faculty
Educational Programme
Alena Lixonosova
Reconstruction of symbol sequences from observed subsequences and its application to genome assembly
School of Applied Mathematics and Information Science
Bachelor’s programme
2014
The graduate work deals with the problem of reconstruction of symbol sequences from their observed fragments.Some of existing theoretical formulations of this problem and the previously obtained results are concerned. For some particular cases there are effective algorithms for solving, they are studied in this work as well. Also, for problems that don’t have exact solutions in general case, some approximation algorithms are described.The Levenshtein algorithm for reconstruction of symbol sequences from their observed subsequences that can be used only for specific set of fragments are implemented in Python and its time complexity is estimated. The conclusion is made that this algorithm can hardly be used for practical purposes, because the result is correct only when the conditions of the theorem are satisfied. Although this algorithm can be applied only with certain restrictions, this is the most effective algorithm.One of the most popular practical applications of this problem is showed by the example of genome assembly. Some difficulties that arise during the genome assembly are discussed, because the given data does not satisfy conditions of the most known results.In this work an approximate algorithm for solving the problem of reconstruction from all their substrings with possible deletions where addition symbol is used, are proposed. The time complexity is estimated and the programming realization is provided. A series of experiments is made in order to estimate the effectiveness. It is shown that on the average this algorithm solve the problem correctly (reconstructs the word from which were obtained the fragments) for the set of all fragments with length more than n/2, where n is the length of the initial sequence. And for the result to be correct, the amount of deletions in fragment should be no more than twenty percent from the number of symbols in it.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.