Year of Graduation
Reconstruction of symbol sequences from observed subsequences and its application to genome assembly
School of Applied Mathematics and Information Science
The graduate work deals with the problem of reconstruction of symbol sequences from their observed fragments.Some of existing theoretical formulations of this problem and the previously obtained results are concerned. For some particular cases there are effective algorithms for solving, they are studied in this work as well. Also, for problems that don’t have exact solutions in general case, some approximation algorithms are described.The Levenshtein algorithm for reconstruction of symbol sequences from their observed subsequences that can be used only for specific set of fragments are implemented in Python and its time complexity is estimated. The conclusion is made that this algorithm can hardly be used for practical purposes, because the result is correct only when the conditions of the theorem are satisfied. Although this algorithm can be applied only with certain restrictions, this is the most effective algorithm.One of the most popular practical applications of this problem is showed by the example of genome assembly. Some difficulties that arise during the genome assembly are discussed, because the given data does not satisfy conditions of the most known results.In this work an approximate algorithm for solving the problem of reconstruction from all their substrings with possible deletions where addition symbol is used, are proposed. The time complexity is estimated and the programming realization is provided. A series of experiments is made in order to estimate the effectiveness. It is shown that on the average this algorithm solve the problem correctly (reconstructs the word from which were obtained the fragments) for the set of all fragments with length more than n/2, where n is the length of the initial sequence. And for the result to be correct, the amount of deletions in fragment should be no more than twenty percent from the number of symbols in it.