Goal of research
The goal of the research project implemented by the Laboratory of bioinformatics is studying the role of DNA secondary structures in the genome functioning. Of particular interest is the study of the relationship of DNA secondary structures with the epigenetic code, the role of DNA secondary structures in the organization of chromatin, and studying DNA-protein interactions.
Building machine learning models recognizing DNA secondary structures and the patterns association of DNA structures with other functional genomic elements; building machine learning models to predict the spatial organization of the genome; updating database and analyses of DNA-protein interactions.
Empirical base of research
The research has been conducted in silico, by methods of computer experiments.
The data from open consortium international projects were used: Encode, Roadmap Epigenomics, The Cancer Genome Atlas.
The database of nuclein-protein interactions NPIDB was used (http://npidb.belozersky.msu.ru/), containing the structure of DNA-protein and RNA-protein complexes. NPIDB is the development and intellectual property of Sergei Spirin, member of the laboratory.
During the project implementation, software modules for data analysis were developed; they are publicly available and represent the laboratory products.
Results of research
Deep learning neural network model for recognition of Z-DNA sites in the human genome was developed, the model was tested on the human genome.
Deep learning neural network model for recognition of quadruplexes in the human genome was developed, the model was tested on the human genome.
Deep learning neural network model have been developed to determine the functional role of quadruplexes using mutation maps.
Patterns of association of quadruplexes with histone labels in brain tissues and stem cells were identified and characterized. Deep learning neural network model for recognition of coomon and tissue-specific patterns were constructed.
Differentially methylated quadruplexes, associated with the development and sexual differentiation, were identified. Enrichment analysis of differentially methylated G4s revealed that G4-based regulation can participate in a number of biological processes, such as cell differentiation, cytoskeleton organization and extracellular matrix.
Deep learning neural network model was constructed to determine the boundaries of topologically associated domains.
Breakpoint regions in the genomes of cancer patients were investigated for the presence of DNA secondary structures. With the help of machine learning methods, the influence of various factors on the probability of formation of breaks in cancer genomes was studied: DNA secondary structures, epigenetic factors, transcription factors. 12 DNA regions containing clusters of mutations from different patients were identified.
The physicochemical and structural properties of stem-loop structures at the ends of transposons in the human genome were studied using machine learning methods. Structures at the ends of pseudogenes were detected. Structural similarity between the ends of transposons, mRNA and pseudogenes was shown by machine learning methods.
Species-specific DNA regulatory elements in brain tissue were identified. Human-specific peaks of acetylation were determined. With the help of machine learning methods, specific and nonspecific human genome regions were recognized, and sequences were obtained that make a significant contribution to the classification.
For 10 families of transcription factors, the parameters of the DNA shape in the region of a contact with the protein were calculated. Important parameters, by which some families are significantly different from the rest, were identified,
A C# program was written for the .NET Core cross-platform framework that uses the X3DNA and Curves+packages. With the help of the developed program, the parameters of the DNA double helix were calculated for 4475 protein domains from 81 families. The resulting program is suitable for further use on large data sets. It is expected to integrate the results into the NPIDB database.
Mutational signatures in the genomes of the genus Brucella of 55 Brucella strains of nine different species were constructed and analyzed. Species-specific mutation signatures have been identified.
Level of implementation, recommendations on implementation or outcomes of the implementation of the results
The developed machine learning models can be used for genome annotation of different species with DNA secondary structures.
DNA Secondary structures have the potential to serve as therapeutic targets in the treatment of various diseases.