Development of the Universal Phonetic Algorithm for IPA

Student: Kostyanitsyna Anastasia

Educational Programme: Fundamental and Computational Linguistics (Bachelor)

Year of Graduation: 2019

The idea of comparing words based on their pronunciation rather than on their spelling has long been discussed by the scientific community. As a result, phonetic algorithms were developed. A phonetic algorithm is an algorithm for comparing words based on their pronunciation. The idea behind this algorithm is to transform strings written using graphemes to their phonetic representations and comparing them using various distance metrics. Although such algorithms have been described before, many of them were specialized for a specific research task, for example, a specific language. This paper introduces the universal phonetic algorithm for IPA, an algorithm and it's programming implementation that calculates distances between word transcriptions. The universality of the algorithm is achieved by using the International Phonetic Alphabet (IPA). The IPA is a phonetic notation system created by the International Phonetic Association that uses a set of characters based on the Latin alphabet to represent the sounds that exist in human spoken languages. The algorithm converts each IPA character of transcriptions in the form of vectors that phonetically describe sounds. For this purpose, we have chosen the phonological system of Chomsky and Halle. The distance between transcriptions is calculated using a modified version of the Levenshtein distance, where the replacement cost is equal to the distance between vectors of the symbols of transcriptions. In addition to comparing phonetic transcriptions, the program provides the ability to automatically transform strings into phonetic representations. The result of the study is an algorithm and its package implementation for Python programming language. The package is applicable for the task of comparing phonetic strings, which is actively used for clustering languages and dialects. Since it uses a wide range of phonetic symbols and phonological features, it is suitable for different research data.

Student Theses at HSE must be completed in accordance with the University Rules and regulations specified by each educational programme.

Summaries of all theses must be published and made freely available on the HSE website.

The full text of a thesis can be published in open access on the HSE website only if the authoring student (copyright holder) agrees, or, if the thesis was written by a team of students, if all the co-authors (copyright holders) agree. After a thesis is published on the HSE website, it obtains the status of an online publication.

Student theses are objects of copyright and their use is subject to limitations in accordance with the Russian Federation’s law on intellectual property.

In the event that a thesis is quoted or otherwise used, reference to the author’s name and the source of quotation is required.

Search all student theses