Year of Graduation
Development of the Universal Phonetic Algorithm for IPA
Fundamental and Computational Linguistics
The idea of comparing words based on their pronunciation rather than on their spelling has long been discussed by the scientific community. As a result, phonetic algorithms were developed. A phonetic algorithm is an algorithm for comparing words based on their pronunciation. The idea behind this algorithm is to transform strings written using graphemes to their phonetic representations and comparing them using various distance metrics. Although such algorithms have been described before, many of them were specialized for a specific research task, for example, a specific language. This paper introduces the universal phonetic algorithm for IPA, an algorithm and it's programming implementation that calculates distances between word transcriptions. The universality of the algorithm is achieved by using the International Phonetic Alphabet (IPA). The IPA is a phonetic notation system created by the International Phonetic Association that uses a set of characters based on the Latin alphabet to represent the sounds that exist in human spoken languages. The algorithm converts each IPA character of transcriptions in the form of vectors that phonetically describe sounds. For this purpose, we have chosen the phonological system of Chomsky and Halle. The distance between transcriptions is calculated using a modified version of the Levenshtein distance, where the replacement cost is equal to the distance between vectors of the symbols of transcriptions. In addition to comparing phonetic transcriptions, the program provides the ability to automatically transform strings into phonetic representations. The result of the study is an algorithm and its package implementation for Python programming language. The package is applicable for the task of comparing phonetic strings, which is actively used for clustering languages and dialects. Since it uses a wide range of phonetic symbols and phonological features, it is suitable for different research data.