Scientists Improve Voiceprint Collection

Researchers from HSE University and Nizhny Novgorod State Linguistic University (LUNN) have developed a new method for ensuring quality in automatic voice recording. Resistant to background noise of 10dB and higher, the new algorithm can operate in real-time, making it possible to use the software to collect voice biometrics for a wide variety of purposes. The article was published in the journal Measurement Techniques.

Speech recognition technologies have been the focus of much research and development over past decade, with some significant progress having been achieved to date. This is evidenced by the rising popularity of voice assistants, such as Siri and others. According to a new forecast from British analysts at Juniper Research, there will be almost 8 billion smart speakers by 2023, compared to 2.5 billion speakers in use in 2018.

These technologies seem to be attractive not only for mobile app creators, but also for companies, such as call centres and banks, which utilize phone subscriber verification. However, there are numerous obstacles in the way of the widespread introduction of voice identification systems. One of them is poor quality of voice reference templates. Every so often, the recognition algorithms may refuse an authentic user due to the presence of noise in the voiceprint template.

The problem is that voice biometrics data is collected in offices, where there usually is a lot of background noise. Since a mere pencil tap on a desktop might prevent the algorithm from identifying a speaker’s voice, it is essential that recordings with ambient noise be identified during the voiceprint collection. The new method put forward by Professor Andrey Savchenko (HSE University) and Professor Vladimir Savchenko (LUNN) can reduce this error rate down to 2%.

Companies are interested in having preventive tools at their disposal. For instance, this could be a system which automatically identifies if a recording is bad before the client leaves their office. With this in mind, our goal is to develop an effective method capable of processing sounds on any device, from a cheap smartphone to a laptop or an office computer, in real-time, Professor Savchenko (HSE University) notes.

Furthermore, the researchers proposed using an algorithm that splits the recorded speech into short frames, measuring the pitch frequency in each of them. Their software assesses the pronunciation stability against its average level and displays the dependence of the measured speech quality on time as a colour chart.

Pitch frequency (PF) is a unique characteristic of human speech. The PF can either increase or decrease depending on the speaker’s emotional state, which causes this fluctuation.

The system treats the initial parts of a recording as a template, awarding them with 100% quality. If the estimated pitch frequencies of the next speech frames are more or less stabilized, the recording will be seen as of good quality. If there is a wide range in the values, the record will be considered faulty. Such faults may be caused by an interfering voice with a different pitch frequency.

A major Russian bank is interested in this development, and has already provided 30 recordings from its database for the initial testing. The software findings appear to have matched the estimation of the people who check the quality of the recordings in 93.3% of the cases.

Date

September 30, 2019

Topics

Research & Expertise

Keywords

speech recognition

About

Research at HSE University