Researchers at the Hungarian Research Centre for Linguistics Have Automated the Transfer of a Database of Hungarian Prisoners of Soviet Camps, Published in February This Year, to the Hungarian Language

22.03.2021

After several decades, Russia finally provided Hungary with data on Hungarian prisoners of war and civilian deportees in 2019. Once the data of approximately 682,000 persons had been processed, the database operated by the National Archives of Hungary was made available on February 25 this year. The database can now be considered complete, making it a significant resource for research. It is also a significant resource for the general public, as anyone interested can now access the available information to find family members and loved ones who were imprisoned in Soviet camps. The automated transfer of the Cyrillic database to Hungarian was carried out by researchers of the ELKH Hungarian Research Centre for Linguistics (NYTK), under the leadership of Bálint Sass.

In 2019, the National Archives of Hungary purchased from the Russian State Military Archives a digitized, scanned image of cards that contained the basic details of approximately 682,000 Hungarian prisoners of war and deported civilians, as well as the database produced from these cards. The cards and database contain the key details of the person concerned: the surname and first name of the person registered as a prisoner, their paternal first name (as patronymic is customary in Russia), rank, place and date of birth, place and date of imprisonment, date of departure and camp they were discharged from, as well as the date of death, if the person lost their life.

Researchers at the Hungarian Research Centre for Linguistics have automated the transfer of a database of Hungarian prisoners of Soviet camps

Naturally, everything on the cards is written in Cyrillic, not only the information in the Russian language, but also in Hungarian, such as the surname, first name, and some elements of the geographical locations – such as the place of birth and site of captivity. A linguistic issue arose during processing of this data. The Hungarian-language personal details that had been dictated by the Hungarian prisoners were made available in Cyrillic, in the format that the recording soldier – usually a Russian – had written down after hearing these details. In addition, the data were further distorted when, during the 2010s, Russian workers compiled a database on the basis of these cards. The problem was they were entering data based on a 70-year-old handwritten Cyrillic version of Hungarian words – a language that they did not understand.

Most of the data processing work, including manual translations, were done by the staff of the National Archives of Hungary. The automated Russian-Hungarian transcription and restoration of personal names and geographic locations were the tasks performed by the staff of NYTK under the leadership of Bálint Sass, based on verification and feedback from the Archives. The task was to implement the transcription of, for example, “Ковач Йожеф → Kovács József”. The difficulty was that, due to distortions, letter-to-letter matching provided the correct solution in only the rarest of cases. There were many cases which were difficult to produce an algorithm for, such as: Цилбауер → Zielbauer (a surname), Дейло → Béla (a first name), Саотморской → Szatmár, Гонграмеде → Csongrád (county names), or Кишкупфьилстьгаза → Kiskunfélegyháza (a city). In many cases, there were several possible solutions that were all equally likely. In this case, it was no longer possible or worthwhile to select them in an automated way. For example: Эрин → Ernő; Ervin; Erik (a first name).

Details of the works can be found in a lecture given at this year's Conference on Hungarian Computational Linguistics and in the related publication, as well as in a lecture given at the 2020 Hungarian Science Festival. The automatic transcription and restoration tool can be found on github. All are available in Hungarian.

It is worth watching the edition of the television show Ez itt a kérdés (This is the question – in Hungarian) on February 22, 2021, starting at the 13th minute. In the archive footage played in the show, a former prisoner of war recalls how much it all depends on whether a person is listed as Hegyi or possibly Gegyi – as the h-g swap is one of the most typical misspellings. This short excerpt illustrates the linguistic problem that the staff of the Hungarian Research Centre for Linguistics undertook to address.The freely searchable public database, which opened on February 25, 2021, the Memorial Day for the Victims of Communism, is available – currently in Hungarian – on the website of the National Archives of Hungary.