A Transformer-based Parser for Syriac Morphology

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

  • Naaijer, Martijn
  • Constantijn Sikkel
  • Mathias Coeckelbergs
  • Jisk Attema
  • Willem Van Peursen
In this project we train a Transformer-based
model from scratch, with the goal of parsing the
morphology of Ancient Syriac texts as
accurately as possible. Syriac is a low-resource
language, only a relatively small training set
was available. Therefore, the training set was
expanded by adding Biblical Hebrew data to it.
Five different experiments were done: the
model was trained on Syriac data only, it was
trained with mixed Syriac and (un)vocalized
Hebrew data, and it was trained first on
(un)vocalized Hebrew data and then trained
further on Syriac data. The models trained on
Hebrew and Syriac data consistently
outperform the models trained on Syriac data
only. This shows that the differences between
Syriac and Hebrew are small enough that it is
worth adding Hebrew data to train the model
for parsing Syriac morphology. Training
models with data from multiple languages is an
important trend in NLP, we show that this
works well for relatively small datasets of
Syriac and Hebrew.
OriginalsprogEngelsk
TitelProceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing RANLP 2023
Antal sider7
UdgivelsesstedVarna, Bulgaria
Publikationsdato2023
Sider23-29
ISBN (Trykt)978-954-452-087-8
StatusUdgivet - 2023

Links

ID: 366755414