Language Representation Models for Low- and Medium-Resource Languages

Loftsson, HrafnDaðason, Jón Friðrik2025-11-172025-11-172025Daðason, J F 2025, 'Language Representation Models for Low- and Medium-Resource Languages', Doctor, Reykjavik University, Reykjavík.978-9935-539-78-6978-9935-539-79-3240198033ce0b4e7d-2bf1-42c8-9c2e-bc7ddda8a653https://hdl.handle.net/20.500.11815/5957Transformer-based language models have proven to be extremely effective for a wide variety of natural language understanding tasks, including question answering, automatic text summarization, and sentiment analysis. These models are typically pre-trained on large, unannotated corpora using self-supervised tasks such as masked token prediction, often requiring weeks or months of training, followed by fine-tuning on practical tasks, which requires substantially less time and data by comparison. Since their introduction, Transformer models have grown exponentially in size, from approximately 100 million parameters in 2018 to over 600 billion in 2024, with the largest pre-training corpora growing from around 800 million tokens to over 14.8 trillion. However, many low- and medium-resource languages lack the extensive datasets and computational resources required to pre-train language models at this scale. Therefore, data-efficient pre-training techniques are crucial for effectively utilizing the limited resources available for these languages. In this thesis, we investigate various data-efficient pre-training strategies and evaluate their impact on downstream tasks in six low- to medium-resource languages: Icelandic, Estonian, Basque, Galician, Nepali, and Tajik. First, we analyze several text quality filtering techniques to discard noisy data from web-crawled corpora. We propose a novel, language-independent filtering approach using unsupervised clustering and outlier detection algorithms which achieves comparable performance to a rule-based approach. Second, we explore the effects of augmenting monolingual pre-training corpora with text from related and unrelated languages, as well as Python code, finding significant improvements in downstream performance for certain tasks for larger models. Our results support the hypothesis that linguistic similarity facilitates cross-lingual transfer. Finally, we compare several subword tokenization algorithms and evaluate their impact on downstream results when used in pre-trained language models. Our analysis reveals that the Unigram algorithm consistently yields the best results on downstream tasks, and that a vocabulary size of 64k outperforms smaller vocabularies by a statistically significant margin. Our findings demonstrate that data-efficient pre-training techniques can substantially improve the performance of language models for low- and medium-resource languages. By optimizing the use of available data and resources, we achieve statistically significant improvements in downstream tasks under data-constrained conditions, paving the way for more effective natural language processing in resource-constrained settings. We release several datasets and tools compiled and developed during the work of this thesis, as well as multiple pre-trained Transformer-based language models.1041282552eninfo:eu-repo/semantics/restrictedAccessNatural language processinglanguage modelstext filteringMultilingual modelsLanguage Representation Models for Low- and Medium-Resource Languages/dk/atira/pure/researchoutput/researchoutputtypes/thesis/doc