Bibliography

Meelen, Marieke, and David Willis, “Towards a historical treebank of Middle and Early Modern Welsh, part I: workflow and POS tagging”, Journal of Celtic Linguistics 22 (2021): 125–154.

  • journal article
Citation details
Article
“Towards a historical treebank of Middle and Early Modern Welsh, part I: workflow and POS tagging”
Periodical
Journal of Celtic Linguistics 22 (2021)
Journal of Celtic Linguistics 22 (2021), University of Wales Press.
Volume
22
Pages
125–154
Description
Abstract (cited)

This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.

Subjects and topics