Meelen, Marieke, and David Willis, “Towards a historical treebank of Middle and Early Modern Welsh, part I: workflow and POS tagging”, Journal of Celtic Linguistics 22 (2021): 125–154.
- journal article
This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.
page url: https://codecs.vanhamel.nl/Meelen_and_Willis_2021_jceltl22ore
redirect: https://codecs.vanhamel.nl/Special:Redirect/page/58585
numerical alternative: https://codecs.vanhamel.nl/index.php?curid=58585
page ID: 58585
page ID tracker: https://codecs.vanhamel.nl/index.php?title=Show:ID&id=58585