- s. xx / s. xxi
This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.
page url: https://codecs.vanhamel.nl/Id:Meelen_(Marieke)
numerical alternative: https://codecs.vanhamel.nl/index.php?curid=36485
page ID: 36485
page ID tracker: https://codecs.vanhamel.nl/index.php?title=Show:ID&id=36485