Mittendorf, Ingo, and David Willis, Corpws hanesyddol yr iaith Gymraeg 1500–1850 = A historical corpus of the Welsh language, 1500–1850, Online: University of Cambridge, 2004–.
The corpus is a planned corpus, and aims to reflect the rich diversity of the texts attested in Welsh during the period 1500–1850 by including texts and samples of texts from different stylistic levels and of varying geographical provenance. A number of the texts included are not available in adequate modern editions or are available only in modernised form, hence the corpus also provides access to a number of texts in an easily available form for the first time. It is hoped that this will encourage further linguistic, literary and historical research on these texts.
The corpus is encoded using Extensive Markup Language (XML) in a format that conforms to the standards of the Text Encoding Initiative (TEI). This should ensure its long-term preservation, and also allows flexibility in the way the texts of the corpus can be displayed and used. The corpus files can be viewed online here, and are also available for download here in a number of formats: as plain XML files; as viewable HTML documents in two formats (diplomatic and edited); as corpus files designed for use with the Concordance software package; and as web-based indexes and concordances. Although the corpus contains no grammatical tagging, the XML files contain some encoding designed to facilitate the usefulness of the corpus as a source for linguistic research. This concerns mainly spelling and graphical variation. Original spelling is maintained, but tagging for scribal errors and extreme orthographic variation is included, and is used in the indexes and concordances. Other editorial conventions are documented here.
The corpus is arranged into different groups of text types in order to represent the stylistic diversity of the Welsh language, while allowing for differences in the specific range of text types actually available at different periods. The texts therefore include drama, personal letters, ballads, political (didactic) prose, scripture, historical narrative, narrative prose, and religious prose. For each text a representative sample of approximately 15,000 words is included. With texts whose total length is less that around 20,000 words, and also in the case of dramatic texts (the interludes) we have generally chosen to include the entire text. Overall the corpus contains around 420,000 words from 30 texts.
page url: https://codecs.vanhamel.nl/Mittendorf_2004
numerical alternative: https://codecs.vanhamel.nl/index.php?curid=1077
page ID: 1077
page ID tracker: https://codecs.vanhamel.nl/index.php?title=Show:ID&id=1077