Reviving the Maldwyn index to Welsh poetry: working document

Maldwyn used to be an online database created by the Centre for Advanced Welsh and Celtic Studies and the National Library of Wales with the aim to record "every Welsh poem found in a manuscript written before 1830", which is no mean feat. For security reasons, the old website was taken offline, but the underlying data have been made available as an Excel file under a CC-BY 4.0 license, basically meaning that the material can be (re-)used and shared as long as its creators are appropriately acknowledged and no new use restrictions will be imposed on derivative output. CODECS has begun to prepare a new version of the database, which is intended to complement the catalogue. This page is work in progress and will be used to outline the aims, and some of the approaches being followed, to offer a fresh new Maldwyn.

Module

Aims

Complement not merge

Although ultimately, it is also CODECS' aim to record every single poem, poet and manuscript, it would not be a wise idea to try merging data from Maldwyn into the existing mechanics of the catalogue. Aside from technical challenges that a mix and match approach might present, Maldwyn has immense value as an authoritative resource to be referenced and consulted on its own terms. There is also the issue that there may be reasons for CODECS to deviate in its treatment of the material, such as whether we should treat a text as one or two different poems, whether the first line of the poem should be modernised or not, etc. Making the data available as part of CODECS’ ecosystem comes with the opportunity to set up queries wherever it is relevant to do so. If we have an entry for, say, Madog Benfras, or one of his poems, or a relevant manuscript, then we can associate it with a query on Maldwyn. And of course, the benefits of linked data work in both directions. In other words, CODECS can be used to enrich searching Maldwyn and offer additional information.

The file in a nutshell

Contains over 46,000 records
Has a simple structure, with only eleven columns: BRN (identification number), First line, Last line, Poet (attribution), Metre, Type, Manuscript references, Print references, Subject (name), Subject (other), Subject (place). Where needed, a pipe symbol is used to delimit an array within a cell.
It is bilingual (Welsh and English) and offers a separate sheet for each language.
Marred by character encoding issues.

Implementation

Option 1

I have begun developing a MediaWiki extension that offers a couple of methods for extracting data from a modified version of the Excel spreadsheet (reasons for modifying it are explained below) along with a few supplementary spreadsheets. The aim is to allow data to be added to the database (MySQL) using Semantic MediaWiki's subobjects. It is likely that the search interface will be built using either FlexForm or Page Forms. When ready, the extension will be released as open-source software over on Github, so that anyone running a MediaWiki-based wiki should be able to set up an instance of Maldwyn for themselves.

Option 2

I will also be looking into using an instance of Wikibase, the software behind Wikidata, to recreate the database. To be continued...

What to do?

Fix encoding

The Excel file is riddled with Unicode encoding errors in a way which suggests they have crept in during the database export, such as Ã¢ for â and Ã¼ for ü. I have been able to reverse engineer most of the incorrect characters in the "first lines" column, although situations remain where I'm less sure of the right solution.

Poets (attributions)

Example: ARTHUR | GRUFFYDD AB ADDA AP DAFYDD

Extending

I've begun keeping a spreadsheet to identify each poet, with further columns to associate them with normalised labels, CODECS IDs and whatever else may prove to be useful. [...]

Metre

[...]

Type

[...]

Manuscript references

What I would like to achieve is (a) for the user to able to search records by manuscript, which if I recall correctly, was not possible with the old interface; (b) to link manuscripts to CODECS and if possible, the wider linked data world, incl. Wikidata.

Tidying up and resolving references

Whatever the old database may have looked like, the structure of the present Excel file is not ideal for taking things further. The way manuscript attestations are recorded in the spreadsheet is by using a single cell for each record to contain an array of references. Each reference may consist of

an abbreviation, either for the repository, followed by the shelfmark/signature, or for a single manuscript. Most of the abbreviations are explained in a separate HTML table that the user would need to consult.
sometimes a lower-case roman numeral for the specific manuscript unit or volume
finally, the first folio/page/column number (the numbering system is rarely specified), although it is occasionally absent.

Some examples:

B 10314, 409 // For London, British Library, MS Additional 10314, p. 409. Fortunately, most references follow this type of format.
B 10314. 244 // Same manuscript, p. 409, using a dot not a comma
B.M. 14900, 91a // Accidentally using an alternative abbreviation for British Museum
B 54, // In this case, the reference may be to London, British Library, MS Additional 14867, but using the number assigned to the MS in Evans' "Report...".
He 113 R // Abbreviation for a single manuscript (Aberystwyth, National Library of Wales, MS 21700 = Heythrop MS), comma omitted, locus = f. 113r
Archifau C.M. Adran 772 // Not identified in the list of abbreviations
Bl.e.3. 55v // Oxford, Bodleian Library, MS Welsh e. 3, f. 55v
Bl.E.3.50 // Oxford, Bodleian Library, MS Welsh e. 3, f. 50r?
B1.e.3, 49a // Could this be an error for Bl.e.3 ?

A cursory glance makes clear that there is a lot to be ironed out - mistakes and inconsistencies in naming sources as well as delineating constituent parts of a reference - before the data can be sensibly parsed. To get a better sense of what needs to be done, I generated an imperfect list of unique identifiers based on the assumption that the string before the comma is typically the identifier. Which is still the rule but one with many exceptions. Presumably, the best course of action at this stage would be to apply a bit more rigour to the underlying data.

Extending references

I've started keeping a spreadsheet to identify each manuscript (e.g. "B 10314"), with further columns to associate them with modern labels, repositories, CODECS IDs, and comments if necessary.

Print references

[...]

Subjects (names)

[...]

Subjects (other)

[...]

Subjects (places)

[...]

Additional work

[...]

Future?

[...]

Continue work on the database?

Monumental as the task has been in creating it, even Maldwyn is not comprehensive. I was not able to find Bodleian MS Cherry 14 or Balliol College MS 353, for instance, both of which contain copies of the poem beg. Ef a wnaeth Panthon. Many new editions and studies have appeared since Maldwyn was last edited. To continue would call for a huge collaborative effort worldwide and to support such an effort would probably call for a new digital environment in which experts can work together, but that is out of scope for this page.