Surviving Parsoid

In the not too distant future, Parsoid, MediaWiki’s PHP library for converting between wikitext and HTML, will replace the current, state-based parser in MediaWiki core. This development does not come without major consequences for the way a wiki like CODECS is currently designed. To put it dramatically but not unrealistically, CODECS will cease to exist unless I invest in a significant overhaul of the wiki code that’s been carefully built up over the last decade. Hundreds of pages need to be rewritten by hand. More specifically, it is Parsoid’s move towards parallelised (as opposed to serialised) parsing which will end support for some key approaches in our current wikitext codebase. Even more specifically, it is due to the not so bright future of extensions like Variables and Arrays as a result of the new parser in town that I need to come up with solutions and implement them site-wide. So wherein lies the rub and how do I intend to solve the problems ahead?

Project category

Warning! What follows is primarily aimed at the MediaWiki community, including administrators of wikis who find themselves in similar difficulties. Much of it may be unintelligible to you if you are not familiar with some key concepts around the use of MediaWiki. Jargon like extensions, parser functions, wikitext/wiki markup, transclusion, (wiki) templates, multiple-instance templates and Semantic MediaWiki (SMW) will be thrown around a lot. Also, because this page represents a process with new insights changing my preferred approaches, it could use a bit of a rewrite.

General

Extensions like Variables and Arrays have one thing in common: they allow you to assign a value to a global variable and use it any number of times in serial fashion throughout the remainder of the wiki output. In the process, any new variable definition can incorporate output from previously defined variables, (other) parser functions and templates. Part of what makes this even more flexible is that the scope of this behaviour extends to content that is transcluded, such as template output or a wiki page embedded within a wiki page. It brings flexibility and adds a flavour of programming logic to wikitext.

All that will change with Parsoid, seeing as it pushes for a more parallelised approach to parsing. I will not go into the details but if you are interested, this conference talk by C. Scott Ananian and Subbu Sastry available on YouTube may be a helpful introduction. Now it is not impossible that the extensions mentioned will retain some of their usefulness in a more limited capacity. Be that as it may, it does look like a lot of functionality will be lost in terms of their ability to manipulate, store and reference output from templates and parser functions.

For a rough comparison, imagine that you are writing static functions in PHP but you are no longer allowed to use variables except by way of passing them as arguments to other functions. That’s the kind of predicament we’re in.

Situations

Before moving on to solutions, let's have a brief look at the most problematic situations that are common for websites like CODECS:

Preprocess once

Variable-like constructs are heavily used to preprocess template input so that it can be appropriately used for data storage with Semantic MediaWiki (via #set and #subobject), sometimes more than once, and visually presented on the page, and possibly, for uses elsewhere. This includes more complex use cases in which the desired value is the outcome of sifting, comparing and combining incoming or previously modified variables.

Across transclusions

The scope of wikitext expansion or transclusion in which a variable is defined may not be the same as the ones where it is actually used. Such cross-transclusion interaction, for lack of established terminology, may include the following situations:

Passing variables from a parent template (where variables are defined) to its child templates (where those variables are used).
Passing variables from the output of one multiple-instance template to the next such template in the series. A simple example is the use of an auto-incrementing counter, or assigning a value to a series of templates only once until a new one is defined along the way.
Passing variables to other content that is transcluded by means other than template usage.

A child template, for lack of any established term, is an instance of a template assigned to a parameter of the main template, which we call the parent template. Child templates are often used as multiple-instance templates. Schematically, the source code would look like this:

{{MyParentTemplate
|Items={{MyChildTemplate
 |Foo=bar1
 }}{{MyChildTemplate
 |foo=bar2
 }}
}}

Sometimes an instance of the template needs to fetch data from the parent template, or from a previous instance of the same template. One example of the latter is that you may have to enumerate your instances so that when you query data, they can be sorted in the right order of appearance, something which requires that your code needs a way to get (previous) input from the page. A more elaborate variation on this is that you have a table with complex intermediate calculations and you use a template instance for each row to handle those calculations. Another aim that might call for this method is making information visually more compact: if the previous item already refers to the same subject, compare the content in each instance so we don’t meed to repeat information that’s already provided. I could go on but the point is that pre-Parsoid, variables used to provide the key to handling these use cases, without the need for additional functionality.

Use case: controlling the behaviour of a page

Determining whether transcluded content should or should not import semantic annotations (because native support has been lacking in this regard).
Determining whether sections of a page should be shown to the currently viewing user.

Solutions

In a nutshell

The solutions below can be summarised as follows:

Embed another template within a template because template arguments, unlike variables, can still be used throughout the template.
Use the #tf-convert parser function of the TemplateFunc extension to work with multiple-instance templates.
Handle arrays within a single parser function.
Write custom parser functions to handle more complex use cases that are common in the wiki.

Not covered below

One of the first recommendations you may receive when asking the community about solutions is to use Lua through the Scribunto extension. Lua is a very useful and efficient scripting language for embedded use that you should consider if you have the appropriate admin privileges to install it on the server. The if here is not a given, unfortunately. I have worked with Lua on wikis that are not my own, but for financial reasons, CODECS is tied to a shared hosting environment where Lua is not an option and so I won’t cover its benefits here.
Widgets using the Widgets extension. This site does have many of them but since they do not allow for reusing and modifying their output, they offer no solution to many of the problems raised below.
Mustache templates. While I have been using some of them through the TemplateFunc extension, Mustache is mostly intended as a declarative, mostly logicless templating engine.

1. Embed content within additional wiki templates

While the output of 'printing' a variable such as #var or #arrayprint may become uncertain, a template parameter such as {{{Foo|}}} can still be used and re-used reliably within the scope of a template. This means that you can always preprocess any provided content if necessary and assign that content directly to a template within your page or template.

It may not be a perfect solution. Nesting content by using a template as an additional layer does add a little to processing time. There is also the issue of maintainability when template nesting develops into a more intricate hierarchy. But the solution should be relatively easy to implement and may be the only one available to you.

2. Use an alternative data strategy

2.1. Handle multiple-instance templates with TemplateFunc

TemplateFunc is an extension that I wrote in 2023, in large part to address this specific issue. Instead of using the multi-instance template directly, it reads the template data from the source code and creates a new dataset, with two main advantages:

it adds an intermediary step which allows you to add to the data before the dataset gets used. This is especially useful if you need to provide an index number to each template instance, or something from the parent template.
it allows you to assign the dataset to a new template, or to a parser function such as #widget.

Documentation will be available here: TemplateFunc extension.

2.2.Query data for presentation

This solution is rather specific to the situation where you store data with an extension like SMW or Cargo and present the same data on the page. Especially if you have some intricate, resource-intensive preprocessing going on before storing the data, it makes sense not to repeat the same preprocessing routine to present the same data. Instead, use a query to call the pre-prepared data and use an appropriate result format.

Regarding SMW, if the order of values in which they appear in the source code must be respected, note that property values do not support it out of the box, but you can enable it using a so-called sequence map. An alternative is to introduce an additional property that simply stores the values as a single, delimited string and to process the string as an array when needed.

3. Arrays

An issue with the Arrays and WSArrays extensions is that the definition of an array and its output are catered by separate parser functions which are then picked up by the parser in serial fashion.

Any alternatives? Yes and no.

The parser function #arraymap, which happens to be part of the Page Forms extension, should remain perfectly usable in the future, but it was not designed for more complex use cases.
Luckily, a new extension called ArrayFunctions (full disclosure: developed by a colleague of mine) rose to the challenge of coming up with a Parsoid-proof solution. It boasts a fairly comprehensive set of parser functions for working with arrays, including separate ones for definition, manipulation and output. Unlike the functions of the older extensions, these are intended to be nested inside each other, which guarantees compatibility with Parsoid. The extension is particularly useful it you have access to Lua/Scribunto.

Nonetheless, I went on to create a new parser function of my own, #cr-array, for three main reasons: the nesting approach can make wiki code hard to maintain; given the amount of work ahead of me, it should also be easy to use them to replace relevant sections in the wiki; and finally, I had some additional wishes I wanted to act on.

#cr-array offers a couple of actions, including 'map', 'count' and 'search-switch' :

'map' - the function behaves similar to Page Forms' #arraymap:, but with additional options. For instance, it lets you sort, in ascending or descending order, by PHP's 'natural sorting' algorithm; duplicate values can be removed. There are also parameters for the final separator, and a combination of offset/length to slice the array.
'count' - count items in the array
'search-switch' (earlier, part of #cr-arraysearch-switch) - checks of a certain value is included in an array and offers output based on whether the boolean evaluation is true or false.

To do

regex for delimiters

When used in conjunction with Regex Fun, the Arrays extension also supports regular expressions to handle more advanced ways of separating items, such as the use of multiple delimiters. ArrayFunctions does not support this feature currently.

4. PHP widgets: writing new parser functions

Sometimes there are common, relatively well-defined tasks that are sufficiently complex to warrant a new parser function. Over time, I ended up writing a collection of parser functions to adopt functionality that was previously handled through wiki templates, with the help of extensions like Variables and Arrays.

#cr-citation - produces formatted citations for a range of publication types, something which requires a good deal of preprocessing based on many possible configurations of variables.
- Together with six other parser functions for relatively smaller tasks involved in producing formatted strings for data storage and reuse
#cr-date-translate: converts input for dates and date ranges to shorthands and machine-readable values.
#cr-nav-tabs: a parser function created to deal more efficiently with dynamic content in Bootstrap-based tab navigation.
Lots of minor functions, e.g.
- #cr-urlget:: a simple parser function for fetching URL data, created as a somewhat more efficient alternative to a wii template that wraps the use of #urlget, a regex to remove potentially harmful characters, and urldecode.

Final notes

Performance and maintenance

We have seen that in dealing with Parsoid, we often need to abandon approaches that were intended to make wiki code faster and more efficient and to resort to methods that add, not strip away complexity, for instance:

Repetition of preprocessing, which will sometimes be inevitable.
Additional templates and template nesting, adding another layer to the recursive parsing cycle.
Additional extensions to be written and maintained.

Fortunately, it is expected that Parsoid will come with some overall gains in performance. The question is how things will hang in the balance. The outcome is not easily predictable and is highly specific to the situation at hand.

Personally, I’m not all that worried about the potentially negative impact on performance. If you are, it may be helpful to look at it from another perspective: does everything on the page need to be parsed right at the start when the page is loading? You could consider deferring the parsing of some sections on the page, especially when those sections are not directly visible to the user anyway. See Parse on demand for one possible approach I have been pursuing.