BachelorDragon.png

The bachelor programme Celtic Languages and Culture at Utrecht University is under threat.

Reconciliation API

From CODECS: Online Database and e-Resources for Celtic Studies

The ReconciliationAPI extension lets you create and finetune API endpoints and services that can be used for entity reconciliation and word completion. Its current focus is on Semantic MediaWiki and MediaWiki core.

  1. Entity reconciliation following the specifications of the W3C Entity Reconciliation Community Group (v2). Reconciliation usually means matching a dataset you are working on with another from an external source. It is a key feature of OpenRefine, which when coupled with the right tools (plugins or otherwise) allows you to look for and ‘reconcile’ with authority records in other databases, such as Wikidata, Library of Congress, or your own instance of Wikibase. To enable a Reconciliation API means to open up data on your wiki via a reconciliation service to which tools like OpenRefine can connect.
  2. Word completion, also known as autocomplete or predictive search, is an essential and ubiquitous feature of user interfaces in which a user starts typing and the application suggests a list of candidate entities/values for selection or confirmation based on a set of criteria. This extension attempts to provide Autocomplete APIs
    1. for Reconciliation API
    2. for internal use so that forms, search bars and other UI/UX features can be designed with ...

Strategies

There are two or three main ways in which an API can be ....

  1. The recommended way is to set up a new page in the Recon namespace.
  2. Add a JSON-encoded configuration profile to the URL (not supported yet)
  3. Some queries are possible with plain parameters.

When it comes to wildcards and ..., a particular challenge is dealing with disparate behaviours depending on the backend being used, for example:

  • SQL (default)
  • SQL with Full-Text Search enabled
  • Elasticsearch

API actions

recon-manifest

Not yet implemented.

recon-suggest-entity

@todo: maybe drop additional source param (mw or smw, which is implicit from other params or data).

Get pages from a MediaWiki category or namespace:

api.php?action=recon-suggest-entity&source=mw&ns=...&substr=...
api.php?action=recon-suggest-entity&source=mw&cat=...&substr=...

Get pages based on a configuration profile:

api.php?action=recon-suggest-entity&profile=...&substr=...
Parameter Description
profile Page ID of the configuration profile in the Recon namespace.
substr substring
substrpattern Substring pattern... May be used to override the pattern set in the profile page.
source mw (for MediaWiki) or smw (for Semantic MediaWiki)
cat the category or categories (comma-separated) that must be searched.
ns the namespace or namespaces (comma-separated) that must be searched, e.g. Main,Help. Example: [1]

recon-suggest-properties

To be renamed to recon-suggest-property?

api.php?action=recon-suggest-properties&format=json&source=smw&substr=...

Used to suggest a property.

recon-suggest-propvalues

api.php?action=recon-suggest-propvalue&source=smw&property=...&substr=...

Used to suggest a value based on the provided property name.

Configuration options in Localsettings.php

[...]

Semantic MediaWiki

SQL with Full-Text Search

SQL's Full-Text Search (FTS) is not an ideal approach for powering autocompletion, but it offers a couple of advantages over SMW's standard search behaviour, notably insensitivity to case and diacritics. Some effort has been made to harness its strengths as user-friendly as possible.

When FTS is enabled for SQL (https://www.semantic-mediawiki.org/wiki/Help:Full-text_search), SMW supports two modes of behaviour for string searching, each represented by a different prefix as the operator placed after :::

  1. standard behaviour, now represented by like: rather than the standard tilde (~). E.g. [[Has title::like:Moun*]]
  2. Full-Text Search, now represented by the tilde (~). E.g. [[Has title::~Moun*]]. Additional special syntax includes +/- (IN BOOLEAN MODE) and double quotes for exact phrase matching.

What does this mean for FTS in a search box?

  • Each word, or each new set of consecutive characters, that the user starts typing is evaluated individually: if its character length is at least the number configured through $smwgFulltextSearchMinTokenSize (default: 3) and is not a stopword, it will be matched against a token. If it is shorter, or indexed as a stopword, it may be ignored. [...]
  • A side-effect of tokenisation with FTS is that by default, the order of appearance is not taken into account: notably, it is not possible to match the full string only at its beginning.
  • However, phrase matching is possible, to an extent, by putting the phrase between double quotes. Again, shorter strings not treated as tokens will be ignored.
    • The trade-off is that it does not support the use of asterisks for partial matching on a token.
  • Another difference is in the evaluation of multiple tokens. "Mount Badon" has a match if the string contains either "Mount" or "Badon". To find matches only where both words are included (AND not OR), each token must be prefixed with a boolean plus sign: "+Mount +Badon".

What does this mean for our implementation?

  • It is up to site admins to determine what is most desirable in their use case when it comes to:
    • because the site admin is responsible for the query pattern in the profile, it is also up to the site admin to decide on a tilde or like:.
    • the 'substrpattern', otherwise used to determine the position of asterisks, still needs more thought
      • stringprefix: not supported by FTS.
      • tokenprefix / allchars: there is no such distinction in FTS.
  • Because the ordinary user of a search input should not be expected to be familiar with the nitty gritty of SMW syntax, some behaviours are handled automatically:
    • Asterisks are appended automatically where they are wanted.
    • Boolean prefixes are added automatically. Care is taken that they are prefixed only to actual tokens of the expected length. A fatal error (RuntimeException) will occur if they are added to shorter phrases.
    • Double quotes, however, are


[Additional guide:] How do you recommend I set up my semantic properties?

For each page that you want to be findable, use properties for the following:

  • Information to be displayed in search results:
    • Label. After all, a page title may not be the most informative or flexible choice. A natural option may be to use the special property "Display title of" (https://www.semantic-mediawiki.org/wiki/Help:Special_property_Display_title_of). For best multilingual suppprt, you could also consider creating a property with Monolingual Text datatype.
    • Description. One to identify the subject or page scope more nearly and maybe to disambiguate between multiple items of the same name (cf. Wikidata). Preferably under 200 characters.
    • Image thumbnail (@todo)
  • For internal use:
    • Searchable string, i.e. the string - invisible to the user - on which the search must operate. It is equivalent to the label above, but 'flattened' to be most efficiently usable in a search: lowercase, stripped of HTML tags and unhelpful punctuation, stripped of diacritics (e.g. a not á). A parser function will be made available to help you accomplish this om your wiki.
      • potentially, you may want to add alternate descriptors and spelling variants, either to this property or a dedicated additional one.
    • A string for improved sortability (@todo)

For later consideration

Note to self

  • What about multilinguality? What about Monolingual Text?
  • What about Wikidata-style "Also known as"
  • What about categories?
  • What about preemptively stripping away some characters from the string being typed.
    • On the @todo list: if the user prepends a boolean +/- but what follows is not of the expected token size, we should have it removed at preprocessing time.
  • Better handling of double quotes so we can support something like:
[[Display title of::~+"Dublin, Trinity College" +131* ]]

where the user would type: "Dublin, Trinity College" 131

maybe use preg_match_all( '/"([^"]+)"/', $str, $matches ), or use preg_replace to remove those sections first.
  • Any benefits or disadvantages from using an array of values containing searchable strings?
  • Should there an option to 'flatten' a user's substring