Link Checking - Enabling PID Resolution Services


Author: Robin Dasler (DataCite)

As part of our ramp up to developing the PID Graph, FREYA partners have worked on providing reliable and scalable PID resolution services with support for particular functionality important to our project partners in various disciplines. One example of such functionality is support for link checking.

What do we mean by link checking?

When we talk about link checking, we're really talking about automatic verification of the URL that is contained in a PID's metadata. This is a critical piece of functionality missing from many PID platforms that would enable PID creators and providers to maintain their commitment to providing resolvable content. Regular checks of whether the landing page for a PID is still reachable are of crucial importance, but link checking doesn't have to stop there. Additional automated checks for PID landing page best practices can go one step further to helping PID creators and providers to ensure a fundamental level of machine-readable metadata quality that is necessary for interconnected PID systems like those in the nascent PID Graph.

A link checker in practice

To support this necessary functionality on a wide-reaching infrastructural level, DataCite has developed a link checker as part of their DOI management services. This link checker is a custom-built crawler (based on the popular open-source tool [Scrapy] that works its way through one DOI per DataCite Client per day and attempts to follow the URL registered in the metadata. The link checker then returns information about whether or not its attempt was successful and what it found at the other end.

In addition to verifying that it can reach the URL found in a DOI’s metadata, the link checker also looks for several characteristics that conform to [DataCite's best practices for DOI landing pages]. These are:

  • **HTTP status code** - Indicates whether the crawler could successfully reach the URL, and if not, why.
  • **Number and URL of any redirects** - Indicates whether the crawler was passed through other URLs before it reached its ultimate destination.
  • **Landing page** - Was there a text landing page present at the URL?
  • **DOI in the landing page** - Was a DOI found in the HTML for the landing page?
  • **Schema.org metadata** - Was there [schema.org](https://schema.org/) metadata found on the landing page? This is especially important for exposing metadata to other web services.


DataCite's link checker is available to DataCite's member organizations through either the DataCite DOI management platform [DOI Fabrica] or through the DataCite [REST API].

From link checker to PID Graph

The information in DataCite DOI records is fed into the Crossref-DataCite Event Data service, which collects links between PIDs of disparate types. Event Data in turn forms the backbone of DataCite's contribution to the PID Graph. Having the link checker in place helps to ensure that the information feeding into the PID Graph is complete, accurate, and at least minimally machine-actionable.

As an open source developer, DataCite makes its work on this link checker available via [GitHub]. Other FREYA partners or other community members can then take this work and build on it to make their own similar link checker systems to help influence the information feeding into their own neighborhoods of the PID Graph.