URLapalooza!
Knowledge graphs use the best practices in graph modeling and network analysis to explore interlinked entities and their relationships. Scientific knowledge graphs build upon these principles to represent all kinds of scholarly knowledge, including but not limited to citations between publications, contributorship links between researchers and their works, and versioning information for datasets and software artifacts.
In order to make FREYA’s PID Graph interact with the web at large, Cobaltmetrics’ URI transmutation API now interlinks with DataCite’s GraphQL API to discover even more PIDs and URLs that identify a given web resource. Now, that’s a lot of acronyms, so it must be good! (Insert sarcasm punctuation here.) Allow me to explain.
Better a URL Today Than a PID Tomorrow
FREYA is a 3-year project funded by the European Commission under the Horizon 2020 programme. One of their outputs is a scientific knowledge graph named the PID Graph, which links persistent identifiers (a.k.a. PIDs) as a basis for a wide range of services.
With Cobaltmetrics, our biggest challenge is to discover URIs and PIDs that directly or indirectly identify the same resource. We believe that it is not up to citation aggregators to define what is citable. We observe all citation patterns, whether authors use standard identifiers like DOIs or copy-paste non-persistent URLs from the address bar of their browsers.
There are billions of resources without PIDs, e.g. old documents, grey literature, and most of the non-scholarly web. Even resources that were assigned PIDs are not necessarily cited using these identifiers: in 2019 we estimated that PIDs account for 2% of the URIs in our citation index. It follows that we cannot merely track PIDs and permalinks to monitor research outputs and the attention they receive.
Initiatives like FREYA advocate the adoption of PIDs for all kinds of scholarly entities, and we wholeheartedly support that. However, until PIDs become the default, our mission is to ensure that entities that were not blessed with PIDs can still be linked with the rest of the scholarly graph.
PID Avengers, Assemble!
In order to make scientific knowledge graphs like FREYA’s PID Graph interact with the web at large, our URI transmutation API integrates PID-to-PID graphs, PID-to-URL resolvers, URL-to-PID unresolvers, and URL unshorteners. (See our documentation for more information on data sources.) The resulting knowledge graph is a very large but simple graph with a single relationship between its nodes, namely “identifies the same resource as,” something similar to yet less strictly defined than owl:sameAs.
Until PIDs become the default, our mission is to ensure that entities that were not blessed with PIDs can still be linked with the rest of the scholarly graph. Our knowledge graph combines PIDs, non-persistent URLs, and even dangling URLs, i.e. URLs that do not resolve to a valid destination.
Since August 2020, our URI transmutation API can interact with DataCite’s GraphQL API to fetch, when applicable, additional identifiers for works with DOIs that are included in DataCite’s collections. Specifically, we extract all identifiers and so-called “related identifiers” with any of the following relation types, when available: IsIdenticalTo, HasVersion, IsVersionOf, IsNewVersionOf, IsPreviousVersionOf, IsVariantFormOf, and IsOriginalFormOf. For more information regarding the technical integration between the URI transmutation API and the PID Graph, see our dedicated case study.
Muggle Scientists Develop Harry Potter “Marauder’s Map” Technology
Earlier this year, we were invited by the U.S. National Institutes of Health to give a presentation on FREYA’s PID Graph during a workshop on the role of generalist repositories to enhance data discoverability and reuse. Knowledge graphs are cool, but they are not terribly visual for less-technical audiences. Show too few nodes, and the audience might not understand the scale of the problems at hand, as well as the scale of the resources built to address them. Show too many nodes, and all you get is a colorful cloud that conveys little information about the graph and its applications.
In order to explain why we love working with PIDs and scholarly knowledge graphs, our presentation drew an analogy between the PID Graph and Cobaltmetrics on one hand, and the Marauder’s Map from the Harry Potter series on the other hand. For the uninitiated:
The Marauder’s Map was a magical document that revealed all of Hogwarts School of Witchcraft and Wizardry. [It] showed every inch of the grounds, as well as all the secret passages […]. It was also capable of accurately identifying each person, and was not fooled by […] invisibility cloaks; even the Hogwarts ghosts were not exempt.
PIDs can provide unambiguous linking between entities of the same type, e.g. different versions of the same research output, or of different types, e.g. a research output and its contributors. There are multiple ways to explore a graph. For example, you might want to start from a given node (representing a person, an organization, etc.) and see what it is linked to. Or you might want to start with multiple nodes and test whether they are linked, and how (what are the nodes on the paths, what are the relationships between these nodes, etc.). But what if the IDs you have are not PIDs? This is where our URI transmutation API comes to your help.
URI transmutation is the process of converting any URI into a set of equivalent URIs, equivalence being defined as directly or indirectly identifying the same resource. So you can start with any non-persistent ID, and we will do our best to find matching PIDs. The PID Graph and the URI transmutation API are quite complementary: while the graph provides unambiguous links between nodes, the API provides additional entry points to start exploring the graph.
Three noteworthy use cases of the PID+URL graph. Example #1 illustrates the search for direct and indirect contributors to a given work, starting from the work itself. Example #2 illustrates the search for people affiliated with a given institution. Example #3 illustrates the search for derivative works, starting from the service or API that produced a given dataset that was then referenced in a published work. In all three use cases, the PID Graph creates the inner structure of the network, and the URI transmutation API provides additional identifiers to access the outermost nodes.
In conclusion, scientific knowledge graphs are magical data structures that reveal all scholarly entities. They show every contributor and every contribution, as well as all the relationships between them. They are also capable of accurately identifying each entity, and are not fooled by non-persistent identifiers; even dead URLs are not exempt.
Interested in learning more about Cobaltmetrics? Try it out, check the public API, join our newsletter, and reach out at contact@thunken.com.
Original Blogpost:
https://medium.com/thunken/urlapalooza-45225db8e702