VIII. Authority Control

Authority control is the area of Linked Data transition that has caused the most concern. According to Maxwell’s Guide to Authority Work, “Authority work is so called because it deals with the formulation and recording of authorized heading forms in catalogue records,” such that, “names and other headings that are access points to records are given one and only one conventional form.” Prior to the internet, when humans and non-networked computers were the only consumers of information, heading forms were string based, which is to say that the written, human readable form of a heading was the functioning authority. Humans and computing systems could only match records if the values of individual fields were identical as strings. Thus, for example, two records, each of which recorded an Author field with the value “Mark Twain” would be seen as connected through the Author field. But a collection records with Author field values “Mark Twain”, “Twain, Mark”, “Samuel Langhorne Clemens,” and “Samuel L. Clemens” would not connect despite that fact that all of these name forms refer to the same, physical author. This is a familiar concept to catalogers.

From one perspective, Linked Data authorities function much the same as MARC’s human readable authorities. As with strings, when URIs are the same they stand as authority for the same named entity and for different entities when they are different. Thus, for example:

http://id.loc.gov/authorities/names/n79021164

matches

http://id.loc.gov/authorities/names/n79021164

but not

http://id.loc.gov/authorities/names/n79021165

As with authorities meant for human consumption, a variation of just one character (in the above case “4” to “5” in the last character of the string) results in treatment as a distinct authority.

When cataloging in MARC, the authorized, human readable version of a heading will always appear in the record access point regardless of how the name, subject, etc. may appear on the actual item, and cross referenced literal values may or may not be provided elsewhere in the record. In Linked Data cataloging, the same URI must be used to create a linkable node in the graph, but any given graph can contain any version of the human readable label (name, subject, etc.) without affecting the field’s linking function.

Given the above, it is not necessarily the case that moving to Linked Data dramatically affects how we work with authorities. We could, in fact, use the same centralized authority control systems that we use today and the workflows that surround them. Linked Data, however, opens the possibility for radically new forms of authority.
Figure 28 below depicts the current, centralized model of authority control.

Figure 28: Centralized authority control

By contrast, Figure 29 blow depicts a completely decentralized model for authority control:

Figure 29: Decentralized authority control

It is the centralized authority control with which we are currently familiar. Authority headings are managed by one or more centralized authority. Individual libraries both request and submit headings from the appropriate managing authority. The decentralized model, by contrast, removes the authority managing organizations from the equation. Instead of going through central points of authority to manage authorities, libraries rely on each other.

In a completely decentralized authority model, rather than turning to a Library of Congress authority file, individual libraries would query each other’s Linked Data points in search of authority URIs. For example, if cataloging a work credited on the title page as authored by “Mary W. Shelley,” a cataloger would submit a query to other libraries for any triple in their graph store with the label “Mary W. Shelley,” or “Mary Shelley”, or even just “Shelley.” If a matching triple(s) were found, the cataloger would then pull the extended graph in the holding institution’s data-store in order to disambiguate. Provided the cataloger determined from traversing this graph that it represented the same “Mary W. Shelley,” the cataloger would use the found URI in the local graph. In cases where no graph can be located by querying other libraries for triples with the Label “Mary W. Shelley” the cataloger would mint a URI locally and insert it into the local graph for the work being cataloged, making the new URI findable and usable by other libraries through the Linked Data gateway to the cataloger’s library.

The above system allows URIs to propagate organically through the extended library information network in a matter that is both efficient and provides a growing graph of context for disambiguation. Once a URI is in circulation, each library that uses the URI extends the graph of information available for other libraries to use in disambiguation. This extend graph would very quickly surpass the current level of context that surrounds existing authority methods.

There are, however, some potential difficulties with the completely decentralized model. Most obvious is the problem of finding an appropriate URI with a non-matching label. The current, authorized heading for “Mary W. Shelley” is “Shelley, Mary Wollstonecraft, 1797-1851.” A query for the label “Mary W. Shelley” would not find a referenced URI for “Mary W. Shelley,” even though the two are actually the same person and should be represented with the same URI. The solution to this problem is a reconciliation process commonly known as sameAs. The sameAs entity provides a mechanism for indicating that two URIs refer to the same entity. Thus, for example, if one graph assigns the following URI to Mary W. Shelley:

http://library1.com/entity/person/72312031

And another graph assigns the following, different URI to Shelley, Mary Wollstonecraft, 1797-1851:

http://library2.com/agent/person/q09eqe9mws

The following sameAs statement indicates that both URIs represent the same person, with the two name variants “Mary W. Shelley” and “Shelley, Mary Wollstonecraft, 1797-1851”:

Figure 30: sameAs entity linking

Once a sameAs statement has been made and published, it becomes available for others to take advantage of. A traversal for the “Mary W. Shelley” URI would find the sameAs statement and know that it also need to query for the “Shelley, Mary Wollstonecraft, 1797-1851” in order to produce a complete graph of the referenced person—provided the querying institution has access to the graph that contains the sameAs statement.

There are two primary obstacles to a completely decentralized authority model of URI creation and sameAs reconciliation. The first is the problem of determining the scope of query traversal. Were the entire library community to transition to Linked Data, the number of graph endpoints would be staggering. This number would continue to grow as commercial vendors and services enter the ecosystem. As such, traversing the entire knowledge graph represented by the Linked Data web is not computationally practical. Making such a traversal would require computing resources on the order of that currently provided by major search engines—a level of technology support not now nor likely ever to be in the grasp of even the most major resource libraries.

History provides a lesson in the above regard. In the early days of the internet it was common for people and institutions to perform their own crawls of the entire internet and store a local cache for searching. However, within a year of the advent of the World Wide Web, such traversals became impractical based on both time of crawl and space required to store crawl caches, and the search engine as service was born. Farming out crawling and caching functions to a handful of centralized systems solved the computing barriers of local crawling and caching.

As the number of cultural heritage institutions and supporting commercial interests increases, libraries will quickly face the same technological barrier that confronted information consumers of the early internet. As such, some form of centralized authority operations will be a technological necessity for the future Linked Data library ecosystem. There are, however, multiple forms that such an operation could take.

Several organizations that currently maintain widely used authority lists have already made their MARC-based authorities available as Linked Data. This includes organizations such as the Library of Congress, OCLC, and Getty. As more libraries move into the Linked Data ecosystem, we can reasonably expect that others will do the same. None of those organizations currently making the authorities available as Linked Data have changed the process through which they manage their authorities to reflect a Linked Data environment. The Library of Congress, for example, still employs the same NACO system of authority management. Their Linked Data gateways are simply a Linked Data representation of the Library of Congress authority files.

Similar to the Library of Congress, OCLC has made its WorldCat, FAST, and VIAF data available as Linked Data. As with the Library of Congress, the bulk of these services represent a re-presentation of traditional MARC-based data, with no significant modification of resource management practice. OCLC has, however, recently been engaged in a variety of pilot projects aimed at capitalizing on the potential of Linked Data to facilitate authority management.

Several of the OCLC Linked Data pilots have focused on solving the sameAs problem discussed earlier. The first iteration of the pilot service provided what can best be described as an authority registry, as system for centralized sameAs aggregation of authority URIs created by various institutions, including local libraries—a process that has come to be known as URI reconciliation. Figure 31 presents an overview of the basic methodology:

Figure 31: Centralized authority reconciliation model

The above model allows individual libraries to submit locally created URIs to a 3rd party service for reconciliation. Needed local URIs would be created and submitted to the reconciliation service where it would be aggregated through a sameAs relationship with other URIs that refer to the same entity. During search and discovery (whether by end-users or internally as part of the cataloging workflow) the aggregated set of URIs provided by the reconciliation service are used to build the graph to be presented to the user.

The type of service described above (for which OCLC is currently planning a pilot) dramatically streamlines the process of building associative graphs. For example, consider a situation where three URIs have been minted for the same entity. In order to build an associative graph, an institution would have to query the Linked Data ecosystem first for the known label of the entity for which they are searching. This would return one URI. They would then have to re-query the universe for all instances of the returned URI looking for sameAs statements. And for each returned sameAs statement, they would again have to query the entire ecosystem for other sameAs relationships that contain references to URIs not already known. Building the complete list of sameAs relationships for an entity with three sameAs URIs in circulation would require a minimum of three and a maximum of five traversals of the ecosystem. A centralized reconciliation system similar to the one depicted in Figure 31 would reduce this to a constant two traversals—a significant improvement in efficiency that would result in significant savings in both speed of query and cost.

The above discussion is meant only as an introduction to the problem of authority control. Its intent is to provide a foundation for understanding what is involved in considering Linked Data authority; and, more importantly to demonstrate that there are viable solutions to this perceived barrier to adoption to Linked Data. As demonstrated by the current Linked Data offerings of major authority providers, it is possible to provide reliable Linked Data authority without changing anything about the way authority management is currently conducted. As such, the perceived Linked Data authority problem is, in fact, a Linked Data opportunity—a chance to improve operational efficiency and the depth of contextual information that surrounds authority headings.

Cornell University is currently mid-cycle of an IMLS grant devoted specifically to understanding and modeling processes for Linked Data authority focused on seizing the opportunity of Linked Data transformation to improve both the quality of authority data and the efficiency of the workflows that create and manage it. This effort is already producing valuable results, and promises to conclude with a collection of community developed principals and models for Linked Data authority control. Those with an interest in this area of Linked Data implementation should follow the work of this project.

<<  Serials Cataloging Vendor Engagement  >>
Return to BIBFLOW Roadmap Table of Contents