Thoughts on Referencing, Linking, Reference Rot

Last updated: December 28, 2013

3 links in a single anchor

Prepared by:
   Herbert Van de Sompel, Martin Klein, Robert Sanderson - Los Alamos National Laboratory
   Michael Nelson - Old Dominion University

Abstract

This document is motivated by the reference rot problem, a combination of the well-known link rot problem and the less discussed content decay problem. Link rot is about links that stop functioning whereas content decay is about the linked content changing over time, possibly to the extent that it stops being representative of the content that was initially referenced.

A common approach to addressing reference rot is to create an archival snapshot of a linked resource and to link to that snapshot instead of to the live resource. Unfortunately, this approach requires the web archive that hosts the snapshot to remain permanently available, a rather unrealistic proposition. If that archive becomes temporarily or permanently unavailable the link to the archived snapshot stops functioning: one link rot problem was replaced by another.

This document argues that the problem can be avoided by linking to the original resource while providing temporal context - the date the linked resource was accessed and/or the URI of an archived snapshot of the resource - as link attributes. Including the temporal context information, in combination with the use of existing archival infrastructure such as web archives, web archive APIs, and the Memento protocol, helps to ameliorate reference rot problems and can be considered a step towards increased web persistence.

Table of Contents

Motivation

Links to Web resources are subject to reference rot, a combination of:
  • Content decay: The content of the linked resource may change over time and, as a result, the degree to which that content remains representative of the content that was intended to be linked to may decrease over time.
  • Link rot: The linked resource may disappear altogether.
A common approach to address reference rot consists of the combination of:
perma link
There is good news and bad news regarding this common approach to address reference rot:
  • The good news is that a snapshot of the linked resource is taken and stored in a web archive. The more snapshots of a resource, the better our record that documents its evolution over time.
  • The bad news is that in order for the link to the archived snapshot to work, the archive that hosts the snapshot (perma.cc in the ongoing example) needs to remain permanently operational, a rather unrealistic expectation. If that archive goes temporarily or permanently off line, the link to the archived snapshot stops working: one link rot problem was replaced by another.
Interestingly, but not surprisingly, archived snapshots of the original resource http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ also exist in other web archives:
These archived snapshots can be found by searching web archives for the original URI http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/: More generally, every existing web archive supports searching for archived snapshots by means of the URI of the original resource of which the archive took (a) snapshot(s). Also, all major web archives support the Memento protocol that allows accessing archived snapshots by using the URI of the original resource. In addition, some web archives have bespoke APIs that provide such functionality. Using the ongoing example, this means that, in case perma.cc would temporarily or permanently go off-line, snapshots can still be found in other web archives.

However, in the common approach to address reference rot, the original URI is replaced by the URI of the archived snapshot. As a result, this approach prohibits retrieving snapshots from other web archives, making its functioning totally dependent on the continued existence of the web archive that assigned the URI for the archived snapshot. From the perspective of web persistence this can hardly be regarded as a satisfactory solution. In order to maximize chances of future retrieval of snapshots of a linked resource, that resource's original URI must be maintained when linking to it.

Citing Web Resources

When citing web resources in scholarly literature or in Wikipedia articles, a practice exists to not only include the URI of the cited resource in the citation but also one or more of the following information elements:
  • The date the cited resource was visited.
  • The URI of an archival version of the cited resource, typically available from a web archive.
  • The date the cited resource was archived.
Wikipedia's http://en.wikipedia.org/wiki/Resonance_fm page has the following reference to a web resource:

10. "Mute magazine - Culture and politics after the net". Metamute.org. 2002-05-09. Retrieved 2010-02-09.

This reference contains:
  • The URI of the cited resource, http://www.metamute.org/en/Radio-Playtime, expressed as a hyperlink attached to the title of that resource.
  • The date the cited resource was visited, 2010-02-09, as a string.
Wikipedia's http://en.wikipedia.org/wiki/Coil_(band) page has the following reference to a web resource:

2. "Coil: Scatology, Horse Rotorvator, Love's Secret Domain". Liar Society (2004-10-30). Archived from the original on 6 February 2008. Retrieved 2007-02-12.

This reference contains:
  • The URI of the cited resource, http://liarsociety.tripod.com/blog/index.blog?from=20041130, expressed as a hyperlink connected to the string Archived from the original.
  • The date the cited resource was visited, 2007-02-12, expressed as a string.
  • The URI of an archival version of the cited resource, http://web.archive.org/web/20080206210600/http://liarsociety.tripod.com/blog/index.blog?from=20041130, expressed as a hyperlink attached to the title of the cited resource.
  • The date the cited resource was archived, 6 February 2008, as a string.
BioMed Central's research paper http://www.biomedcentral.com/1471-2148/6/95 has the following reference to a web resource:

86. ClustalW 1.8 (European Bioinformatics Institute) http://www.ebi.ac.uk/clustalw/ webcite

This reference contains:
  • The URI of the cited resource, http://www.ebi.ac.uk/clustalw/, expressed as a hyperlink connected to the URI string itself.
  • The URI of an archival version of the resource, http://www.webcitation.org/query.php?url=http://www.ebi.ac.uk/clustalw/&refdoi=10.1186/1471-2148-6-95, expressed as a hyperlink connected to the name of the archive where that version resides, webcite.

In order to cite a web resource, Wikipedia's Simple Citation Template and most scholarly citation styles including the Chicago citation style, the American Psychological Association style, and the American Chemical Society style list the URI and the access date of the cited resource as necessary information. In addition, the citation in Example 3 and Wikipedia's Full Citation Template include the URI of an archival version of the resource. The latter also includes the date the archival version of the cited resource was created.
Inclusion of this temporal context information - access date, archive url, and archive date - is motivated by the understanding that the web is ephemeral, that the URI of the cited resource (from now on referred to as linkedurl) is subject to reference rot.

The temporal context information provided in these web citations serves the following purposes:
  • The date the cited resource was visited, from now on referred to as versiondate, serves as a reminder that one should not assume that the content at the cited resource will be the same when visiting it some time after that date.
  • The URI of an archival version of the cited resource, from now on referred to as versionurl, allows revisiting a version of the page as it was when it was cited.
  • The date the cited resources was archived is somehow informative from a documentation perspective but seems less crucial as this information is typically also available when visiting the archived version of the resource.

The Case for Structured Temporal Context on Links

This temporal context information has, so far, been included in a way that is helpful for human consumption. Despite the many variations in expressing the information that is relevant for a web citation, a user can interpret it and connect the dots. Also, temporal context information has so far only been included in formal web citations. However, since all links are subject to reference rot, addition of such information should not be limited to formal citations of web resources, but should rather be applicable to all links to web resources.

There are compelling reasons to express temporal context information in a structured manner on links to support use by applications such as browsers, crawlers, search engines:
  • The many variations in expressing web citation information makes machine interpretation challenging.
  • In the current representation of information, the linkedurl and the versionurl look like two independent URIs despite the tight - temporal - relationship between them.
  • The approach used for formal web citations can not be used for links in general because it would e.g. require adding two links to the same anchor text.
  • The versionurl, if provided in a structured manner, can be used by applications such as browsers, to indicate and provide the option to retrieve the archived snapshot of the linked resource.
  • The combination of the linkedurl and the versiondate, if provided in a structured manner, can be used by applications such as browsers, to indicate and provide the option to obtain an archived snapshot of the linked resource that is temporally near to the versiondate, even if no versionurl is provided. The Memento protocol that specifies content negotiation in the datetime dimension provides this functionality in an interoperable manner, but it could also be provided by leveraging bespoke APIs of web archives.
The question then arises how to best convey the temporal context information so that applications can use it. And how to do so in a uniform manner, i.e. a manner that is independent on the venue conveying the information. With this regard, it is interesting to observe that in 1995, the definition of the anchor element included an optional URN attribute, possibly/likely provided to address concerns regarding web persistence. The attribute was deprecated and it is probably a fair guess that this happened because no infrastructure existed to act upon URNs. The HTML 5 development page for the anchor element includes a reminder that the URN attribute is obsolete.

There are several reasons to revisit the inclusion of attributes related to web persistence in select HTML elements, most importantly the anchor element <a>:
  • There is a growing concern regarding persistence at least in some pockets of the web:
    • Wikipedia has an active Link rot thread looking into the problem domain.
    • The Hiberlink project and CrossRef's OpCit explore the problem for scholarly communication. A pilot study that led to Hiberlink found disconcerting percentages of link rot and lack of archival versions for web resources referenced in the arXiv.org preprint collection and the thesis repository of the University of North Texas.
    • Reference rot has become a significant concern in legal cases that depend on web resources. The Perma effort at Harvard University has emerged to try and ameliorate the problem.
    • The Modern Language Association style for citing web resources no longer mandates the inclusion of the cited URI because "Web addresses are not static".
  • Infrastructure has emerged that can play a role in achieving an increased degree of web persistence:
In the below, various options are explored to convey structured temporal context information for links in HTML pages. They share the following design characteristics:
  • The linkedurl is considered the central information element and is conveyed as the value of the href attribute of the anchor (<a>) element. The major motivation for this choice is that the linkedurl is the URI by which the linked resource is known throughout the web, including in web archives.
  • Temporal context information, for example the versionurl of an archived snapshot of the linked resource, is provided as additional information that surrounds the linkedurl.
These options are presented:
  • Option 1: Using versiondate and versionurl attributes with the <a> element.
  • Option 2: Using a versionurl attribute with the <a> element.
  • Option 3: Using data-versiondate and data-versionurl attributes with the <a> element.
  • Option 4: Using a data-versionurl attribute with the <a> element.
  • Option 5: Using RDFa, microdata to express the necessary information.
The approaches are illustrated by means of the aforementioned formal web citation Example 2 and the below linking Example 4:
The New York Times article In Supreme Court Opinions, Web Links to Nowhere the anchor text a new, permanent link is linked to the archived snapshot http://perma.cc/0Hg62eLdZ3T created by perma.cc on October 2 2013, whereas the URI of the original resource http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ is no longer available. The goal is to convey all available information - linkedurl, versionurl, versiondate - on the link.


Using versiondate and versionurl attributes with the <a> element

Define the additional, optional versiondate and versionurl attributes for the <a> element to convey the versiondate and versionurl information, respectively.
  • The linkedurl (value of the href attribute) can be used to access the current representation of the linked resource.
  • The URI provided as the value of versionurl can be used to directly access an archived snapshot of the linked resource.
  • The datetime provided as the value of versiondate in combination with the linkedurl (value of the href attribute) can be used with the Memento protocol or web archive APIs to obtain an archived snapshot of the referenced resource that is temporally near to the datetime.
This Option could be realized in HTML5 as the specification is still under development.

Example 2:
    2.	<a href="http://liarsociety.tripod.com/blog/index.blog?from=20041130"  
    versionurl="http://web.archive.org/web/20080206210600/http://liarsociety.tripod.com/
                                                       blog/index.blog?from=20041130" 
    versiondate="2007-02-12">"Coil: Scatology, Horse Rotorvator, Love's Secret Domain".</a> 
    Liar Society (2004-10-30). Archived from the original on 6 February 2008.
    
Example 4:
    It allows writers and editors to capture and fix transient information on the Web with a 
    <a href="http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/"  
    versionurl="http://perma.cc/0Hg62eLdZ3T" 
    versiondate="2013-10-02">new, permanent link</a>. 
    

Using a versionurl attribute with the <a> element

Define an additional, optional versionurl attribute for the <a> element. This attribute is used to convey a URI.
  • The linkedurl (value of the href attribute) can be used to access the current representation of the linked resource.
  • In case a versionurl is known, it is used as the attribute value in which case it can be used to directly access an archived snapshot of the linked resource.
  • In case a versionurl is not known, a DURI is used as the attribute value. A DURI carries the linkedurl and the versiondate, information that can be used to obtain an archived snapshot of the linked resource.
This Option could be realized in HTML5 as the specification is still under development.

Example 2:
    2.	<a href="http://liarsociety.tripod.com/blog/index.blog?from=20041130"  
    versionurl="http://web.archive.org/web/20080206210600/http://liarsociety.tripod.com/
                                                   blog/index.blog?from=20041130">
    "Coil: Scatology, Horse Rotorvator, Love's Secret Domain".</a> 
    Liar Society (2004-10-30). Archived from the original on 6 February 2008. Retrieved 2007-02-12.
    
Example 4:
    It allows writers and editors to capture and fix transient information on the Web with a 
    <a href="http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/"  
    versionurl="http://perma.cc/0Hg62eLdZ3T">new, permanent link</a>. 
    
There are drawbacks to this approach:
  • The DURI scheme is not well known and manually inputting a DURI could be cumbersome.
  • The HTTP URI of a resource version provided in the attribute is itself subject to link rot.
  • In case the URI of an archived resource is provided in versionurl, the versiondate is not provided in a machine-actionable manner. As a result, crucial information to pinpoint a temporally appropriate archived snapshot of the linked resource is missing.

Using data-versiondate and data-versionurl attributes with the <a> element

HTML5 allows the use of custom attributes that have names starting with data-. Using this feature, data-versiondate and data-versionurl can be introduced for the <a> element, with the same use and meaning as described in Option 1 for versiondate and versionurl, respectively.

This can be done for HTML5, and given the migration path offered from HTML1 and HTML4 to HTML5, this approach is potentially within reach for pre-HTML5 pages.

Example 2:
    2.	<a href="http://liarsociety.tripod.com/blog/index.blog?from=20041130"  
    data-versionurl="http://web.archive.org/web/20080206210600/http://liarsociety.tripod.com/
                                                              blog/index.blog?from=20041130" 
    data-versiondate="2007-02-12">"Coil: Scatology, Horse Rotorvator, Love's Secret Domain".</a>
    Liar Society (2004-10-30). Archived from the original on 6 February 2008.
    
Example 4:
    It allows writers and editors to capture and fix transient information on the Web with a 
    <a href="http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/"  
    data-versionurl="http://perma.cc/0Hg62eLdZ3T" 
    data-versiondate="2013-10-02">new, permanent link</a>. 
    
This approach raises the question whether using data- attributes is acceptable in order to achieve cross-site interoperability. The HTML5 specification says the following with this regard:
  • Custom data attributes are intended to store custom data private to the page or application, for which there are no more appropriate attributes or elements.
  • These attributes are not intended for use by software that is independent of the site that uses the attributes.

Using a data-versionurl attribute with the <a> element

This approach uses the HTML5 data- extensibility mechanism described in Option 3 but applies it to the versionurl attribute with use and meaning described in Option 2.

Example 2:
    2.	<a href="http://liarsociety.tripod.com/blog/index.blog?from=20041130"  
    data-versionurl="http://web.archive.org/web/20080206210600/http://liarsociety.tripod.com/
                                                      blog/index.blog?from=20041130">
    "Coil: Scatology, Horse Rotorvator, Love's Secret Domain".</a> 
    Liar Society (2004-10-30). Archived from the original on 6 February 2008. Retrieved 2007-02-12.
    
Example 4:
    It allows writers and editors to capture and fix transient information on the Web with a 
    <a href="http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/"  
    data-versionurl="http://perma.cc/0Hg62eLdZ3T">new, permanent link</a>. 
    
This approach inherits the drawbacks of both Option 2 and Option 3.

Using RDFa, microdata to express the temporal context information

RDFa or microdata markup could be used to convey the temporal context information pertaining to the linkedurl, provided in the href attribute of the <a> element. This approach requires the introduction of a vocabulary that defines the versiondate and versionurl properties.

This Option can be realized for various HTML flavors.

This approach could realistically be implemented in environments such as Wikipedia, for citations expressed using the web citation template that is computationally turned into HTML. The approach is significantly less attractive when HTML is authored manually and when links in general are concerned. Also, while this kind of markup can easily be consumed in a batch manner, for example by search engines, using it in real-time in browsers may present a challenge. Also, the approach reduces the degree of freedom regarding citation style: semantics can only be provided for information that is directly shown to the user.

Publication Date as Approximation for Versiondate

A page-wide default value for versiondate can be set by using the datePublished property introduced by schema.org as page-wide metadata (using ISO 8601 notation and UTC time zone):

<meta itemprop="datePublished" content="2012-02-24T12:15SZ">

This proposed use of meta indicates that, for those links that do not provide an explicit versiondate on an anchor element, the date provided in meta should to be used, i.e. an versiondate provided on an anchor element overwrites the datePublished provided in meta.

Applications

Temporal context information can be put to use in various applications. Search engines could use it, for example, to highlight frequently referenced snapshots of a resource. Also, when a web author adds versiondate information to a link, this could be taken as a hint for a pro-active archiving application that the linked resource should be archived. Such an application could be integrated with a desktop authoring tool, or a content management system like a MediaWiki, or it could be a crawler-based third party service, for example operated by a web archive that is looking to extend its archival collection.

A major use case for temporal context information is browsers. A challenge in using the information is the way in which to make it actionable by the user. One way to do so is by means of a right click on a linked resource. The Memento extension for Chrome uses this approach to present the following options in the context menu (watch the movie that illustrates the functionality):
  • Retrieve a snapshot of the resource as it existed around the date set in a calendar user interface element.
  • Retrieve the most recent snapshot of the resource (especially helpful in case of 404 responses)
The following options could be added to the context menu to leverage temporal context information:
  • Retrieve the archived snapshot with the provided versionurl.
  • Retrieve the archived snapshot that is temporally closest to the provided versiondate.
  • Retrieve the archived snapshot that is temporally closest to the provided publication date of the page that includes the link.
As an example, consider the http://en.wikipedia.org/w/index.php?title=Resonance_fm&oldid=478575545 version of the http://en.wikipedia.org/wiki/Resonance_fm Wikipedia page used in Example 1. The page-wide and link-specific approaches to convey versiondate can be combined in the following manner in the versioned page: Note that this example assumes Memento support by Wikipedia. A Memento extension for MediaWikis is available. Information about value that support for Memento can add to MediaWikis is described here.
Thanks for input and inspiration: Harihar Shankar, Lyudmila Balakierva