The second invitational Workshop on Linkage from Citations to Journal Literature was held at the Westin Copley Place in Boston on June 9, 1999. The workshop was co-sponsored by the National Information Standards Organization (NISO), the Digital Library Federation (DLF), the National Federation of Abstracting and Information Services (NFAIS), and the Society for Scholarly Publishing (SSP). The 37 attendees represented researchers, librarians, primary and secondary publishers, and vendors of a variety of information services.
The Boston workshop was intended as a follow-up to an earlier workshop on the same topic held in Washington, DC in February 1999. At the February workshop, a general model for reference linking was proposed, which was subsequently refined and elaborated by a smaller working group. Bill Arms, the chair of the working group, began the meeting by presenting the group's report, "A model for reference linking". He did not repeat the entire paper, which is available at http://www.lib.uchicago.edu/Annex/pcaplan/reflink.html but rather highlighted certain conclusions and issues. (To see the Powerpoint of Arms' Presentation click here).
"A model for reference linking"
In the general model, publishers contribute data to three database systems, a reference database linking identifiers to citation data, a locations database linking identifiers to locations or URLs, and one or more content database(s) containing the journal articles themselves. The end user or client logically performs a three-step process, submitting a citation to the reference database to obtain an identifier, submitting the identifier to a locations database to obtain one or more locations, and using selected location information to obtain a copy of the desired content.
Arms noted that although citations could theoretically be linked directly to locations, in practice working applications have been implemented using identifiers such as the PMID, the BibCode and the DOI. Using intermediate identifiers helps with persistence and with "flexible targets" (multiple satisfiers).
Identifiers may be embedded in citations, calculated from citations, or obtained by lookup in a reference database. Journal publishing currently follows a static model in which links for references are established at the point of publication and embedded in the article itself. In an alternative, dynamic model, reference links are established on demand. Although the latter can be expected to obtain more up-to-date results, success cannot be guaranteed.
In all cases, the quality of the metadata in the reference database is crucial. The working group compared some existing schemes and found considerable agreement on a minimal required set of elements but great differences in details and syntax. The group also found that Dublin Core does not easily represent citation information for articles, lacking clear guidelines on where to put the title of the containing journal, volume enumeration, etc. Arms made a plea to the DC community to address these problems.
Resolution of an identifier to a location presents two issues: how to know which resolver to submit the identifier to, and how to select from multiple locations. It is widely accepted that journal articles may be available at a number of locations beyond the primary publisher's own site, for performance, economic, or functionality reasons. In a central resolution model, holdings information for all copies would be available in a central location database, and the client (library or individual user) would submit his preferences with his resolution request. In a distributed model, a local interceptor would sit between the user and a central location database, and only identifiers not found in the interceptor would be forwarded on to the central resolver.
Panel Reaction
Larry Lannom from CNRI began by observing that the local interceptor was a variation on proxy caching, but with the significant difference that each interceptor would contain unique information, making them an interesting research problem. The most important architectural issue is to ensure that the multitude of local interceptors are interoperable.
Jim Ostell from the National Library of Medicine did a short presentation on PubRef. The citation matcher component takes journal title, year, volume, start page number, and first author, and returns one or more identifiers. The LinkOut component matches user profiles against stored holdings information. Publishers, aggregators, and any type of commercial, academic or government service may all submit holdings of the materials they provide. Users can set up their own profiles, or can use profiles that institutions have set up in their behalf. For example, the National Institutes of Health have an NIH profile so that any user coming from the NIH home page will have only NIH-specific links turned on. Alternatively, users can request to see all links. An interesting development is that authors now want to provide themselves as "locations", which makes sense in the scientific community but raises interesting questions of publisher-validated vs. non-validated information.
Eric Hellman from Openly Informatics noted that the term "interceptor" is generally used to refer to software that blocks access to undesirable Internet sites, and drew parallels between these and the interceptors as proposed in the working group's paper. Internet interceptors contain huge lists of URLs to block, and may also block by doing pattern matching (looking for particular strings in URLs). The interceptor would redirect a blocked URL to a page indicating the site is blocked. Businesses like FamilyConnect.com maintain huge central databases and charge monthly fees for use of their proxy server.
Eric Hellman's Presentation
Open discussion
The open discussion which followed the panel focused initially on challenging a few key assertions in the report.
Mark Doyle from APS questioned why identifiers were needed, when peer review citations serve the same functions, are persistent, serve as human readable citations, and bypass the overhead of a centralized lookup service. If one knew one had valid metadata in a citation, it would be more efficient to go directly to the publisher's site. Jim Ostell noted that if a user has to go through the trouble of validating a citation, he might as well use a direct key. Evan Owens from the University of Chicago Press noted that some editors want authors to contribute identifiers for the references they use. One use of the reference database might be for authors to create a reference list for the publisher.
Deb Bendig from OCLC questioned the idea of local interceptors. Representing an aggregator herself, she didn't think that aggregators would want to provide files of current identifiers to each of their customers. This led to a discussion of whether it was preferable for interceptors to maintain individual identifiers or whether they should redirect by maintaining holdings information (institution A subscribed to title B from source C from date 1 to date 2). The former approach means aggregators and other third-party sources would have to cooperate in providing identifiers, and institutions would have to manage huge files of them. The latter approach gets complicated very quickly, as most aggregators add and drop individual titles with such frequency libraries are hard pressed to keep track of content at the title level.
Jim Ostell pointed out this may be an opportunity for a service. LinkOut, for example, currently bases institutional profiles on journal titles. It could let institutions provide a list of aggregators instead, and allow algorithms to determine which copy to prefer if a title is held by multiple aggregators. The central LinkOut service could maintain the information as to which titles were available from each aggregator at any given time.
Scenarios
Bill Arms reviewed several reference linking scenarios. Bill Arms noted that the early DOI model assumed that a user with a DOI in hand would be routed from the DOI system directly to the publisher of the article. In the current model multiple alternative locations may exist, and the user, a library, or the publisher itself may want to mediate which is selected. All three of these players have different goals, as do the authors.
Publishers want to satisfy their customers (libraries, authors, readers), to resolve identifiers to the definitive version of the article, to advertise their own offerings, and to collect good statistics.
Libraries want to satisfy readers, resolve identifiers to the definitive version, minimize costs, and collect good statistics.
Authors want to maximize their own readership, resolve identifiers to the most appropriate (often the current) version, and advertise related works of their own.
Readers have the most varied goals of all. They want to resolve identifiers to the most appropriate version, but this may differ depending on circumstances. They also want fast, direct access to content, and links to related works by topic.
There are already several reference linking services, including those developed or under development for PubRef, the Astrophysics Data Center, the DOI, and LANL. One has to assume that there will be multiple reference databases, location databases and content databases. As a next step, the community should consider what levels of agreement and tools are needed for cross linking these. These may include a registry of reference linking schemes, hand-off protocols, and citation analysis tools.
Priscilla Caplan asked how, given multiple lookup services, a user would know which one to use for any given citation. This led to a number of suggestions. There might be a known primary service for the field of knowledge. Possibly there could be a registry of lookup services, which a searcher could use to find the most appropriate. If there were only a small number of lookup sites, front end software could be written to search them all simultaneously. However, for these front ends to return intelligible results to the user, it may be necessary to standardize the response formats from the various lookup sites.
Using the DOI for Reference Linking
David Sidman described a project of the International DOI Foundation (IDF) to build a prototype reference linking database, an activity initiated in direct response to the sense of urgency communicated at the first Workshop on Reference Linking. This is an implementation project which follows the general model presented in the working group's paper, and uses metadata elements similar to those described in that paper.
The reference lookup is to the DOI system as the phonebook is to the telephone system. Batch as well as individual lookup will be supported, so, for example, an article with 15 references could be submitted to the lookup database, and have 15 DOIs returned. The building of the reference database is integrated with the building of the identifier database. Both use technology developed by the Center for National Research Initiatives (CNRI). A single submission by a publisher goes to both databases.
The metadata stored will be a minimal set, including author, registrant, the date of publication and the date the master goes online, the type of object (journal article, abstract), the journal identifier, article identifier, article title, and enumeration (volume, issue, page).
Currently the primary publishers Springer, Wiley, Academic, ACS, and AIP, and secondary publishers Dawson and ISI, are participating in the prototype project. The hope is to have enough of the system in place to be able to demonstrate it at the Frankfort Book Fair in October.
SFX-Links
Herbert van de Sompel did a presentation on extended services in the hybrid library environment. This model postulates a heterogeneous environment in which materials are both paper and digital, formally and informally published, and local and remote. Information sources include abstracting and indexing services, full text collections, library catalogs, preprint servers, document delivery services, web services and web search engines. A number of these sources are outside of the library's control. Nonetheless, libraries want to present information from any particular source in the context of the complete collection.
Linking services have traditionally worked by creating static "link bundles" of relationships. However, these relationships can be precomputed only in environments where you control all the resources, which is not the case for libraries. Instead of creating relationships between documents, the SFX linking model creates conceptual relationships between resources. In this model one defines sources (e.g. MEDLINE), services (e.g. holdings, table of contents, full text, etc.) and targets (e.g. the online catalog or Current Contents). "Thresholds" define what information is required for a link. The link source goes to the linking service, screened by thresholds, to determine what services can be offered.
Two papers by Van de Sompel and Patrick Hochstenbach published in D-Lib magazine describe the SFX framework more thoroughly: Reference Linking in a Hybrid Library Environment, Part 1: Frameworks for Linking, and Part 2: SFX, a Generic Linking Solution. See http://www.dlib.org/dlib/april99/van_de_sompel/04van_de_sompel-pt1.html
An upcoming paper, probably to appear in the September 1999 issue of D-Lib Magazine, will elaborate on the mechanism for local/selective resolution that has been introduced in the recent SFX implementation, in which the University of Ghent, the Los Alamos National Laboratory, Wiley Interscience and the American Physical Society have collaborated. Using a pragmatic, easy to implement mechanism, information systems (A&I databases, full-text repositories, OPACs, ...) are being informed about the existence and the location of a local resolver when being accessed by users of an institution hosting such a resolver. Based on that knowledge, the system is then able to tunnel the information that needs local resolution towards the institutional resolver, rather than to the default target. The mechanism is modular and generic and can be applied to any piece of information for which local resolution is desired/required. Since the problem statement for the SFX research is broader than the specific reference linking problem, the techniques that have been developed so far also apply within this more specific domain. Identifiers (such as the DOI) as well as metadata contained in citations can be tunneled towards an institutional resolver, who can then decide how to handle the request for resolution. In the most simple case, such resolution can be redirecting a user to the most appropriate copy of the cited paper; in the case of SFX it is presenting the user with a wide variety of services for the given citation, including the delivery of the full-text. The described mechanism has successfully been demonstrated in the collaborative implementation mentioned above. It introduces an approach to tackle the so-called Harvard problem.
Next steps
The workshop series was intended to formulate the issues related to reference linking in a compelling way in a short period of time. It was agreed that the workshop group should not perpetuate itself, but rather attempt to facilitate more targeted efforts to resolve some of the issues raised.
It was noted that an NSF/DLF funded research problem on reference linking is already underway at Cornell.
Some action items ennumerated at the close of the workshop include: