The fundamental importance of capturing cited-reference metadata in Institutional Repository deposits from Stevan Harnad on 2009-01-23 (American-Scientist-Open-Access-Forum)

From: Stevan Harnad <amsciforum_at_GMAIL.COM>
Date: Thu, 22 Jan 2009 19:27:31 -0500

On 22-Jan-09, at 5:18 AM, Francis Jayakanth wrote on the eprints-tech
list:

      Till recently, we used to include references for all the
      uploads that are
      happening into our repository. While copying and pasting
      metadata content
      from the PDFs, we don't directly paste the copied content
      onto the
      submission screen. Instead, we first copy the content
      onto an editor like
      notepad or wordpad and then copy the content from an
      editor on to the
      submission screen. This is specially true for the
      references.

      Our experience has been that when the references are
      copied and pasted on to an editor like notepad or wordpad
      from the PDF file, invariably
      non-ascii characters found in almost every reference.
      Correcting the
      non-ascii characters takes considerable amount of time.
      Also, as to be
      expected, the references from difference publishers are
      in different
      styles, which may not make reference linking straight
      forward. Both these
      factors forced us take a decision to do away with
      uploading of references,
      henceforth. I'll appreciate if you could share your
      experiences on the
      said matter.

The items in an article's reference list are among the most important
of metadata, second only to the equivalent information about the
article itself. Indeed they are the canonical metadata: authors,
year, title, journal. If each Institutional Repository (IR) has those
canonical metadata for every one of its deposited articles as well as
for every article cited by every one of its deposited articles, that
creates the glue for distributed reference interlinking and metric
analysis of the entire distributed OA corpus webwide, as well as a
means of triangulating institutional affiliations and even name
disambiguation.

Yes, there are some technical problems to be solved in order to
capture all references, such as they are, filtering out noise, but
those technical problems are well worth solving (and sharing the
solution) for the great benefits they will bestow.

The same is true for handling the numerous (but finite) variant
formats that references may take: Yes, there are many, including
different permutations in the order of the key components,
abbreviations, incomplete components etc., but those too are finite,
can be solved once and for all to a very good approximation, and the
solution can be shared and pooled across the distributed IRs and
their softwares. And again, it is eminently worthwhile to make the
relatively small effort to do this, because the dividends are so
vast.

I hope the IR community in general -- and the EPrint community in
particular -- will make the relatively small, distributed,
collaborative effort it takes to ensure that this all-important OA
glue unites all the IRs in one of their most fundamental functions.

(Roman Chyla has since replied to eprints-tech with one potential
solution: "The technical solution has been there for quite some time,
look at citeseer where all the references are extracted automatically
(the code of the citeseer, the old version, was available upon
request - I dont know if that is the case now, but it was in the
past). That would be the right way to go, imo. I think to remember
one citeseer-based library for economics existed, so not only the
computer-science texts with predictable reference styles are possible
to process. With humanities it is yet another story.")

Stevan Harnad
Received on Fri Jan 23 2009 - 00:28:17 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:49:39 GMT