Unethical harvesters

From: Arthur Sale <ahjs_at_OZEMAIL.COM.AU>
Date: Wed, 28 Oct 2009 12:01:09 +1100

I write to draw the list's attention to unethical behaviour by a
national harvester - the Australian Research Online gateway. This
gateway, operated by the National Library of Australia, has rejected
the OAI-PMH standard and has announced a local variant. This sort of
behaviour by harvesters must be firmly stamped on as soon as
possible. International standards are to be complied with, not
modified for budgetary convenience.



Responsibility

It is the responsibility of any OAI-PMH harvester such as ARO, ADT,
ROAR, OpenDOAR, OAIster, etc to harvest correctly from all
OAI-PMH-compliant repositories that exist in the wild and which it
regard as its target group. Please examine that sentence carefully:
the responsibility is with a gateway (which ARO is) to harvest from
any compliant OAI-PMH interface, and not to misrepresent the data.
The National Library fails on both counts.



Remember that international standards such as OAI-PMH are designed to
permit global interchange of metadata. Any harvester that insists on
some individual or local restriction of the international standard is
irresponsible. I did not expect this of the National Library of
Australia. So far it seems to be globally unique in this behaviour.



Why does it fail? In a nutshell, possible hubris and probable
laziness. As to hubris, the NLA has produced a set of requirements
for harvesting to which expects repositories to comply. Requiring
each repository to comply with its "requirements" rather than
National Library of Australia (NLA) harvesting properly:

multiplies the work as each Australian repository has to
adapt its interface or opt-out (rather than the NLA doing the job
properly once),

introduces the chance of breaking an existing harvesting
arrangement if the repository changes its interface, and

would be absolutely fatal to the whole global enterprise if
another harvester came up with incompatible requirements.

In the case of my university it would definitely break our in-house
one-on-one harvesting for Government data reporting and would be
likely to have similar flow on effects for our national PhD thesis
harvesting at the very least. If all harvesters were to come up with
idiosyncratic requirements, the world would be in a real mess and
harvesting, not to mention search engines, would be infeasible. Just
imagine if Google were to behave the same way in the html world! At
most these ARO "requirements" constitute a set of suggestions.






The probable laziness comes from programmers. It is trivially easy to
do a proper harvest from all the repositories that exist in Australia
(there are not that many and even fewer softwares). I can think of at
least two strategies, neither of which would take more than an hour
of a competent programmer's time. ADT and the rest of the world's OAI
harvesters can do it, why can't the NLA?



"Best Practice"

I hesitated to write this section because some will think it is
important. It isn't. The main issue is the one above. However, it is
bound to be raised by the NLA to justify their so-called
"requirements". This is the argument that their harvesting
"requirements" are good practice. In fact it is not difficult to
mount a case that the GNU EPrints scheme is better practice than the
ARO scheme. Consider these quotes from the Dublin Core Initiative
(the red is mine):

  "4.14. Identifier

Label: Resource Identifier

Element Description: An unambiguous reference to the resource within
a given context. Recommended best practice is to identify the
resource by means of a string or number conforming to a formal
identification system. Examples of formal identification systems
include the Uniform Resource Identifier (URI) (including the Uniform
Resource Locator (URL), the Digital Object Identifier (DOI) and the
International Standard Book Number (ISBN).

Guidelines for content creation:

This element can also be used for local identifiers (e.g. ID numbers
or call numbers) assigned by the Creator of the resource to apply to
a particular item. It should not be used for identification of the
metadata record itself."

[Using Dublin Core - The Elements,
http://dublincore.org/documents/usageguide/elements.shtml]

  "3. Element Content and Controlled Vocabularies

Each Dublin Core element is optional and repeatable, and there is no
defined order of elements. The ordering of multiple occurrences of
the same element (e.g., Creator) may have a significance intended by
the provider, but ordering is not guaranteed to be preserved in every
user environment."

[Using Dublin Core, http://dublincore.org/documents/usageguide/]



The NLA "requirements" specify that the relevant metadata must be in
a dc:identifier field contrary to these guidelines. Further ARO
"require" that the first dc:identifier element be the metadata
identifier, despite clear indications that order does not matter.



Don't get me wrong. I am not on a crusade to change the way
repositories currently present their OAI-PMH elements, unlike ARO. I
really don't care much how they interpret the standards. But I do
care about the NLA assuming such a bullying stance in relation to
Australian repositories. Already at least two Australian repositories
have confessed to changing their OAI-PMH interface to suit ARO! If
this happens elsewhere, the consequences for open access are
significant as incompatibilities are bound to arise.



Conclusions

1. Readers of the list should be alert for similar unethical
behaviour in their territories.

2. ARO and the NLA should start harvesting from the Australian
OAI-PMH interfaces correctly, as soon as possible, just as the rest
of the world does.

3. In the meantime, mis-harvested repositories should be
withdrawn from the ARO gateway database.

4. If ARO does not comply, Australian repositories will need to
consider boycotting the service.



Arthur Sale

Emeritus Professor of Computer Science

University of Tasmania






Received on Wed Oct 28 2009 - 06:32:28 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:49:58 GMT