How Green Open Access Supports Text- and Data-Mining

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Wed, 17 Oct 2007 04:26:20 +0100

    How Green Open Access Supports Text- and Data-Mining

        Stevan Harnad

    Version with hyperlinked references:
    http://openaccess.eprints.org/index.php?/archives/310-guid.html

    SUMMARY: Data-mining robots like SciBorg can harvest Green OA
    full-texts, self-archived in their authors' Institutional Repositories
    (IRs) and "repurpose" them for better functionality. A Green publisher
    has endorsed the author's posting of his Green OA postprint in his
    own IR, free for all. The postprint is the author's own refereed,
    revised final draft. The author can certainly revise that draft
    further, making additional corrections, updates and enhancements,
    including marking it up in XML and adding comments. Those corrections
    need not be done by the author's own hands: They could be done by a
    graduate student, a collaborator, a secretary, or a hired hand. The
    author could also have SciBorg "repurpose" his postprint -- under
    one trivial condition, easily fulfilled, which is that the locus
    of the enhanced postprint, the URL from which users must download
    it, remains the author's own IR, not a 3rd-party website. It would
    be highly inimical to the progress of Green OA mandates to insist
    instead that the Green publisher's endorsement to self-archive the
    postprint in the author's IR is "not enough" -- that the author must
    also successfully negotiate with the publisher the retention of the
    right to assign to 3rd-party harvesters like SciBorg the right to
    publish a "derivative work" derived from the author's postprint.

Peter Murray-Rust, in "Why Green Open Access does not support text-
and data-mining", wrote:
http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=702

> PM-R: "the first thing to do is to gather a corpus of documents... any
> other scientist should be able to have access to it. It therefore has
> to be freely distributable"

Agreed. So far this is just bog-standard OA. If the original
documents are self-archived as Green OA postprints in their
authors' Institutional Repositories (IRs), your SciBorg robot
http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html can harvest them
and data-mine them, and make the results freely accessible (but linking
back to the postprint in the author's IR whenever the full-text needs
to be downloaded).

> PM-R: "[At SciBorg] we are interested in machines understanding
> science"

Fine. Let your SciBorg machines harvest the Green OA full-texts and
"repurpose" them as they see fit.

> PM-R: "almost all articles are copyrighted and non-distributable.
> Publisher Copyright is a major barrier? you can't just go out and
> compile a wordlist or whatever as you may infringe copyright or
> invisible publisher contracts (we found that out the hard way)"

You can't do that if you are harvesting the publisher's proprietary
text, but you can certainly do that if you are harvesting the author's
Green OA postprints.

> PM-R: "PDFs are so awful? we have to repurpose them by converting to
> HTML, XML and so on"

Fine.

> PM-R: "Now the corpus is annotated. Expert humans go through line
> by line... It is this annotated corpus which is of most use to the
> scientific community"

Fine.

> PM-R: "So suppose I find 50 articles in 50 different repositories, all
> of which claim to be Green Open Access. I now download them, aggregate
> them and [SciBorg] repurpose[s] them. What is the likelihood that some
> publisher will complain? I would guess very high"

Complain about what, and to whom? A Green publisher has endorsed the
author's posting of his Green OA postprint in his IR, free for all. The
postprint is the author's own refereed, revised final draft. Now follow
me: Having endorsed the posting of that draft, does anyone imagine that
the publisher would have any grounds for objection if the author revised
it further, making additional corrections and enhancements? Of course
not. It's exactly the same thing: the author's Green OA postprint.

So what if the author decides to mark it up as XML and add comments? Any
grounds for objections? Again, no. Corrections, updates and enhancements
of the author's postprint are in complete conformity with posting his
postprint.

Suppose the author did not do those corrections with his own hands, but
had a graduate student, a secretary, or a hired hand do them for him,
and then posted the corrected postprint? Still perfectly fine.

Now suppose the author had your SciBorg "repurpose" his postprint: Any
difference? None -- except a trivial condition, easily filled, which is
that the locus of the enhanced postprint, the URL from which users can
download it, should again be the author's IR, not a 3rd-party website
(that the publisher could then legitimately regard as a rival publisher
-- especially if they were selling access to the "repurposed" text).

So the solution is quite obvious and quite trivial: It is fine for the
SciBorg harvester to be the locus of the data-mining and enhancement of
each Green OA postprint. It can also be the means by which users search
and navigate the corpus. But SciBorg must not be the locus from which
the user accesses the full-text: The "repurposed" full-text must be
parked in the author's own IR, and retrieved from there whenever a user
wants to read and download it, rather than just to search and surf the
entire corpus via SciBorg.

Not only does this all sound silly: it really is silly. In the online
age, it makes no functional difference at all where a document is
actually physically located, especially if the document is OA!. But we
are still at the interface between the paper age and the OA era. So we
have to be prepared to go through a few silly rituals, to forestall any
needless fits of apoplexy, which always mean delay (for OA).

So the ritual is this: It would be highly inimical to the progress of
Green OA mandates to insist that the publisher's endorsement to
self-archive the postprint in the author's IR is not enough -- that the
author must also successfully negotiate with the publisher the retention
of the right to assign to 3rd-party harvesters like SciBorg the right to
publish a "derivative work" derived from the author's postprint. That
would definitely be the tail wagging the dog, insofar as OA is
concerned, and it would put authors and off providing Green OA (and
hence their institutions from mandating it) for a long time to come.

Instead, when SciBorg harvests a document from a Green OA IR, SciBorg
must make an arrangement with the author that the resultant "repurposed"
draft will be deposited by the author in the author's IR as an update of
the postprint. Then when a user of SciBorg wishes to retrieve the
"repurposed" draft, the downloading site must always be the author's IR,
not a draft hosted by and retrieved directly from SciBorg.

This ritual is ridiculous, and of course it is functionally unnecessary,
but it is pseudo-juridically necessary, during this imbecilic
interregnum, to keep all parties (publishers, lawyers, IP specialists,
institutions, authors) calm and happy -- or at least mutely resigned --
about the transition to the optimal and inevitable that is currently
taking place. Once it's over, and we have 100% Green OA, all this
papyrophrenic nonsense can be dropped.

Please, Peter, be prepared to adapt SciBorg to the exigencies of this
all-important (and all too slow-footed) transitional phase, rather than
trying to adapt the status quo to SciBorg, at the cost of still more
delays to OA.

> PM-R: "Only a rights statement actually on each document would allow
> us to create a corpus for NLP without fear of being asked to take it down"

No. Green OA authors with standard copyright agreements are not in a
position to license republication rights to SciBorg or any other 3rd
party. Let us be happy that they have provided Green OA at all, and let
SciBorg be the one to adapt to it for now, rather than vice versa.

    Brody, T., Carr, L., Gingras, Y., Hajjem, C., Harnad, S. and
    Swan, A. (2007) Incentivizing the Open Access Research Web:
    Publication-Archiving, Data-Archiving and Scientometrics. CTWatch
    Quarterly 3(3). http://eprints.ecs.soton.ac.uk/14418/

Stevan Harnad
Received on Wed Oct 17 2007 - 05:14:32 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:49:05 GMT