Re: Free Access vs. Open Access from Stevan Harnad on 2003-08-12 (American-Scientist-Open-Access-Forum)

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Tue, 12 Aug 2003 00:10:55 +0100

On Mon, 11 Aug 2003, Matthew Cockerill wrote:

>sh> "The use one makes of those full texts is to read them,
>sh> print them off, quote/comment them, cite them, and use
>sh> their *contents* in further research, building on them.
>sh> What is "re-use"? And what is "redistribution" (when
>sh> everyone on the planet with access to the web has access
>sh> to the full-text of every such article)?"
>
> Having free access to articles on the publisher's website would certainly
> offer progress compared to the current status quo. But it would not offer
> anything like the benefits of true open access.

Free access to the current 20,000 journals (2 million articles yearly)
would be like the difference between night and day. Compared to that,
the difference between "free" and "true open" access amounts to just a
few degrees of luminosity.

But let me agree at once that if free access were gerrymandered so
the only thing the user could do was to browse the text on-screen,
without being able to download, save, grep, or print-off, then that
would indeed arbitrarily limit free access's usefulness. How many
(if any) of the several million free-access refereed-journal articles
currently on the web, however -- whether BOAI-1, BOAI-2, or otherwise --
are gerrymandered in that way? If (as I suspect) the answer is "very few"
or even "none that I know of," then this hypothetical constraint is not
worth another moment's thought or energy diverted from the real task at
hand, which is to turn night into day, as soon as possible!

> Here are just some of the
> reasons why re-use and re-distribution rights are vital to open access:
>
> (1) Digital permanence - it is not enough for the publisher to be the only
> body which curates the full archive of published research content. To ensure
> long term digital permanence of the scientific record, it is vital that
> articles should be deposited with multiple archives, and redistributable
> from and between those archives.

It seems to me that this is conflating (arbitrarily) two completely
independent matters. One is toll-free online *access* to the articles
in the 20K journals that are currently only accessible via tolls. The
other is the *preservation* of that toll-based corpus.

Well, preservation of that toll-based corpus was always a concern, in
on-paper days as in on-line days, and that concern has nothing whatsoever
to do with free (or open) access! We could have a failsafe preservation
system without free access, or we could have a failsafe preservation
system with free access; or we could have an uncertain preservation system
without free access (as we do now) or an uncertain preservation system
with free access (bringing the present system out into the light of day).

The preservation burden has to be (and will be, and is being) faced in
any case. Why on earth should that entirely orthogonal longterm
task be coupled in *any way* to the immediate and urgent problem of free
access today? And why should "open access" be linked with or defined in
terms of the eventual solution to the preservation problem, one way or
the other? (This is not an argument for indifference to preservation! it
is an argument for decoupling two completely independent desiderata,
so as not to slow the growth of open access with irrelevant added
burdens.)

> (2) A flexible choice of tools for searching and browsing
> The reason that Google exists is because the web is free for anyone to
> download and index. As a result, there is competition among search engines,
> and Google had the incentive to develop a better system for indexing web
> pages, which has since driven other search engine companies to improve the
> tools they offer.
>
> Compare this with the situation with scientific research. If the research
> resides only on the publisher's site, you don't have a free choice of what
> tools you use to search and browse it - you are stuck with what that
> particular publisher provides you with.

We are quite squarely in the domain of hypotheticals here. (Which
publisher's free-access corpus, inaccessible to google, are we talking
about?) But let us suppose that a publisher does provide free access --
not gerrymandered free access, but free access that allows individual
downloading, saving, grepping and printing:

First, I will bet that such a publisher will want to maximize the
visibility and impact of his journals' contents by allowing at least
the indexing metadata to be harvested, both by google, and by the OAI
search engines specializing in the refereed journal literature.

But even if we get doubly hypothetical here, and suppose the publisher
does *not* disclose the metadata to harvesters, there is
still a super-simple solution: Every author has an online
CV. Their CV will contain the metadata for every one of their
journal publications. (Such CVs can and will be OAI-compliant:
http://paracite.eprints.org/cgi-bin/rae_front.cgi ).
Add the URL for the free-access full-text on the publisher's website to
your CV entry and the circle is closed. (Better still, also self-archive
the full-text in your own institutional OAI-compliant repository!)
End of story.

> This ties in with developments in Grid computing (e.g.
> http://www.escience-grid.org.uk/ ). With open access, published research
> would be available "on tap" via the grid, and scientists would be able to
> use their preferred choice of grid tools to access the data, rather than
> being stuck with the tools provided by the publisher.

As stated above, the CV/OAI gambit above already trivially takes care of
closing the circle.

I agree, though, that for many research purposes, it is beneficial to
have not just the metadata but the full-text inverted and indexed, as
well as agent-harvestable. Again, if the publisher's free-access site
doesn't do this, the author's institutional site certainly can and will.
In fact, authors and their institutions are the ones with the most
direct interest in making sure their own research output is maximally
usable in this way.
http://www.ecs.soton.ac.uk/~harnad/Temp/unto-others.html

Let us not, however, conflate article-text archiving with
data-archiving. Data-archiving is important too, but it is an extra:
an independent new bonus of the online era, having nothing to do with
the question of toll-free access to article-texts. In the paper era, raw
data were not published, just summarized in what was published. Eventually
data will no doubt be incorporated into online publications in some way,
but until then there is certainly no need for authors to wait! They
can publish their article, as before, and, in addition, self-archive
the data on which their article is based in their own OAI-compliant
institutional research repository (the same repository in which
the full-text of their article can and should be self-archived too,
whether it appears in an open-access journal, a toll-access journal, or a
toll-access journal that offers toll-free access too). Again, the online
CV can close the circle, if it is not already closed of its own accord.

And this way, although it is functionally independent, data-archiving
can help speed the progress toward toll-free full-text access too.

> (3) Datamining
>
> With a million or so biomedical research articles being published each year,
> the sheer volume of output is an obstacle to the comprehension and synthesis
> of the results reported in that research. If the XML of the articles can be
> brought together in one place then the tools of datamining can be applied to
> it to extract useful but non-obvious information.

Agreed. See above. But before we get carried away with the potential
perks, let's not forget the still absent basics: Let there be Light
(toll-free full-text access), now! Leave the Solar-Energy and Club-Med
projects for when we already have our daily fill of photons.

> The simplest type of datamining is citation analysis
>
> Currently you need to pay ISI a lot of money to find out what cites what,
> but with true open access, citation analysis becomes trivial.

Perhaps not quite trivial. (There's still the problem of parsing,
identifying and linking the citations for all those articles without the
ultimate mark-up: But we're working on it: http://opcit.eprints.org/ ).

But again, this is an independent perk, because you could have universal
citation linking and analysis even *without* toll-free full-text access!
For an article's full reference list, like its indexing metadata (and
its accompanying empirical data), can all be self-archived by the author
(guess where?). We are in fact promoting this solution for royalty-based
books, whose authors, unlike journal article-authors, are unlikely
to want to make their full-texts accessible toll-free. Their metadata
and reference lists, however, are another matter, and can (and will)
be tucked into the institutional OAI-compliant repository too, with a
newfound indicator of global book citation impact as the harvestable
reward. http://www.ariadne.ac.uk/issue35/harnad/

> So, for example, if you view a PubMed record:
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_ui
> ds=11667947&dopt=Abstract
> you already get links to all the full text articles in PubMed Central which
> cite that PubMed item
> http://www.pubmedcentral.gov/tocrender.fcgi?action=cited&tool=pubmed&pubmedi
> d=11667947

And if you look at citebase, you will see how this generalizes to the
entire OAI-compliant literature:
http://citebase.eprints.org/cgi-bin/search

> The more true open access research that is published and archived at PubMed
> Central, the more useful this becomes for biomedical researchers. [Sure,
> "screen-scaping" HTML from free articles displayed on publisher sites could
> give some citation information, but with nothing like the ease, accuracy and
> reliability that it can be obtained with the use of XML data, as at PubMed
> Central].

Fine. But I'd rather have toll-free access to all 20K journals right
now, rather than waiting for these XML perks -- wouldn't you?

Again, toll-free access is one thing -- and extremely important,
already reachable, and already overdue -- whereas potential perks such
as citation-based navigation are another. Let there be light first;
then we can worry about calibrating the photometers on our Yashicas.

> Beyond citation analysis, there are many other forms of datamining that
> are possible:
> For more information see:
> http://www.biomedcentral.com/info/about/datamining/
>
> e.g. Research articles can be mined for details of protein interactions
> http://bioinfo.mshri.on.ca/prebind/

See above. Right now, it is an indisputable fact that open-access
publishing today (BOAI-2) is the solution only for that 5% of the
literature (of 20K journals) that has a suitable open-access journal
today. The immediate solution for all the rest is self-archiving (BOAI-1),
rather than continuing to wait for more open-access journals to spawn
and grow.

(If, in the meanwhile, toll-access publishers also want to help hasten
things along by providing free access, they are certainly welcome
to do so! I still regret -- for the sake of open access --
that the BOAI http://www.soros.org/openaccess/sign2.shtml?o was
not ready to count it as publisher support of open access if a
toll-access journal supported author self-archiving of their articles
http://www.ecs.soton.ac.uk/~harnad/Temp/rcoptable.gif: *Of course* that
is publisher support for open access! By the same token, I would certainly
consider it as publisher support for open access if a toll-access journal
made its full-text contents publicly accessible online toll-free. Even if
it was gerrymandered full-text access -- as long as they also supported
self-archiving!)

> And as scientific content is increasingly marked up using richer forms of
> semantically meaningful XML (e.g. CML for chemical structures, MathML for
> equations), the value of datamining will continue to increase.

All true. And it will all prevail eventually. But we need free access
*now*. We need just plain light, right now, not to
keep sitting in the darkness holding out for the "true
light." http://www.ecs.soton.ac.uk/~harnad/Temp/che.htm

> The BioLINK group are using BioMed Central's open access corpus as the raw
> material for a datamining competition, designed to stimulate progress in the
> development of tools for biological datamining.
> http://www.pdg.cnb.uam.es/BioLINK/BioCreative_task2.html

That is commendable and welcome. But it must not be forgotten what
percentage of the annual biological journal literature that sample
actually represents. We must not be held back to that small percentage
because we are informed that mere free access is not good enough -- not
"true open access." Such rarefied fussiness does not serve the cause of
either free or open access at this point.

> (4) Derivative works and compilations
> Say that a scientist performs a meta-analysis on a group of published
> clinical trials, and wants to make available the conclusions of that
> research. Or perhaps a datamining researcher has taken a corpus of 1000
> articles breast cancer, and established some interesting conclusions.

All very welcome and valuable (indeed, inevitable) developments in the
online age. But I'd rather that progress toward free access for all 20K
did not wait for these perks. Indeed, the sooner we have free access,
the sooner the rest will come too.

> In a true open access environment, each is free to post the results of their
> research, *along with* the actual corpus of data which the research was
> based on (effectively, the raw data for that research).
> But in a non-open access environment, that raw data (i.e. the research
> articles) cannot be redistributed, which makes it far more difficult than it
> needs to be for other scientists to reproduce, critique and follow up the
> work.

I am afraid I have to disagree. As already noted above, for ordinary data
-- i.e., data in which it is not the full-texts of articles themselves
that *constitute* the data! -- authors are as free to self-archive (in
their institutional repositories) the empirical database underlying their
toll-access publications as they are to do so with the data underlying
their open-access publications. Data-archiving itself is another
thing for which there is no point sitting around awaiting the era of
universal open-access publishing. Data-archiving will encourage article
self-archiving, and both will hasten the era of universal open-access.

The special case where the empirical data *are* the full-texts of
published articles of course depends on access to those full-texts.
Without free online access, the digital meta-analyses you describe
could not even have been conducted. Now the online perk of being able
to sperately archive the underlying data may not be there if the free
access did not allow full-text harvesting, but then whatever software did
manage to penetrate to the access-barriers to do the analysis in the first
place could be made accessible to readers and users who wish to check
or replicate the data or analyses. So again, the circle is closed.

> Similarly, a scientist may wish to make a point by assembling a collection
> of certain articles or article fragments (perhaps they wish to assemble a
> comparison of the methods used for a certain technique).
> In an open access world, as long as they cite the sources, they are
> completely free to create and redistribute that compilation. Such a
> selective compilation may in itself be extremely useful contribution to
> science.

I can't follow this at all. A compilation is a list of articles, whether
online or on-paper, whether toll-access or open-access. If the
full-texts of the articles are *free* access, all the compilation need list
is their URLs. (Ditto for article "fragments": try section number,
paragraph number, or even [yech!] PDF page number.)

> (5) Print redistribution rights - the National Health Service, for example,
> should be able to redistribute thousands of printed copies of an important
> research article (which it may have funded) to its doctors if it wishes to
> do so. It should not have to pay a hefty copyright fee for the privilege.

I have no views on this, but it has nothing to do with open access,
which even in the strict BOAI definition refers to online access, not
to multiple reprinting and redistribution rights. Besides, this is all
becoming moot in the online era: Why redistribute print copies instead of
URLs, if the texts are publicly accessible online toll-free?

(I think it is a big mistake, and clouds the issue, to try to link online
toll-free access arguments with paper-multiprinting rights. Don't forget
that those worthy paper-based arguments would have been just as worthy
in the paper era. So surely they are *not* what has changed in the
online era!)

> Certainly, print redistribution will likely become less significant in the
> future, but there is no logical reason that the scientific community should
> not be free to exchange and distribute the research that it has created in
> print form, as well as online.

The case for multiple reprinting/redistribution rights is *much* weaker
than the case for toll-free online access. Please let us not needlessly
weaken the case for free access by handicapping it with such
supererogatory extra burdens. Free access will erode the need to print,
even as it erodes publisher opposition to printing. But now, the only
thing fussing about print "redistribution" rights does is to provoke
needless opposition, to no good purpose. Keep it light, till everyone
sees the light.

Stevan Harnad

NOTE: A complete archive of the ongoing discussion of providing open
access to the peer-reviewed research literature online is available at
the American Scientist September Forum (98 & 99 & 00 & 01 & 02 & 03):

    http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html
                            or
    http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/index.html

Discussion can be posted to: american-scientist-open-access-forum_at_amsci.org
Received on Tue Aug 12 2003 - 00:10:55 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:47:02 GMT