Re: Blackwell Publishing & Online Open from Matthew Cockerill on 2005-03-08 (American-Scientist-Open-Access-Forum)

From: Matthew Cockerill <matt_at_BIOMEDCENTRAL.COM>
Date: Tue, 8 Mar 2005 19:02:55 -0000

On Mon, 7 Mar 2005, Stevan Harnad wrote:

> -----Original Message-----
> From: American Scientist Open Access Forum
> [mailto:AMERICAN-SCIENTIST-OPEN-ACCESS-FORUM_at_LISTSERVER.SIGMAXI.ORG]On
> Behalf Of Stevan Harnad
> Sent: 07 March 2005 21:25
> To: AMERICAN-SCIENTIST-OPEN-ACCESS-FORUM_at_LISTSERVER.SIGMAXI.ORG
> Subject: Re: Blackwell Publishing & Online Open
>
...
> (2) It is not at all clear, however, what reprinting, redistribution
> or re-use is *needed* for an article whose full-text is permanently
> available for free online to any user, any time, anywhere. (This seems
> to me to be based on obsolete paper-based thinking in an age of online
> access.)
>
> (3) As to online "datamining": I am not sure what you have in
> mind. The
> full-text is online, free, downloadable, and analyzable, by anyone,
> anywhere. It just may not be republished by a thrid party.

What I am saying is that there is huge scope for the scientific community to
'add value' to published scientific research in various ways.
Adding value to the literature is made dramatically easier (and therefore
are much more likely to happen) if the scientific research concerned can be
downloaded in a structured XML form. And similarly, many of these forms of
"added value" are both dramatically easier to implement, and dramatically
more useful in practice, if they are not hobbled by the constraint that the
original article/derivative versions may not be redistributed but can only
be linked to.

I can perhaps clarify what I mean by "added value" with an analogy.

Take the new service Google Maps <http://maps.google.com>

This is basically built from 2 resources:
1. Structured vector map data, along with zipcode/street address data,
licensed by Google from NAVTEQ and TeleAtlas
2. Web page information harvested by Google robots (which happens to
frequently include zip code and/or address information)

Both of these resources have been available for years. Sites like
www.mapquest.com use the very same NAVTEQ/TeleAtlas data, which is
comprehensive, available in a standard structured format, and reasonably
priced, given that mapping is a competitive market, and someone could always
collect the map data themselves and undercut NAVTEQ if they grossly
overcharged. Web page information is freely accessible to robot webcrawlers,
and search engines have been making use of it since the mid-1990's.

What Google has done is add value by using its technical know-how to build a
powerful user interface which combines mapping and search technology, so
that you can do a search for anything you like, in combination with a named
location (e.g. swimming pools in Palo Alto,CA, or buddhist temples in
Omaha, NE), and it will then show the closest relevant search results from
across the web, dynamically overlaid onto a map at an appropriate level of
magnification. This is a wonderful, innovative service, and it adds value to
the underlying NAVTEQ data in a way that goes beyond what previous licensees
of the data had managed to achieve.

Now compare this to the situation with the scientific literature - imagine
that you have an equivalently impressive idea for adding value to the
scientific literature, how do you go about turning it into a reality?

You can't easily license the data, because it is controlled by hundreds of
different publishers, who you need to speak to separately. Those publishers
have a monopoly on the research they have published, since you can't get
those research articles from anyone else, and so they can charge you
whatever they think they can get away with - there's no effective
competition. And the publishers don't even store the data in a common
structured format, so even if you succeed in getting funding and signing
licensing deals, or rely on spidering their website, you don't end up with
the structured information that you need to make the most effective use of
the data.

Structured information includes: reliably and consistently formatted
bibliographic citation information, mathematical formulae, chemical
structures, figure legends, author affiliation information etc etc. Sure,
much of this can be partially reverse-engineered from an unstructured
version of the document, but this process is complex, error-prone and a
major unnecessary hurdle to re-use. The more hurdles there are, the less
chance there is that any given idea for adding value will see the light of
day. It's not simply a question of whether something is possible - we need
to be making it *easy* to add value to the data.

The irony in the comparison with NAVTEQ is that NAVTEQ spends a lot of money
to collect the data and yet this proprietary data ends up being far more
available for re-use than the traditional scientific literature, even though
the scientific literature is freely contributed by authors, who have have
nothing to gain from it being restrictively licensed, and everything to gain
from wide re-use. It is a fantastic achievement on the part of traditional
publishers that they have got away with this for so long!

So anyway, here is a very partial list giving some specific examples of how
value can be added to scientific research articles.
Stevan, I know that you will object that some of these things could
theoretically be achieved without access to the XML, and without rights to
redistribution, but my point is not that these things are *impossible* in
the absence of structured XML and rights for reuse/redistribution, but that
all are made immeasurably *easier* if freely re-distributable structured
data is available. The practical result of this is that making the
re-usable and re-distributable XML available has an immensely stimulating
effect on innovation.

* Digital archives
The University of Potsdam is one of several institutions which maintains a
comprehensive local mirror of BioMed Central's open access content:
http://bmc.ub.uni-potsdam.de/
Similarly, OhioLink integrates copies of all BioMed Central open access
research articles into their local archive.
http://www.ohiolink.edu/

* Image mining/reuse
Several groups are working with BioMed Central's collection of open access
figures. One group plans to use the images and their legends as part of a
free searchable collection of biological images. Another is developing image
processing techniques to extract useful information from vector-based images
automatically.

* Text mining
Several different text mining groups are experimenting with applying
advanced information retrieval techniques to BioMed Central's XML data.
The UK e-Science program recently created a National Centre for Text Mining:
http://www.cse.salford.ac.uk/nactem/
The following article by the directors of the centre is not a bad place to
start if you want to know what text mining is, and why it is important:
http://www.ariadne.ac.uk/issue42/ananiadou/

At the simple end of the text mining spectrum, consider the Collexis
concept-based indexing/search tool, a basic version of which is used by the
e-BioSci prototype. Collexis's concept-based search, amongst other things,
is multi-lingual - it allows you to use another language (e.g. French, or
German) to query an English language corpus of documents (and vice versa).
http://www.collexis.com
http://prototype.e-biosci.com/search/query

* Antbase
http://antbase.org/ brings together taxonomic descriptions across thousands
of ant species into a single database.
As Donat Agosti has pointed out in this forum, Antbase is severely hampered
by the restrictions on re-use and re-distribution which cover much of the
taxonomic literature. It would also benefit greatly from the availability of
structured XML for research articles, especially if that XML were to include
a standardized representation of taxonomic information.

* PubMed's Bookshelf
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books
This NCBI system adds value to scientific text books, by slicing them up
into fragments, and then using textual analysis to associate those fragments
with they key terms that they relate to. As a result of this when searching
PubMed and viewing an abstract, you can click the 'Books' link to highlight
phrases for which textbook links are available, to clarify the concept
concerned.
For an example, see:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&list_ui
ds=15450118&dopt=Books
This whole system relies on having structured XML data from the publisher.
It uses textbooks that have been made available with the consent of the
publisher, so it is not directly a case of the reuse of open access
research. But what it *is*, is a demonstration of how, if you are allowed
access to the XML, and given the right to redistribute the content, you can
add a huge amount of value.

* Specialist tools for searching particular types of structured information
BioMed Central articles will soon contain embedded MathML versions of all
mathematical formulae.
How best to search this type of content is an active area of research:
http://www.ima.umn.edu/complex/spring/searching.html
The availability of open access research articles containing embedded MathML
will mean that researchers will be able to test out their math-searching
techniques on real world data, and make available MathML-enhanced search
engines.
[The same will apply for other emerging domain-specific dialects of XML that
can be embedded within articles - e.g. CML for Chemistry]

* Biological database linking, as done by PubMed Central
Since PubMed Central holds a central collection of Open Access content in a
standard XML format,it is able to add a huge amount of value through linking
against the NCBI's biological databases. For example, all mentions of small
molecules (referred to by any common alias) within the text of the article,
are identified and linked to the relevant compounds in NCBI's small molecule
database (PubChem), allowing the user to then find similar compounds, and to
follow the links to find other articles which relate to those similar
compounds.
e.g. this article on COX-1 and COX-2 inhibitors
http://www.pubmedcentral.gov/articlerender.fcgi?tool=pmcentrez&artid=408457
links to the following 'substances'
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pmc&cmd=Display&from_uid=40
8457&dopt=pmc_pcsubstance

Again, I'm not saying that all the above is *impossible* with self-archived
material in unstructured form. I am saying is that it is all dramatically
easier if the final official XML version is open access and redistributable.
The explicit right of authors to self-archive their manuscript, and "free
access but only on the publisher's website" are certainly incremental steps
forward, and should be welcomed. But there's a whole lot more that will be
achievable once the bulk of the scientific literature is fully open access
in the Bethesda/Berlin/Budapest sense (i.e. available in a structured form,
and fully redistributable).

> (4) Moreover, both these publishers are both *green*, which
> means that the
> author can self-archive his own final, revised, peer-reviewed draft as
> a supplement on his own website, including any enhancements
> (e.g., XML)
> that he may wish to add to it.
> http://romeo.eprints.org/publishers.html
> http://archives.eprints.org/

But in the long term, is it really the best use of resources invested by the
funder/employer for researchers to spend their time self-archiving their
research, identifying any significant changes which happened during
post-acceptance proofing and copy-editing, reconciling their various
versions, coordinating XML markup etc? Surely this can far more efficiently
be taken care of as an inherent part of the publishing process, rather than
being tacked on the end?

Matt
==
Matthew Cockerill Ph.D.
Technical Director
BioMed Central

This email has been scanned by Postini.
For more information please visit http://www.postini.com
Received on Tue Mar 08 2005 - 19:02:55 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:47:48 GMT