Re: Comparing the Wellcome OA Policy and the RCUK (draft) Policy from Leslie Carr on 2005-05-20 (American-Scientist-Open-Access-Forum)

From: Leslie Carr <lac_at_ecs.soton.ac.uk>
Date: Fri, 20 May 2005 09:54:00 +0100

I welcome the Wellcome stance on OA archiving, and like Stevan,
believe that the issue at stake is one of strategy. After all, since
1999's formation of the Open Archiving Initiative, repositories have
been built with an eye to interoperability because it is recognised
that they operate in a larger context than any single repository can
accommodate. Even Robert's remarks on subject-based science can seem
parochial in a world of increasing inter- and multi-disciplinarity!
From a pragmatic point of view I know that my repository can put
procedures in place to harvest Southampton University papers and
metadata from PMC so that they appear in our IR's record of our
institution's research (appropriately slaved to the PMC versions to
avoid wanton version proliferation).

But there are larger issues that Wellcome's position opens up, so
bearing in mind that I agree with Robert on all but strategy, I'd
like to explore some of his comments.

On 19 May 2005, at 18:36, Terry ,Mr Robert wrote:

> it is important to remember that the Trust operates globally
> supporting 4000 researchers in more that 40 countries - we need a
> repository that meets all our needs today and PMC offers that.
I think that this is the heart of the matter - Wellcome has its own
specific requirements and it wants to control the software (and the
repository ingest process and hence the information) to ensure that
those requirements are filled.
As we shall see from below, Wellcome's requirements fall into
categories: document preservation and data integration. I don't
believe that "a separate funding body archive" is necessary to
fulfill either of these requirements.

> we want a long-term digital archive (i.e. not Word or PDF files but
> XML files) that will integrate the research literature with the data.
> We fund research from a scientific perspective, not its
> geographical location, and we want to ensure that when the
> literature is searched the search engine can go deeper than the
> metadata and provide links between, for example, genome sequence,
> chemical compounds or MRI scan images embedded in an article and
> databases such as PubChem and Genbank. It will move between the
> databases and PMC and visa versa - a Japanese or French team
> working on a gene but not publishing in English will be able to
> discover other research groups working on the same sequence. Teams
> working on drug compounds but investigating different uses will be
> able to discover who else is working on that compound either by
> searching the literature or the database.

As I understand it, the PMC ingest process involves the translation
of the submitted document into an XML-based format (with necessary
rounds of manuscript checking and reviewing by authors). Of course
this XML is only used internally - visitors to PMC see an HTML
presentation generated from the XML sources. With regard to research
data, it would appear that PMC takes data "as is" in whatever format
the author supplies (sometimes XML, more often Word, tar.gz bundles,
perl scripts etc). While I applaud the effort to have material stored
in XML. Taking a look at the current contents of PMC, it is apparent
that many of the "Supplementary Materials" are provided in Word form.
Even those that are provided in XML have no DTD, Schema or
instructions on how they should be interpreted.

As far as Open Access is concerned, all of this functionality is
available from any Institutional Repository (whether it's EPrints or
DSpace). The storage of multiple formats of a document (text, HTML,
XML, PDF, Word, RTF, LaTeX, PostScript, JPEG, MPEG) can be
accommodated, as can the storage of supplementary research data. What
the repositories lack is the editorial processes to support document
translation - and Wellcome could easily provide that as a separate
service to interoperate with any archive.

[[ Also as far as Open Access is concerned, can I just say that
certainly PDF and probably Word are pretty much as good as "XML".
What gets lost in some "long term archiving" discussions, is that
there is no such document format as "XML". It is a meta-language for
defining document vocabularies. Even then (when one uses a specific
DTD or Schema to enforce a particular grammar on your documents'
structure) all you have is a well-formed but (literally) meaningless
tree. What is required is a way of applying some sort of
interpretation to this tree (e.g. a way of rendering the document
onto a screen using CSS or XSL stylesheets to convert into HTML or
XSL-FO) and it is there that the complexity starts to come in. XML,
PDF and Word formats all rely on the existence of documented ways of
interpretation, and available software renders. PDF and Word have
these, from various organisations. We are not living in the 1980s any
more, when formats were opaque and interoperability was a dirty word! ]]

XML is DEFINITELY a thousand times preferable to Word or PDF when you
need to reuse (reformat, republish, repurpose) a document. I can
imagine situations where it will be useful to do this in an OA
context (e.g. representing papers for small, handheld devices), but
that is providing an added value service on top of Plain Old Open
Access.

> PMC already offers this functionality and that's vital to enhance
> the potential that the Internet offers.
Please excuse my unfamiliarity with PMC - can you give an example of
a PMC entry showing this integration (ie beyond listing supplementary
materials)?

> The life sciences have already moved beyond the need to read a word
> document on a local website
I definitely agree with you! And more - it's not only the life
sciences. It's all the experimental sciences. And engineering. And
social sciences.

> Institutional repositories may never offer the same degree of
> functionality until every single institution uses the same
> ingestion and storage system
You are thinking in terms of monolithic and centrally controlled
software. In the web-based, distributed and interoperable environment
in which we find ourselves, I could easily deposit my research
articles inside my Plain Old Institutional Repository and my research
data inside my Learned Society's Advanced Chemistry-Aware Repository,
and have the scientific record seamlessly and automatically tied
together. Document and data. Measurement, analysis and
interpretation, all interoperable, all open for scrutiny and use.

> OAI only links the metadata to files that might be in Word or PDF
> which may be unreadable in the years to come.
There is indeed no constraint within OAI on the formats in which its
items are to be provided. However, PDF documents could only become
unreadable if all the public PDF specifications were systematically
destroyed. (And no-one had bothered to create a translation program
from PDF to the majority formats of the day.)

There is a lot of work being undertaken in these topics by various
projects. The JISC-funded E-Bank project (http://www.ukoln.ac.uk/
projects/ebank-uk/) of which UKOLN and Southampton are partners, are
producing the kind of integration between data and document that you
are describing, precisely for supplementing Institutional
Repositories. In particular, the project is taking the view that the
data format must be well-understood, and that i must be exposed to
harvesters to allow chemistry-specific searching. The new JISC
Digital Repositories programme will soon have a raft of related data-
based repository work.

Despite my comments about PDF and Word, I agree with Robert that
repositories should be managed with an understanding of preservation!
Our repository has a cheap policy of "including at least one safe
format" whereas Wellcome has a relatively expensive conversion
process in place. In the end we disagree about which formats are,
practically speaking, safe. I applaud Wellcome for putting their
money where their mouth is and providing a service.

BUT, that service could easily be made to work within a network of
institutional repositories.
ALSO, the data integration could be made to work within a network of
institutional repositories.

So we're back to strategy, because there is no technical barrier
against Wellcome's policy working with Institutions.

Finally, I hope that Robert will accept an invitation to visit the
EBank project and to discuss the nature of scientific communication
and the advantage that our respective repositories can offer scientists.

---
Dr Leslie Carr
Eprints Technical Director
EBank Project partner

Received on Fri May 20 2005 - 09:54:00 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:47:53 GMT