Re: A Search Engine for Searching Across Distributed Eprint Archives from Stevan Harnad on 2004-10-20 (American-Scientist-Open-Access-Forum)

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Wed, 20 Oct 2004 16:06:48 +0100

On Wed, 20 Oct 2004, Donat Agosti wrote:

> Something, which bothers me and doesn't show up in most of the
> discussion of open access, is the construction of search tools across
> digital publications (and potentially millions of pages of legacy
> information). In the end, this will be the real issue, not just reading
> another publication face to face.

The real issue -- and the 1st, 2nd, 3rd and Nth priority today -- is
Open Access (OA) *content*: The full-texts of the 2.5 million annual
articles published in the world's 24,000 peer-reviewed journals are
still not openly accessible online (only about 20% of them are).

It is merely distraction and dreaming to worry about search tools when the OA
content is not yet there for them to search!

Having said that, cross-archive search tools (for the little OA content
we have so far) already *do* exist (and they are already far more powerful
than their sparse content yet deserves!):

    http://oaister.umdl.umich.edu/o/oaister/
    http://citebase.eprints.org/
    http://www.scirus.com/srsapp/

And (I promise you), providing more OA content is guaranteed to inspire
the creation of more and more such tools, with more and more powerful
capacities.

So please, don't worry about more powerful search tools when the cupboards are
still bare: Fill the cupboards and the search tools will come, hungrily!

> What do you think about that? It seems, that the big publishing houses
> are already thinking about that, and that they developed such facilities.

The big publishing houses' cupboards are *not* bare: They have the 100% Toll
Access content on which to provide ever more powerful search tools. Let's provide
100% Open Access content and then watch what happens!

> This of course is one of the most important tools, for data
> mining, extraction, or just finding the right piece of information. It
> also means, that we look beyond self-archived pdf documents to searchable
> documents with some mark up of their logic content included. Any ideas?

Two ideas:

(1) Provide the full-text Open Access content, and the tools for finding, mining
and extracting from it will come with the territory.

(2) The primary target is journal articles, which consist primarily of text. The
most powerful means of text-processing today is full-text inversion. (This is part
of the magic that google does.) Enhancing this with citation-linking (in place
of google's ordinary linking), plus some hub/authority analysis, citation and
download ranking, co-citation analysis, co-text (semantic/similarity) analysis,
and full-text boolean search, and I think you will have search capabilities to
surpass your wildest dreams.

The only missing element is the content. Please let's not forget that, and
lapse into Oneirology instead of Open Access Provision!

Stevan Harnad
Received on Wed Oct 20 2004 - 16:06:48 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:47:38 GMT