Re: Interoperability - subject classification/terminology

From: Stevan Harnad <>
Date: Tue, 11 Mar 2003 13:23:30 +0000

On Mon, 10 Mar 2003, David Goodman wrote:

> The reason I suggested classification is that various people in the
> subjects covered have told me that they use this archive by checking
> everything in their subject classification each day, and that the current
> rather straight-forward classification suits them fine.

I assume that "this archive" refers to the Physics ArXiv, which is a
global, discipline-based archive. Some users monitor some topics daily
or weekly, and there are ways to accommodate their needs that include a
subject taxonomy. (Whether that taxonomy, and the classification of the
the papers within it, is best done, in our online digital era, by human
classifiers and/or authors, rather than by a text-processing algorithm,
is another question.)

I was not referring, however, to global, discipline-based archives,
but to local, institutional archives. For local search and use they
certainly don't need a global taxonomy; and as bits of a harvested
distributed worldwide "virtual archive" they are surely better sorted
and navigated globally by cross-archive search tools than by local
classification schemes.

> People work in various ways, especially for current awareness. One of the
> many virtues of systems such as this is that they can be designed to be
> adaptable to individuals.

The current-awareness alerting system (likewise probably better if based
on text-processing algorithms rather than human classifiers and/or
authors) is not the same issue as the question of whether or not there
is any need to develop a classification system for local institutional
refereed-research output archives. (The Eprints software, for example,
has an alerting capability but no elaborate classification system.)

> I did not mention Boolean full-text searching, only because I assumed it.
> Stevan, would anyone design such a system without it--still, now?

Not only is the boolean capability there with all inverted digital
full-text, but (I'm betting) it can beat any human classification scheme
(with the help of the right text-processing algorithms).

> And I remain much less sanguine than you about the ability to accommodate
> all the fields of science -- let alone all academic knowledge -- in a
> single relatively simple system.

In one (local, institutional) archiving system or in one classification
system? I am sanguine about the first (though not necessarily all squashed
into a single university archive: many interoperable departmental ones
will probably work better) whereas I consider the second unnecessary and
a waste of time (beyond a very rudimentary, first-cut classification)
scheme: Computational algorithms on the full-text should do the rest. Not
human classifiers (including the authors). Remember: we are talking about
journal articles, not books or other works. Who ever searched the journal
literature on the basis of a fixed human classification of it? (And if
they did, how much mileage did they really get out of that taxonomy,
compared to computational sorting based on full-text analysis?)

> Anyone who has ever worked in a library can tell you about the
> unreliability of a rough arrangement by discipline and journal name.

Unreliability for what? Ambulatory, analog search? Of course. But we
are talking about digital data and digital search. Who searches the
journal system by taxonomy rather than, say, boolean word-search?

> What subject is Phys Rev B (Condensed Matter)? or J Chem Phys? or Brain
> Research?

Who cares?

If I am looking for stuff on neuropeptides, my boolean search will
retrieve any papers from the latter two journals regardless, as long as
they contain the indicators my algorithm specifies.

> And if you always remember journal names correctly, I congratulate you but
> wish you weren't unique. All your plans--as is inevitable--are shaped by
> your own preferences. So would mine be, but at least I realize
> it--sometimes.

No need to remember journal names correctly (fuzzy matches can be
fine-tuned -- see http:// and (in my
optinion) no longer any need for any prefabricated a-priori human
taxonomies (in searching the refereed research journal literature) --
though a-posteriori algorithmic ones can be generated on the fly.

Stevan Harnad
Received on Tue Mar 11 2003 - 13:23:30 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:46:53 GMT