Re: Interoperability - subject classification/terminology

From: Stevan Harnad <>
Date: Wed, 19 Nov 2003 19:29:32 +0000

On Wed, 19 Nov 2003, Chris Korycinski wrote:

> > But we are not talking here about books or book-indexing! We are
> > talking about the annual 2.5 million full-text refereed-journal
> > articles.
> ... in subjects outside science, remember.

I understand fully. My bet (that inverted full-text boolean search is
all that is needed to navigate the entire refereed-journal corpus, all
24,000 journals' worth, and would beat human classification any day)
does apply to all disciplines, both science and non-science.

But my bet does not apply to books and book indexes (although -- without
wagering! -- I do believe that software-based indexing and navigation
will prevail there too, as "semantic-web" tools grow and improve for
navigating large text corpora -- in any/every domain).

> My original comments apply just as well to articles as to books - many
> of the 'books' she works on are papers or conference proceedings.

If a set of articles is gathered together and published as an indexed
book, then that is a book! Nolo contendere. Book users don't have online
powers over the book, moreover navigating just a local book is a much
narrower and more focussed task than navigating the whole of the journal
literature in a field.

But please don't forget that journals never had (and never needed) subject
indices, the way books do, and one of the reasons is probably that,
apart from happening to have been accepted for the same issue, their
articles don't have much to do with one another -- and if the *were*
gathered into a book as a collection, they wouldn't be in the *same*
book, but many different, topic-specific ones.

Journal article space never had or needed a subject index in paper days,
a fortiori, it needs it even less in online days, with the possibility of
boolean inverted-text searching (as well as other digital prestidigitation,
such as similarity matching, latent semantic indexing, citation-linking,
citation-ranking, download-ranking, co-citation analysis, etc.).

> The reality is that these areas are intrinsically different as they often
> (I'm by no means saying always!) deal with concepts/points-of-view
> rather than facts. And concepts lie closer to the realms of metadata
> and hence are intractable by naive and simplistic schemes such as
> keyword/inverted file indexing.

I'm not sure I disagree. I agree that book-space, especially in some subjects,
needs something more than just keyword and inverted full-text searching: But my
guess is that that something more will turn out to consist of further
text-analytic software tools.

But if by metadata you mean that human judgment will have to do the tagging and
sorting, as in human indexing days, I doubt it (though I make no bets, outside the
one area I am pretty sure about: the annual 2,500,000 articles in the planet's
24,000 refereed journals -- across all disciplines and languages).

> Have a look at the example I gave... it was edited out of my posting!

Apologies. here it is again:

> It is concepts, not words people want. The same concept is often expressed in
> different words, or, to take another example: "Major announced in Westminster that
> Maastricht was totally unacceptable".
> Is this about Westminster? Majors? The Netherlands? No. Try "British foreign
> policy" or something similar (depending on the thrust of the book.

My guess? This particular example, and countless others like it, are already a piece
of cake for some of the more sophosticated digital-text processors I mentioned

> Belive me, this is a simple example compared to many sociological or philosophical
> texts and any inverted-file style of 'indexing' would produce complete rubbish.

Indexing, yes, but software text-analyzers? I don't suggest you make any wagers!

But my bet about the refereed journal corpus stands!

Cheers, Stevan

Some references on LSI and SW:
Received on Wed Nov 19 2003 - 19:29:32 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:47:09 GMT