Re: subject classification from Stevan Harnad on 2008-06-25 (American-Scientist-Open-Access-Forum)

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Wed, 25 Jun 2008 12:54:04 +0100

On Wed, 25 Jun 2008, Neil Godfrey wrote:

> Can you clarify what you mean by "human" in this context -- as opposed to
> what?

I mean a human being (whether the author or a librarian) hand-tagging
a text in terms of a prefabricated subject classification scheme.

> I would be the last person to propose the entrenchment of a dinosaur,
> as I believe I demonstrated with my statements, "Whether they [subject
> classifications] should be used at the level I think you are addressing is
> another question. . . . . .

My own concern is very simple: My target content (peer-reviewed journal
article full-texts) is currently about 90% absent from IRs:
http://elpub.scix.net/cgi-bin/works/Show?178_elpub2008

The rationale behind the OAI-PMH was to minimize tags and keystrokes, so
as to maximise content provision.

Hence it is imperative that content-provision (refereed journal
articles, self-archived in IRs) is not slowed by requiring, requesting,
or expecting any needless keystrokes.

Hand-tagging in terms of any pre-fabricated classification scheme is
unnecessary keystrokes. The classification tagging, if any, can be done
by software, after deposit, either at the local IR level or the global
harvester level, or both.

> To what extent, for what purposes, and with what
> rationale is again another question", and my record for promoting RDF
> systems which replace any need for metadata schema. The problem remains
> that
> "human" systems will continue to use a word like "plant" (as per my last
> post) that can have X number of meanings falling under disparate
> categories.

I assume you mean human users. Those are the ones that the
machine-classification of full-texts -- together with sensible human
boolean search -- are meant to help. (They should help robot searchers
and data-miners too.)

> In response to the 2 case studies you raise:
>
> (1) -- the IR at the USQ began by depositing its graduate engineering
> projects, thinking they would have little/zilch interest beyond their
> local
> institution. But that proved not to be the case. They surprisingly had a
> considerable global interest. How to explain that in your model?

I didn't say local IR contents would not have interest beyond the
institution. I said local IR search would not have interest beyond
the institution. IRs are searched globally, at the harvester level.

> (2) -- what do you mean by "post-classification schemes"?

Hand-based pre-classification schemes means you have the (possibly
extensible) subject categories, and you hand-tag the document in terms
of them. Computational post-classification means software has the
document plus the the (possibly extensible) subject categories, and
automatically tags the document.

> I would like to agree with you, if you are implying that the fallible
> ("human"?) can be removed from the equation.

Yes, that is all I am saying. (Please forgive me if I have misunderstood
what is being discussed: I have no quarrel with machine classification
of texts -- local or global; just with hand-tagging.)

Stevan Harnad

> 2008/6/25 Stevan Harnad <harnad_at_ecs.soton.ac.uk>:
>
> > This topic keeps recurring, year after year (try the google search "
> > site:
> > listserver.sigmaxi.org amsci boolean full-text classification" or "
> > site:
> > users.ecs.soton.ac.uk amsci boolean full-text classification") and it's
> > always the same two points:
> > (1) No, the limited local output of a single university collected in its
> > Institutional Repository will not be searched directly by either
> > external or
> > internal users except for rare local administrative or actuarial
> > purposes.
> > Consisting largely of one institution's own journal article and thesis
> > output, it will not be searched by scholars worldwide desperate to know
> > what
> > University X has to say about global warming or E. Coli. Those and all
> > other
> > topics will be searched at the global harvester, pan-institutional
> > level.
> >
> > (2) Even at the global harvester level, Boolean full-text search,
> > supplemented by machine data-mining-based a-posteriori
> > post-classification
> > schemes (added to the database automatically) will outperform
> > (economically
> > and ergonomically) human a-priori pre-classification schemes any day of
> > the
> > week.
> >
> > There is no point trying to draw conclusions on the basis of the human
> > classification that people still happen to be faithfully doing. People
> > will
> > go on doing things well past the point where they are needed or make any
> > sense. We should be discussing, case by case, what human
> > preclassification
> > is allegedly needed for, whether that need really exists, and where it
> > does
> > exist, whether computational post-classification methods do not or would
> > not
> > fulfill the need at least as well as human preclassification.
> >
> > Stevan Harnad
> >
> > On 25-Jun-08, at 1:31 AM, Neil Godfrey wrote:
> >
> > Au contraire, Arthur. Subject classifications are still very much widely
> > used, and not only LCSH, in most (surely all?) libraries throughout
> > Australia for starters. One non-library example: I recently did some
> > work
> > for the Australian Agriculture and Resources Online (AANRO) database and
> > they, too, are using subject classifications that are consistent with
> > international standards. The internationally recognized controlled
> > vocabularies are a vital key to the discoverability of Australian
> > research
> > through FAO portals. And their use has nothing to do with government
> > threats. Controlled vocabularies enable a more systematic search that
> > keywords and data-mining lack. Dublin Core has built in allowance for
> > standard controlled vocabularies. And by extension, therefore, so does
> > OAI-PMH. (There has been a heated debate within the LC community about
> > this,
> > and I am well aware of those who argue for the scrapping of controlled
> > vocabularies. But they have by no means settled the debate at this
> > point.)
> >
> > Controlled vocabularies certainly are used for subject search and
> > retrieval. Whether they should be used at the level I think you are
> > addressing is another question. We are still in transition mode, and
> > some
> > users do access IR's via library catalogues -- via controlled
> > vocabularies.
> > To what extent, for what purposes, and with what rationale is again
> > another
> > question. But one cannot just ditch entire classes of users in one blow
> > without some prior investigation. Meanwhile, I have no argument with
> > anyone
> > wanting to explore ways to integrate their library and IR collections.
> >
> > IRs have represented in most cases limited types of collections to date,
> > and that has meant they have not had the same need for controlled
> > vocabularies (subject classifications) that have been necessary for the
> > more
> > diversified library collections. But if we are looking into the future
> > where
> > the various IRs can be subject to various federated searches, and if RDF
> > has
> > any place in the future of this, then there is no need for the scrapping
> > of
> > controlled vocabularies. We may well have the best of both worlds (e.g.
> > SKOS). It is easy now to rely on keywords and limited search gateways,
> > but
> > when we start to look at federated searching and data exchange and reuse
> > across repositories, then controlled subject vocabularies may well find
> > a
> > renewed functionality, even if it is one hidden from users' eyes who see
> > only the friendly portal.
> >
> > RDA, SWAP -- and, given the work currently being done re SKOS (work on
> > the
> > semantic web that is designed to bring in these controlled vocabularies)
> > --
> > are not "magic". They represent years of hard work and experience to
> > bring
> > the best of the know-how of those long experienced in resource searching
> > and
> > retrieval. It appears that modules have "finally" been developed for the
> > prototype intros of the scholarly works application profile (SWAP). RDA
> > is
> > just around the corner. They will need refining no doubt. But it is
> > premature to ditch the very idea of controlled vocabularies at this
> > point.
> > Maybe they will go eventually, but if we ditch them now then we risk
> > ditching a whole swathe of organized information and access points way
> > too
> > soon.
> >
> > Neil Godfrey
> >
> > http://metalogger.wordpress.com
> >
> >
> > 2008/6/25 Arthur Sale <ahjs_at_ozemail.com.au>:
> >
> > > Neil
> > >
> > > Just briefly, two points.
> > >
> > > · Nobody uses any subject classification for much at all,
> > > anywhere in the world. They are not worth spending much time thinking
> > > about,
> > > unless you have a government breathing over your shoulder.
> > >
> > > · The Australian RFCD codes were **very** obsolete, hence the
> > > 'not elsewhere classified" problems within each discipline code. The
> > > ANZSRC
> > > is specifically designed to address this problem.
> > >
> > > Rather than wait for a magic solution, I'd rather forget it. Useless
> > > data
> > > does not demand action.
> > >
> > > Arthur Sale
> > >
> > > University of Tasmania
> > >
> > >
> > > *From:* Repositories discussion list [mailto:
> > > JISC-REPOSITORIES_at_JISCMAIL.AC.UK] *On Behalf Of *Neil Godfrey
> > > *Sent:* Wednesday, 25 June 2008 10:12 AM
> > > *To:* JISC-REPOSITORIES_at_JISCMAIL.AC.UK
> > > *Subject:* Re: subject classification
> > >
> > > Some libraries catalogue certain types of resources also found in
> > > their
> > > IRs, such as theses, and so their library catalogues will contain
> > > LCSHs and
> > > links to the resources in the repository. I believe it would be too
> > > much of
> > > an ask to expect academics to assign LCSH headings anyway. Deciding
> > > appropriate LCSH terms is not even always straightforward for trained
> > > cataloguers.
> > > The RFCD codes are an administrative tool for research reporting, and
> > > since IRs are seen as potential archives of much research publication
> > > as
> > > well as being tools to assist in reporting for funding cum status
> > > purposes,
> > > it makes sense to include these codes in Australian academic
> > > repositories.
> > > But there are too many "'XXX9999' not elsewhere classified" sub
> > > categories
> > > to make it useful as a subject search tool.
> > >
> > > Is it reasonable to hope that RDA and SWAP will point to the most
> > > rational
> > > ways to work with these complexities?
> > >
> > > Neil Godfrey
> > >
> > > http://metalogger.wordpress.com
> > >
> > > 2008/6/25 Arthur Sale <ahjs_at_ozemail.com.au>:
> > >
> > > *All* Australian universities with repositories implement the RFCD
> > > classification (*R*esearch *F*ields, *C*ourses and *D*isciplines), an
> > > older version of the ANZSRC (*A*ustralia and *N*ew *Z*ealand Science &
> > > *R
> > > *esearch *C*lassification) which came out this year. A few
> > > institutions
> > > are already switching over and all will eventually. As Ingrid said
> > > this is a
> > > research funding classification, produced by the Australian Bureau of
> > > Statistics and its New Zealand counterpart and used by the Australian
> > > Government for research assessment.
> > >
> > > We recognise that almost no-one browses repository search interfaces,
> > > with
> > > intra-institution searches dominating the few there are. Governmental
> > > and
> > > internal administrative uses for the classification metadata dominate,
> > > often
> > > through the OAI-PMH interface. *In other words, the classification is
> > > provided for the analysers of research, not users.*
> > >
> > > Arthur Sale
> > >
> > > University of Tasmania
> > >
> > >
> >
> >
> >
Received on Wed Jun 25 2008 - 12:54:36 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:49:21 GMT