Eprints Open Archive Software from Robert Tansley on 2000-06-27 (American-Scientist-Open-Access-Forum)

From: Robert Tansley <rht96r_at_ecs.soton.ac.uk>
Date: Tue, 27 Jun 2000 17:32:37 +0100

There seems to be some confusion here. In San Antonio I questioned why
the partitions were in the open archives protocol at all. Their only
possible use, as I saw it, was as a means of assigning some subject
categorisation information to them; since for many reasons this isn't
possible, I think they should be taken out of the protocol that open
archives use.

OK; that's partitions, in the OA/Dienst protocol sense of the word, done
with. (Partitions might be a word we should avoid in the future as it
can be taken in different contexts.)

The original issue that prompted this discussion was, I think, that one
might want to set up, for example, a distributed chemistry archive; or
rather, you want a search service that will only ever give you papers
concerning the chemistry discipline. With the current OAMS set and
protocol, you can't do that. The best you could do is search for a bunch
of keywords that you think would catch all those records that pertain to
chemistry, while not picking up records that don't. How you judged your
success in this I don't know.

EPrints allows an archive administrator to specify a subject hierarchy.
It just so happens that it is this that is mapped to partitions in the
Dienst protocol, simply because that seemed the most natural mapping,
but in EPrints the subjects are not "partitions" by definition. (A
partition is defined in the Cambridge Dictionary as a dividing
structure, implying mutually exclusive divisions.) Rather, each record
in the archive can be assigned a number of subjects from this hierarchy.
Effectively, each record gets a repeatable "subject" tag that has values
from a controlled vocabulary.

This is what I would advocate for open archives; a repeatable metadata
field that has some value (or importantly, set of values) from a
controlled subject vocabulary. Obviously, this vocabulary doesn't and
shouldn't have to be large (maybe even just one entry for each major
discipline); it should just give some processable indication of what the
record is about. One can then search or browse a (distributed) archive
by subject, viewing those records which posses a given tag or set of
tags. The idea of "partitions" just doesn't come into it.

Of course, I understand that there are major issues here, not least of
which is deciding on the controlled vocabulary/subject hierarchy,
integrity of information and so on. I just think that this is a
worthwhile line of inquiry, and I explained it here to distinguish it
from the partition issue.

I think the reason the subject hierarchy/partition confusion first came
about was because, in EPrints, the most natural mapping for OA/Dienst
partitions was the subject hierarchy. This happened mainly because,
since it is part of the architecture, it was necessary to provide some
partition structure, even though none actually existed.

In view of the fact that in EPrints (and in fact in the OA subset of the
Dienst protocol) documents may appear in more than one partition, this
mapping might not be appropriate. In light of this, it might be better
that EPrints just presents itself to the protocol as a single Dienst
partition. Would this cause any problems? _Requiring_ an archive to
provide some (arbitrary) mutually exclusive partitions of its data
wouldn't seem ideal to me.

R

Stevan Harnad wrote:
>
> Carl's technical wisdom is most welcome. I repeat, the generic
> archive-level subject categorization (partition) is meant to be
> a "satisficing" approximation, not an exact and final classification.
>
> The question is (in "minimalism-plus" terms) whether generic
> open-archiving software is more likely to be useful and effective NOW
> (in getting open archives to proliferate and fill) WITH or WITHOUT a
> default, first-approximation set of subject-headers. I don't know the
> answer for sure, but my guess is that it is better WITH.
>
> The default subjects can be turned off, or replaced, if the
> user/institution wishes (with the corresponding reduction in default
> interoperability), but is there any advantage to not providing any
> default minimal subject classification at all, for what is meant to be
> generic open archive software for any and every discipline, right now?
> [My own guess is that we are better off providing a default partition.]
>
> Second, let us not forget that one extremely central function (though
> not the only one) for institutional open archives will be
> institutional-researcher SELF-archiving. For this, the optimal scenario
> of an "information-specialist" doing the classification is not the
> right model. There WILL be institutional and individual
> SELF-classification for the subject matter of the papers in the
> archives. Again, on a minimalism-plus philosophy, why withhold the
> default partition, the possibility of turning it off, and the
> possibility of adding to it -- just because an optimal means is not
> within reach today?
>
> Let authors/institutions self-archive, and self-classify their own
> papers, and whatever shortfalls there are can be made up once the
> archives achieve their primary and most important objective, which is
> to proliferate and fill now!
>
> I also like and admire Carl's philosophy of "virtual collections,"
> providing different higher-level blends and compilations from the
> overall corpus. But right now we are talking about the primary corpus
> itself, and the default taxonomy it provides for self-describing
> itself. Let us not (in the interests of "absolute minimalism") mute its
> self-describing vocabulary because it is not optimal, and wait for
> meta-level collections and services to furnish it instead; as a
> compromise, the self-description fields can always be ignored by the
> higher-level collection. But let there be a default set now.
>
> The mother of all "collections" is the super-set of primary open
> archives itself. That generic set is what we are trying to create,
> using a "minimalism-plus" philosophy, avoiding both the Scylla of
> Absolute Minimalism (offering less functionality than is feasible
> today) and the Charybdis of Optimalism (which means holding back from
> open archiving today, to await the ideal implementation some day).
>
> Cheers, Stevan
>
> On Tue, 27 Jun 2000, Carl Lagoze wrote:
>
> > All,
> >
> > I have to jump in and say that IMHO there is a mistaken focus on
> > partitions as the proper means of segmenting up the information space
> > (of eprint archives, of OAI, of anything).
> >
> > First, a little history which I meant to say in San Antonio but never
> > got around to it. The idea of partitions was a direct result of the
> > beginning of our (NCSTRL) collaboratioin with LANL a few years ago. At
> > that point (and still has) LANL had this, in my opinion, somewhat
> > misguided and non-scalable legacy notion of fixed partitions in its
> > archive. They wanted to make these partitions visible at the protocol
> > level and thus was born the notio of repository parititions in Dienst.
> > (Paul and Simeon, please understand that I'm not meaning to disparage
> > your work or arXiv. As noted below the partition concept is just fine
> > for your application!)
> >
> > The Intention! -- These were and still are purely intended as a
> > repository local, administrative convenience. Basically a very simple
> > way of dividing up an individual repository and certainly not meant as a
> > means for partitioning up some larger information space.
> >
> > I have never felt very comfortable with this whole idea esp. the way
> > that it is implemented at LANL - e.g., authors decide which partition a
> > paper should be placed and users search within partitions. This works
> > (maybe) just great in a closed and highly expert community such as those
> > who use the LANL archive but breaks down badly in other communities.
> > Esp. at the user end where searching within a partition makes little
> > sense.
> >
> > Extending this notion across repositories/archives really starts to
> > break down. We seen this confusion in the OAi discussions. All the
> > sudden we're trying to figure out what is the right way to
> > "universalize" partitions? What is the way of registering partitions?
> > What do these partitions mean anyway?
> >
> > The Reality! -- There is no "right" way to partition information spaces
> > (just as there is no "right" metadata). There are many ways to
> > partition information spaces that are customized to different user
> > groups. Furthermore, partitioning of information spaces is completely
> > independent of archive location (e.g., the set of information in a
> > partition may some content from repository A, all from repository B,
> > some from repository zz, etc.). So, mapping individual respository
> > partitions to a global or even intranet cross-repository partitioning
> > system breaks down due to 1) projecting local decisions to global
> > decisions and 2) ignoring the fact that any one document in a repository
> > should be able to exist in more than one global partition.
> >
> > Solution? -- Back in '98 I wrote
> > http://www.dlib.org/dlib/november98/lagoze/11lagoze.html, in which I
> > talked about a collection abstraction in distributed information spaces.
> > At this point we implement such an idea in Diesnt as the means of
> > creating the NCSTRL collection that spans multiple repositories. Our
> > implementation is imperfect but I maintain is on the right track. Over
> > the next year we have funding and people to push this to the next and
> > hopefully correct implementation that will allow organizations and
> > instititions to create flexible collections that do (hopefully) scale
> > across multiple repositories and make it possible to aggregate documents
> > for multiple communities.
> >
> > For now, please lets not try to push the partition thing beyond its
> > original goal or a merely repository local administrative convenience.
> >
> > Finally, Rob and Stevan, please understand that I'm not trying to
> > criticize the work you've done on your eprint software. I'm really
> > looking forward to seeing it in action and working with you on the idea
> > of overlaying more features of the Dienst protocol on it as we try to
> > scale from individual archives to federated information spaces.
> >
> > Carl
> >
> > > -----Original Message-----
> > > From: Stevan Harnad [mailto:harnad_at_coglit.ecs.soton.ac.uk]
> > > Sent: Tuesday, June 27, 2000 6:56 AM
> > > To: Robert Tansley
> > > Cc: Eric F. Van de Velde; 'Stevan Harnad'; john.ober_at_UCOP.EDU;
> > > ken.weiss_at_UCOP.EDU; Carl Lagoze; Ed Sponsler
> > > Subject: uni- and multi-disciplinary settings on ePrints
> > >
> > >
> > > Dear Eric, Ken et al:
> > >
> > > The question of whether it will prove optimal for University Open
> > > Archives to be pluridisciplinary or unidisciplinary can be settled by
> > > actual practise.
> > >
> > > The ePrints archiving software is designed to be useable either way:
> > > Part of the local institution's parameter-setting and customization of
> > > the generic ePrints software can amount to turning other disciplines
> > > off if it is being used for just one department (or lab, or
> > > researcher).
> > >
> > > Also, there should be a generic spectrum of discipline partitions that
> > > ePrints provides as a default (we are still looking for the optimal
> > > default one to use, and recommendations are welcome!), and then these
> > > can be added to. To preserve overall interoperability, it
> > > would be best if
> > > such site-specific additions to an expanding open-partition
> > > space could
> > > be percolated to all open-archives in some systematic way (but this is
> > > a technical issue that exceeds my own technical grasp!: Carl?)
> > >
> > > What is certain is that, again, the philosophy of "minimalism plus"
> > > should prevail: We must not hold back, waiting for a final, ultimate,
> > > optimal solution, requiring more complicated compliance by
> > > individuals.
> > >
> > > Find an approximation that will "satisfice" to launch, fill, and bring
> > > up-to-speed a large number of universities' open archives right now.
> > > THEN the collective commitment that comes with having all those
> > > institutions' intellectual goods already minimum-plus-functional in
> > > the interoperable open-archives will ensure that the functionality
> > > grows, and that the growth comes in an already-shared collective
> > > convention.
> > >
> > > So: "Satisficing" approximate partitions for now, optimizing
> > > for later,
> > > once the open-archiving is irreversibly in motion.
> > >
> > > Cheers, Stevan
> > >
> > > On Tue, 27 Jun 2000, Robert Tansley wrote:
> > >
> > > > "Eric F. Van de Velde" wrote:
> > > > >
> > > > > Stevan, Rob,
> > > > > The tech guru for our preprint service (currently
> > > consisting of NCSTRL) is
> > > > > Ed Sponsler. He is in today and tomorrow, but then takes
> > > a (well-deserved)
> > > > > vacation. So, it may take a bit before we get into this.
> > > > >
> > > > > However, I believe we may have similar issues. Until now,
> > > the primary usage
> > > > > of Dienst has been within the NCSTRL context. We are
> > > struggling with
> > > > > decisions on how to implement a Caltech-wide
> > > cross-disciplinary archive.
> > > > >
> > > > > Do we really have only one Caltech-wide archive with
> > > partitions for
> > > > > individual options (departments). However, can these
> > > partitions easily
> > > > > participate in disciplary federations?
> > > >
> > > > This is an issue that hasn't fully been resolved by the
> > > open archives
> > > > initiative, and is in fact the main issue I raised at the
> > > OA workshop in
> > > > San Antonio. I will certainly be pushing to get this resolved.
> > > >
> > > > > Another option is to create a repository for each
> > > department and combine
> > > > > them through federation into a Caltech repository.
> > > >
> > > > This does sound like a better option to me, as it would
> > > ease some of the
> > > > difficulties involved in disciplinary federation (actually
> > > harvesting in
> > > > the OA world.) Additionally, if individual archives are
> > > smaller, this
> > > > does tend to improve their individual performance.
> > > >
> > > > As well as the departmental archives, you could quite easily have a
> > > > Caltech "gateway" search engine, that could create an index
> > > covering all
> > > > of the departmental archives, and search them all in a very
> > > efficient
> > > > way. This separation of services (such as searching) and
> > > data provision
> > > > brings many benefits.
> > > >
> > > > > Occasionally, even the option of creating a repository
> > > for every faculty
> > > > > member is mulled over, because there are quite a number of
> > > > > "independence-minded" faculty in this place.
> > > > > Question though is whether the federations remain
> > > manageable under such a
> > > > > scenario...
> > > >
> > > > You could allow each department a degree of freedom. For
> > > example, using
> > > > the EPrints software, each department's archive could be given the
> > > > departmental "look and feel", if they have one. Additionally the
> > > > software allows each department to hold their own extra information
> > > > about documents (for example, "funding body"). Provided each archive
> > > > supports the open archives protocol, and provides the same central
> > > > metadata, the distributed searches performed by the Caltech search
> > > > gateway are not affected.
> > > >
> > > > R
> > > >
> > > > > --Eric.
> > > > >
> > > > > -----Original Message-----
> > > > > From: Stevan Harnad [mailto:harnad_at_coglit.ecs.soton.ac.uk]
> > > > > Sent: Thursday, June 22, 2000 9:02 AM
> > > > > To: Eric F. Van de Velde
> > > > > Cc: Rob Tansley
> > > > > Subject: Re: EPrints Software Beta
> > > > >
> > > > > Hi Eric,
> > > > >
> > > > > The link will come to you shortly, from Rob Tansley.
> > > Meanwhile see:
> > > > > http://www.eprints.org/software.html
> > > > >
> > > > > Chrs, Stevan
> > > > >
> > > > > On Thu, 22 Jun 2000, Eric F. Van de Velde wrote:
> > > > >
> > > > > > Stevan,
> > > > > > I would definitely be interested to take a look at
> > > this. Did you mean to
> > > > > > include a link in your e-mail? I did not find a link to
> > > the Beta on the
> > > > > > cogprints site.
> > > > > > --Eric.

--
 Robert Tansley                    Tel: +44 (0) 23 80594492
 IAM Research Group                Fax: +44 (0) 23 80592865
 Electronics & Computer Science    http://www.ecs.soton.ac.uk/~rht96r/
 University of Southampton
 Southampton SO17 1BJ, UK

Received on Mon Jan 24 2000 - 19:17:43 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:45:46 GMT