Re: Estimates on data and cost per department for institutional archives?

From: JQ Johnson <jqj_at_DARKWING.UOREGON.EDU>
Date: Mon, 8 Dec 2003 15:07:02 +0000

Min-Yen KAN writes:
>we are currently trying to estimate the amount of data that
>will eventually flow into the institutional archive.

"archive"? "eventually"?

As Stevan Harnad has noted, the volume of data for preprints, postprints,
and theses is likely to be minuscule, even if you are successful in
getting widespread adoption by your faculty. At most institutions
the average annual peer-reviewed production [per author] is probably
between 1 and 5 papers. Let's say you got total buy-in from a faculty
of 1000, that all of your faculty submit 10 papers per year, and that
each paper is a PDF document 400KB. That's 10K items and 4GB/year max,
which is tiny in terms of disk space.

Note that a plausible first-order model of growth is initial exponential
expansion settling down to linear growth when the service matures. That
would be great, because the cost of technology will decline
exponentially (Moore's law).

Note that although the cost of storage and server is minimal, the cost
of archival is potentially very large. If you agree with Stevan you
don't care much about long-term access. We don't agree, and hence have
to budget for preservation, which for us means regularly scheduled
(every 5 years or so) collection surveys and remediation through format
conversion (e.g. from PDF version 27 to PDF version 57, or HTML to XML,
or GIF to PNG, or ...). That's expensive. Achieving faculty buy-in and
self-archiving is expensive too; we don't think the top-down approach
(provost mandates) is likely to be successful in most places, so you
should budget a lot for marketing and hand-holding.

A huge issue in planning an institutional repository, though, is that
you are unlikely to collect just preprints, postprints, and printable
theses. A natural extension of a preprint goal would be to collect
supporting materials for those preprints. Such supporting materials may
be very large; it's not too unusual to have a multi-TB dataset in some
fields such as astronomy or biology. It only takes one such large
dataset to completely blow away any space calculations based only on
collecting the paper-publishable text. Even if you are collecting just
preprints and theses, the size estimates depend on how you are handling
acquisition of multimedia materials; if you collect theses in dance you
might have videos of performances, each of which is several GB.
Conclusion: space needs depend sensitively on the details of your
submission/collection policies, and on the behavior and needs of a tiny
fraction of your faculty clientele. [aside: we believe that if we DON'T
collect such unprintable items we'll never get faculty buy-in for
Stevan's laudable goal of collecting the printable peer-reviewed works]

We put quite a bit of effort into estimating the expected rate of
growth of our institutional repository, and eventually gave up. We
took a very pragmatic approach and sized our initial server based on
hardware we happened to have available (about 25GB at the moment), with
the expectation that we will radically increase the disk space (probably
into the low-TB range) over the next 1 to 3 years if the service catches
on. However, our IR goals and policies are quite different in detail
from Stevan's and probably yours, so the one thing guaranteed is that
your mileage will vary.

JQ Johnson Office: 115F Knight Library
Academic Education Coordinator e-mail:
1299 University of Oregon 1-541-346-1746 (v); -3485 (fax)
Eugene, OR 97403-1299
Received on Mon Dec 08 2003 - 15:07:02 GMT

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:47:11 GMT