Re: Ranking Web of Repositories: July 2010 Edition

From: Peter Suber <peter.suber_at_GMAIL.COM>
Date: Sat, 10 Jul 2010 10:37:33 -0400

On Thu, Jul 8, 2010 at 8:59 AM, Leslie Carr <> wrote:
      If you assume that a repository is full of locally-authored research
      literature then you will find all sorts of counter-examples in one
      area or another. The "Rich Media" criterion goes some way to
      filtering out non-documents, but whether the items are "scholarly"
      or "local" or "equivalent to those in other repositories" is very
      difficult to ascertain.

For more detail on "rich media" or "rich files", see the Webometrics page on
methodology:  "Only the number of text files in Acrobat format (.pdf) ... are

This is a bug, not a feature.  A more useful ranking would try to count
full-text scholarly or peer-reviewed articles regardless of format.  I know
that's hard to do.  But it's a mistake to use any format as a surrogate for that
status, and especially a format as flawed as PDF.

Even if Webometrics wanted to reward some formats more than others, it should
not reward PDF.  The format is optimized for print or reading, not for use or
reuse.  PDFs are slow to load and often not even readable in bandwidth-poor
parts of the world.  They crash many browsers.  They often lack working links;
when they do have links, they require users to open in the same window rather
than in a separate window, losing the file that took so long to load.  Users
can't deep-link to subsections.  Publishers can lock them to prevent cutting and
pasting.  Publishers can insert scripts to make them unreadable offline or after
a certain time.  PDFs impede text processing by users, text mining by software,
handicapped access ("read-aloud" software), and mark-up by third parties.  

PubMed Central scores low in the Webometric rankings because it has no PDFs.
 But PMC is one of the most populated and useful OA repositories in the world.
 The format it uses instead of PDF, the NLM DTD coded in XML, is vastly superior
to PDF for every scholarly purpose.

I haven't had time to code my articles in XML.  But since even HTML is superior
to PDF for purposes of access and reuse, I self-archive in HTML rather than PDF
whenever I can.


Peter Suber
Received on Sun Jul 11 2010 - 07:46:52 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:50:11 GMT