Re: Ranking Web of Repositories: July 2010 Edition

From: Peter Suber <peters_at_EARLHAM.EDU>
Date: Sun, 11 Jul 2010 09:21:28 -0400

Hi Les:  You're arguing that Webometrics should count PDFs, and I fully agree.
 I was only arguing that Webometrics should not *limit* its count to PDFs.
 Sorry if I didn't make that clear.
BTW, I'd make the analogous case to publishers.  Publish in PDF if you like, but
never publish in PDF-only.  If you offer PDF editions, then also offer XML or
HTML editions.

     Best,      Peter

Peter Suber


On Sun, Jul 11, 2010 at 6:49 AM, Leslie Carr <> wrote:
      On 10 Jul 2010, at 15:37, Peter Suber wrote:
      For more detail on "rich media" or "rich files", see the
      Webometrics page on methodology:  "Only the number of text
      files in Acrobat format (.pdf) ... are considered."...This is
      a bug, not a feature.  A more useful ranking would try to
      count full-text scholarly or peer-reviewed articles regardless
      of format.  I know that's hard to do.  But it's a mistake to
      use any format as a surrogate for that status, and especially
      a format as flawed as PDF. Even if Webometrics wanted to
      reward some formats more than others, it should not reward

I think it should. The overwhelming majority of academic papers are
distributed online as PDF; the overwhelming majority of things in
repositories that are not PDF are not academic papers.

      The format is optimized for print or reading, not for use or
      reuse.  PDFs are slow to load and often not even readable in
      bandwidth-poor parts of the world.  They crash many browsers.
       They often lack working links; when they do have links, they
      require users to open in the same window rather than in a
      separate window, losing the file that took so long to load.
       Users can't deep-link to subsections.  Publishers can lock
      them to prevent cutting and pasting.  Publishers can insert
      scripts to make them unreadable offline or after a certain
      time.  PDFs impede text processing by users, text mining by
      software, handicapped access ("read-aloud" software), and
      mark-up by third parties.

This is an argument about what software/data formats researchers *should*
use; affecting their authoring and editorial processes is probably beyond
the scope of what we can expect from this league table.

      PubMed Central scores low in the Webometric rankings because
      it has no PDFs.

It does "have" PDFs - it might ingest articles in XML, but it certainly
exports them in PDF. Enquiring of Google (
filetype:pdf) shows that it has about 6,690,000 PDFs.

      But PMC is one of the most populated and useful OA
      repositories in the world.

This is something that needs investigating. If I had to guess why it ranks
so low, it might be because no-one is linking INTO pubmed; rather they are
linking to the original publishers. 

      The format it uses instead of PDF, the NLM DTD coded in XML,
      is vastly superior to PDF for every scholarly purpose. I
      haven't had time to code my articles in XML.  But since even
      HTML is superior to PDF for purposes of access and reuse, I
      self-archive in HTML rather than PDF whenever I can.

For the record, I completely agree with you about PDF / HTML / XHTML. If
only Microsoft Word (and LaTeX) had decent export facilities that produced
good "semantic" HTML.

Les Carr
Received on Sun Jul 11 2010 - 19:58:14 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:50:11 GMT