Re: Ranking Web of Repositories: July 2010 Edition

From: Leslie Carr <>
Date: Sun, 11 Jul 2010 11:49:43 +0100

On 10 Jul 2010, at 15:37, Peter Suber wrote:
      For more detail on "rich media" or "rich files", see the Webometrics
      page on methodology:  "Only the number of text files in Acrobat
      format (.pdf) ... are considered."...This is a bug, not a feature.
       A more useful ranking would try to count full-text scholarly or
      peer-reviewed articles regardless of format.  I know that's hard to
      do.  But it's a mistake to use any format as a surrogate for that
      status, and especially a format as flawed as PDF. Even if
      Webometrics wanted to reward some formats more than others, it
      should not reward PDF.

I think it should. The overwhelming majority of academic papers are distributed
online as PDF; the overwhelming majority of things in repositories that are not
PDF are not academic papers.

      The format is optimized for print or reading, not for use or reuse.
       PDFs are slow to load and often not even readable in bandwidth-poor
      parts of the world.  They crash many browsers.  They often lack
      working links; when they do have links, they require users to open
      in the same window rather than in a separate window, losing the file
      that took so long to load.  Users can't deep-link to subsections.
       Publishers can lock them to prevent cutting and pasting.
       Publishers can insert scripts to make them unreadable offline or
      after a certain time.  PDFs impede text processing by users, text
      mining by software, handicapped access ("read-aloud" software), and
      mark-up by third parties.

This is an argument about what software/data formats researchers *should* use;
affecting their authoring and editorial processes is probably beyond the scope
of what we can expect from this league table.

      PubMed Central scores low in the Webometric rankings because it has
      no PDFs.

It does "have" PDFs - it might ingest articles in XML, but it certainly exports
them in PDF. Enquiring of Google ( filetype:pdf) shows
that it has about 6,690,000 PDFs.

      But PMC is one of the most populated and useful OA repositories in
      the world.

This is something that needs investigating. If I had to guess why it ranks so
low, it might be because no-one is linking INTO pubmed; rather they are linking
to the original publishers. 

      The format it uses instead of PDF, the NLM DTD coded in XML, is
      vastly superior to PDF for every scholarly purpose. I haven't had
      time to code my articles in XML.  But since even HTML is superior to
      PDF for purposes of access and reuse, I self-archive in HTML rather
      than PDF whenever I can.

For the record, I completely agree with you about PDF / HTML / XHTML. If only
Microsoft Word (and LaTeX) had decent export facilities that produced good
"semantic" HTML.

Les Carr
Received on Sun Jul 11 2010 - 14:00:54 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:50:11 GMT