Re: Ranking Web of Repositories: July 2010 Edition

From: Isidro F. Aguillo <isidro.aguillo_at_CCHS.CSIC.ES>
Date: Mon, 12 Jul 2010 09:03:45 +0200

Dear all:

In fact we have already take into account some of your comments in the last
editions of the ranking. Let me explain:

- The ranking is based on a ratio 1:1 between ACTIVITY and VISIBILITY, so it is
as important as publishing a lot of OA papers doing it in a way others
(worldwide) can recover, use and link them. The ratio 1:1 means the weight of
each is 50%. As stated in previous messages, Visibility is measured counting the
total number of external inlinks.

- Regarding activity, we decided to follow your advices so the value is
calculated giving more or less the same value to these three variables:

* Number of papers, usually full text articles, using as a proxy the number of
items from Google Scholar
* Number of web pages: ALL the webpages (usually html or similar ones, but also
other formats)  of the website
* Number of documents: A subset of the former, those files in rich format like
pdf, ps, doc or ppt. It is probably true that pdf is not the best format and
perhaps we should consider other formats, but people are not using other
formats. The number of files in OpenOffice formats, XML, or others are
negligible, useless for ranking purposes.

- PMC. Our policy is not to rank repositories without its own domain or
subdomain. There are technical reasons but also visibility ones. The address of
PMC is "absurdly" complex:

Regarding UK PMC they are included in the ranking but its position is delayed
because they do not use suffixes in their file's names. They have hundreds of
thousands of Adobe Acrobat (pdf) files without making them as *.pdf. This avoid
an efficient filtering by file type by major search engines.

Best regards,

El 11/07/2010 15:21, Peter Suber escribió:
      Hi Les:  You're arguing that Webometrics should count PDFs, and I
      fully agree.  I was only arguing that Webometrics should not *limit*
      its count to PDFs.  Sorry if I didn't make that clear.
BTW, I'd make the analogous case to publishers.  Publish in PDF if you
like, but never publish in PDF-only.  If you offer PDF editions, then also
offer XML or HTML editions.

     Best,      Peter

Peter Suber


On Sun, Jul 11, 2010 at 6:49 AM, Leslie Carr <> wrote:
      On 10 Jul 2010, at 15:37, Peter Suber wrote:
      For more detail on "rich media" or "rich files", see the
      Webometrics page on methodology:  "Only the number of
      text files in Acrobat format (.pdf) ... are
      considered."...This is a bug, not a feature.  A more
      useful ranking would try to count full-text scholarly or
      peer-reviewed articles regardless of format.  I know
      that's hard to do.  But it's a mistake to use any format
      as a surrogate for that status, and especially a format
      as flawed as PDF. Even if Webometrics wanted to reward
      some formats more than others, it should not reward PDF.

I think it should. The overwhelming majority of academic papers are
distributed online as PDF; the overwhelming majority of things in
repositories that are not PDF are not academic papers.

      The format is optimized for print or reading, not for
      use or reuse.  PDFs are slow to load and often not even
      readable in bandwidth-poor parts of the world.  They
      crash many browsers.  They often lack working links;
      when they do have links, they require users to open in
      the same window rather than in a separate window, losing
      the file that took so long to load.  Users can't
      deep-link to subsections.  Publishers can lock them to
      prevent cutting and pasting.  Publishers can insert
      scripts to make them unreadable offline or after a
      certain time.  PDFs impede text processing by users,
      text mining by software, handicapped access
      ("read-aloud" software), and mark-up by third parties.

This is an argument about what software/data formats researchers
*should* use; affecting their authoring and editorial processes is
probably beyond the scope of what we can expect from this league

      PubMed Central scores low in the Webometric rankings
      because it has no PDFs.

It does "have" PDFs - it might ingest articles in XML, but it
certainly exports them in PDF. Enquiring of Google
( filetype:pdf) shows that it has about
6,690,000 PDFs.

      But PMC is one of the most populated and useful OA
      repositories in the world.

This is something that needs investigating. If I had to guess why it
ranks so low, it might be because no-one is linking INTO pubmed;
rather they are linking to the original publishers. 

      The format it uses instead of PDF, the NLM DTD coded in
      XML, is vastly superior to PDF for every scholarly
      purpose. I haven't had time to code my articles in XML.
       But since even HTML is superior to PDF for purposes of
      access and reuse, I self-archive in HTML rather than PDF
      whenever I can.

For the record, I completely agree with you about PDF / HTML /
XHTML. If only Microsoft Word (and LaTeX) had decent export
facilities that produced good "semantic" HTML.

Les Carr
Isidro F. Aguillo, HonPhD
Cybermetrics Lab (3C1)
Albasanz, 26-28
28037 Madrid. Spain
Editor of the Rankings Web
Received on Mon Jul 12 2010 - 12:53:53 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:50:11 GMT