Re: Ranking Web of Repositories: July 2010 Edition

From: Frederic MERCEUR <Frederic.Merceur_at_IFREMER.FR>
Date: Mon, 12 Jul 2010 16:06:05 +0200

Hello,

Personally I am feeling uncomfortable with this ranking because, to my mind, it
is uncomplete and unprecise.

It is uncomplete because the repositories hosted on the subdirectory are not
ranked (e.g : www.xxx.zz/repository) for technical reasons, even if, as Isidro
noted, "the number of these repositories is far lower than the
"non-repositories" listed in ROAR and OpenDOAR".

It is unprecise because it is based on web automatic commands that are very
sensitive to noise. For example, it is the case for the visibility indicator
(external inlinks). As far as I understand from Isidro explanations, a part of
this indicator is calculated with the yahoo linkdomain function :

linkdomain:http://my_site –site:my_site  

I tested this function on a few repositories ranked including our one. More than
90% (and, in some cases, I guess more than 99% inlinks) are not significant
because they come from :

-    automatic spam web site (e.g: www.find-pdf.com, www.mypdffiles.com,... or
automatic site such as http://www.123people.fr )
-    automatic links from OAI harvesters
-    automatic links that comes from other domains of the university (e.g. :
auto-citation through automatic personnal author’s pages)...
-    automatic repetition of the same link : in some forums, a link on the main
page will be duplicated automatically on all archive pages so, with one manual
significant link you can have several hundred of unsignificant automatic links.
-    …

The other indicators (size, rich files, scholar) may also be hazardous for
similar reasons.

According to Isidro, all these points affect the numbers but not (much) the
ranking. This should be confirmed...

Kind regards,
Fred

 

 



Isidro F. Aguillo a écrit :
      Dear all:

      In fact we have already take into account some of your comments in
      the last editions of the ranking. Let me explain:

      - The ranking is based on a ratio 1:1 between ACTIVITY and
      VISIBILITY, so it is as important as publishing a lot of OA papers
      doing it in a way others (worldwide) can recover, use and link them.
      The ratio 1:1 means the weight of each is 50%. As stated in previous
      messages, Visibility is measured counting the total number of
      external inlinks.

      - Regarding activity, we decided to follow your advices so the value
      is calculated giving more or less the same value to these three
      variables:

      * Number of papers, usually full text articles, using as a proxy the
      number of items from Google Scholar
      * Number of web pages: ALL the webpages (usually html or similar
      ones, but also other formats)  of the website
      * Number of documents: A subset of the former, those files in rich
      format like pdf, ps, doc or ppt. It is probably true that pdf is not
      the best format and perhaps we should consider other formats, but
      people are not using other formats. The number of files in
      OpenOffice formats, XML, or others are negligible, useless for
      ranking purposes.

      - PMC. Our policy is not to rank repositories without its own domain
      or subdomain. There are technical reasons but also visibility ones.
      The address of PMC is "absurdly" complex:

      www.ncbi.nlm.nih.gov/pmc

      Regarding UK PMC they are included in the ranking but its position
      is delayed because they do not use suffixes in their file's names.
      They have hundreds of thousands of Adobe Acrobat (pdf) files without
      making them as *.pdf. This avoid an efficient filtering by file type
      by major search engines.

      Best regards,


--
Fred Merceur
Ifremer / Bibliothèque La Pérouse
frederic.merceur_at_ifremer.fr
Tél : 02-98-49-88-69
Fax : 02-98-49-88-84
Archimer, Ifremer's Institutional Repository
Avano, a marine and aquatic OAI harvester
Bibliothèque La Pérouse
Avant d'imprimer, pensez à l'environnement!
Received on Mon Jul 12 2010 - 15:47:54 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:50:11 GMT