Re: Ranking Web of Repositories: July 2010 Edition

From: Leslie Chan <>
Date: Mon, 12 Jul 2010 01:25:40 -0400

On 7/11/10 6:49 AM, "Leslie Carr" <lac_at_ECS.SOTON.AC.UK> wrote:

> On 10 Jul 2010, at 15:37, Peter Suber wrote:
>> For more detail on "rich media" or "rich files", see the Webometrics
page on
>> methodology: "Only the number of text files in Acrobat format (.pdf)
... are
>> considered."...This is a bug, not a feature. A more useful ranking
would try
>> to count full-text scholarly or peer-reviewed articles regardless of
>> I know that's hard to do. But it's a mistake to use any format as a
>> surrogate for that status, and especially a format as flawed as PDF.
Even if
>> Webometrics wanted to reward some formats more than others, it should not
>> reward PDF.
> I think it should. The overwhelming majority of academic papers are
> distributed online as PDF; the overwhelming majority of things in
> that are not PDF are not academic papers.

This is rather circular. The view that academic papers should be fixed in
form and format is rather out of sync with the emergence of new forms of
scholarly expression enabled by the web. Here is an interesting commentary
in a recent THE:

" Academics in the humanities and social sciences need to question whether
the current narrowly conceived conventions of academic publication are in
our best interests. If reality is multifaceted, then writing that responds
to it needs to be multifaceted, too. Academics should be encouraged to
explore a heterogeneous range of formats, reaching different audiences and
finding new ways to write about research."

I think this discussion raises a fundamental question about the design of
IRs and their support for scholarship. IRs must do better to capture the
diversity of scholarly contribution and formats, and make them count in
meaningful way.

>> The format is optimized for print or reading, not for use or reuse.
PDFs are
>> slow to load and often not even readable in bandwidth-poor parts of the
>> world. They crash many browsers. They often lack working links; when
>> do have links, they require users to open in the same window rather
than in a
>> separate window, losing the file that took so long to load. Users can't
>> deep-link to subsections. Publishers can lock them to prevent cutting and
>> pasting. Publishers can insert scripts to make them unreadable offline or
>> after a certain time. PDFs impede text processing by users, text
mining by
>> software, handicapped access ("read-aloud" software), and mark-up by third
>> parties.
> This is an argument about what software/data formats researchers
*should* use;
> affecting their authoring and editorial processes is probably beyond the
> of what we can expect from this league table.

This points to the problem with league tables in general. Much like the
league tables in the Journal Citation Report with journal ranking, such
tables gloss over what are important to different disciplinary needs and
authoring processes, and privilege quantitative measures that encourage
spurious ranking and comparison. Do we really need more output based

>> PubMed Central scores low in the Webometric rankings because it has no
> It does "have" PDFs - it might ingest articles in XML, but it certainly
> exports them in PDF. Enquiring of Google (
> filetype:pdf) shows that it has about 6,690,000 PDFs.

So PMC is being penalized by the ranking system because it is dynamic?

>> But PMC is one of the most populated and useful OA repositories in the
> This is something that needs investigating. If I had to guess why it
ranks so
> low, it might be because no-one is linking INTO pubmed; rather they are
> linking to the original publishers.

How should we define the most "useful"? Should download and other usage
stats be taken into consideration, instead of only in-bound links?

>> The format it uses instead of PDF, the NLM DTD coded in XML, is vastly
>> superior to PDF for every scholarly purpose. I haven't had time to code my
>> articles in XML. But since even HTML is superior to PDF for purposes of
>> access and reuse, I self-archive in HTML rather than PDF whenever I can.
> For the record, I completely agree with you about PDF / HTML / XHTML. If
> Microsoft Word (and LaTeX) had decent export facilities that produced good
> "semantic" HTML.

Why wait for Microsoft? What has the the open source community be doing on
this front? What about OpenOffice? Any good open source NLM DTD conversion
tools out there? Why has it taken so long?

Leslie (Chan)

> --
> Les Carr
Received on Mon Jul 12 2010 - 12:42:03 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:50:11 GMT