Re: Ranking Web of Repositories: July 2010 Edition from Isidro F. Aguillo on 2010-07-12 (American-Scientist-Open-Access-Forum)

From: Isidro F. Aguillo <isidro.aguillo_at_CCHS.CSIC.ES>
Date: Mon, 12 Jul 2010 14:00:29 +0200

El 12/07/2010 7:25, Leslie Chan escribió:
> On 7/11/10 6:49 AM, "Leslie Carr"<lac_at_ECS.SOTON.AC.UK> wrote:
>
>> On 10 Jul 2010, at 15:37, Peter Suber wrote:
>>> For more detail on "rich media" or "rich files", see the Webometrics
> page on
>>> methodology: "Only the number of text files in Acrobat format (.pdf)
> ... are
>>> considered."...This is a bug, not a feature. A more useful ranking
> would try
>>> to count full-text scholarly or peer-reviewed articles regardless of
> format.
>>> I know that's hard to do. But it's a mistake to use any format as a
>>> surrogate for that status, and especially a format as flawed as PDF.
> Even if
>>> Webometrics wanted to reward some formats more than others, it should not
>>> reward PDF.
>> I think it should. The overwhelming majority of academic papers are
>> distributed online as PDF; the overwhelming majority of things in
> repositories
>> that are not PDF are not academic papers.
> This is rather circular. The view that academic papers should be fixed in
> form and format is rather out of sync with the emergence of new forms of
> scholarly expression enabled by the web. Here is an interesting commentary
> in a recent THE:
>
> " Academics in the humanities and social sciences need to question whether
> the current narrowly conceived conventions of academic publication are in
> our best interests. If reality is multifaceted, then writing that responds
> to it needs to be multifaceted, too. Academics should be encouraged to
> explore a heterogeneous range of formats, reaching different audiences and
> finding new ways to write about research."
> http://www.timeshighereducation.co.uk/story.asp?storyCode=411466&sectioncode=26
>
> I think this discussion raises a fundamental question about the design of
> IRs and their support for scholarship. IRs must do better to capture the
> diversity of scholarly contribution and formats, and make them count in
> meaningful way.
Dear Leslie:

You are completely right, others formats should be used, far better than
others currently available and of course open source/open access. Now
try to convince 1 billion Internet users to do that. NOBODY (well, a few
thousands) is using these other formats today (yet). I have the figures.

>>> The format is optimized for print or reading, not for use or reuse.
> PDFs are
>>> slow to load and often not even readable in bandwidth-poor parts of the
>>> world. They crash many browsers. They often lack working links; when
> they
>>> do have links, they require users to open in the same window rather
> than in a
>>> separate window, losing the file that took so long to load. Users can't
>>> deep-link to subsections. Publishers can lock them to prevent cutting and
>>> pasting. Publishers can insert scripts to make them unreadable offline or
>>> after a certain time. PDFs impede text processing by users, text
> mining by
>>> software, handicapped access ("read-aloud" software), and mark-up by third
>>> parties.
>> This is an argument about what software/data formats researchers
> *should* use;
>> affecting their authoring and editorial processes is probably beyond the
> scope
>> of what we can expect from this league table.
> This points to the problem with league tables in general. Much like the
> league tables in the Journal Citation Report with journal ranking, such
> tables gloss over what are important to different disciplinary needs and
> authoring processes, and privilege quantitative measures that encourage
> spurious ranking and comparison. Do we really need more output based
> comparisons?
I have only a few numbers to support the "need" of league tables. QS,
the former editors of the THES Ranking of Universities, stated they
received 18 million visitors per year, our Web Ranking is close to 5
million and probably the Shanghai ranking reach similar or even higher
levels.

>>> PubMed Central scores low in the Webometric rankings because it has no
> PDFs.
>> It does "have" PDFs - it might ingest articles in XML, but it certainly
>> exports them in PDF. Enquiring of Google (site:www.ncbi.nlm.nih.gov
>> filetype:pdf) shows that it has about 6,690,000 PDFs.
> So PMC is being penalized by the ranking system because it is dynamic?

Nobody is saying that. PMC is excluded because it have not its own
domain or subdomain. You can disagree but I dislike my papers being
"url-authored" by the library.

>>> But PMC is one of the most populated and useful OA repositories in the
> world.
>> This is something that needs investigating. If I had to guess why it
> ranks so
>> low, it might be because no-one is linking INTO pubmed; rather they are
>> linking to the original publishers.
> How should we define the most "useful"? Should download and other usage
> stats be taken into consideration, instead of only in-bound links?
As soon as (standardized) user statistics become available they will be
used. Good indicators need to be useful but also feasible.

>>> The format it uses instead of PDF, the NLM DTD coded in XML, is vastly
>>> superior to PDF for every scholarly purpose. I haven't had time to code my
>>> articles in XML. But since even HTML is superior to PDF for purposes of
>>> access and reuse, I self-archive in HTML rather than PDF whenever I can.
>> For the record, I completely agree with you about PDF / HTML / XHTML. If
> only
>> Microsoft Word (and LaTeX) had decent export facilities that produced good
>> "semantic" HTML.
> Why wait for Microsoft? What has the the open source community be doing on
> this front? What about OpenOffice? Any good open source NLM DTD conversion
> tools out there? Why has it taken so long?
No answer to that. I am only mirroring the current situation.

> Leslie (Chan)
>
>> --
>> Les Carr
>>
>>

-- 
===========================
Isidro F. Aguillo, HonPhD
Cybermetrics Lab (3C1)
IPP-CCHS-CSIC
Albasanz, 26-28
28037 Madrid. Spain
Editor of the Rankings Web
===========================

Received on Mon Jul 12 2010 - 13:23:44 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:50:11 GMT