Re: Open Access Speeds Use by Others

From: Stevan Harnad <>
Date: Thu, 18 May 2006 21:16:28 +0100

Gunther Eysenbach (as also submitted to PLoS today) wrote in

> The introduction of the article and two accompanying editorials [1-3]
> already answer Harnad's questions why author, editors, and reviewers
> were critical of the methodology employed in previous studies, which
> all only looked at "green OA" (self-archived/online-accessible papers)

I didn't ask why the author and editors were critical of prior
self-archiving (green OA) studies; I asked why they said such studies
were "surprisingly hard to find" and why the two biggest and latest of
them were not even taken into account:

    Brody, T., Harnad, S. and Carr, L. (2005) Earlier Web Usage Statistics
    as Predictors of Later Citation Impact. Journal of the American
    Association for Information Science and Technology (JASIST) 56.

    Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year
    Cross-Disciplinary Comparison of the Growth of Open Access and How
    it Increases Research Citation Impact. IEEE Data Engineering Bulletin
    28(4) pp. 39-47.

And the reason all prior within-journal studies look at "green OA"
is that the majority of OA today is green; hence almost all OA/NOA
impact comparisons are based on green OA (self-archiving) rather than
on paid-OA (gold). (OA/NOA between-journal comparisons would be comparing
apples with oranges.)

> (hint 1: "confounding") (hint 2: arrow of causation: are papers online
> because they are highly cited, or the other way round?).

I am afraid I don't see Eysenbach's point here at all: What exactly
does he think is being confounded in within-journal comparisons
of self-archived versus non-self-archived articles? The paid-OA
effect? But among OA articles today there is almost zero within-journal
paid-OA, because so few journals offer it! (Hajjem et al.'s within-journal
comparisons were based on over a million articles, across 12 years, in
12 disciplines! Eysenbach's were based on 1492 articles, in 6 months,
in one journal.)

And is Eysenbach suggesting that his failure to find any significant
difference among author self-reports -- about their own article's quality
and its causal role in their decision about whether or not to pay for OA
(or to self-archive) in his sample of 237 authors -- is an objective test
of the arrow of causation? (I agree that Eysenbach's failure to find a
difference fails to support the hypothesis of a self-selection bias, but
surely that won't convince those who are minded to hold that hypothesis! I
would welcome rigorous causal evidence against the self-selection
hypothesis as much as Eysenbach would, but author self-reports are not
that evidence!)

> The statement in the PLoS editorial has to be seen against this
> background. None of the previous papers in the bibliography mentioned
> by Harnad employed a similar methodology, working with data from a
> "gold-OA" journal.

True. But so what? It is Eysenbach (and PLoS) who are focussed on gold-OA
journals; the rest of the studies are focussed on OA itself. Only about
10% of the planet's peer-reviewed journals are gold today, and most of
those are 100% gold, hence allow no within-journal comparisons. Very few
journals as yet offer authors the "Open Choice" (optional paid gold) that
would allow gold within-journal OA/NOA comparisons; and few authors are
as yet taking those journals up on it (about 15% in this PNAS sample),
compared to the far larger number that are self-archiving (also 15%,
as it happens, though that percentage too is still far too small!). The
difference in article sample sizes is about four orders of magnitude
(c. 1500 articles in Eysenbach's study to 1.5 million Hajjem et al's).

> The correct method to control for problem 1 (multiple confounders)
> is multivariate regression analysis, not used in previous
> studies.

Correct. But with the large, consistent within-journal OA/NOA differences
found across al journals, all disciplines and all years in samples four
orders of magnitude larger than Eysenbach's, it is not at all clear
that controls for those "multiple confounders" are necessary in order to
demonstrate the reality, magnitude and universality of the OA advantage.
That does not mean the controls are not useful, just that they are not
yet telling us much that we don't already know.

> Harnad's statement that "many [of the confounding variables]
> are peculiar to this particular... study" suggests that he might
> still not fully appreciate the issue of confounding. Does he suggest
> that in his samples there are no differences in these variables
> (for example, number of authors) between the groups? Did he even
> test for these? If he did, why was this not described in these
> previous studies?

No, we did not test for "confounding effects" of number of authors: What
confounding effects does Eysenbach expect from controlling for number of
authors in a sample of millions of articles across a dozen disciplines
and a dozen fields all showing the very same, sizeable OA advantage? Does he
seriously think that partialling out the variance in the number of authors
would make a dent in that huge, consistent effect?

Not that Eysenbach's tentative findings on 1st-author/last-author
differences in his one-journal sample of 1492 are not interesting; but
those are merely minor differences in shading, compared to the whopping
main effect, which is: substantially more citations (and downloads)
for self-archived OA articles.

> The correct method to address problem 2 (the "arrow of causation"
> problem) is to do a longitudinal (cohort) study, as opposed to a
> cross-sectional study. This ascertains that OA comes first and THEN
> the paper is cited highly, while previous cross-sectional studies
> in the area of "green OA" publishing (self-archiving) leave open
> what comes first -- impact or being online.

I agree completely that time-based studies are necessary to demonstrate
causation, for those who think that the OA advantage might be based
on self-selection bias (i.e., that high-impact studies tend to be
preferentially self-archived, perhaps even after they have gained their
high impact), but Eysenbach's author self-report data certainly don't
constitute such a longtitudinal cohort study! (Once we have reliable
deposit dates for self-archived articles, we will be able to do some
time-based analyses but, frankly, by that time the outcome is likely to
be a foregone conclusion.)

In the meanwhile, the fact that (a) the OA advantage does not diminish for
younger articles years (as one would expect if it were a post-hoc effect),
that (b) OA increases downloads, and that (c) increased downloads in
the first 6 months are correlated with increased citations later on --
plus the logic of the fact that (d) unaffordability reduces access and
that (e) access is a necessary condition for citation -- all suggest that
most of the scepticism about the SOA advantage is because of conflicting
interests, not because of objective uncertainty.

> Harnad - who usually carefully distinguishes between "green" and
> "gold" OA publishing -- ignores that open access is a continuum, much
> as publishing is a continuum [4],

I'm afraid I have no idea what Eysenbach means about OA being a
continuum: Time is certainly a continuum, and *access* certainly admits
of degrees (access may be easier/harder, narrower/wider, cheaper/dearer,
longer/shorter, earlier/later, partial/full) -- but *Open Access* does
not admit of degrees (any more than pregnancy does). OA means immediate,
permanent, full-text online access, free for all, now.

And, by the way, green OA is certainly not a lesser degree of gold OA!

For the innocent reader, puzzled as to why this would even be an issue:

Please recall that OA (gold) journals, whether total or optional gold,
need authors (and those gold journals with the gold cost-recovery
model need *paying* author/institutions). To attract authors, they need
to persuade them of the benefits of OA. So far so good. But there is
another thing they have to persuade them of, implicitly or explicitly,
and that is the benefits of gold OA over green OA. For if there *are*
no benefits, then surely it makes much more sense for authors to publish
in their journal of choice, as they always did, and simply self-archive
their articles, rather than switching journals and/or paying for gold OA!

This theme alas keeps recurring, implicitly or explicitly, in the
internecine green/gold squabbles, because green OA is indeed a rival
to gold OA in gold OA journals' efforts to win over authors. This is
regrettable, but a functional fact today, owing to the nature of OA and
the two means of providing it.

Is the effect symmetrical? Is gold OA a rival to green OA? The answer is
more complicated: No, an author who chooses gold OA (by publishing in an
OA journal) is not at all a loss for green OA, because the article is
nevertheless OA, and green OA's sole objective is 100% OA, as soon as
possible, and nothing else. (Besides, a gold OA article too can be
self-archived in the author's Institutional Repository if the author
or institution wishes! All gold journals are also green, in that they
endorse author self-archiving.)

But there is a potential problem with gold from the standpoint of green.
The problem is not with authors choosing gold. The problem is with gold
publishers promoting gold as superior to green, or, worse, with gold
publishers implying that green OA is not really OA, or not "fully" OA
(along some imaginary OA "continuum")..

    "Free Access vs. Open Access" (thread started Aug 2003)

Why, you ask, would gold OA want to give the impression that green OA
was not really OA or not fully OA? This is because of the rivalry for
authors that I just mentioned. The causal arrow is a one-way one insofar
as competition for authors is concerned: green OA does not lose an
author if that author publishes in a gold OA journal, whereas gold OA
does lose an author if an author publishes in a green journal instead of
a gold one. However, if gold portrays green as if it were not really or
fully OA, and authors believe this, then it loses authors for green --
and loses them even if they do not elect to publish gold. Because there
is today something still very paradoxical, indeed incoherent, about
author behavior and motivation:

Authors profess to want OA. Thirty-four thousand of them even signed the
2001 PLoS Open Letter threatening
to boycott their journals if they did not provide (gold) OA (within 6
months of publication). (Most journals did not comply, and most authors
did not follow through on their boycott threat: how could they? There were
not enough suitable gold journals for them to switch to, and most authors
clearly were not interested in switching journals, let alone paying for
publication, then or now.)

Yet (and here comes the paradox): if those 34,000 signatories -- allegedly
so desirous of OA as to be ready to boycott their journals if they did
not provide it -- had simply gone on to self-archive all their papers,
they would be well on the road to having the OA they allegedly desired
so much! For the green road to 100% OA happens to be based on the
(golden!) rule: Self-Archive Unto Others As You Would Have Them
Self-Archive Unto You.

Why didn't (and don't) most authors do it (yet)? It is partly (let us
state it quite frankly) straightforward foolishness and inconsistency
on authors' part. They simply have not thought it through. This cannot
be denied.

But it is partly also the fault of the promotional efforts of
(well-meaning) OA advocates. Harold Varmus sent a mixed message with
his 1999 "E-biomed" proposal (which led to PLoS, the PLoS Open letter,
PubMed Central, Biomed Central, and eventually the PLoS (and BMC)
fleet of OA journals, including PLoS Biology). Was E-biomed a gold
proposal, a green proposal, both, or neither? The fact is that it was
an incoherent proposal -- a confused and confusing mish-mash of central
self-archiving, publishing reform/replacement and rival publishing --
and although it has undeniably led to genuine and valuable progress
toward (what was eventually baptized as) OA, it has left a continuing
legacy of continuing confusion too.

And we are facing part of that legacy of confusion now, with PLoS
thinking that the only way (or the best) to reach 100% OA is to publish
and promote gold OA journals. That is why PLoS Biology agreed to
referee the Eysenbach paper, which seemed to show that OA gold is the
only one that increases citation impact, not green self-archiving, which
is (when you come right down to it) not even "real" OA at all!

That is also why PLoS Biology editorialised that they found it
"surprisingly hard to find" evidence -- "solid evidence" -- that OA
articles are read and cited more. And that is why PLoS Biology was
happy to make an exception and publish the Eysenbach study, even though
scientometrics is not the subject matter of PLoS Biology, but (I'll
warrant) PLoS Biology would not have been happy to advertise in its
pages the fact that green OA self-archiving was enough to get articles
read and cited more!

So green OA does have a bit of an uphill battle against gold OA and the
subsidies and support it has received (because gold OA is an attractive and
understandable idea, whereas green OA requires a few more mental steps
to dope out -- though not many, as none of this is rocket science!).

But, to switch metaphors, the green road to 100% OA (sic) is far wider,
faster and surer than the golden road, and 100% OA really is beneficial
to research and researchers; so the green road of self-archiving is
bound to prevail, despite the extra obstacles. And the destination
(100% OA) is exactly the same for both roads. Indeed, I am pretty sure
that even the fastest way to reach 100% *gold* OA (i.e., not just 100%
OA but also the conversion of all journals to gold) is in fact to take
the green road to 100% OA first.

So gold is doing itself a disservice when it tries to devalue green.
Read on:

> and this study (and the priority claims in the editorial) was talking
> about the gold OA end of the spectrum.

Spectrum? Continuum? Degrees of OA?

> Publishing in an open access journal is a fundamentally
> different process from putting a paper published in a toll-access
> journal on the Internet. In analogy, printing something on a flyer
> and handing it out to pedestrians on the street, and publishing an
> article in a national newspaper can both be called "publishing",
> but they remain fundamentally different processes, with differences
> in impact, reach, etc. A study looking at the impact of publishing
> a newspaper can not be replaced with a study looking at the impact
> of handing out a flyer to pedestrians, even though both are about
> "publishing".

Oh dear! I have a feeling Eysenbach is going to tell as that making a
*published* journal articles immediately, permanently, accessible online,
free for all by self-archiving them, is not OA after all, or not "full
OA". If the journal doesn't do it for you, and/or you don't pay for it,
it's not the real thing.

I wonder why Eysenbach would want to say that? Could it be because he is
promoting an OA (gold) journal (his own)? Could that also have been the
reason the PLoS editorial was so sanguine about Eysenbach's findings on
the OA gold advantage, and so dismissive of any prior evidence of an OA
green advantage?

> Finally, Harnad says that "prior evidence derived from substantially
> larger and broader-based samples showing substantially the same
> outcome". I rebut with two points here.
> Regarding "larger samples" I think rigor and quality (leading
> to internal validity) is more important than quantity (or sample
> size).

And if the studies -- large and small -- have *exactly* the same outcome?

> Going through the laborious effort to extract article and
> author characteristics for a limited number of articles (n = 1492)
> in order to control for these confounders provides scientifically
> stronger evidence than doing a crude, unadjusted analysis of a
> huge number of online accessible vs non-online accessible articles,
> leaving open many alternative explanations.

As I said, for those who doubt the causality and think the OA advantage
is just a self-selection bias, Eysenbach's study will not convince them
otherwise either. For those with eyes to see, the repeated demonstrations,
in field after field, of exactly the same effect on incomparably larger
samples will already have been demonstration enough. For those with eyes
only for gold, evidence that green enhances citations will never be
"solid evidence."

If Eysenbach and the editors had portrayed their findings as they should
have, namely, as yet another confirmation of the OA impact advantage,
with some further details about its fine-tuning, I would have done
nothing but praised it. But the actual self-interested spin and puffery
that instead accompanied this work -- propagating the frankly false idea
that this is the first "solid evidence" for the OA impact advantage,
and, worse, that it implies that self-archiving itself does not deliver
the OA impact advantage -- would have required not the lack of an ego,
but the lack of any real fealty to OA itself to have been allowed to stand

> Secondly, contrary to what Harnad said, this study is NOT at all
> "showing substantially the same outcome". On the contrary, the effect
> of green-OA -- once controlled for confounders - was much less than
> what others have claimed in previous papers.

Let's be quite explicit about what, exactly, we are discussing here:

Eysenbach found that in a 6-month sample of 1492 articles in one
3-option journal (PNAS):

    "While in the crude analysis self-archived papers had on average
    significantly more citations than non-self-archived papers (mean,
    5.46 versus 4.66; Wilcoxon Z = 2.417; p = 0.02), these differences
    disappeared when stratified for journal OA status (p= 0.10 in the
    group of articles published originally as non-OA articles, and p =
    0.25 in the group of articles published originally as OA).

    "In a logistic regression model with backward elimination,
    which included original OA status and self-archiving OA status as
    separate independent variables as well as all potential confounders,
    self-archiving OA status did not remain a significant predictor
    for being cited. In a linear regression model, the influence of
    the covariate "article published originally as OA, without being
    self-archived" (beta = 0.250, p < 0.001) on citations remained
    stronger than self-archiving status (beta = 0.152, p = 0.02)."

To translate this into english (from an article with *exceedingly*
user-unfriendly data-displays, by the way, making it next to impossible
to extract and visualize results from the tables!): First, the numbers:

NOA (Not OA): 1159 articles
POA (Payed OA only): 176 articles
SOA (Self-Archived OA only): 121 articles
BOA (POA and SOA): 36 articles

The finding is that (with many other factors statistically isolated so
as to be measured independently: days since publication, number of
authors, article type, country, funding, subject, etc.):

In this sample, POA, SOA and BOA considered together, and PAO considered
alone, all have significantly more citations than NOA, but SOA considered
alone ("stratified") does not. Also, if considered jointly (multiple
regression), both POA and SOA increase citations, but POA is the
stronger effect.

Here are three simple hypotheses, in decreasing order of likelihood, as
to why this small PNAS study may have found that the citation counts and
their significance distributed themselves as they did: BOA>POA>>SOA>NOA

Hypothesis 1: The POA+ effect might be unique to high-profile 3-option
journals (POA, SOA, NOA) like PNAS (which are themselves a tiny minority
among journals) and occurs because the POA articles are more visible
than the SOA articles. (The POA + SOA = BOA articles do the best of
all!) So the POA authors *do* get something more for their money (but that
something is not *OA* but high-profile POA in a high-profile journal) --
at least for the time being. This extra POA-over-SOA advantage will of
course wash out as SOA and Institutional Repositories for self-archiving

Hypothesis 2: The POA+ effect might result at least in part from QB
(self-selection Quality Bias) because the decision (by a self-selected 15%
subset of PNAS authors) to *pay* for POA is influenced by the author's
underlying sense of the potential importance (hence impact) of his
article: I agree with the critics that simply asking authors about how
important they think their article is, and whether that influenced their
decision to pick POA or SOA or NOA, and failing to detect any significant
difference among the authors, does not settle this matter, and certainly
not on the basis of such a small and special sample. (But I thing QB
is just one of many components of the OA citation advantage itself,
and certainly not the only determinant or even the biggest one.)

Hypothesis 3: The POA+ effect might be either a chance result from this
small sample size or a temporary side-effect of the 3-option journals
in early days:

NOA: 1159 86.2% cited at least once
POA: 176 94.3% cited at least once
SOA: 121 90.1% cited at least once
BOA: 36 97.2% cited at least once

(Note that our own and Lawrence's 2001 finding had been that the
proportion of OA articles increases in the higher citation ranges,
being lowest among articles with 0-1 citations.)

Eysenbach finds that with logistic regression analysis separating
the independent effects of POA, SOA and other correlates, SOA has no
significant independent effect in his 1-year PNAS sample. Now let's
test whether that replicates in larger samples, both in terms of number
of articles, journals, and time-base. (*Failure* to find a significant
effect in a small sample is far less compelling than success in
finding a significant effect in a small sample!)

> Harnad, a self-confessed "archivangalist", co-creator of a
> self-archiving platform, and an outspoken advocate of self-archiving
> (speaking of vested interests) calls the finding that self-archived
> articles are... cited less often than [gold] OA articles from the same
> journal "controversial". In my mind, the finding that the impact of
> nonOA<greenOA<goldOA<green+goldOA is intuitive and logical: The level
> of citations correlates with the level of openness and accessibility.

I don't dispute that POA can add more citations, just as BOA can; maybe
self-archiving in 10 different places will add still more. But what does
this imply, right now, practically speaking? And, even more important,
how likely is it that this sort of redundancy will continue to confer
significant citation advantages once a critical mass of the literature
is in interoperable Institutional Repositories (green SOA) rather than
few and far between, as now? It is indeed intuitive and logical that
the baseline 15% of the literature as a whole that is being spontaneously
self-archived somewhere, somehow on the Web, across all fields, has
somewhat less visibility right now than the 15% of PNAS articles that PNAS
is making OA for those authors who pay for it (POA). That's a one-stop
shopping advantage for PNAS articles, against PNAS articles, in a
high-profile store, today.

But the true measure of the SOA advantage today (at its 15% spontaneous
baseline) is surely not to be found in PNAS but in the statistically far
more numerous, hence far more representative full-spectrum of journals
that do not yet offer POA. (I would be delighted if those journals took
the Eysenbach findings as a reason for offering a POA option! But not
at the expense of authors drawing the absurd conclusion -- not at all
entailed by Eysenbach's PNAS-specific results -- that in the journals
they currently publish in, SOA alone would not confer citation advantages
at least as big as the ones we have been reporting.)

Regarding my self-confessed sin of archivanglizing, however, I do protest
that my first and only allegiance is to 100% OA, and I evangelize the
green road (and promote the self-archiving software) only because it is
so resoundingly obvious that it is the fastest and surest road to 100%
OA. (If empirical -- or logical -- evidence were ever to come out showing
the contrary, I assure you I too would join the gold rush!)

> Sometimes our egos stand in the way of reaching a larger common goal,
> and I hope Harnad and other sceptics respond with good science rather
> than with polemics and politics to these findings.

Well, first, let us not get carried away: There's precious little science
involved here (apart from the science we are trying to provide Open
Access to). The call to self-archive in order to enhance access and
impact is so obvious and trivial that, as I noted, the puzzle is only
why anyone would even have imagined otherwise.

But when it comes to polemics and politics (and possibly also egos),
it might have kept things more objective if the results of Eysenbach's
small but welcome study confirming the OA impact advantage had not been
hyped with editorial salvos such as:

    "solid evidence to support or refute... that papers freely available
    in a journal will be more often read and cited than those behind a
    subscription barrier... has been surprisingly hard to find...

Or even the heavily-hedged:

    "As far as we are aware, no other study has compared OA and non-OA
    articles from the same journal and controlled for so many potentially
    confounding factors."

> Unfortunately, in this area a lot more people have strong opinions
> and beliefs than those having the skills, time, and willingness to
> do rigorous research. I hope we will change this, and I reiterate a
> "call for papers" in that area []

May I echo that call, adding only that the rigorous research might
perhaps be better placed in a journal specializing in scientometrics and
in rigorously peer-reviewing it, rather than in The Journal of Medical
Internet Research, or even PLoS Biology.

    Brody, T., Harnad, S. and Carr, L. (2005) Earlier Web Usage Statistics
    as Predictors of Later Citation Impact. Journal of the American
    Association for Information Science and Technology (JASIST) 56.

    Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year
    Cross-Disciplinary Comparison of the Growth of Open Access and
    How it Increases Research Citation Impact. IEEE Data Engineering
    Bulletin 28(4) pp. 39-47.

I close with some replies to portions of another version of Eysenbach's
response which appeared on

> Harnad's point that the PLoS paper is about the "citation advantage
> of open access" and that there have been "previous papers about the
> citation advantage of open access" (mostly his own studies, mostly
> not published in peer-reviewed journals) is as meaningful as saying
> "this paper is about a cancer treatment, and there are previous
> papers about cancer treatments, so this one doesn't add anything".

That's not what I said. I said this:

    "[T]he only new knowledge from this small, journal-specific sample was
    (1) the welcome finding of how early the OA advantage can manifest
    itself, plus (2) some less clear findings about differences between
    first- and last-author OA practices, plus (3) a controversial finding
    that will most definitely need to be replicated on far larger samples
    in order to be credible: "The analysis revealed that self-archived
    articles are also cited less often than OA [sic] articles from the
    same journal."

And I do think all of this is as far away from rigorous oncological
research as it is from rocket science!

> The statement made by the reviewers and editors of the PLoS paper
> that this is the first study looking at the citation advantage of
> an open access/hybrid journal remains correct until somebody can
> show me a reference where this has been done before.

But who ever contested that far more modest and circumspect statement
(which was certainly not the one the accompanying PLoS editorial
made)? This is indeed "the first study looking at the citation advantage
of an open access/hybrid journal"; indeed, it's the first such study of
PNAS. But it's certainly not the first study looking at the citation
advantage of OA in general, or OA self-archiving in particular, and
looking at it within journals -- within *many* journals, and many

> In analogy, a small carefully designed cohort study showing a
> relationship between smoking and cancer with 1500 patients, obtaining
> through questionnaires and interviews additional variables which could
> account for the association and controlling for these confounders and
> still coming to the conclusion that there is a relation between smoking
> and cancer is scientifically stronger evidence than a quick-and-dirty
> uncontrolled cross-sectional study showing an association between smoking
> and cancer, even if this is done in a population of millions.

Indeed it would. And I forgot to add to my list (4) that Eysenbach
had tested the hypothesis that the OA citation advantage is merely the
result of a self-selection bias by asking 247 authors whether it was,
and they replied that it wasn't...

Stevan Harnad
Received on Sat May 20 2006 - 14:42:42 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:48:20 GMT