Davis et al's 1-year Study of Self-Selection Bias: No Self-Archiving Control, No OA Effect, No Conclusion

From: Stevan Harnad <amsciforum_at_GMAIL.COM>
Date: Thu, 31 Jul 2008 16:06:21 -0400

      Davis, PN, Lewenstein, BV, Simon, DH, Booth, JG, &
      Connolly, MJL (2008) Open access publishing, article
      downloads, and citations: randomised controlled
      trial British Medical Journal337: a568

Overview (by SH):

Davis et al's study was designed to test whether the "Open Access
(OA) Advantage" (i.e., more citations to OA articles than to non-OA
articles in the same journal and year) is an artifact of a
"self-selection bias" (i.e., better authors are more likely to
self-archive or better articles are more likely to be self-archived
by their authors).

The control for self-selection bias was to select randomly which
articles were made OA, rather than having the author choose. The
result was that a year after publication the OA articles were not
cited significantly more than the non-OA articles (although they were
downloaded more).

The authors write:
      "To control for self selection we carried out a
      randomised controlled experiment in which articles from a
      journal publisher's websites were assigned to open access
      status or subscription access only"

The authors conclude:
      "No evidence was found of a citation advantage for open
      access articles in the first year after publication. The
      citation advantage from open access reported widely in
      the literature may be an artefact of other causes."

Commentary: 

To show that the OA advantage is an artefact of self-selection bias
(or any other factor), you first have to produce the OA advantage and
then show that it is eliminated by eliminating self-selection bias
(or any other artefact).

This is not what Davis et al did. They simply showed that they could
detect no OA advantage one year after publication in their sample.
This is not surprising, since most other studies don't detect an OA
advantage one year after publication either. It is too early.

To draw any conclusions at all from such a 1-year study, the authors
would have had to do acontrol condition, in which they managed to
find a sufficient number of self-selected self-archived OA articles
(from the same journals, for the same year) that do show the OA
advantage, whereas their randomized OA articles do not. In the
absence of that control condition, the finding that no OA advantage
is detected in the first year for this particular sample of journals
and articles is completely uninformative.

The authors did find a download advantage within the first year, as
other studies have found. This early download advantage for OA
articles has also been found to be correlated with a citation
advantage 18 months or more later. The authors try to argue that this
correlation would not hold in their case, but they give no evidence
(because they hurried to publish their study, originally intended to
run four years, three years too early.)

(1) The Davis study was originally proposed (in December 2006) as
intended to cover 4 years:
      Davis, PN (2006) Randomized controlled study of OA
      publishing (see comment) 

It has instead been released after a year.

(2) The Open Access (OA) Advantage (i.e., significantly more
citations for OA articles, always comparing OA and non-OA articles in
the same journal and year) has been reported in all fields tested so
far, for example:
      Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year
      Cross-Disciplinary Comparison of the Growth of Open
      Access and How it Increases Research Citation
      Impact. IEEE Data Engineering Bulletin 28(4) pp. 39-47.

(3) There is always the logical possibility that the OA advantage is
not a causal one, but merely an effect of self-selection: The better
authors may be more likely to self-archive their articles and/or the
better articles may be more likely to be self-archived; those better
articles would be the ones that get more cited anyway.

(4) So it is a very good idea to try to control methodologically for
this self-selection bias: The way to control it is exactly as Davis
et al have done, which is to select articles at random for being made
OA, rather than having the authors self-select.

(5) Then, if it turns out that the citation advantage for randomized
OA articles is significantly smaller than the citation advantage for
self-selected-OA articles, then the hypothesis that the OA advantage
is all or mostly just a self-selection bias is supported.

(6) But that is not at all what Davis et al. did.

(7) All Davis et al did was to find that their randomized OA articles
had significantly higher downloads than non-OA articles, but no
significant difference in citations.

(8) This was based on the first year after publication, when most of
the prior studies on the OA advantage likewise find no significant OA
advantage, because it is simply too early: the early results are too
noisy! The OA advantage shows up in later years (1-4).

(9) If Davis et al had been more self-critical, seeking to test and
perhaps falsify their own hypothesis, rather than just to confirm it,
they would have done the obvious control study, which is to test
whether articles that were made OA through self-selected
self-archiving by their authors (in the very same year, in the very
same journals) show an OA advantage in that same interval. For if
they do not, then of course the interval was too short, the results
were released prematurely, and the study so far shows nothing at all:
It is not until you have actually demonstrated an OA advantage that
you can estimate how much of that might due to a self-selection
artefact!

(10) The study shows almost nothing at all, but not quite nothing,
because one would expect (based on our own previous study, which
showed that early downloads, at 6 months, predict enhanced citations
a year and a half or later) that Davis's increased downloads too
would translate into increased citations, once given enough time.
      Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web
      Usage Statistics as Predictors of Later Citation
      Impact. Journal of the American Association for
      Information Science and Technology
      (JASIST) 57(8) pp. 1060-1072.

(11) The findings of Michael Kurtz and collaborators are also
relevant in this regard. They looked only at astrophysics, which is
special, in that (a) it is a field with only about a dozen journals,
and every research astronomer has subscription access to them -- and
these days also free online access via ADS -- and (b) it is a field
in which most authors self-archive their preprints very early
in arxiv -- much earlier than the date of publication.
      Kurtz, M. J. and Henneken, E. A. (2007) Open Access does
      not increase citations for research articles from The
      Astrophysical Journal. Preprint deposited in arXiv
      September 6, 2007.

(12) Kurtz & Henneken found the usual self-archiving advantage in
astrophysics (i.e., about twice as many citations for OA papers than
non-OA) but when they analyzed its cause, they found that most of the
cause was the Early Advantage of access to the preprint, as much as a
year before publication of the (OA) postprint. In addition, they
found a self-selection bias (for preprints -- which is all that were
involved here, because, as noted, as of publication, everything is
OA): The better articles by the better authors were more likely to
have been self-archived as preprints.

(13) Kurtz's results do not generalize to all fields, because it is
not true in other fields either that (a) they already have 100% OA
for their published postprints, nor that (b) many authors tend to
self-archive preprints before publication.

(14) However, the fact that early preprint self-archiving (in a field
that is 100% OA as of postprint publication) is sufficient to double
citations is very likely to translate into a similar effect, in a
non-OA field, if one reckons on the basis of the one-year access
embargo that many publishers are imposing on the postprint. (The
yearlong "No-Embargo" advantage in other fields might not turn out to
be so big as to double citations, as with the preprint Early
Advantage in astrophysics, because at least there is some
subscription access to the postprint, but the counterpart of the
Early Advantage for the postprint is likely to be there too.)

(15) Moreover, the preprint OA advantage is primarily Early
Advantage, and only secondarily Self-Selection.

(16) The size of the postprint self-selection bias would have been
what Davis and al tested -- if they had done the proper control, and
waited long enough to get an actual OA effect to compare against.

(17) We had reported in a pilot study that there was no statistically
significant difference between the size of the OA advantage for
mandated and unmandated self-archiving:
      Hajjem, C & Harnad, S. (2007) The Open Access Citation
      Advantage: Quality Advantage Or Quality Bias?Preprint
      deposited in arXiv January 22, 2007. 

(18) We will soon be reporting the results of a 4-year study on the
OA advantage in mandated and unmandated self-archiving that confirms
these earlier findings: Mandated self-archiving is like Davis et al's
randomized OA, and it does not reduce the OA advantage at all -- once
enough time has elapsed for there to be an OA Advantage at all. 

Stevan Harnad
American Scientist Open Access Forum
Received on Thu Jul 31 2008 - 21:11:03 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:49:26 GMT