Re: Future UK RAEs to be Metrics-Based

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Tue, 28 Mar 2006 12:19:27 +0100

  Date: Tue, 28 Mar 2006 08:13:32 -0500
  From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
  To: ASIS&T Special Interest Group on Metrics <SIGMETRICS_at_LISTSERV.UTK.EDU>
  Subject: Re: Future UK RAEs to be Metrics-Based

The UK has a "dual" funding system: (1) conventional direct research
grant applications, with peer review of competitive proposals (RCUK) and
(2) top-sliced funding accorded to departments (not individuals) based on
past departmental research performance (RAE). The RAE was a monstrously
expensive and time-consuming exercise, with paper collection and
submission of all kinds of performance markers, including 4 full-text
papers, for peer-re-review by RAE panels. It turned out that the RAE's
outcome -- each departmental RAE "rank" from 1 to 5*, with top-sliced
funding given according to the rank and number of researchers submitted
-- was highly correlated with total citation counts for the department's
submitted researchers (r = .7 to .9+) and even more highly correlated
with prior RCUK funding (.98).

So RAE rank correlates highly with prior RCUK (and European) funding and
almost as highly with citations (and with other metrics, such as number
of doctorates accorded, etc.). The RAE rank is based on the data received
and evaluated by the panel -- not through multiple regression, but
through some sort of subjective weighting, including a "peer-re-review"
of already published, already peer-reviewed articles (although I very
much doubt many of them are actually read, the panels not being specific
experts in their subject matter as the original journal peer-reviewers
were meant to be -- it is far more likely that their ranking of the
articles is based on the reputation of the journal in which they were
published, and there is definitely pressure in the departments to
preferentially submit articles that have been published in high-quality,
high-impact journals).

So what is counted explicitly is prior funding, doctorates, and a few
other explicit measures; in addition, there is the "peer-re-review" --
whatever that amounts to -- which is no doubt *implicitly* influenced by
journal reputations and  impact factors. However, neither journal impact
factors nor article/author citations are actually counted *explicitly* --
indeed it is explicitly forbidden to count citations for the RAE. That
makes the high correlation of the RAE outcome with citation counts all
the more remarkable -- and less remarkable than the even higher
correlation with prior funding, which *is* counted explicitly.

The multiple regression ("metric") method is not yet in use at all. It
will now be tried out, in parallel with the next RAE (2008), which will
be conducted in the usual way, but doing the metrics alongside.

Prior funding counts are no doubt causal in the present RAE outcome
(since they are explicitly counted), but that is not the same as saying
that research funding is causal in generating research performance
quality! (Funding is no doubt causal in being a necessary precondition
for research quality, because without funding one cannot do research, but
to what extent prior funding levels in and of themselves are causes of
research quality variance over and above being a Matthew Effect or
self-fulfilling prophecy is an empirical question about how good a
predictor individual research-proposal peer-review is for allotting
departmental top-sliced finding to reward and foster research
performance. 

Hence the causality question is in a sense a question about the causal
efficacy of UK's dual funding system itself, and the relative
independence of its two components. For if they are indeed measuring and
rewarding the very same thing, then RAE and the dual system may as well
be scrapped, and the individual RCUK proposal funding with the redirected
funds simply scaled up proportionately .

I am not at all convinced that the dual system itself should be scrapped,
however; just that the present costly and wasteful implementation of the
RAE component should be replaced by metrics. And those metrics should
certainly not be restricted to prior funding, even though it was so
highly correlated with RAE ranking. It should be enriched by many other
metric variables in a regression equation, composed and calibrated
according to each discipline's peculiar profile as well as its internal
and external validation results. And let us supplement conservative
metrics with the many richer and more diverse ones that will be afforded
by an online, open-access full-text corpus, citation-interlinked, tagged,
and usage-monitored.

Stevan Harnad

On 28-Mar-06, at 6:39 AM, Loet Leydesdorff wrote

> > SH: To repeat: The RAE itself is a predictor, in want of
> > validation. Prior funding correlates 0.98 with this predictor
> > (in some fields, and is hence virtually identical with it),
> > but is itself in want of validation.

> Do you wish to say that both the RAE and the multivariate regression
> method correlate highly with prior funding. Is the latter perhaps causal
> for research quality, in your opinion?
>
> The policy conclusion would then be that both indicators are very
> conservative. Perhaps, that is not a bad thing, but one may wish to
> state it straightforwardly.

  Date: Tue, 28 Mar 2006 12:19:27 +0100 (BST)
  From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
  To: SIGMETRICS_at_LISTSERV.UTK.EDU, oaci-working-group_at_mailhost.soros.org
  Subject: Re: Future UK RAEs to be Metrics-Based

  This an anonymised exchange from a non-public list concerning
  scientometrics and the future of the UK Research Assessment Exercise.
  I think it has important general scientometric implications.
  By way of context: The RAE was an expensive, time-consuming
  submission/peer-re-evaluation exercise, performed every 4 years. It turned
  out a few simple metrics were highly correlated with its outcome. So it
  was proposed to scrap the expensive method in favour of just using
  the metrics. -- SH

---------- Forwarded message ----------

On Tue, 28 Mar 2006, [identity deleted] wrote:

> At 8:34 am -0500 27/3/06, Stevan Harnad wrote:
> >SH: Scrap the RAE make-work, by all means, but don't just rely on one
> >metric! The whole point of metrics is to have many independent
> >predictors, so as to account for as much as possible of the
> >criterion variance:
>
> This seems extremely naive to me. All the proposed metrics I have
> seen are *far* from independent - indeed they seem likely to be
> strongly positively associated.

that's fine. In multiple regression it is not necessary that each
predictor variable be orthogonal; they need only predict a significant
portion of the residual variance in the target (or "criterion") after
the correlated portion has been partialled out. If you are trying to
predict university performance and you have maths marks, english marks
and letters of recommendation (quantified), it is not necessary, indeed
not even desirable, that the correlation among the three predictors
should be zero. That they are correlated shows that they are partially
measuring the same thing. What is needed is that the three jointly, in a
multilinear equation, should predict university performance better than
any one of them alone. Their respective contributions to the variance
can then be given a weight.

The analogy is vectors, a linear combination of several of which may
yield another, target vector. It need not be a linear combination of
orthogonal vectors, just linearly independent ones.

Three other points:

(1) RCUK ranking itself is just a predictor, not the criterion that is
being predicted and against which the predictor(s) need to be validated.
The criterion is research performance/quality. Only metrics with face
validity can be taken to be identical with the criterion, as opposed to
mere predictors of it, and the RAE outcome is certainly not face-valid.

(2) Given (1), it follows that the *extremely* high correlation between
prior funding and RAE rank (0.98 was mentioned) is *not* a desirable
thing. The predictive power of the RAE ranking needs to be increased, by
adding more (semi-independent but not necessarily orthogonal) predictor
metrics to a regression equation (such as funding, citations, downloads,
co-citations, completed PhDs, and many other potential metrics that will
emerge from an Open Access database and digital performance record-keeping
CVs, customised for each discipline) rather than being replaced by a
single one-dimensional predictor metric (prior funding) that happens to
co-vary almost identically with the prior RAE outcome in many disciplines.

(3) Validating predictor metrics against the target criterion is
notoriously difficult when the criterion itself has no direct
face-valid measure. (An example is the problem of validating IQ tests.)
The solution is partly internal validation (validating multiple
predictor metrics against one another) and partly calibration, which
is the adjustment of the weight and number of the predictor metrics
according to corrective feedback from their outcome: In the case of the
RAE multiple regression equation, this could be done partly on the basis
of the 4-year predictive power of metrics against their own later values,
and partly against subjective peer rankings of departmental performance
and quality as well as peer satisfaction ratings for the RAE outcomes
themselves. (There may well be other validating methods.)

> This sounds perilously close to what I used to read in the software
> metrics literature, where attempts were made to capture 'complexity'
> in order to predict the success or failure of software projects.
> People there adopted a
> measure-everything-you-can-think-of-and-hope-something-useful-pops-up
> approach. The problem was that all the different metrics turned out
> to be variants of 'size', and even together they did not enable good
> prediction.

It is conceivable but unlikely that all research performance predictor
metrics turn out to be measuring the same thing, and that none of them
contributes a separate independent component to the variance of the
outcome; but I rather doubt it. At the risk of arousing other prejudices,
I would make an analogy with psychometrics: Test of cognitive
performance capacity (formerly called "IQ" tests) (maths, spatial,
verbal, motor, musical, reasoning, etc.) are constructed and validated
by devising test items and testing them first for reliability (i.e.,
how well they correlate with themselves on repeated administration)
and then cross-correlation and external validation. The (empirical)
result has been the emergence of one general or "G" factor for which
the weight or "load" of some tests is greater than others, so that no
single test measures it exactly, and hence a multiple regression battery,
with each test weighted according to the amount of variance it accounts
for, is preferable to relying on just a single test. And the outcome
is that there turns out to be the one large underlying G factor, with
a component in every one of the tests, plus a constellation of special
factors, associated with special abilities supplementing the G factor,
each adding a smaller but significant component to the variance too,
but varying by individual and field in their predictive power.

The controversy has been about whether the fact that the tests are
validated on the basis of positive correlations among the items is the
artifactual source of the positive manifold underlying G. I am not
a statistician or a psychometrician, but I think the more competent,
objective verdict (the one not driven by a-priori ideological views)
has been that G is *not* an artifact of the selection for positive
correlations, but a genuine empirical finding about a single general
(indeed biological) factor underlying intelligence.

I am not saying there will be a "G" underlying research performance!
Just that the multilinear (and indeed nonlinear) regression method can
be used to tease out the variance and the predictivity from a rich and
diverse set of intercorrelated predictor metrics. (It can also sort
out the duds, that are either redundant or predict nothing of interest
at all.)

> > SH: Metrics are trying to measure and evaluate research performance,
>
> I think you mean 'predict' - not the same thing at all

They measure the predictor variable and try to predict the criterion
variable. As such, they are meant to provide an objective (but
validated) basis for evaluation.

> >SH: not just to 2nd-guess the present RAE outcome,
> >nor merely to ape existing funding levels. We need a rich multiple
> >regression equation, with many weighted predictors, not just one
> >redundant mirror image of existing funding!
>
> Well.... In fact 'existing funding' *may* actually be a good
> predictor of whatever it is we want to predict (see [deleted]'s recent
> posting)!

To repeat: The RAE itself is a predictor, in want of validation. Prior
funding correlates 0.98 with this predictor (in some fields, and is
hence virtually identical with it), but is itself in want of validation.
This high correlation with the actual RAE outcome is already rational
grounds for scrapping the time-wasting and expensive ritual that is the
present RAE, but it is certainly not grounds for scrapping other metrics
that can and should be weighted components in the metric equation that
replaces the current wasteful and redundant RAE. The metric predictors
can then be enriched, cross-tested, and calibrated. (It is my
understanding that RAE 2008 will consist of a double exercise: yet
another iteration of the current ergonomically profligate RAE ritual
plus a parallel metric exercise. I think they could safely scrap the
ritual already, but the parallel testing of a rich battery of actual
and potential metrics is an extremely good -- and economical -- idea.)

> We can only test such hypotheses when we are clear what it
> is we want to predict, and what we mean by 'accuracy' of prediction.

In the first instance, in the decision about whether or not to scrap the
expensive and inefficient current RAE ritual, it is sufficient to
predict the current RAE outcome with metrics.

In order to go on to test and strengthen the predictive power of
that battery of metrics, they need to be enriched and diversified,
internally validated and weighted against one another (and the prior
RAE), and externally validated against the kinds of measure I mentioned
(subjective peer evaluations, predictive power across time, perhaps
other outcome metrics etc.)

> Even if we knew this, I'm not sure the right data is available. But
> in the absence of such a proper investigation, let's not pretend that
> the answer is obvious, as you seem to be doing.

The answer is obvious insofar as scrapping the prior RAE method is
concerned, given the strong correlations. The answer is also obvious
regarding the fact that multiple metrics are preferable to a single
one. Ways of strengthening the predictive power of objective measures
of research performance are practical and empirical matters we need to
be analysing und upgrading continuously.

Stevan Harnad
Received on Tue Mar 28 2006 - 12:26:16 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:48:17 GMT