about true names ... searching and measuring.

From: Bernard Lang <Bernard.Lang_at_inria.fr>
Date: Fri, 18 Aug 2006 18:36:47 +0200

* guedon <jean.claude.guedon_at_umontreal.ca>, le 18-08-06, a écrit:
> The author disambiguation is indeed a really important issue. It affects
> all kinds of things, ranging from the Science Citation Index to even
> some commercial offerings. For example, while searching an author in a
> Springer journal the other day, I noticed that their own search engines
> distinguished between the author's name with full first name from the
> same author's name with just the initial... I had to search through two
> lists of articles instead of one.
> ....

This is actually just a special case of a general problem, regarding
the relation between entities and names. I have been thinking for
some time about it, actually before the issue arose for e-prints,
because I was wondering about a proper way to build bibliographies.

An author may have several names, not just with varying initials and
full first names. A woman may use a husband name, or a maiden name
and even get a new husband (I know a female scientist who publishes
under the name of her first husband, though she is divorced and
remaried). Another well known scientist just changed his first name
to a new one because he did not like it.

So, many names for the same scientist is a common occurence.

But the reverse is also true. Though I did write a good number of
them, I did not write all scientific papers published by Bernard Lang.
There is at least one other scientist by that name, possibly more, and
in a field that is not that remote from mine.

So : many scientists for the same name.

Both problems can be solved by the solution proposed by Jean-Claude,
which is of course the natural one. Though I would hate it for the
publisher to choose my personnal ISBN ... whatever the name for that
ID. This should be handled some other way.

But that is not all. The same problem exists for institutions.  My
institution, INRIA, used to be called IRIA.  We got the N (for
National) only after 13 years of existence.  And I did publish as
Bernard Lang from INRIA papers that were written by Bernard Lang from
IRIA.  The paper actually gives IRIA as my institution (though it also
says that my address is at INRIA) and the publisher thinks my
institution is INRIA.
The is also a place in Netherland called CWI, which was formerly the
Mathematische Centrum  (apologies if the spelling is wrong).
So, how do you search by author institution ?
Or how do you search the reports of an institution if you do not know
all its names.
And I am pretty sure there are unrelated institutions that have the
same name, though I did not bother finding an example.
Shouldn't we have unique identifiers in this case too ?
Possibly ... though things may be more complicated, since institutions
can merge or split (I have examples of both), though many researchers
are not aware of that.
The problem arises again with publications.
The same, identical, paper can sometimes appear on several media, and
it is useful to know that to make search easier (though we should
assume the problem is not essential for e-documents which should be
easier to access).
The same physical book can be part of a series of books, as well as
the proceeding of a conference belonging to two usually independent
series of annual conferences.  So it can be referenced as the book, as
proceedings of conference A or as proceedings of conference B.  But is
is the same things, and while a library may think it does not have
proceedings of conference B, it actually has proceedings of conference
A ... the same. I did not make up this example, I have seen it.
There is also the issue of conference series that change their names.
etc ...
Conclusion :  the name is not the thing.
  One name may denote many things and one thing may have many names.
And that is generally true for many real life entities.
[side note : I wonder how sorcerers deal with this issue ...  but
maybe, ISBN like references is what they call the true name of things
...  for more scientific details on this, read for example "a wizard
of Earthsea" by Ursula le Guin.]
Having unique references (ISBN style) will solve part of the problem.
However that will not solve all problems, and we may want whatever
database structure we are using to be able to describe other relations
such as split and merges between institutions, between associations,
between collections, between conferences ... to make it easier to
retrieve documents, or to analyse publication structures, to evaluate
productive output of x or y.
It would also be useful to relate variations of the same paper.
Sorry if I have been stating what some of you consider obvious.
Jean-Claude's remarks were obvious to me, and I assumed that if he
felt it useful to state them, there was a chance that my further
developments might be useful too.
Bernard Lang
PS My late colleague and boss, Gilles Kahn, acting as scientific
director of INRIA, spent considerable time on local reports trying to
determine from the names of conferences and publications, what
conferences or publications were actually named.  This should not
* guedon <jean.claude.guedon_at_umontreal.ca>, le 18-08-06, a écrit:
> The author disambiguation is indeed a really important issue. It affects
> all kinds of things, ranging from the Science Citation Index to even
> some commercial offerings. For example, while searching an author in a
> Springer journal the other day, I noticed that their own search engines
> distinguished between the author's name with full first name from the
> same author's name with just the initial... I had to search through two
> lists of articles instead of one.
> I believe that scientific and scholarly authors ought to be given a
> permanent identifier which ought to accompany their publication in any
> journal that carries peer review. In effect, it would be the equivalent
> of an ISBN. 
> The easiest way to begin implementing this PAI (Permanent Author
> Identifier) might be for a group of journals to come together and agree
> that when a paper is submitted, the author must supply his/her permanent
> identifier. If he/she does not have one, indicating so would mean that
> the cooperating publisher would assign one immediately and would place
> it in an open database. Universities could encourage their students to
> take up such an identifier as soon as these are on a track (e.g.
> doctoral studies) that should lead to some publishing.
> In conclusion, I do not claim to have clear strategies about this PAI,
> but the need for one appears very high to me. In particular, it would be
> very useful for institutional repositories and the OA movement in
> general. 
> Google and other large search engines might be interested in supporting
> such a development. It would greatly enhance the capability of Google
> Scholar. Countries that do not use the Latin script or use it with funny
> diacritical marks (as in Guédon) might also find it useful to have their
> scientists unambiguously visible in the whole world, even though this
> might decrease the number of "scientists" for any given country.
> Best,
> jcg
> Le vendredi 18 août 2006 à 08:51 -0400, Timothy Miles-Board a écrit :
> > The EPrints team have been looking at this issue in some detail. The current
> > version of EPrints has "clone" and "new version" options which save having
> > to re-enter metadata for similar/different versions of an existing deposit.
> > However, this doesn't help much if you are starting a new deposit. The
> > approach we've been favouring of late is auto-completion (like Google
> > Suggest http://labs.google.com/suggest), whereby the depositor begins typing
> > the first few characters of the name of a co-author and is presented with a
> > pop-up list of suggestions. The behind-the-scenes logic that determines what
> > to suggest can be customised to an individual repository's requirements e.g.
> > suggest from the list of registered users, suggest by looking up in the
> > institutions user account (e.g. LDAP) server, suggest according to an
> > internal database list of institutional and non-institutional users. The
> > previous deposits that you have made can also inform the list of suggestions
> > e.g. frequent/recent co-authors can be promoted to the top of the list of
> > suggestions.
> > 
> > This is not just about minimising keystrokes - the suggestion mechanism we
> > implemented is also able to carry additional data about the authors being
> > suggested. You mention the potential for cross-linking an author's work
> > between archives. In order to do this you need to be able to uniquely
> > identify them. Author disambiguation is potentially important for the
> > Research Assessment Exercise (RAE) in the UK. When an author's name is
> > autocompleted, the ID of that author is also attached.
> > 
> > We have also successfully applied the auto-completion technique to keywords
> > and journal names (with the ISSN number of the journal being passed with the
> > suggestion and used to auto-fill the ISSN field upon selection of the
> > intended journal by the user).
> > 
> > Although for the moment we've decided not to include it in the next version
> > of EPrints (3.0), it will be in a future version. In the meantime, I'd be
> > happy to describe our technique in more technical detail on the eprints.org
> > wiki if that would be useful (creating an autocompleting field in the
> > EPrints deposit form using an open source AJAX library is straightforward-
> > the complicated bit comes in designing the (independent) program that makes
> > appropriate and useful suggestions in reponse to the user's keystrokes).
> > 
> > It is also worth noting that EPrints 3.0 will have a number of new options
> > for importing data e.g. users can create new deposits by cutting and pasting
> > BibTeX/EndNote/etc entries from a bibliography file into a textbox and
> > hitting a button.
> > 
> > Tim
> > 
> > --
> > Timothy Miles-Board
> > EPrints Services
> > Southampton, UK    tmb_at_ecs.soton.ac.uk
> > http://www.eprints.org/services/
> > Consultancy - Training - Hosting
> > 
> > 
> > 
> > On Tue, 15 Aug 2006 11:08:58 +0100, Andrew A. Adams
> > <a.a.adams_at_READING.AC.UK> wrote:
> > 
> > >Regarding this note, one of the things we're struggling with in setting up a
> > >pilot of an IR at the University of Reading (the School of Systems
> > >Engineering and the School of Maths, Meteorology and Physics are jointly
> > >piloting an IR for the Univrsity) is that of manually inputting local
> > >institutional co-authors. It's one of the weaknesses, IMHO, of the GNU
> > >eprints software that it doesn't have two methods of author input - selection
> > >from a list of institutional users already registered, and free text input of
> > >non-institutional authors. In fact, even with non-institutional authors, it's
> > >quite common to regularly author joint papers with the same
> > >non-co-institutional a number of times, if one has a productive external
> > >collaboration. I would prefer, rather than manually entering each author name
> > >in free text, to have a search system available for "registered authors" not
> > >all of whom need to be registered users of the system (which deals with the
> > >issue of people leaving institutions and stopping being registered users but
> > >remaining as authors for their prior papers). If a new co-author is to be
> > >entered, then minimising the number of keystrokes and the utility of having
> > >more than just free-text name-entry only available, though not neceesarily
> > >mandated, should be considered. As the IR grows then, if it is deemed useful,
> > >people can be employed to add extra information onto the non-user author
> > >details, such as affiliation at the time the paper was deposited, and
> > >possibly cross-links to other IRs containing the works of that author (which
> > >could also be useful for authors moving between institutions).
> > >
> > >
> > >--
> > >*E-mail*a.a.adams_at_rdg.ac.uk********  Dr Andrew A Adams
> > >**snail*27 Westerham Walk**********  School of Systems Engineering
> > >***mail*Reading RG2 0BA, UK********  The University of Reading
> > >****Tel*+44-118-378-6997***********  Reading, United Kingdom
             Le brevet logiciel menace votre entreprise
               Software patents threaten your company
    Soutenez la Majorité Économique - Support the Economic Majority
Bernard.Lang_at_inria.fr             ,_  /\o    \o/    Tel  +33 1 3963 5644
http://pauillac.inria.fr/~lang/  ^^^^^^^^^^^^^^^^^  Fax  +33 1 3963 5469
            INRIA / B.P. 105 / 78153 Le Chesnay CEDEX / France
         Je n'exprime que mon opinion - I express only my opinion
Received on Fri Aug 18 2006 - 18:58:47 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:48:28 GMT