#
Bayesian observation modeling in presence-only data
*
Seminar
*

- Time:
- 15:45
- Date:
- 18 April 2013
- Venue:
- Building 54 room 10031

## Event details

Statistics Research Thursday Seminar Series

The prevalence of presence-only samples eg. in ecology or criminology has led to a variety of statistical approaches. Aiming to predict ecological niches, species distribution models provide probability estimates of a binary response (presence/absence) in light of a set of environmental covariates. Similarly, statistical models to predict crime use propensity indicators from observable attributes inferred from incidental data. However, the associated challenges are confounded by non-uniform observation models; even in cases where observation is driven by seemingly irrelevant factors, these may distort estimates about the distribution of occurrences as a function of covariates due to unknown correlations. We present a Bayesian non-parametric approach to addressing sampling bias by carefully incorporating an observation model in a partially identiftiable framework with selectively informative priors and linking it to the underlying process. Any available information about the role of various covariates in the observation process can then naturally enter the model. For example, in cases where sampling is driven by presumed likelihood of detecting an occurrence, the observation model becomes a proxy of the presence/absence model. We illustrate our methods on an example from species distribution modeling and a corporate accounting application. Many statistical tools are made available by us for the analysis of randomized response data and our presentations gives an overview that includes logistic regression (useful when the sensitive question is the response variable to be related to explanatory variables), loglinear analysis (useful to investigate relations between various sensitive questions), item response theory models (useful for investigating the extent of individual sensitive behaviour from a number of sensitive questions) and count data models (counting the number of sensitive behaviours). In addition, extensions to the above statistical models are proposed that accommodate respondents that do not follow the randomized response design by saying ?no? to whatever sensitive question is asked (the untruthful-responses-problem). We argue that in the past randomized response has not been used as often as it could be, and that there are probably three reasons for this. Firstly, randomized response research is expensive as larger sample sizes are needed to obtain the same precision of estimates. Secondly, researchers mistakenly believe that randomized response only allows for the estimation of the prevalence of sensitive behaviour. Thirdly, randomized response is not believed to be solving the untruthful-answer problem. These three reasons can be refuted as follows: firstly, data collection over the internet has become easier over the years and can solve the sample size problem. Secondly, as is indicated above, a whole range of tools for the analysis of randomized response data exists. And thirdly, new statistical methodology is available to handle respondents that do not comply to the randomized response design.

## Speaker information

Ioanna Manolopoulou , University College of London. Lecturer at the Department of Statistical Science