The University of Southampton
The Alan Turing Institute

Data Science Seminar: Efficient Extraction of Event-Centric Sub-Collections from the Web and Large Scale Web Archives Event

11:00 - 12:00
11 May 2016
Room 3077, Building 32, Highfield Campus, University of Southampton SO17 1BJ

The Web and Web archives are invaluable sources to follow the traces of recent and past events, in particular for researchers in the Digital Humanities, journalists and historians. On the one hand, the large size of data and their distributed nature makes their analysis daunting, especially for non-computer scientists. On the other hand, most research questions only require a smaller relevant subset of the Web or a Web archive such as the snapshots of Web pages describing one particular event or topic. For example, these sub-collections can reflect the ongoing refugee crisis in Europe, the Fukushima nuclear disaster in 2011, the German federal election in 2009, or the FIFA World Cup 2006. In this talk, I present our recent work to create methods that facilitate extraction of event – centric sub-collections from the Web and Web archives. Creation of sub-collections raises several challenging research questions with respect to the crawler guidance, indexing and relevance estimation. On the Web, our methods are facilitated through social media guidance using Twitter and enable efficient monitoring, gathering and analysis of the fresh online content regarding current events. In Web archives, we propose flexible re-crawling methods coupled with topical and temporal relevance estimation and light-weight indexing. We discuss the opportunities and challenges of these approaches and present a framework for creating sub-collections.

Dr Elena Demidova,is a Senior Research Fellow at the WAIS Group, ECS, University of Southampton. Elena has been a post-doctoral researcher at the L3S Research Centre in Hannover, Germany. She received her Ph.D. degree from the Leibniz Universität Hannover (Germany) in 2013 and her M.Sc. from the Universität Osnabrück (Germany) and the University of Twente (The Netherlands) in 2006. Her main research interests include multilingual text processing, Web archives, data quality analytics and Open Data. Elena has been involved in leading roles in large scale EU projects, most recently ARCOMEM FP7 IP, ALEXANDRIA, KEYSTONE COST Action and WDAqua ITN. Her work has been published throughout major conferences and journals, including ACM SIGIR, ACM CIKM, and IEEE TKDE, and she has been a reviewer and committee member for numerous scientific events and publications, most recently including the roles of PC member for the ISWC, ESWC and ACM CIKM conferences and guest editor for IJSWIS.

