With the advancement of web 2.0, an explosion of user-generated content (UGC) on the internet is evident through various platforms such as social media (Twitter, Facebook, Youtube), forums (Quora, Reddit), and e-commerce (Amazon, Alibaba, eBay) sites. Increasingly, critical information is posted on these platforms by embedding it in different modalities, such as images and videos. Such posts are not accessible to conventional web crawling technology focusing on a post's text alone. Extraction of embedded text for analysis can support many roles, including business analysts, policymakers, and law-enforcement agencies. For example, on e-commerce sites such as eBay and Alibaba, medical products are advertised where the ingredient lists are only visible on the product's image (see Appendix Figure 1). Law enforcement and medical regulators would like to quantify the trade in illegal or harmful ingredients sold online to provide intelligence to focus enforcement interventions and better policy decision-making. Similarly, policymakers are interested in analysing polarization, including opposition and hate speech within political discourse, such as short comments embedded within shared images that offer insights into sentiment and sarcasm (see Appendix Figure 2).
Recent advancement in computer vision and natural language processing (NLP) domains using deep learning techniques has made the prospect of performing automated multimodal information extraction of such embedded UGC viable. Various researchers have investigated diverse multimodal document understanding tasks such as Representation learning [1 - 2], Table-structure recognition [3 - 4], detecting textual region [5 - 6], Entity extraction [7 - 9], and so on from images. A few recent studies have used Optical Character Recognition (OCR) [11,13] and Convolutional Neural Network (CNN)  methods to detect ingredient labels for food and cosmetics to check if products are Halal. Further, a few studies have automatically classified images containing embedded text to classify online hate [14-16] and sentiment , using multimodal feature extraction such as OCR and CNN pre-trained models. Motivated by these studies on multimodal document understanding tasks, this project will explore how effective multimodal deep-learning NLP models can deliver information extraction from text embedded within posted images in support of social science focussed downstream applications. Specifically, this project will focus on extracting ingredient labels from medicine products (to support illegal ingredient analysis) and comments embedded into images posted online in political discourse (to support political sentiment analysis).
This project will use publicly available datasets such as the Facebook Hateful Meme dataset , CORDS , and PubTables-1M  to build a multimodal information extraction model. Further, we will curate annotated resources posted in Reddit forums, social media, and e-commerce sites to perform information extraction for the social science topics we are interested in (i.e., ingredients from images of medicine products for sale; hatred and sentiment classification of political image captions). We will apply state-of-the-art document understanding methods to perform the multimodal information extraction task. First, we will perform feature extraction of the image using a representation learning method such as Docformer . Further, we will detect text regions and areas of interest in the image using techniques such as CascadeTabnet  and DocExtractor  and extract the text itself using Optical Character Recognition (OCR) models like Tesseract OCR .
This project will set a new baseline for multimodal information extraction of small text from embedded images, co-designed and tailored for use in social science applications. Outputs from this project will be: i. Dataset - new multimodal dataset for training information extraction algorithms from small text embedded in images tailored for criminology and political science applications. In addition, this dataset will be used to explore the potential for comparative studies on different social science topics, such as the pattern of word usage in a political context (emotive embedded text to influence individuals) and criminal context (advertising language used to obfuscate illegal ingredients in medical products). ii. NLP models - novel multimodal NLP model for information extractions iii. Lessons learnt - experience and lessons learnt from applying modern deep-learning NLP to this type of social science problem areas iv. Exploitation - stakeholder co-design exploitation strategy leading to a proposal submitted to a funder (e.g., MHRA, DEFRA, or DSTL DASA).