COVID-19 UK Social Media Dataset for Public Health Research

We present a benchmark database of public social media postings from the United Kingdom related to the Covid-19 pandemic for academic research purposes, along with some initial analysis, including a taxonomy of key themes organised by keyword. This release supports the findings of a research study funded by the Scottish Government Chief Scientist Office that aims to investigate social sentiment in order to understand the response to public health measures implemented during the pandemic.

Updates:

v1.1: Extended data into 2021
v1.0: Initial release

Structure

Dataset is delivered as a set of CSV files. The messages are separated by network and month, each row consisting of: Date, Message ID (Tweet ID/Crowdtangle ID), Theme ID. 

A key to the themes identified for each message can be found below: 
0: Test & Protect 
1: Shielding 
2: Care homes 
3: Covid survivors 
4: Resumption of health services 
5: Mental health & loneliness 
6: Trust in Scottish Government 
7: Routemap to exit lockdown 
8: Impact on BAME population 
9: Inequalities 
10: Community cohesion/solidarity 
11: Education 
12: Environment 
13: Quality of life 
14: Social/Family 
15: Leisure/Entertainment 
16: Travel 
17: Business restrictions 
18: Work 
19: Hygiene 
20: Shopping 
21: Unemployment 
22: Business growth 
23: Other

Collection 

When working with Twitter data, we designed a containerised streaming listener application that could be deployed quickly on any available host, which would connect to the network endpoint, establish a stream connection with anumber of possible pre-selected filters, and consume and log incoming messages constantly. This application was connected to the endpoint on the 23rd of June at 09:38 GMT, and so the set contains messages from this point onwards.

In this study, we determined that harvesting the widest possible set of messages within our defined regional boundaries was of primary importance, and so we defined our stream filter parameters to harvest all messages tagged with a geographical location within the United Kingdom. Unfortunately, the Twitter 1.1 API specification in use at the time of development allows the use of bounding box location filters only, and only returns Tweets that have been tagged with Place information derived from the 'fine-grained location' permission enabled by users.

These messages were queued via Google Cloud Pub/Sub, through which we sent the extracted post fields to a Cloud Function written in Python that cleaned and sequenced the data into our preferred format, before inserting it into a BigQuery database.

We gained access to Facebook data through the Crowdtangle platform, a Facebook-owned venture which allows access to public post and group data through both visual dashboard tools and an API. In order to harvest Covid-related material, we manually curated lists of important groups, pages and profiles through the web interface. These lists were sorted into both Scotland-only and UK-wide. 

A scheduled script was set up to connect to the API, retrieve all lists attached to the project dashboard, retrieve all posts made by lists members during the preceding 24 hours, then directly upload the results to the BigQuery database. Unfortunately, the lack of a streaming interface precludes near-real-time updates, but this frequency of updates was judged to be acceptable by the project team. Since Facebook posts were available for archive retrieval, we collected posts beginning from the 1st of January 2020 at 00:01 GMT.

Processing 

Pre-processing for data was limited to the removal of line ending characters (\n and \r), as well as annotation with a theme ID determined by keyword frequency analysis, drawn from a pre-determined list of themes and keywords. Any messages that could not have a theme label applied to them were tagged with a valence marker that ensured they would not be entered for further analysis, and have not been published in the released dataset.

Hydration

In order to comply with network policies for researchers conducting data collection via the Twitter and Crowdtangle platforms, we are able to share only the IDs of material that we collected. This precludes us from sharing the location or text of our collected posts directly.

There are several tools that will enable researchers to rehydrate this data to return the full content of the post or profile. For Twitter, we note that the DocNow Hydrator and Tweepy Python library can fulfill this function admirably, however the only option for rehydration of Crowdtangle-provided data is to apply for access to the platform and gain access to the official API.

Note: there are two separate IDs available for Facebook posts via the Crowdtangle API, the platform ID used by Facebook itself, and the Crowdtangle ID used by the analytics platform. We have provided the Crowdtangle ID in our dataset, and so when hydrating posts the API endpoint http://api.crowdtangle.com/ctpost/:id should be used.

For full details on the collection methodology, please view http://arxiv.org/abs/2103.16446