Characterizing Pandemic Information on the Social Media

Soumyadeep Basu
11 min readDec 16, 2020

Social media affects our lives to a great extent. The information we consume on a daily basis may be heavily influenced by the news broadcasted on social media. One such social media website is Twitter, where people turn to authentic news channels to consume news on the go. But it comes with a caveat that all news content should be taken with a pinch of salt. It is becoming harder to differentiate between fact and fiction in news content. Moreover, in the backdrop of the divisive socio-political environment in the U.S., people are susceptible to false claims about important issues like COVID19.

It is the responsibility of national organizations to convey scientific facts related to the pandemic to the public so that they can improve their lifestyle and be aware of harmful consequences of not adhering to the guidelines related to contagious diseases. Centre for Disease Control (C.D.C.)’s move to promote COVID-19 safety through another popular social media app TikTok is a strong testament to this fact. Thus, it is necessary to understand how science is communicated through social media over time. This will help us in developing proper strategies to ensure verified scientific facts reach the common people in an effective manner. Since we know that masks have become an extremely heated point of contention during the COVID-19 outbreak, the scientific guidelines provided by CDC should be communicated to every person effectively because any misinformation about scientific topics may pose a threat to society, especially in the times of a pandemic.

Thus, we have collected tweets around COVID-19, both from domain experts and normal users (non-domain experts). This will help us to understand the sentiments around these tweets and identify the popular topics of discussion. More details have been explained in the following sections.

Problem Statement:

As discussed above, the main aim of this work is to analyze the conversations taking place on Twitter with respect to the COVID-19 pandemic. We have tried to answer the following questions through this work:

  • What are the popular topics of discussion during the pandemic?
  • How is scientific knowledge communicated by experts?
  • What are the popular sentiments regarding the pandemic?

Now we would like to highlight some of the methods and techniques we have used in this work:

Popular Topics:

Popular topics can be described as the pivot for any conversation. Though we have collected tweets around COVID-19, identifying the popular topics helps us to understand which topics help advance the discussion of science in social media. If a certain topic is identified as popular, scientific organizations can organize their message around that topic and promote their message.

Subjectivity:

Subjectivity can be defined as the quality of being influenced by personal feelings or opinions. Since we are working towards understanding scientific communication, analyzing expert users’ tweets is of utmost importance. Subjectivity analysis tells us how subjective a certain tweet is and helps us distinguish the tweets with facts from the tweets with opinions, almost analogous to the process of separating the wheat from the chaff.

Sentiment:

Understanding the sentiment behind the user’s tweets can help in understanding the overall sentiment related to a certain topic. Hence, we have performed sentiment analysis to better understand the sentiments related to the scientific topics and advance scientific communication.

Datasets:

Our main focus is to collect tweets from a group of fifty identified domain experts. Most of these experts are Twitter verified users and are doctors, professors, researchers or well known professionals by occupation. We plan to utilize this collected data as “ground truth” when we predict subjectivity/objectivity for Covid-19 data. We therefore utilized Twitter’s API and collected tweets of these expert users.

Along with this, we collected data from a dataset called TweetsCOV19. This dataset did not contain the tweet text column which we required for further analysis. Hence, we fetched the tweet texts for each row for this dataset, using the tweet IDs. For this purpose, we split the large data file into 600 smaller chunks and used the Twitter API to fetch the tweets. This activity was time consuming due to the API’s rate limitations.

The data collected can be described in two parts as follows:

  • Expert tweets(tweets from the identified expert users): We collected the tweets for the expert users, roughly 3200 tweets per user(due to API limits). The dataset has roughly 14,000 expert tweets through the period of January 2020 to September 2020.
  • Non-expert tweets(tweets from TweetsCOV19 dataset): These tweets are from general users, not specifically from domain experts. It consists of 8,151,524 tweets(~8 Million) in total, posted by 3,664,518 users(~3Million) and reflects the societal discourse about COVID-19 on Twitter through the period of October 2019 until April 2020.

Now we will discuss the major steps of our implementation in the following sections:

Topic Modeling:

Understanding the popular topics of conversation is very important in order to understand the scientific communication through social media websites like Twitter. Hence, we have performed topic modeling in order to extract the popular topics in our dataset.

The main idea in topic modeling is to vectorize the given set of words by term frequency or term frequency-inverse document frequency and split that document term matrix into document+topic and topic+word subsets and thereby optimizing subsets either by using probabilistic or factorization techniques. In order to identify the conversation topics, we performed topic modeling on the datasets using two popular methods:

  1. Latent Dirichlet Allocation (LDA)
  2. Non-negative matrix factorization (NMF)

Further, we have implemented both of these methods on the expert users’ tweets and labelled them based on their respective cluster. The number of topics were chosen manually based on the non overlapping condition of the topics.

Fig. 1: Comparison of the results generated from both the LDA and NMF methods

After careful analysis, we have observed that keywords of the obtained topics in our LDA analysis were more closely related to our current research hence we decided to go forward with the LDA model for our future analysis. Since we intend to understand the nature of scientific communication over time, we have plotted the frequency of tweets for each topic according to the corresponding month.

Fig. 2: Time series analysis of the expert topics using LDA

Now, in order to understand the non-expert tweets according to the identified topics, we have classified them using the six previously identified topics. This step was also suggested by our peers during the intermediate report review.

Fig. 3: Topic wise frequency distribution of the tweets by the non-expert users

Subjectivity Analysis:

As discussed earlier, subjectivity can be defined as the quality of being influenced by personal feelings or opinions. For characterizing the degree of subjectivity of the tweets in both the datasets, we used a Python package named TextBlob.

The TextBlob package is quite a convenient package to perform a lot of Natural Language Processing(NLP) tasks like Sentiment Analysis, Subjectivity Analysis, Noun Phrase Extraction, Part-of-speech(PoS) Tagging, Spelling Correction etc. It not only handles negation, but it also takes into account modifier words like “very”, “almost” in order to predict the polarity and subjectivity of a text. The subjectivity range lies from 0.0 to 1.0, where 0.0 signifies that the text is objective while 1.0 signifies that the text is highly subjective.

We have performed subjectivity analysis on both the expert and non-expert users’ tweets

Subjectivity Analysis for the expert users: This is particularly important as it helps us analyze our ground truth effectively. We aggregated the tweets for each user based on the subjectivity score returned by TextBlob and generated an average subjectivity for every user. We have observed that the subjectiveness does not exceed much beyond 40%. In other words, the expert users are objective at least 60% of the time.

Fig. 4: Overall subjectivity of each of the expert users

We have also counted the number of tweets vis-a-vis the subjectivity scores. It has been observed that only 5.68% of the total expert users’ tweets were found to be highly subjective(i.e., subjectivity score equal to 1). This implies that most of the expert users’ tweets were objective and thus expert users’ tweets contain more fact than fiction.

Fig. 5: Histogram comparing the count of tweets across different subjectivity scores

We have also analyzed that out of those highly subjective tweets,

  • 21% belonged to Topic 5 (work, read, great) ,
  • 20% belonged to Topic 1 (year, thread, health)while
  • Topic 6 (join, news, tweet) had the least number of subjective tweets, i.e., 11% of the highly subjective tweets.

But no clear pattern emerged from the combination of the results from the topic modeling and subjectivity analysis.

Subjectivity Analysis for the non-expert users: We also calculated the subjectivity scores of the non-expert users’ tweets to understand how non-experts’ tweets are different from those of experts. We have observed that for the non- expert users’ tweets, only 4% of the total tweets were highly subjective(subjectivity score equal to 1). Thus, the tweets by non-expert users in our dataset are also highly objective.

Fig. 6: Histogram comparing the count of tweets across different subjectivity scores

Out of those 4% tweets,

  • Topic 4 (mask, time, school) had the highest number of subjective tweets at 28%, followed by
  • Topic 1 (year, thread, health) at 22% while
  • Topic 2 (discuss, trump, live)had the least subjective tweets at 8%.

Again, no clear pattern emerged from the combination of the results from the topic modeling and subjectivity analysis of the non-expert users’ tweets.

Sentiment Analysis:

Sentiment analysis is used to understand and classify emotions in data. It is extremely useful in social media monitoring as it allows us to gain an overview of the wider public opinion behind certain topics. In order to understand the sentiments related to the predicted topics and advance scientific communication we performed sentiment analysis on our datasets using two well-known methods:

  1. BERT- Bidirectional Encoder Representations from Transformers is a paper published by researchers at Google. BERT caused excitement in the Machine Learning community by introducing state-of-the-art results in a wide variety of NLP tasks.
  2. Flair- It is a simple natural language processing (NLP) library developed and open-sourced by Zalando Research. Flair’s framework builds directly on PyTorch, one of the best deep learning frameworks out there and comprises popular and state-of-the-art word embeddings.

We have implemented these methods to understand the expert users in order to compare and find out which model best suits our needs. We found out that BERT predicted more neutral values than positive or negative while Flair had very less neutral values. We randomly selected a few tweets and made sanity checks in order to better understand the model predictions and choose the best model for our further analysis.

Fig. 7: Comparison of the sentiment classification by the Flair model over time
Fig. 8: Comparison of the sentiment classification by the BERT model over time

We have also observed that Flair classified our tweets more accurately. For example, the tweet “another child died” was classified as Neutral by BERT, while Flair classified it more accurately as Negative. Another example: the tweet “coronavirus could put hospital weeks even kill…” was classified as Neutral by BERT, however, again, Flair classified it more accurately as Negative. Hence, we selected Flair as our classifier of choice.

For the expert users, we found out that there were more positive tweets than negative, however, the percentages of the positive and negative tweets are very close at 51% and 48% respectively. But in the case of the non-expert users, negative tweets dominated with 68% of the total tweets while positive tweets were only 32%.

Fig. 9: Sentiment distribution of expert tweets
Fig. 10: Sentiment distribution of non-expert tweets

Limitation and Ethical Issues:

Bias in expert selection: We have collected a list of 50 experts in this field. But we understand that there may be an inherent bias in this selection since it is difficult to find a set of users who are absolutely neutral to the surrounding politics and overall negativity around the issue of COVID-19.

Issues with topic modeling: We have extracted six topics related to COVID-19, initially from the expert users’ tweets analysis. However, all the collected tweets may not fit perfectly well under these topics and hence could create an obstruction in gaining a wholesome understanding of the scenario.

Problems with Elasticsearch: At first, we planned to upload all the tweets that we collected as a part of the TweetsCOV19 dataset to Elasticsearch so that we can utilize the Kibana dashboard linked to it and analyze the tweets. But due to the low computing power available as a part of the AWS Free Tier plan, our analysis was quite slow and we did not pursue it further.

Challenges in collecting response tweets: Along with the topic modeling, we aimed to collect the tweets which were responses to the tweets of the experts. This would give us more context about the response to the experts and their influence. However, we faced a challenge in that we were able to only collect response tweets from the last 7 days, which did not help us in our initial objective of checking if the tweets are backed up by evidence.

Conclusion:

It has been widely acknowledged that the ongoing pandemic is accompanied by various kinds of misinformation about the nature, cure, long-term impact, and social implications of COVID-19.
While social media makes it easy for people to access information, separating facts from misinformation is an arduous task for a naive internet user. Our attempt in this project is a step forward towards solving this problem.

With the recognition of social media platforms, getting to know various facts and information has become significantly easier for the general population. Nevertheless, this sort of convenience also creates the spread of fiction at the same time.

In summary, we have implemented the following methods on both expert and non-expert users’ tweets:

  1. Topic Modeling
  2. Subjectivity Analysis
  3. Sentiment Analysis

After careful analysis of the results, we have observed that most of the tweets in our dataset, for both the expert and non-expert users, are quite objective. In order to understand the sentiment related to these tweets, we have observed that the expert users have an almost equal combination of positive and negative tweets while the non-expert users have more negative tweets. We have also identified the popular topics from these tweets but no correlation was found between the subjective tweets and the related keywords of the topics.

In conclusion, we would like to point out that though the results say it loud and clear that the tweets related to COVID-19 are fairly objective, these results may change if the size of the datasets changes significantly.

Thank you for reading!

Disclaimer and Credits:

All references have been hyper-linked inline in order to adhere to word limitations. Cover Credits: The Economic Times

The data collection process took place in the month of September and October. This is a developing story. Opinions and results may change.

Our work distribution can be found in this appendix.

--

--