Twitter has become a toxic place. There, I said it. It is no longer the fun and happy place it used to be a few years back, certainly not in India. It is now full of trolls, rude and nasty people, politicians and companies busy trying to sell their products or spreading propaganda.
But I still love Twitter. Partly because it is not Facebook (that’s a good enough reason). However it pains me see the negativity every time I visit it. As a user, it appears that the Twitter team isn’t moving fast and hard enough to eliminate the problem of trolls and propaganda. So I decided to approach this problem on my own, doing what I do best – data analysis. In this post, I use the Twitter data to perform a basic data analysis in R to analyze a very specific part of the problem – unnaturally trending hashtags and trends on Twitter.
So the question we’re trying to answer using data is: How can we identify and differentiate real / natural Twitter trends from the forced / propaganda / fake trends showing up in the ‘trending’ tab?
A real / natural trend is a response to any event or news, and this response generally starts gradually, peaks for a while and then fades over time. It’s like discussing a topic in real life – similar to how information is spread from a small group of people to a large one. But if you’re trying to force a trend or a hashtag, it is difficult to achieve a natural distribution pattern – obviously because people aren’t talking about it ‘naturally’. So a handful of people have to coordinate to consistently tweet and RT content over a period of time to get it trending. This difference between a natural and an unnatural trend can easily be identified using simple techniques of data analysis. Let us examine how.
I’ll be using the data provided by the Twitter API. The analysis will be performed in R and the code is available at the end of this article. I’ve divided the whole analysis into steps for easier understanding:
Part 0 – Initialize
Before anything else, you need to set up a connection to Twitter’s API in R. You can follow the tutorial here – very straightforward so not going into details for this step. We use the the setup_twitter_oauth() function for this step.
Part 1 – Search Available Trends
The next step is to get a list of the currently trending topics for a region and country of our choice. We use the availableTrendLocations() function to get the full list of locations and their woeID. Enter this woeID in the getTrends() function to get the latest list of trends for your location. Note the exact name of the trend / hashtag you want to analyze from the first column.
For our analysis, we’ve chosen two hashtags – #MondayMotivation and #SidduChallengesYeddy. I have a hunch that people tweet about #MondayMotivation on their own every Monday whereas #SidduChallengesYeddy (one political candidate challenging other over an upcoming state election) seems such a painfully obvious forced hashtag. Nevertheless, we’ll uncover insights about them through our code.
Part 2 – Define Our Analysis Function
The next step is to actually get the tweets for our trends. We use the searchTwitter() function and pass arguments like trend name, number of tweets & RTs to fetch, set date ranges if you wish and other optional arguments. We’ll also need original tweets (i.e not RTs) for our analysis and we use the function strip_retweets() to subset the list of total tweets. Lastly, for easier data manipulation, convert both these datasets to dataframes using twListToDF().
Next, we plot a histogram of all tweets (including RTs) to try to understand the tweet frequency pattern of a particular trend. This visual insight is incredibly powerful and is, in a lot of cases, good enough to identify the forced and unnatural trends. A forced trend usually starts abruptly (probably due to coordinated mass tweeting) and fades off over time as opposed to a natural trend which starts gradually, peaks for a while, then fades off.
In the next part, we use the data set of original tweets (tweets without RTs) and clean-up / strip down the text to their essence. This means removing any alphanumeric chars, extra spaces, links, new lines from the tweets and converting everything to the lower case. Note: this works only for tweets in English, doesn’t work on languages with special characters (content in other languages will be stripped away).
From my observation, forced and unnatural trends have a lot of duplicate / similar tweets because the conversation isn’t happening naturally and a bunch of people may be involved in just copy-pasting & RT’ing tweets from a template given to them. We can identify this duplicity by analyzing the data set of clean-up tweets created above.
So we create a frequency table (twfreq) of duplicates from the cleaned-up data set and remove any blank rows while we’re at it. This table lets us identify the duplicate posts and their frequency. Sort them by frequency and you’ll be able to see which posts have been posted multiple times from different accounts. Based on this, we can calculate a uniqueness score which tells us what percentage of tweets are original. The formula we’re using to calculate uniqueness is (Number of Unique Tweets) / (Total Number of Tweets). The closer this score is to 1, the more natural a trend is.
Part 3 – Call Our Function And Get Results
In the last step we defined a function to identify and analyze fake and forced trends. This step is simply a call to that function with the argument as the trend or the search term you wish to analyze.
Results and Conclusion
Using the histogram, duplicity table and the uniqueness score together easily helped us identify forced and fake Twitter trends. This analysis can be easily replicated for any other trend or search term you come across to quickly determine how real the trending topic is.
However, the unfortunate part is that it is equally easy for anyone with a handful of money (or following) to start trending a particular topic and possibly spread propaganda. And Twitter doesn’t seem to b doing much to stop this manipulation. Same goes for the trolling, harassment and abuse. The solution to this is data science. I’m no expert in data science, definitely not as good as the people at Twitter. But if I can use data to identify propoganda so easily on Twitter, their engineers can do far more impactful things stop this problem.
There are several limitations to this analysis method:
- We are limited by the Twitter API – there’s a limit on how many tweets can be extracted in a time period. So for bigger trends or trends spread over a large time period, you many need to extract your data in parts. This can consume a significant amount of time.
- We can process tweets only in English at the moment.