Reading Habit Analysis Using Pocket API And Python

I like to read. Like a LOT. But I’m not limited to just books. I read everything that comes my way – books, articles, Reddit threads, tweets and what not. Consuming information by audio (podcasts, audio books) or video is just not my thing. Text is how I like it – and to keep track of all the articles and posts I have to read (but can’t at the moment), I use a very popular app called Pocket. Whenever I come across any interesting article that needs to be saved for reading later, I just save it to my Pocket account. It’s a very handy app – you can save articles from your phone, within apps or from your browser. You can then go back to it later and read the articles in a distraction-free way, offline.

Being the data-curious person that I am, I thought, why not use data analysis to gain deeper insights on my internet reading habits using my Pocket data? So this is what this post is about – I explore trends on how frequently I add articles to my Pocket, how frequently I read them and what those articles are about. I use the Pocket API and Python language to do this analysis. Let’s go!

The Question

This section is usually for clearly defining the analysis question at hand, as seen in my previous posts. However, this time there’s not just one big question. There are a lot of small questions we’re asking and answering based on my Pocket reading habits. So instead of listing all the questions here, I’ll list them directly in the ‘Methodology’ section.

Just a glimpse of the questions we’ll answer using the Pocket API data:

  1. How many articles I’ve added to my Pocket till date?
  2. How many articles have I read / I need to read?
  3. How large is the gap between added and read articles?
  4. What topics do I add articles about?

… and many more! Let’s jump to the analysis.

Methodology

Let’s start coding now! Fire up your Python tool of choice – I’m using the Jupyter ipython notebook for this project.

Part 1 – Set up Pocket API connection

This is the only complicated part of the whole analysis. I’ll blame myself for the complications though as I’m new to Python and connecting to an API for the first time in my life. It was quite a struggle understanding all the API related terminologies – but once you get a hang of it, there’s quite a rush (imagine the data analysis possibilities!). Anyway, to set up the connection, I would recommend that you read the following two articles which explain the multi-step process very beautifully:

The articles above will help you understand how to make a connection to the Pocket API using Python code. It may seem intimidating if you’re a newbie like me but trust me, the struggle is worth it. Here are the steps along with my code for the whole process, explained with comments for each step.

Import all the necessary packages:

import requests
import pandas as pd
from pandas.io.json import json_normalize
import json
import datetime
import matplotlib.pyplot as plt

There are four major steps to connect to the Pocket API using Python:

  • Step 1: Create a new ‘Pocket application’ from the Pocket website. Once created, you’ll get your unique consumer_key.
  • Step 2: Now go to your python code and paste the consumer_key in the requests.post() function. The response of requests.post() will be stored in pocket_api variable. Check if the response is correct by executing pocket_api.status_code – if the response is 200, then the connection was successfully made. If not, try to understand the error reason using pocket_api.headers[‘X-Error’] command. Finally, execute pocket_api.text to get your request_token.
  • Step 3: Now use the request_token obtained above and authenticate in your browser. Use the link given in STEP 3 in code below and replace the text after “?request_token=” with your own request_token.
# STEP 1: Get a consumer_key by creating a new Pocket application
# Link: https://getpocket.com/developer/apps/new

# STEP 2: Get a request token
# Connect to the Pocket API
# pocket_api variable stores the http response
pocket_api = requests.post('https://getpocket.com/v3/oauth/request',
                           data = {'consumer_key':'12345-23ae05df52291ea13b135dff',
                                   'redirect_uri':'https://google.com'})

# Check the response: if 200, then it means all OK
pocket_api.status_code       

# Check error reason, if any
# print(pocket_api.headers['X-Error'])

# Here is your request_token
# This is a part of the http response stored in pocket_api.text
pocket_api.text

# STEP 3: Authenticate 
# Modify and paste the link below in the browser and authenticate
# Repace text after "?request_token=" with the request_token generated above
# https://getpocket.com/auth/authorize?request_token=PASTE-YOUR-REQUEST-TOKEN-HERE&redirect_uri=https://getpocket.com/connected_applications

Once you have authenticated the link above in your browser, return to the python code:

  • Step 4: Use your consumer_key and request_token generated earlier to call requests.post() again and check for status_code. If 200, then execute pocket_auth.text and you’ll finally get your access_token.
# STEP 4: Generate an access_token
# After authenticating in the browser, return here
# Use your consumer_key and request_token below
pocket_auth = requests.post('https://getpocket.com/v3/oauth/authorize',
                            data = {'consumer_key':'12345-23ae05df52291ea13b135dff',
                                    'code':'a1dc2a39-abcd-af28-e235-25ddd4'})

# Check the response: if 200, then it means all OK
# pocket_auth.status_code

# Check error reason, if any
# print(pocket_auth.headers['X-Error'])

# Finally, here is your access_token
# We're done authenticating
pocket_auth.text

Finally, we can import our Pocket data from the API. Use your consumer_key and access_token to execute requests.post(). Based on our data needs, we can specify some options on what kind and how much data we wish to receive. For this analysis, I’ve set the options to receive data of state: ‘all’ and detailType: ‘simple’. This means that I need both read & unread items from my Pocket list and the level of detail needed is simple. I strongly recommend you check the official Pocket developer documentation on the retrieve process to understand available options better. Lastly, execute pocket_add.text to get a JSON dump of your Pocket data.

# Get data from the API
# Reference: https://getpocket.com/developer/docs/v3/retrieve
pocket_add = requests.post('https://getpocket.com/v3/get',
                           data= {'consumer_key':'12345-23ae05df52291ea13b135dff',
                                  'access_token':'b07ff4be-abcd-4685-2d70-d47816',
                                  'state':'all',
                                  'detailType':'simple'})

# Check the response: if 200, then it means all OK
# pocket_add.status_code

# Here is your fetched JSON data
pocket_add.text

if everything goes well, you’ll get a JSON dump file like the one shown below. It’s time to finally flex our data analysis muscle!

Pocket API Tutorial in Python - JSON Output

Our next step is to convert this JSON file into something more useful for analysis.

Part 2 – Prepare the analysis dataframe

Now we’ll convert our JSON file to a pandas data frame. Load the data onto a variable json_data using the json.loads() function. Now we’ll simply loop through the JSON file to extract and append rows to the pandas data frame.

# Prepare the dataframe: convert JSON to table format
json_data = json.loads(pocket_add.text)

df_temp = pd.DataFrame()
df = pd.DataFrame()
for key in json_data['list'].keys():
        df_temp  = pd.DataFrame(json_data['list'][key], index=[0])
        df = pd.concat([df, df_temp])

df = df[['item_id','status','favorite','given_title','given_url','resolved_url','time_added','time_read','time_to_read','word_count']]
df.head(5)

Glance at the top 5 rows of our newly created dataset and try to inspect the different columns and rows available. As you can see, we can do a very thorough analysis using this data.

Pocket API Tutorial in Python - Data Table

Hold on – before we jump into the analysis, we need to fix some datatypes and clean-up some columns. We also need to save a local copy of the data in case we need to quickly summarize the data in Excel.

 
# Clean up the dataset
df.dtypes
df[['status','favorite','word_count']] = df[['status','favorite','word_count']].astype(int)
df['time_added'] = pd.to_datetime(df['time_added'],unit='s')
df['time_read'] = pd.to_datetime(df['time_read'],unit='s')
df['date_added'] = df['time_added'].dt.date
df['date_read'] = df['time_read'].dt.date

# Save the dataframe as CSV locally
df.to_csv('pocket_list.csv')

# Check the data types
df.dtypes

It’s now time to do what we’re meant to do – analyze data!

Part 3 – Answer questions using data

We start with basic questions like how many items are there in our Pocket account and go on to answer complex questions like how fast is our unread article list piling up. Answers to most of these questions are simple python commands.

Question 1 – How many items are there in my Pocket?

# Answer questions using data

# How many items are there in my Pocket?
print(df['item_id'].count())

# What % of articles are read?
print((df['status'].sum()*100)/df['item_id'].count())

The first question that comes my mind is about my overall Pocket usage. Simply counting the number of item_id rows gives us the total number of articles added to our Pocket account till date (not including the ones deleted). If we wish to understand how many of the added articles we’ve read (or archived), simply sum the status column. Using these two metrics, we can also calculate the percentage of read articles till date.

For me, I’ve added 1236 articles to my Pocket account till date and read nearly 73% of those articles. That’s a lot of articles! If I dig deeper into my data, I can see that the first article added to my Pocket was on 5th September, 2014 (just celebrated my 4th year Pocket anniversary by reading more articles!). Fun fact: it was an article on why walking helps us think.

Question 2 – How long is the average article in my Pocket? (minutes)

# How long is the average article in my Pocket? (minutes)
df['time_to_read'].describe()

Next I’m curious about the kind of articles I like to read – do I like long or short articles? There are two ways we can estimate this – by approximating the time to read or by the number of words. If we simply use the describe() function on the time_to_read variable, we get a summary of the reading time estimate of our articles.

For me, the average article is 11 minutes long, ranging from 3 mins to 206 mins (for the curious, this one. Oh look, it’s still unread). 75% of the articles added take less than 12 minutes to read.

Question 3 – How long is the average article in my Pocket? (word count)

# How long is the average article in my Pocket? (word count)
df['word_count'].describe()

Expanding on the last question, if we wish to estimate the reading time by the number of words in our articles, we can simply describe() the word_count variable. For me, the average article is 2152 words long, with the maximum being 45000+ words and the median being 1405 words.

Question 4 – What is the % of favorites?

# What is the % of favorites?
print((df['favorite'].sum()*100)/df['item_id'].count())

How often do I find an article so good that I favorite it to revisit later? For this, simply sum the favorite column and divide it by the total article count. For me, nearly 4.5% of the articles are a favorite.

Question 5 – How many words have I read till date? (and equivalent books)

# How many words have I read till date?
print(df.loc[df['status'] == 1, 'word_count'].sum())

# How many books is this equivalent to?
print(df.loc[df['status'] == 1, 'word_count'].sum()/64000)

Of the 73% or nearly 913 articles read till date, how many words have I read in total? Easy, just sum the word_count where status is 1. I’ve read nearly 1,697,842 words using Pocket. Whew. I read that the average book is about 64,000 words in length. So my 1,697,842 words would be equivalent to nearly 26 books, or about 6.5 books per year!

Question 6 – How were the articles added and read over time?

# How were the articles added over time?
plot_added = df.groupby('date_added')['item_id'].count()
plot_added.describe()
# plot_added.head(10)

# How were the articles read over time?
plot_read = df.groupby('date_read')['status'].sum()
plot_read.describe()
#plot_read.head(10)

I’m interested in a visual representation of my Pocket habits. Specifically, how I add and read articles. To see this, we need to group the articles added and read by date_added and date_read respectively.

Pocket API Tutorial in Python - Articles added

I’ve plotted these graphs using Excel since I was facing problems with Python. However, you can extend the code to generate the graphs in Python itself.

Pocket API Tutorial in Python - Articles read

From the graphs it is clear that my Pocket usage has been very on-and-off. However, it has really picked up since 2017 (perhaps when I streamlined my reading habits and tools). 2018 has been a good year in terms of both articles added and read. My article addition activity usually coincides with my reading spree, as observed.

Question 7 – How large is the gap between read and unread articles?

Pocket API Tutorial in Python - Unread Gap

This is the question which actually gave me the idea of writing this article. It’s also a very important one regarding my procrastination habits and Pocket’s user base in general. Do I add articles to my Pocket faster than I’m able to read them? Does Pocket make it too easy for people to add articles to the app? (Spoiler: yes it does) What is my unread article backlog right now and how has it increased over time?

Above, I’ve plotted a graph which shows the cumulative number of articles added and read over time since 2014 (red and green lines). I’ve also plotted the cumulative backlog (defined as the difference between number of unread and read articles) (black line, secondary axis). Clearly, the backlog has only increased over time, despite my very hard efforts to bring it down. This is clearly because adding articles to the app requires nowhere as much mental bandwidth, effort and time as actually reading an article.

To plot this graph, I’ve manipulated the data in Excel. If you find a way to do these steps in python, let me know.

Question 8 – What topics do I read about the most?

# Wordcloud of the topics I read about
from wordcloud import WordCloud, STOPWORDS

stopwords = set(STOPWORDS)
wordcloud = WordCloud(background_color='white',
                      stopwords=stopwords,
                      max_words=300,
                      max_font_size=40, 
                      random_state=42
                      ).generate(str(df['given_title']))

print(wordcloud)
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
fig.savefig("Pocket Wordcloud.png", dpi=900)

Pocket API Tutorial in Python - Pocket Wordcloud

Lastly, I also wish to know what are the topics I’m reading most about. What better way to summarize the topics than by building a word cloud? I’ve used the WordCloud function from the wordcloud package and plotted the article titles. Here is an explanation of some of the words showing up:

  • Medium, Atlantic – websites whose articles I like to read, their titles usually begin with their name
  • WordPress – because I’m building my websites and learning more about WordPress
  • Creating, Life, Help – all kinds of self-help articles I like to read (but probably shouldn’t)
  • Analytics, Sales, Marketing – topics I’m trying to get better at

This word cloud is majorly representative of the things I read about. We can refine it further by removing names of websites and other stop words.

Results and Conclusion

The conclusion, as I suspected, is that I read a lot. But now I know HOW much. Along with that – what all do I read about, when do I read, how much do I still need to read etc. One interesting insight is how my ‘to read’ list is constantly growing. Maybe Pocket should change their tagline from ‘read it later’ to ‘read it never’! (ok sorry)

Limitations

  • I faced no limitations or challenges when doing this analysis. The Pocket API is lovely and the data very straightforward to analyze. One thing to note is that deleted articles do not show up in the API fetch results – hence it may not be possible to analyze them.
  • ‘time_to_read’ variable and its data may not be a very reliable indicator of time to read. It may differ based on your reading speed. All the rows may not have a populated value – not sure why this happens.

Get The Code

This analysis’ complete Python code is available on my Github profile. Is there a way to make my code more efficient? Do let me know! I’m still trying to get better at Python and every little helps.

So that’s it. I’ve shown you a way to analyze your Pocket reading habits. If you’re planning to analyze your own Pocket account after getting inspired by this post, do tweet me @kumaagx on Twitter and I’ll re-tweet your post. If you found this analysis useful, please share this post and consider buying me a coffee. I’m working to produce one such in-depth analysis article every 2 weeks and your support helps!

Instagram Hashtag Analysis In Python

Not many people know this but apart from being a data analyst, I am an artist too. This means that I regularly create art and post it on my Instagram account. Making art, just like doing an analysis, takes a lot of time and effort. And it makes me sad when I’m not able to get enough social validation in the form of likes, comments or new followers on my posts.

So I keep trying out different methods to increase my following and post engagement. One of the methods I use is to include relevant Instagram hashtags in my posts. But the biggest struggle is finding the most relevant hashtags for a particular post. How do I know if the hashtags I’m using are effective enough or not? Therefore I decided to tackle this problem doing what I do best (apart from making art!) – I decided to write a Python code for doing my own Instagram hashtag analysis!

The Question

Hashtags are basically search terms on Instagram. Depending on their popularity, they can be ‘fast-moving’ or ‘slow-moving’. Popular hashtags like #art get used a lot and are fast moving – that means, the posting frequency and number of posts tagged is very high. On the other hand, tags like #BlackAndWhiteArt are relatively slow moving as they are niche and specific – everyone may be making art, but not everyone is making black and white art. So their posting frequency and number of posts tagged will be lower.

To be discovered via hashtags, we need to use hashtags that aren’t too popular or too niche. If the hashtag is too popular, our post will be lost in the deluge of posts. If it’s too niche, hardly anyone will see it and we won’t get discovered.

So the questions we’re trying to answer using data is:

  1. Where can I find and extract a list of hashtags to analyse?
  2. How can I identify which of my hashtags are popular / niche / relevant?
  3. How do I do all this in a fast and automated way?

In this analysis, my personal aim is to find art-related hashtags in the range of 100K to 500K posts.

Methodology

Let’s start coding now! Fire up your Python tool of choice – I’m using the Jupyter ipython notebook for this project.

Part 1 – Initialize

I was actually planning to do this project in R using the rvest package. This was until I discovered rvest works best only on static sites. Instagram is a dynamic site – the content is loaded via javascript and cannot be accessed via rvest. Disappointed, I searched for a different way and came across the method of using Selenium + BeautifulSoup + ChromeDriver in Python.

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import datetime

Selenium is the automation library in Python and BeautifulSoup is the library used for web scraping. ChromeDriver is what will enable us to open an independent Chrome browser window from within Python, load the Instagram website and then extract data from it. To be able to use ChromeDriver’s capabilities, you need to download (from this link), extract and place the chromedriver.exe file in the same folder as your python code working directory.

Part 2 – Get the hashtag list

Now we need to get data from the Instagram website. Ideally I would have loved to use the API, especially after recently using Twitter’s API in my previous post. However Instagram’s API is such a disappointment that the only good thing about it is that I was forced to learn web scraping.

The first part of a problem is to get a list of Instagram tags for hashtag analysis. If you already have a list of tags, that’s great – we can start with that. Else, I’ve written a code to extract the tags from any relevant post on Instagram. For me, it would be extracting hashtags from the post of an artist whose style is similar to mine. We can simply enter the link of the target post and scrape the description and the comments section data. Then we can extract the hashtags from the text by identifying words that start with a ‘#’.

Data in websites is stored within HTML tags and this can be seen by inspecting the code (right-click > then choose ‘Inspect’). The data for our Instagram post and comments is stored under the selected <ul> tag hierarchy and within this, inside the <span>, the <a> tag contains the hashtag text values.

Instagram Hashtag Analysis - Inspect Element

driver = webdriver.Chrome()

# Extract description of a post from Instagram link
driver.get('https://www.instagram.com/p/BiRnjDsFKzl/')
soup = BeautifulSoup(driver.page_source,"lxml")
desc = " "

for item in soup.findAll('a'):
    desc= desc + " " + str(item.string)

# Extract tag list from Instagram post description
taglist = desc.split()
taglist = [x for x in taglist if x.startswith('#')]
index = 0
while index < len(taglist):
    taglist[index] = taglist[index].strip('#')
    index += 1

# (OR) Copy-paste your tag list manually here
#taglist = ['art', 'instaart', 'iblackwork']

print(taglist)

When we execute this code, the Chrome driver opens a test browser window, loads the post link we’ve provided in driver.get function, goes to the description element and extracts the text. Then it strips down this text to find for words beginning with a ‘#’. If you don’t wish to extract hashtags from a post and enter your own list instead, you can directly write those hashtags into the taglist variable after un-commenting the line.

Part 3 – Loop over hashtags and extract information

Our hashtag list is ready. Now we need to loop over each tag and extract information from their individual page. There are two main data points we’re after – the number of posts in a hashtag and the posting frequency. We’ll load up the hashtag page one by one by navigating the Chrome window to www.instagram.com/explore/tags/tagname page. Then our code does the rest.

# Define dataframe to store hashtag information
tag_df  = pd.DataFrame(columns = ['Hashtag', 'Number of Posts', 'Posting Freq (mins)'])

# Loop over each hashtag to extract information
for tag in taglist:
    
    driver.get('https://www.instagram.com/explore/tags/'+str(tag))
    soup = BeautifulSoup(driver.page_source,"lxml")
    
    # Extract current hashtag name
    tagname = tag
    # Extract total number of posts in this hashtag
    # NOTE: Class name may change in the website code
    # Get the latest class name by inspecting web code
    nposts = soup.find('span', {'class': 'g47SY'}).text
        
    # Extract all post links from 'explore tags' page
    # Needed to extract post frequency of recent posts
    myli = []
    for a in soup.find_all('a', href=True):
        myli.append(a['href'])

    # Keep link of only 1st and 9th most recent post 
    newmyli = [x for x in myli if x.startswith('/p/')]
    del newmyli[:9]
    del newmyli[9:]
    del newmyli[1:8]

    timediff = []

    # Extract the posting time of 1st and 9th most recent post for a tag
    for j in range(len(newmyli)):
        driver.get('https://www.instagram.com'+str(newmyli[j]))
        soup = BeautifulSoup(driver.page_source,"lxml")

        for i in soup.findAll('time'):
            if i.has_attr('datetime'):
                timediff.append(i['datetime'])
                #print(i['datetime'])

    # Calculate time difference between posts
    # For obtaining posting frequency
    datetimeFormat = '%Y-%m-%dT%H:%M:%S.%fZ'
    diff = datetime.datetime.strptime(timediff[0], datetimeFormat)\
        - datetime.datetime.strptime(timediff[1], datetimeFormat)
    pfreq= int(diff.total_seconds()/(9*60))
    
    # Add hashtag info to dataframe
    tag_df.loc[len(tag_df)] = [tagname, nposts, pfreq]
        
driver.quit()

So first, we save the hashtag name and the number of posts into a variable. The number of posts is easily extracted from the <span> tag in the page.

Instagram Hashtag Analysis - Inspect Element 2

Next, we extract the data necessary for calculating the post frequency for a particular hashtag. We’ll calculate the average time difference between the 1st and the 9th most recent post in the hashtag page. To do this, we get all the post URLs in the hashtag page but keep only the 1st and 9th most recent post links. We then visit the individual post pages of these two posts to get the posting time. Finally we calculate the average time difference and store it in the diff variable. The last step is to simply assign the variables to the dataframe columns. The loop repeats the above steps for each hashtag. The code may take 5+ minutes to execute, based on your connection speed and number of hashtags.

Results and Conclusion

Our final output is the dataframe of hashtags, number of posts and posting frequency. You can export the dataframe to a CSV and continue your hashtag analysis in Excel if you wish.

# Check the final dataframe
print(tag_df)

# CSV output for hashtag analysis
tag_df.to_csv('hashtag_list.csv')

We can now easily compare the number of posts on each hashtag, the ratio of popular vs niche tags used and how frequently someone posts on each hashtag. We can compare this with our own Instagram posting strategy and make adjustments as needed.

Instagram Hashtag Analysis - Output

I did a hashtag analysis of my own posts with the strategy of a popular account and found out that I was using way too many popular hashtags. I need to niche out a bit. I need to be using more tags between 100K to 500K posts and slower posting frequency.

Limitations

  • The class names of the elements in the website seem to keep changing. I’m not sure why this happens but you need to make sure that the element class name being scraped should match with the name in the website code.
  • Scraping may or may not be legal based on the country / website. Be careful with the laws.
  • This is not a strategy guaranteed to increase your Instagram engagement. A lot of other factors like content, time of posting etc should also be considered.

Get The Code

The latest code is available on my Github profile. Do you think I can make the code more efficient? Do let me know! I’m still trying to get better at Python and every little helps.

So that’s it. We’ve now created our own Instagram hashtag research tool. If you found this analysis useful, please share this post and consider buying me a coffee. I’m working to produce one such in-depth analysis article every 2 weeks and your support helps!

Identifying Forced And Fake Twitter Trends Using R

Twitter has become a toxic place. There, I said it. It is no longer the fun and happy place it used to be a few years back, certainly not in India. It is now full of trolls, rude and nasty people, politicians and companies busy trying to sell their products or spreading propaganda.

But I still love Twitter. Partly because it is not Facebook (that’s a good enough reason). However it pains me see the negativity every time I visit it. As a user, it appears that the Twitter team isn’t moving fast and hard enough to eliminate the problem of trolls and propaganda. So I decided to approach this problem on my own, doing what I do best – data analysis. In this post, I use the Twitter data to perform a basic data analysis in R to analyze a very specific part of the problem – unnaturally trending hashtags and trends on Twitter.

The Question

So the question we’re trying to answer using data is: How can we identify and differentiate real / natural Twitter trends from the forced / propaganda / fake trends showing up in the ‘trending’ tab?

A real / natural trend is a response to any event or news, and this response generally starts gradually, peaks for a while and then fades over time. It’s like discussing a topic in real life – similar to how information is spread from a small group of people to a large one. But if you’re trying to force a trend or a hashtag, it is difficult to achieve a natural distribution pattern – obviously because people aren’t talking about it ‘naturally’. So a handful of people have to coordinate to consistently tweet and RT content over a period of time to get it trending. This difference between a natural and an unnatural trend can easily be identified using simple techniques of data analysis. Let us examine how.

Methodology

I’ll be using the data provided by the Twitter API. The analysis will be performed in R and the code is available at the end of this article. I’ve divided the whole analysis into steps for easier understanding:

Part 0 – Initialize

Before anything else, you need to set up a connection to Twitter’s API in R. You can follow the tutorial here – very straightforward so not going into details for this step. We use the the setup_twitter_oauth() function for this step.

Part 1 – Search Available Trends

The next step is to get a list of the currently trending topics for a region and country of our choice. We use the availableTrendLocations() function to get the full list of locations and their woeID. Enter this woeID in the getTrends() function to get the latest list of trends for your location. Note the exact name of the trend / hashtag you want to analyze from the first column.

For our analysis, we’ve chosen two hashtags – #MondayMotivation and #SidduChallengesYeddy. I have a hunch that people tweet about #MondayMotivation on their own every Monday whereas #SidduChallengesYeddy (one political candidate challenging other over an upcoming state election) seems such a painfully obvious forced hashtag. Nevertheless, we’ll uncover insights about them through our code.

Part 2 – Define Our Analysis Function

The next step is to actually get the tweets for our trends. We use the searchTwitter() function and pass arguments like trend name, number of tweets & RTs to fetch, set date ranges if you wish and other optional arguments. We’ll also need original tweets (i.e not RTs) for our analysis and we use the function strip_retweets() to subset the list of total tweets. Lastly, for easier data manipulation, convert both these datasets to dataframes using twListToDF().

Next, we plot a histogram of all tweets (including RTs) to try to understand the tweet frequency pattern of a particular trend. This visual insight is incredibly powerful and is, in a lot of cases, good enough to identify the forced and unnatural trends. A forced trend usually starts abruptly (probably due to coordinated mass tweeting) and fades off over time as opposed to a natural trend which starts gradually, peaks for a while, then fades off.

Identifying Fake Twitter Trends - SidduChallengesYeddy - Histogram
Tweet Pattern – #SidduChallengesYeddy – Histogram
Identifying Fake Twitter Trends - MondayMotivation - Histogram
Tweet Pattern – #MondayMotivation – Histogram

In the next part, we use the data set of original tweets (tweets without RTs) and clean-up / strip down the text to their essence. This means removing any alphanumeric chars, extra spaces, links, new lines from the tweets and converting everything to the lower case. Note: this works only for tweets in English, doesn’t work on languages with special characters (content in other languages will be stripped away).

From my observation, forced and unnatural trends have a lot of duplicate / similar tweets because the conversation isn’t happening naturally and a bunch of people may be involved in just copy-pasting & RT’ing tweets from a template given to them. We can identify this duplicity by analyzing the data set of clean-up tweets created above.

Identifying Fake Twitter Trends - SidduChallengesYeddy - Duplicate Tweet Frequency
#SidduChallengesYeddy – Duplicate Tweet Frequency
Identifying Fake Twitter Trends - MondayMotivation - Duplicate Tweet Frequency
#MondayMotivation – Duplicate Tweet Frequency

So we create a frequency table (twfreq) of duplicates from the cleaned-up data set and remove any blank rows while we’re at it. This table lets us identify the duplicate posts and their frequency. Sort them by frequency and you’ll be able to see which posts have been posted multiple times from different accounts. Based on this, we can calculate a uniqueness score which tells us what percentage of tweets are original. The formula we’re using to calculate uniqueness is (Number of Unique Tweets) / (Total Number of Tweets). The closer this score is to 1, the more natural a trend is.

Identifying Fake Twitter Trends - SidduChallengesYeddy - Output
#SidduChallengesYeddy – Output
Identifying Fake Twitter Trends - MondayMotivation - Output
#MondayMotivation – Output

Part 3 – Call Our Function And Get Results

In the last step we defined a function to identify and analyze fake and forced trends. This step is simply a call to that function with the argument as the trend or the search term you wish to analyze.

Results and Conclusion

Using the histogram, duplicity table and the uniqueness score together easily helped us identify forced and fake Twitter trends. This analysis can be easily replicated for any other trend or search term you come across to quickly determine how real the trending topic is.

However, the unfortunate part is that it is equally easy for anyone with a handful of money (or following) to start trending a particular topic and possibly spread propaganda. And Twitter doesn’t seem to b doing much to stop this manipulation. Same goes for the trolling, harassment and abuse. The solution to this is data science. I’m no expert in data science, definitely not as good as the people at Twitter. But if I can use data to identify propoganda so easily on Twitter, their engineers can do far more impactful things stop this problem.

Limitations

There are several limitations to this analysis method:

  1. We are limited by the Twitter API – there’s a limit on how many tweets can be extracted in a time period. So for bigger trends or trends spread over a large time period, you many need to extract your data in parts. This can consume a significant amount of time.
  2. We can process tweets only in English at the moment.

Get The Code

I’ve uploaded the code to my Github account. If you have any questions, feel free to contact me here.