Reading Habit Analysis Using Pocket API And Python

I like to read. Like a LOT. But I’m not limited to just books. I read everything that comes my way – books, articles, Reddit threads, tweets and what not. Consuming information by audio (podcasts, audio books) or video is just not my thing. Text is how I like it – and to keep track of all the articles and posts I have to read (but can’t at the moment), I use a very popular app called Pocket. Whenever I come across any interesting article that needs to be saved for reading later, I just save it to my Pocket account. It’s a very handy app – you can save articles from your phone, within apps or from your browser. You can then go back to it later and read the articles in a distraction-free way, offline.

Being the data-curious person that I am, I thought, why not use data analysis to gain deeper insights on my internet reading habits using my Pocket data? So this is what this post is about – I explore trends on how frequently I add articles to my Pocket, how frequently I read them and what those articles are about. I use the Pocket API and Python language to do this analysis. Let’s go!

The Question

This section is usually for clearly defining the analysis question at hand, as seen in my previous posts. However, this time there’s not just one big question. There are a lot of small questions we’re asking and answering based on my Pocket reading habits. So instead of listing all the questions here, I’ll list them directly in the ‘Methodology’ section.

Just a glimpse of the questions we’ll answer using the Pocket API data:

  1. How many articles I’ve added to my Pocket till date?
  2. How many articles have I read / I need to read?
  3. How large is the gap between added and read articles?
  4. What topics do I add articles about?

… and many more! Let’s jump to the analysis.

Methodology

Let’s start coding now! Fire up your Python tool of choice – I’m using the Jupyter ipython notebook for this project.

Part 1 – Set up Pocket API connection

This is the only complicated part of the whole analysis. I’ll blame myself for the complications though as I’m new to Python and connecting to an API for the first time in my life. It was quite a struggle understanding all the API related terminologies – but once you get a hang of it, there’s quite a rush (imagine the data analysis possibilities!). Anyway, to set up the connection, I would recommend that you read the following two articles which explain the multi-step process very beautifully:

The articles above will help you understand how to make a connection to the Pocket API using Python code. It may seem intimidating if you’re a newbie like me but trust me, the struggle is worth it. Here are the steps along with my code for the whole process, explained with comments for each step.

Import all the necessary packages:

import requests
import pandas as pd
from pandas.io.json import json_normalize
import json
import datetime
import matplotlib.pyplot as plt

There are four major steps to connect to the Pocket API using Python:

  • Step 1: Create a new ‘Pocket application’ from the Pocket website. Once created, you’ll get your unique consumer_key.
  • Step 2: Now go to your python code and paste the consumer_key in the requests.post() function. The response of requests.post() will be stored in pocket_api variable. Check if the response is correct by executing pocket_api.status_code – if the response is 200, then the connection was successfully made. If not, try to understand the error reason using pocket_api.headers[‘X-Error’] command. Finally, execute pocket_api.text to get your request_token.
  • Step 3: Now use the request_token obtained above and authenticate in your browser. Use the link given in STEP 3 in code below and replace the text after “?request_token=” with your own request_token.
# STEP 1: Get a consumer_key by creating a new Pocket application
# Link: https://getpocket.com/developer/apps/new

# STEP 2: Get a request token
# Connect to the Pocket API
# pocket_api variable stores the http response
pocket_api = requests.post('https://getpocket.com/v3/oauth/request',
                           data = {'consumer_key':'12345-23ae05df52291ea13b135dff',
                                   'redirect_uri':'https://google.com'})

# Check the response: if 200, then it means all OK
pocket_api.status_code       

# Check error reason, if any
# print(pocket_api.headers['X-Error'])

# Here is your request_token
# This is a part of the http response stored in pocket_api.text
pocket_api.text

# STEP 3: Authenticate 
# Modify and paste the link below in the browser and authenticate
# Repace text after "?request_token=" with the request_token generated above
# https://getpocket.com/auth/authorize?request_token=PASTE-YOUR-REQUEST-TOKEN-HERE&redirect_uri=https://getpocket.com/connected_applications

Once you have authenticated the link above in your browser, return to the python code:

  • Step 4: Use your consumer_key and request_token generated earlier to call requests.post() again and check for status_code. If 200, then execute pocket_auth.text and you’ll finally get your access_token.
# STEP 4: Generate an access_token
# After authenticating in the browser, return here
# Use your consumer_key and request_token below
pocket_auth = requests.post('https://getpocket.com/v3/oauth/authorize',
                            data = {'consumer_key':'12345-23ae05df52291ea13b135dff',
                                    'code':'a1dc2a39-abcd-af28-e235-25ddd4'})

# Check the response: if 200, then it means all OK
# pocket_auth.status_code

# Check error reason, if any
# print(pocket_auth.headers['X-Error'])

# Finally, here is your access_token
# We're done authenticating
pocket_auth.text

Finally, we can import our Pocket data from the API. Use your consumer_key and access_token to execute requests.post(). Based on our data needs, we can specify some options on what kind and how much data we wish to receive. For this analysis, I’ve set the options to receive data of state: ‘all’ and detailType: ‘simple’. This means that I need both read & unread items from my Pocket list and the level of detail needed is simple. I strongly recommend you check the official Pocket developer documentation on the retrieve process to understand available options better. Lastly, execute pocket_add.text to get a JSON dump of your Pocket data.

# Get data from the API
# Reference: https://getpocket.com/developer/docs/v3/retrieve
pocket_add = requests.post('https://getpocket.com/v3/get',
                           data= {'consumer_key':'12345-23ae05df52291ea13b135dff',
                                  'access_token':'b07ff4be-abcd-4685-2d70-d47816',
                                  'state':'all',
                                  'detailType':'simple'})

# Check the response: if 200, then it means all OK
# pocket_add.status_code

# Here is your fetched JSON data
pocket_add.text

if everything goes well, you’ll get a JSON dump file like the one shown below. It’s time to finally flex our data analysis muscle!

Pocket API Tutorial in Python - JSON Output

Our next step is to convert this JSON file into something more useful for analysis.

Part 2 – Prepare the analysis dataframe

Now we’ll convert our JSON file to a pandas data frame. Load the data onto a variable json_data using the json.loads() function. Now we’ll simply loop through the JSON file to extract and append rows to the pandas data frame.

# Prepare the dataframe: convert JSON to table format
json_data = json.loads(pocket_add.text)

df_temp = pd.DataFrame()
df = pd.DataFrame()
for key in json_data['list'].keys():
        df_temp  = pd.DataFrame(json_data['list'][key], index=[0])
        df = pd.concat([df, df_temp])

df = df[['item_id','status','favorite','given_title','given_url','resolved_url','time_added','time_read','time_to_read','word_count']]
df.head(5)

Glance at the top 5 rows of our newly created dataset and try to inspect the different columns and rows available. As you can see, we can do a very thorough analysis using this data.

Pocket API Tutorial in Python - Data Table

Hold on – before we jump into the analysis, we need to fix some datatypes and clean-up some columns. We also need to save a local copy of the data in case we need to quickly summarize the data in Excel.

 
# Clean up the dataset
df.dtypes
df[['status','favorite','word_count']] = df[['status','favorite','word_count']].astype(int)
df['time_added'] = pd.to_datetime(df['time_added'],unit='s')
df['time_read'] = pd.to_datetime(df['time_read'],unit='s')
df['date_added'] = df['time_added'].dt.date
df['date_read'] = df['time_read'].dt.date

# Save the dataframe as CSV locally
df.to_csv('pocket_list.csv')

# Check the data types
df.dtypes

It’s now time to do what we’re meant to do – analyze data!

Part 3 – Answer questions using data

We start with basic questions like how many items are there in our Pocket account and go on to answer complex questions like how fast is our unread article list piling up. Answers to most of these questions are simple python commands.

Question 1 – How many items are there in my Pocket?

# Answer questions using data

# How many items are there in my Pocket?
print(df['item_id'].count())

# What % of articles are read?
print((df['status'].sum()*100)/df['item_id'].count())

The first question that comes my mind is about my overall Pocket usage. Simply counting the number of item_id rows gives us the total number of articles added to our Pocket account till date (not including the ones deleted). If we wish to understand how many of the added articles we’ve read (or archived), simply sum the status column. Using these two metrics, we can also calculate the percentage of read articles till date.

For me, I’ve added 1236 articles to my Pocket account till date and read nearly 73% of those articles. That’s a lot of articles! If I dig deeper into my data, I can see that the first article added to my Pocket was on 5th September, 2014 (just celebrated my 4th year Pocket anniversary by reading more articles!). Fun fact: it was an article on why walking helps us think.

Question 2 – How long is the average article in my Pocket? (minutes)

# How long is the average article in my Pocket? (minutes)
df['time_to_read'].describe()

Next I’m curious about the kind of articles I like to read – do I like long or short articles? There are two ways we can estimate this – by approximating the time to read or by the number of words. If we simply use the describe() function on the time_to_read variable, we get a summary of the reading time estimate of our articles.

For me, the average article is 11 minutes long, ranging from 3 mins to 206 mins (for the curious, this one. Oh look, it’s still unread). 75% of the articles added take less than 12 minutes to read.

Question 3 – How long is the average article in my Pocket? (word count)

# How long is the average article in my Pocket? (word count)
df['word_count'].describe()

Expanding on the last question, if we wish to estimate the reading time by the number of words in our articles, we can simply describe() the word_count variable. For me, the average article is 2152 words long, with the maximum being 45000+ words and the median being 1405 words.

Question 4 – What is the % of favorites?

# What is the % of favorites?
print((df['favorite'].sum()*100)/df['item_id'].count())

How often do I find an article so good that I favorite it to revisit later? For this, simply sum the favorite column and divide it by the total article count. For me, nearly 4.5% of the articles are a favorite.

Question 5 – How many words have I read till date? (and equivalent books)

# How many words have I read till date?
print(df.loc[df['status'] == 1, 'word_count'].sum())

# How many books is this equivalent to?
print(df.loc[df['status'] == 1, 'word_count'].sum()/64000)

Of the 73% or nearly 913 articles read till date, how many words have I read in total? Easy, just sum the word_count where status is 1. I’ve read nearly 1,697,842 words using Pocket. Whew. I read that the average book is about 64,000 words in length. So my 1,697,842 words would be equivalent to nearly 26 books, or about 6.5 books per year!

Question 6 – How were the articles added and read over time?

# How were the articles added over time?
plot_added = df.groupby('date_added')['item_id'].count()
plot_added.describe()
# plot_added.head(10)

# How were the articles read over time?
plot_read = df.groupby('date_read')['status'].sum()
plot_read.describe()
#plot_read.head(10)

I’m interested in a visual representation of my Pocket habits. Specifically, how I add and read articles. To see this, we need to group the articles added and read by date_added and date_read respectively.

Pocket API Tutorial in Python - Articles added

I’ve plotted these graphs using Excel since I was facing problems with Python. However, you can extend the code to generate the graphs in Python itself.

Pocket API Tutorial in Python - Articles read

From the graphs it is clear that my Pocket usage has been very on-and-off. However, it has really picked up since 2017 (perhaps when I streamlined my reading habits and tools). 2018 has been a good year in terms of both articles added and read. My article addition activity usually coincides with my reading spree, as observed.

Question 7 – How large is the gap between read and unread articles?

Pocket API Tutorial in Python - Unread Gap

This is the question which actually gave me the idea of writing this article. It’s also a very important one regarding my procrastination habits and Pocket’s user base in general. Do I add articles to my Pocket faster than I’m able to read them? Does Pocket make it too easy for people to add articles to the app? (Spoiler: yes it does) What is my unread article backlog right now and how has it increased over time?

Above, I’ve plotted a graph which shows the cumulative number of articles added and read over time since 2014 (red and green lines). I’ve also plotted the cumulative backlog (defined as the difference between number of unread and read articles) (black line, secondary axis). Clearly, the backlog has only increased over time, despite my very hard efforts to bring it down. This is clearly because adding articles to the app requires nowhere as much mental bandwidth, effort and time as actually reading an article.

To plot this graph, I’ve manipulated the data in Excel. If you find a way to do these steps in python, let me know.

Question 8 – What topics do I read about the most?

# Wordcloud of the topics I read about
from wordcloud import WordCloud, STOPWORDS

stopwords = set(STOPWORDS)
wordcloud = WordCloud(background_color='white',
                      stopwords=stopwords,
                      max_words=300,
                      max_font_size=40, 
                      random_state=42
                      ).generate(str(df['given_title']))

print(wordcloud)
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
fig.savefig("Pocket Wordcloud.png", dpi=900)

Pocket API Tutorial in Python - Pocket Wordcloud

Lastly, I also wish to know what are the topics I’m reading most about. What better way to summarize the topics than by building a word cloud? I’ve used the WordCloud function from the wordcloud package and plotted the article titles. Here is an explanation of some of the words showing up:

  • Medium, Atlantic – websites whose articles I like to read, their titles usually begin with their name
  • WordPress – because I’m building my websites and learning more about WordPress
  • Creating, Life, Help – all kinds of self-help articles I like to read (but probably shouldn’t)
  • Analytics, Sales, Marketing – topics I’m trying to get better at

This word cloud is majorly representative of the things I read about. We can refine it further by removing names of websites and other stop words.

Results and Conclusion

The conclusion, as I suspected, is that I read a lot. But now I know HOW much. Along with that – what all do I read about, when do I read, how much do I still need to read etc. One interesting insight is how my ‘to read’ list is constantly growing. Maybe Pocket should change their tagline from ‘read it later’ to ‘read it never’! (ok sorry)

Limitations

  • I faced no limitations or challenges when doing this analysis. The Pocket API is lovely and the data very straightforward to analyze. One thing to note is that deleted articles do not show up in the API fetch results – hence it may not be possible to analyze them.
  • ‘time_to_read’ variable and its data may not be a very reliable indicator of time to read. It may differ based on your reading speed. All the rows may not have a populated value – not sure why this happens.

Get The Code

This analysis’ complete Python code is available on my Github profile. Is there a way to make my code more efficient? Do let me know! I’m still trying to get better at Python and every little helps.

So that’s it. I’ve shown you a way to analyze your Pocket reading habits. If you’re planning to analyze your own Pocket account after getting inspired by this post, do tweet me @kumaagx on Twitter and I’ll re-tweet your post. If you found this analysis useful, please share this post and consider buying me a coffee. I’m working to produce one such in-depth analysis article every 2 weeks and your support helps!