Instagram Hashtag Analysis In Python

Not many people know this but apart from being a data analyst, I am an artist too. This means that I regularly create art and post it on my Instagram account. Making art, just like doing an analysis, takes a lot of time and effort. And it makes me sad when I’m not able to get enough social validation in the form of likes, comments or new followers on my posts.

So I keep trying out different methods to increase my following and post engagement. One of the methods I use is to include relevant Instagram hashtags in my posts. But the biggest struggle is finding the most relevant hashtags for a particular post. How do I know if the hashtags I’m using are effective enough or not? Therefore I decided to tackle this problem doing what I do best (apart from making art!) – I decided to write a Python code for doing my own Instagram hashtag analysis!

The Question

Hashtags are basically search terms on Instagram. Depending on their popularity, they can be ‘fast-moving’ or ‘slow-moving’. Popular hashtags like #art get used a lot and are fast moving – that means, the posting frequency and number of posts tagged is very high. On the other hand, tags like #BlackAndWhiteArt are relatively slow moving as they are niche and specific – everyone may be making art, but not everyone is making black and white art. So their posting frequency and number of posts tagged will be lower.

To be discovered via hashtags, we need to use hashtags that aren’t too popular or too niche. If the hashtag is too popular, our post will be lost in the deluge of posts. If it’s too niche, hardly anyone will see it and we won’t get discovered.

So the questions we’re trying to answer using data is:

  1. Where can I find and extract a list of hashtags to analyse?
  2. How can I identify which of my hashtags are popular / niche / relevant?
  3. How do I do all this in a fast and automated way?

In this analysis, my personal aim is to find art-related hashtags in the range of 100K to 500K posts.

Methodology

Let’s start coding now! Fire up your Python tool of choice – I’m using the Jupyter ipython notebook for this project.

Part 1 – Initialize

I was actually planning to do this project in R using the rvest package. This was until I discovered rvest works best only on static sites. Instagram is a dynamic site – the content is loaded via javascript and cannot be accessed via rvest. Disappointed, I searched for a different way and came across the method of using Selenium + BeautifulSoup + ChromeDriver in Python.

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import datetime

Selenium is the automation library in Python and BeautifulSoup is the library used for web scraping. ChromeDriver is what will enable us to open an independent Chrome browser window from within Python, load the Instagram website and then extract data from it. To be able to use ChromeDriver’s capabilities, you need to download (from this link), extract and place the chromedriver.exe file in the same folder as your python code working directory.

Part 2 – Get the hashtag list

Now we need to get data from the Instagram website. Ideally I would have loved to use the API, especially after recently using Twitter’s API in my previous post. However Instagram’s API is such a disappointment that the only good thing about it is that I was forced to learn web scraping.

The first part of a problem is to get a list of Instagram tags for hashtag analysis. If you already have a list of tags, that’s great – we can start with that. Else, I’ve written a code to extract the tags from any relevant post on Instagram. For me, it would be extracting hashtags from the post of an artist whose style is similar to mine. We can simply enter the link of the target post and scrape the description and the comments section data. Then we can extract the hashtags from the text by identifying words that start with a ‘#’.

Data in websites is stored within HTML tags and this can be seen by inspecting the code (right-click > then choose ‘Inspect’). The data for our Instagram post and comments is stored under the selected <ul> tag hierarchy and within this, inside the <span>, the <a> tag contains the hashtag text values.

Instagram Hashtag Analysis - Inspect Element

driver = webdriver.Chrome()

# Extract description of a post from Instagram link
driver.get('https://www.instagram.com/p/BiRnjDsFKzl/')
soup = BeautifulSoup(driver.page_source,"lxml")
desc = " "

for item in soup.findAll('a'):
    desc= desc + " " + str(item.string)

# Extract tag list from Instagram post description
taglist = desc.split()
taglist = [x for x in taglist if x.startswith('#')]
index = 0
while index < len(taglist):
    taglist[index] = taglist[index].strip('#')
    index += 1

# (OR) Copy-paste your tag list manually here
#taglist = ['art', 'instaart', 'iblackwork']

print(taglist)

When we execute this code, the Chrome driver opens a test browser window, loads the post link we’ve provided in driver.get function, goes to the description element and extracts the text. Then it strips down this text to find for words beginning with a ‘#’. If you don’t wish to extract hashtags from a post and enter your own list instead, you can directly write those hashtags into the taglist variable after un-commenting the line.

Part 3 – Loop over hashtags and extract information

Our hashtag list is ready. Now we need to loop over each tag and extract information from their individual page. There are two main data points we’re after – the number of posts in a hashtag and the posting frequency. We’ll load up the hashtag page one by one by navigating the Chrome window to www.instagram.com/explore/tags/tagname page. Then our code does the rest.

# Define dataframe to store hashtag information
tag_df  = pd.DataFrame(columns = ['Hashtag', 'Number of Posts', 'Posting Freq (mins)'])

# Loop over each hashtag to extract information
for tag in taglist:
    
    driver.get('https://www.instagram.com/explore/tags/'+str(tag))
    soup = BeautifulSoup(driver.page_source,"lxml")
    
    # Extract current hashtag name
    tagname = tag
    # Extract total number of posts in this hashtag
    # NOTE: Class name may change in the website code
    # Get the latest class name by inspecting web code
    nposts = soup.find('span', {'class': 'g47SY'}).text
        
    # Extract all post links from 'explore tags' page
    # Needed to extract post frequency of recent posts
    myli = []
    for a in soup.find_all('a', href=True):
        myli.append(a['href'])

    # Keep link of only 1st and 9th most recent post 
    newmyli = [x for x in myli if x.startswith('/p/')]
    del newmyli[:9]
    del newmyli[9:]
    del newmyli[1:8]

    timediff = []

    # Extract the posting time of 1st and 9th most recent post for a tag
    for j in range(len(newmyli)):
        driver.get('https://www.instagram.com'+str(newmyli[j]))
        soup = BeautifulSoup(driver.page_source,"lxml")

        for i in soup.findAll('time'):
            if i.has_attr('datetime'):
                timediff.append(i['datetime'])
                #print(i['datetime'])

    # Calculate time difference between posts
    # For obtaining posting frequency
    datetimeFormat = '%Y-%m-%dT%H:%M:%S.%fZ'
    diff = datetime.datetime.strptime(timediff[0], datetimeFormat)\
        - datetime.datetime.strptime(timediff[1], datetimeFormat)
    pfreq= int(diff.total_seconds()/(9*60))
    
    # Add hashtag info to dataframe
    tag_df.loc[len(tag_df)] = [tagname, nposts, pfreq]
        
driver.quit()

So first, we save the hashtag name and the number of posts into a variable. The number of posts is easily extracted from the <span> tag in the page.

Instagram Hashtag Analysis - Inspect Element 2

Next, we extract the data necessary for calculating the post frequency for a particular hashtag. We’ll calculate the average time difference between the 1st and the 9th most recent post in the hashtag page. To do this, we get all the post URLs in the hashtag page but keep only the 1st and 9th most recent post links. We then visit the individual post pages of these two posts to get the posting time. Finally we calculate the average time difference and store it in the diff variable. The last step is to simply assign the variables to the dataframe columns. The loop repeats the above steps for each hashtag. The code may take 5+ minutes to execute, based on your connection speed and number of hashtags.

Results and Conclusion

Our final output is the dataframe of hashtags, number of posts and posting frequency. You can export the dataframe to a CSV and continue your hashtag analysis in Excel if you wish.

# Check the final dataframe
print(tag_df)

# CSV output for hashtag analysis
tag_df.to_csv('hashtag_list.csv')

We can now easily compare the number of posts on each hashtag, the ratio of popular vs niche tags used and how frequently someone posts on each hashtag. We can compare this with our own Instagram posting strategy and make adjustments as needed.

Instagram Hashtag Analysis - Output

I did a hashtag analysis of my own posts with the strategy of a popular account and found out that I was using way too many popular hashtags. I need to niche out a bit. I need to be using more tags between 100K to 500K posts and slower posting frequency.

Limitations

  • The class names of the elements in the website seem to keep changing. I’m not sure why this happens but you need to make sure that the element class name being scraped should match with the name in the website code.
  • Scraping may or may not be legal based on the country / website. Be careful with the laws.
  • This is not a strategy guaranteed to increase your Instagram engagement. A lot of other factors like content, time of posting etc should also be considered.

Get The Code

The latest code is available on my Github profile. Do you think I can make the code more efficient? Do let me know! I’m still trying to get better at Python and every little helps.

So that’s it. We’ve now created our own Instagram hashtag research tool. If you found this analysis useful, please share this post and consider buying me a coffee. I’m working to produce one such in-depth analysis article every 2 weeks and your support helps!