Use Python To Scrape & Visualize Likes On Your LinkedIn Posts
Blog: Think Data Analytics Blog
LinkedIn doesn’t offer content analytics other than for each individual post separately. What if I want to see performance of my posts (likes, views, comments) over the past month? I wrote a program in Python that would visualize whether the number of likes went up or down over the past year for me, and here is what I got:
Below is the video showcasing what the program does in general. Those of you who wish to jump right to the Python script, please see this repository on my GitHub while beginners may read this article for a step-by-step guide.
Can I scrape and visualize content performance on my own?
Although I share this program with the community, unfortunately, this is simply a piece of code that you cannot run without a programming environment BUT you can message me on LinkedIn and I will send you your results.
List of Techniques Utilized:
Selenium to navigate to your LinkedIn
Selenium is a portable framework for testing web applications by multiple programming languages, including Python. Practically for us, it is a Python library that contains methods that allow to navigate through Google Chrome and other browsers (Note: I used Chrome). What Selenium also allows you to do is to simulate scrolling down the page. We need that to get information on more than just the last 5 posts.
Beautiful Soup to scrape HTML contents & collect data
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Open your LinkedIn and make a right click on any of the page elements. Then, click “Inspect”. You will see the HTML contents of the webpage. Those contents (tags) contain information about likes, views, and comments for each post.
Here is an example of what we are looking for (likes):
There is a way to automate the collection of such data, thanks to Beautiful Soup. We specify the tags that correspond to either likes, views, or comment containers of the HTML “soup” and we pull those pieces of HTML. But that’s not it.
We need to recognize the right numbers. For example, we need to recognize not only ‘643’ likes within the html soup contents, but also ‘1,643’. Comma is a problem here. We need to pull even numbers with commas and then take the comma out and save the number as an integer. Regular Expressions to the rescue:
Let’s say you normally receive, on average, 20 likes per post. One time, you got 1000 likes. If we keep this data point with the 1000 likes, our final graph will look ugly and the general trend will be affected by this random success. What we need is a stable trend that we can then analyze with a graph. We define an outlier as a a data point that is 3 standard deviations away from the mean (average) value.
So we get rid of the outliers for each list: likes, views, and comments. Then, we replace each outlier with a median value (because median is less impacted by the outliers) of the entire list. The code looks like this:
Visualize data and save graphs
Not only we have to visualize our data but also add a trend line (regression line) to our graphs, as well as identify whether the slope is positive or negative.
Save datasets as CVS files
For most of professional visualization projects, we need to use a separate software, such as Tableau. For that, we need to export our data in order to do work with it in Tableau.
Slope: Slope’s sign means whether your content performance is growing (positive) or negative (falling). If we are talking about views, if the slope’s value is 32.93, it means you are likely to get about 33 more views then your previous post (if the slope is positive).
Error: Here it is referring to the Normalized Root Mean Square Error or normalized deviation of the residuals, and sometimes referred to as Standard Error of the Estimate. (Note: this is not R squared.) In simple words, the larger it is the worse it is. It represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable. Finally, it is “normalized” because NRMSE relates the RMSE to the observed range of the variable.
I personally use the statistical “rule of 30”: any dataset with less than 30 observations is not valid to derive any conclusions because it is too small. Hence, I wouldn’t use this code if I haven’t posted less than 30 posts.
For myself, I would observe that around 25 posts ago (6 months ago) I started to receive more reactions, which made my trend line clearly positive overall over the past year. A good analyst would then ask “Why is that?”. I went back 6 months in my feed. I noticed that I almost stopped making and posting videos on LinkedIn and switched to stories and plain text.
Does that mean video-posts perform worse on LinkedIn. No. In order to find an explanation, I would have to conduct another data collection in order to conclude if there is dependence of one variable (likes/views) on another (format). And that’s another story.