Blog Posts

How to Query Your Pandas Dataframe

Blog: Think Data Analytics Blog

Introduction

Whether you are transitioning from a data engineer/data analyst or wanting to become a more efficient data scientist, querying your dataframe can prove to be quite a useful method of returning specific rows that you want. It is important to note that there is a specific query function for pandas, appropriately named, query. However, I will instead be discussing the other ways that you can mimic querying, filtering, and merging your data. We will present common scenarios or questions that you would ask to your data, and rather than SQL, we will do it with Python. In the paragraphs below, I will outline some simple ways of querying rows for your pandas dataframe with the Python programming language.

Multiple Conditions


Sample data. Screenshot from Author [2].

As data scientists or data analysts, we want to return specific rows of data. One of these scenarios is where you want to apply multiple conditions, all in the same line of code. In order to display my example, I have created some fake sample data of a first and last name, as well as their respective gender and birthdate. This data is displayed above in the screenshot.

The example multiple conditions will essentially answer a specific question, just like when you use SQL. The question is, what percent of our data is Male gender OR a person who was born between 2010 and 2021.

Here is the code that will solve that question (there are a few ways to answer this question, but here is my specific way of doing it):

print(“Percent of data who are Males OR were born between 2010 and 2021:”,
 100*round(df[(df[‘Gender’] == ‘M’) | (df[‘Birthdate’] >= ‘2010–01–01’) & 
 (df[‘Birthdate’] <= ‘2021–01–01’)][‘Gender’].count()/df.shape
 [0],4), “%”)

To better visualize this code, I have also included this screenshot of that same code from above, along with the output/result. You can also apply these conditions to return the actual rows instead of getting the fraction or percent of rows out of the total rows.

Conditions code. Screenshot by Author [3].

Here is the order of commands we performed:

As you can see, this code is similar to something you would see in SQL. I personally think it is easier in pandas because it can be less code, while also being able to visually see all the code in one easy spot, without having to scroll up and down (but this format is just my preference).

Merging On Multiple, Specific Columns


Merged dataframe result. Screenshot by Author [4].

We have probably seen how to merge dataframes together in other tutorials, so I wanted to add a unique approach that I have not really seen out there, which is merging on multiple, specific columns. In this scenario, we want to join two dataframes where two fields are shared between them. You could tell that if there are even more columns, this method could be even more useful.

We have our first dataframe, which is df, then we are merging our columns on a second dataframe, df2. Here is that code to achieve our expected result:

merged_df = df.merge(df2, how=’inner’, 
 left_on=cols, 
 right_on=cols
 )

To better visualize this merging and code, I have presented the screenshot below. You see what the second dataframe looks like below, with the First and Last names, just like they are in the first dataframe, but with a new column, Numeric. Then, we have out specific columns that we wanted to merge on, while returning columns GenderBirthdate, and the new Numeric column as well. The columns are a list of columns, which is named cols.

Merging dataframe. Screenshot by Author [5].

As you can see, this way of merging dataframes is a simple way to achieve the same results that you would get from a SQL query.

Summary

 
 
In this tutorial, we saw two common questions or queries that you would perform in SQL, but instead, have performed them with pandas dataframes in Python.

To summarize, here are the two scenarios we worked with:

  1. Returning the percent of rows out of the total dataset from multiple conditions
  2. Merging on multiple, specific columns to return a final dataframe with a new column

I hope you found my article both interesting and useful. Please feel free to comment down below if you agree or disagree with these methods. Why or why not? These can certainly be clarified even further, but I hope I was able to shed some light on some of the ways you can use pandas and Python instead of SQL. Thank you for reading!

Please feel free to check out my profile, Matt Przybyla, and other articles, as well as reach out to me on LinkedIn.

I am not affiliated with these companies.

 
References

[1] Photo by Bruce Hong on Unsplash, (2018)

[2] M. Przybyla, sample data screenshot, (2021)

[3] M. Przybyla, conditions code screenshot, (2021)

[4] M. Przybyla, merged dataframe result screenshot, (2021)

[5] M. Przybyla, merging dataframe screenshot, (2021)

Original Source

The post How to Query Your Pandas Dataframe appeared first on Big Data, Data Analytics, IOT, Software Testing, Blockchain, Data Lake – Submit Your Guest Post.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/how-to-query-your-pandas-dataframe/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×