What you must know about Pandas — hands-on Scenario

Photo by Kunal Kalra on Unsplash

If you had asked me five years ago, what is the best method to handle and manipulate data, I would probably answer — database or excel. A meeting with my friend Elad.P (I disguised his real name for modesty reasons) made me embark on a journey to learn the world of data science. Quickly I came to conclusion that in the ongoing debate between the delineation of roles, responsibilities, and capabilities between data engineers and data scientists, it is common that there is an overlap: the data handling skills. The job cannot be performed whether you are a data engineer or a data scientist without this ability. In other words: You must know Pandas.

I will demonstrate several important preprocessing data abilities using the students and grades example. Of course, this is just an example, the data is insufficient, and may not make sense but the concept of handling data using Pandas is important.
I will then try to explain the power of the Pandas with a simple use case — How a data group can be taken and enriched thought the Pandas pipeline:

  • Pandas as an object (Concept)
  • Create new fields
  • Enrich the data
  • Group by
  • Rename columns
  • Data merging
  • Order columns
  • Questioning the data

The (perceptual) ability to move data from place to place, like any object, process it and return it to work for model preparation is simply staggering.

Let’s start with general settings. I like the definition of a few rows and columns to display (pd.set_option(‘display.max_columns’, 500)). As a Jupyter’s person, the default is to reduce the view.

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
intSizeColToCalc =8
intRankForBonus=80

I also set two global parameters to be used later. The first step is to build a data frame that contains eight fields. The first is a single-valued ID of the school, the second of the student names and the rest — the names of the courses and professions. In the process, we will create a data frame that is built from a number of study subjects, and various educational institutions.

df = pd.DataFrame({ 'School Id': np.random.random_integers(1, 4, size=(6)),
'English': np.random.random_integers(50, 99, size=(6)),
'History': np.random.random_integers(50, 100, size=(6)),
'Mathematics': np.random.random_integers(50, 100, size=(6)),
'Sports': np.random.random_integers(50, 100, size=(6)),
'Music': np.random.random_integers(50, 99, size=(6)),
'Biology': np.random.random_integers(50, 99, size=(6)),
},

index=[
['OrenA', 'Uzish', 'YanivP',
'Aviel', 'Eran Lev Anank', 'Asraf']])

I run the data display command (scores are randomly selected so no variance here)

display (df)

The data structure will be presented to us:

We send all the Data Frame to a function that calculates average per row and return it with another column, this time with average per row.

def get_calc_row_mean (sPs):
for index, row in sPs.iterrows():
sPs['Mean_Student_Rank'] = sPs[list(sPs.columns[2:intSizeColToCalc])].mean(axis=1).round(1)
return sPs
sPs = get_calc_row_mean(df)
display (sPs)

What should be shown is:

Now, using the same principle again, we add a function that gives a bonus to students with an average of over 80 (as you will note we think everyone deserves a bonus …) using the global parameter (intRankForBonus):

def get_calc_row_bonus (sPs):
for index, row in sPs.iterrows():
if (sPs['Mean_Student_Rank'][index] > intRankForBonus ):
sPs['bonus'] =(sPs['Mean_Student_Rank']+10).round(1)
else:
sPs['bonus'] =sPs['Mean_Student_Rank']+5
return sPs
sPs = get_calc_row_bonus(df)
display (sPs.sort_values(by='bonus', ascending=False))

It is important to pay attention to the use of the loop and how I place the new figure in the column. Using iterrows () is important to understand how to combine important logics for each column.

Now we build another Data Frame, which is based on the previous one and calculates the average of each profession according to the academic institution.

def calc_school_mean_by_profession (sPs):
sPsGroup = (sPs.groupby(['School Id']).mean().round(1))
sPsGroup= sPsGroup.rename(columns={"English": "English_avg", "History": "History_avg", "Mathematics": "Mathematics_avg", "Sports": "Sports_avg", "Music": "Music_avg", "Biology": "Biology_avg"})
sPsGroup = sPsGroup.drop (['Mean_Student_Rank','bonus' ], 1)
return sPsGroup
sPsGroup =calc_school_mean_by_profession (sPs)
display (sPsGroup)
display (sPs)

Notice the method I chose: Group By on the same Data Frame naturally changes it so I create a new one, change the names and merge it with the old one. There are other good options to do this.

I present above the two Data frames that we merge into one.

Now in order to complete the process of preparing the information, we will add a column of average profession for each student depending on the student’s institution.

sPs = sPsGroup.merge(sPs, left_on='School Id', right_on='School Id',
suffixes=('_left', '_right'))

Suppose we want to proceed to the model, and that our prediction column, will be last. Not that it is not possible to leave the data that way, but again, for the sake of convenience and cleanliness I prefer to arrange them. And so it was done:

sPs=sPs[sPs.columns.reindex([

'School Id',
'Name',
'English',
'History',
'Mathematics',
'Sports',
'Music',
'Biology',
'bonus',
'English_avg',
'History_avg',
'Mathematics_avg',
'Sports_avg',
'Music_avg',
'Biology_avg',
'Mean_Student_Rank'])[0]]
display (sPs)

The new order is as follows:

Now you can question the information a bit, perform more manipulations and make sure that they meet principle #1: the way your data is arranged reflects how your model will perform ….

display (sPs[(sPs.History> 67.0)][['Name','History', 'History_avg']])

In this case I am interested, as a history buff, to know who (what is his name) this firm is history greater than 67, and the side role has the screen of this profession. This is the result:

To conclude, what was presented here: Information is the most important thing in the world of data engineering and data scientist. I do not see a situation where someone who works in this field does not understand this domain. This means knowing the data and knowing how to handle it and prepare it for the model. Pandas, is the main framework that allows you to play with data and shape it.

--

--

--

Husband | Father | Data Scientist with passion | blogger I Bassist I Dreaming & try to apply | Copy-paste Python programmer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Change is the biggest headache for most IT organizations

CS373 Spring 2022: Sage Sanford

Overview and Basics of Markdown

Virtual Kubernetes Clusters In Production

Making a Lego Spirograph, and then modeling it.

Agile Project Management: Best Practices and Methodologies

Can We Use Serverless Functions to Build Microservices?

A simple way to get your kids into making chatbots

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Oren Atia

Oren Atia

Husband | Father | Data Scientist with passion | blogger I Bassist I Dreaming & try to apply | Copy-paste Python programmer

More from Medium

Building a Data Engineering Center of Excellence

Data Engineering Center of Excellence

Managing Big Data in Clusters and Cloud Storage

Data Science VS. Data Engineering

Big Data or Smart Data?