You've seen an example of one way to manipulate data but pandas has many more. How many more? Put it this way, if you can imagine it, chances are, pandas can do it.
Let's start with string methods. Because pandas is based on Python, however you can manipulate strings in Python, you can do the same in pandas.
You can access the string value of a column using .str. Knowing this, how do you think you'd set a column to lowercase?import pandas as pd car_details = pd.DataFrame({ "Make" : pd.Series(["Toyota", "Toyota", "Nissan","Honda", "Toyota"]), "Colour": pd.Series(["White", "Blue", "White","Blue", "White"]), "Odometer (KM)": pd.Series([150043, 32549, 213095, 45698, 60000]), "Doors" : pd.Series([4, 3, 4, 4, 4]), "Price" : pd.Series(["$4,000.00", "$7,000.00", "$3,500.00","$7,500.00", "$6,250.00"]) }) print(car_details) |
Make Colour Odometer (KM) Doors Price 0 Toyota White 150043 4 $4,000.00 1 Toyota Blue 32549 3 $7,000.00 2 Nissan White 213095 4 $3,500.00 3 Honda Blue 45698 4 $7,500.00 4 Toyota White 60000 4 $6,250.00 |
Lower the Make column
print(car_details["Make"].str.lower()) |
0 toyota 1 toyota 2 nissan 3 honda 4 toyota Name: Make, dtype: object |
Notice how it doesn't change the values of the original car_sales DataFrame unless we set it equal to.
print(car_details) |
Make Colour Odometer (KM) Doors Price 0 Toyota White 150043 4 $4,000.00 1 Toyota Blue 32549 3 $7,000.00 2 Nissan White 213095 4 $3,500.00 3 Honda Blue 45698 4 $7,500.00 4 Toyota White 60000 4 $6,250.00 |
let do it now
car_details["Make"] = car_details["Make"].str.lower() print(car_details) |
Make Colour Odometer (KM) Doors Price 0 toyota White 150043 4 $4,000.00 1 toyota Blue 32549 3 $7,000.00 2 nissan White 213095 4 $3,500.00 3 honda Blue 45698 4 $7,500.00 4 toyota White 60000 4 $6,250.00 |
Reassigning the column changes it in the original DataFrame. This trend occurs throughout all kinds of data manipulation with pandas.
Some functions have a parameter called inplace which means a DataFrame is updated in place without having to reassign it.
Let's see what it looks like in combination with .fillna(), a function which fills missing data. But the thing is, our table isn't missing any data.
In practice, it's likely you'll work with datasets which aren't complete. What this means is you'll have to decide whether how to fill the missing data or remove the rows which have data missing.
Let's check out what a version of our car_sales DataFrame might look like with missing values.
import pandas as pd with pd.ExcelFile("E:\\Input_File.xlsx") as reader: Car_Detail_df = pd.read_excel(reader) print(Car_Detail_df) |
Make Color Odometer (KM) Doors Price Date Launched 0 Toyota White NaN 4 4000 28/02/2000 1 Toyota NaN 32549.0 3 7000 28/02/2000 2 Nissan White NaN 4 3500 22/09/2005 3 Honda NaN 45698.0 4 7500 22/09/2005 4 Toyota White 60000.0 4 6250 28/02/2000 |
Missing values are shown by NaN in pandas. This can be considered the equivalent of None in Python.
Let's use the .fillna() function to fill the Odometer column with the average of the other values in the same column. Fill Odometer column missing values with mean.
We'll do it with and without inplace.
Car_Detail_df["Odometer (KM)"].fillna(Car_Detail_df["Odometer (KM)"].mean(), inplace=False) # inplace is set to False by default |
Now let's check the original car_sales_missing DataFrame.
print(Car_Detail_df) |
Make Color Odometer (KM) Doors Price Date Launched 0 Toyota White NaN 4 4000 28/02/2000 1 Toyota NaN 32549.0 3 7000 28/02/2000 2 Nissan White NaN 4 3500 22/09/2005 3 Honda NaN 45698.0 4 7500 22/09/2005 4 Toyota White 60000.0 4 6250 28/02/2000 |
Because inplace is set to False (default), there's still missing values in the "Odometer" column. Let's try setting inplace to True.
Car_Detail_df["Odometer (KM)"].fillna(Car_Detail_df["Odometer (KM)"].mean(), inplace=True) # inplace is set to True print(Car_Detail_df) |
Make Color Odometer (KM) Doors Price Date Launched 0 Toyota White 46082.333333 4 4000 28/02/2000 1 Toyota NaN 32549.000000 3 7000 28/02/2000 2 Nissan White 46082.333333 4 3500 22/09/2005 3 Honda NaN 45698.000000 4 7500 22/09/2005 4 Toyota White 60000.000000 4 6250 28/02/2000 |
The missing values in the Odometer column have been filled with the mean value of the same column.
In practice, you might not want to fill a column's missing values with the mean, but this example was to show the difference between inplace=False (default) and inplace=True.
Whichever you choose to use will depend on how you structure your code. All you have to remember is inplace=False returns a copy of the DataFrame you're working with. This is helpful if you want to make a duplicate of your current DataFrame and save it to another variable.
We've filled some values but there's still missing values in car_sales_missing. Let's say you wanted to remove any rows which had missing data and only work with rows which had complete coverage.
The following two lines do the same thing
Car_Detail_df.dropna(inplace=True) Car_Detail_df = Car_Detail_df.dropna() print(Car_Detail_df) |
Now if check again, the rows with missing values are gone and the index numbers have been updated.
Make Color Odometer (KM) Doors Price Date Launched 0 Toyota White 46082.333333 4 4000 28/02/2000 2 Nissan White 46082.333333 4 3500 22/09/2005 4 Toyota White 60000.000000 4 6250 28/02/2000 |
Instead of removing or filling data, what if you wanted to create it?
For example, creating a column called Seats for number of seats.
pandas allows for simple extra column creation on DataFrame's. Three common ways are adding a Series, Python list or by using existing columns.
seats_column = pd.Series([5, 5, 5], index=[0,2,4]) Car_Detail_df["Seats"] = seats_column print(Car_Detail_df) |
Make Color Odometer (KM) Doors Price Date Launched Seats 0 Toyota White 46082.333333 4 4000 28/02/2000 5 2 Nissan White 46082.333333 4 3500 22/09/2005 5 4 Toyota White 60000.000000 4 6250 28/02/2000 5 |
You can also make a column by directly combining the values of other columns. Such as, price per kilometre on the Odometer.
Car_Detail_df["Price per KM"] = Car_Detail_df["Price"] / Car_Detail_df["Odometer (KM)"] print(Car_Detail_df) |
Make Color Odometer (KM) ... Date Launched Seats Price per KM 0 Toyota White 46082.333333 ... 28/02/2000 5 0.086801 2 Nissan White 46082.333333 ... 22/09/2005 5 0.075951 4 Toyota White 60000.000000 ... 28/02/2000 5 0.104167 [3 rows x 8 columns] |
As for column creation, you can also create a new column setting all values to a one standard value.
Car_Detail_df["Number of wheels"] = 4 print(Car_Detail_df) |
Make Color Odometer (KM) ... Seats Price per KM Number of wheels 0 Toyota White 46082.333333 ... 5 0.086801 4 2 Nissan White 46082.333333 ... 5 0.075951 4 4 Toyota White 60000.000000 ... 5 0.104167 4 [3 rows x 9 columns] |
Now you've created some columns, you decide to show your colleague what you've done. When they ask about the Price per KM column, you tell them you're not really sure why it's there. You decide you better remove it to prevent confusion.
You can remove a column using .drop('COLUMN_NAME', axis=1). Why axis=1? Because that's the axis columns live on. Rows live on axis=0. Now, drop the Price per KM column from the dataframe.
Car_Detail_df = Car_Detail_df.drop("Price per KM", axis=1) print(Car_Detail_df) |
Make Color Odometer (KM) ... Date Launched Seats Number of wheels 0 Toyota White 46082.333333 ... 28/02/2000 5 4 2 Nissan White 46082.333333 ... 22/09/2005 5 4 4 Toyota White 60000.000000 ... 28/02/2000 5 4 |
Let's say you wanted to shuffle the order of your DataFrame so you could split it into train, validation and test sets. And even though the order of your samples was random, you wanted to make sure.
To do so you could use .sample(frac=1).
.sample() randomly samples different rows from a DataFrame. The frac parameter dictates the fraction, where 1 = 100% of rows, 0.5 = 50% of rows, 0.01 = 1% of rows.
car_sales_sampled = Car_Detail_df.sample(frac=1) print(car_sales_sampled) |
Make Color Odometer (KM) ... Date Launched Seats Number of wheels 4 Toyota White 60000.000000 ... 28/02/2000 5 4 2 Nissan White 46082.333333 ... 22/09/2005 5 4 0 Toyota White 46082.333333 ... 28/02/2000 5 4 |
Notice how the rows remain intact but their order is mixed (check the indexes).
.sample(frac=X) is also helpful when you're working with a large DataFrame. Say you had 2,000,000 rows.
Running tests, analysis and machine learning algorithms on 2,000,000 rows could take a long time. And since being a data scientist or machine learning engineer is about reducing the time between experiments, you might begin with a sample of rows first.
For example, you could use 40k_rows = 2_mil_rows.sample(frac=0.05) to work on 40,000 rows from a DataFrame called 2_mil_rows containing 2,000,000 rows.
What if you wanted to get the indexes back in order?. You could do so using .reset_index().
car_sales_sampled.reset_index(inplace=True) print(car_sales_sampled) |
index Make Color ... Date Launched Seats Number of wheels 0 0 Toyota White ... 28/02/2000 5 4 1 2 Nissan White ... 22/09/2005 5 4 2 4 Toyota White ... 28/02/2000 5 4 [3 rows x 9 columns] |
Calling .reset_index() on a DataFrame resets the index numbers to their defaults. It also creates a new Index column by default which contains the previous index values.
Finally, what if you wanted to apply a function to a column. Such as, converting the Odometer column from kilometers to miles.
You can do so using the .apply() function and passing it a lambda function. We know there's about 1.6 kilometers in a mile, so if you divide the value in the Odometer column by 1.6, it should convert it to miles.
print(car_sales_sampled["Odometer (KM)"]) car_sales_sampled["Odometer (KM)"] = car_sales_sampled["Odometer (KM)"].apply(lambda x: x / 1.6) print(car_sales_sampled["Odometer (KM)"]) |
print(car_sales_sampled["Odometer (KM)"]) car_sales_sampled["Odometer (KM)"] = car_sales_sampled["Odometer (KM)"].apply(lambda x: x / 1.6) print(car_sales_sampled["Odometer (KM)"]) |
Now let's check our final car_sales_sampled DataFrame.
print(car_sales_sampled) |
index Make Color ... Date Launched Seats Number of wheels 0 0 Toyota White ... 28/02/2000 5 4 1 4 Toyota White ... 28/02/2000 5 4 2 2 Nissan White ... 22/09/2005 5 4 [3 rows x 9 columns] |
Since pandas is such a large library, it would be impossible to cover it all in one go.
The following are some resources you might want to look into for more.Data Manipulation with Pandas (section of Python Data Science Handbook by Jake VanderPlas) - a very hands-on approach to many of the main functions in the pandas library.
10-minute introduction to pandas.
Pandas getting started tutorial
Pandas essential basic functionality
If you have any doubts or queries related to this chapter, get them clarified from our Python Team experts on ibmmainframer Community!