Describing DataFrame
One of the first things you'll want to do after you import some data into a pandas DataFrame is to start exploring it. Pandas has many built in functions which allow you to quickly get information about a DataFrame.
Let's explore some using the car_sales DataFrame.
import pandas as pd
car_details = pd.DataFrame({ "Make" : pd.Series(["Toyota", "Toyota", "Nissan","Honda", "Toyota"]),
"Colour": pd.Series(["White", "Blue", "White","Blue", "White"]),
"Odometer (KM)": pd.Series([150043, 32549, 213095, 45698, 60000]),
"Doors" : pd.Series([4, 3, 4, 4, 4]),
"Price" : pd.Series(["$4,000.00", "$7,000.00", "$3,500.00","$7,500.00", "$6,250.00"]) })
print(car_details)
|
Output:
Make Colour Odometer (KM) Doors Price
0 Toyota White 150043 4 $4,000.00
1 Toyota Blue 32549 3 $7,000.00
2 Nissan White 213095 4 $3,500.00
3 Honda Blue 45698 4 $7,500.00
4 Toyota White 60000 4 $6,250.00
|
.dtypes shows us what datatype each column contains.
print(car_details.dtypes)
|
Output:
Make object
Colour object
Odometer (KM) int64
Doors int64
Price object
dtype: object
|
.describe() gives you a quick statistical overview of the numerical columns.
print(car_details.describe())
|
Output:
Odometer (KM) Doors
count 5.000000 5.000000
mean 100277.000000 3.800000
std 78090.879483 0.447214
min 32549.000000 3.000000
25% 45698.000000 4.000000
50% 60000.000000 4.000000
75% 150043.000000 4.000000
max 213095.000000 4.000000
|
.info() shows a handful of useful information about a DataFrame such as:
How many entries (rows) there are
Whether there are missing values (if a columns non-null value is less than the number of entries, it has missing values)
The datatypes of each column
Output:
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Make 5 non-null object
1 Colour 5 non-null object
2 Odometer (KM) 5 non-null int64
3 Doors 5 non-null int64
4 Price 5 non-null object
dtypes: int64(2), object(3)
memory usage: 328.0+ bytes
|
You can also call various statistical and mathematical methods such as .mean() or .sum() directly on a DataFrame or Series.
Output:
Odometer (KM) 100277.0
Doors 3.8
dtype: float64
|
Calling .mean() on a Series
car_prices = pd.Series([3000, 3500, 11250])
print(car_prices.mean())
|
Output:
Calling .sum() on a DataFrame
Output:
Make ToyotaToyotaNissanHondaToyota
Colour WhiteBlueWhiteBlueWhite
Odometer (KM) 501385
Doors 19
Price $4,000.00$7,000.00$3,500.00$7,500.00$6,250.00
dtype: object
|
Calling .sum() on a Series
car_prices = pd.Series([3000, 3500, 11250])
print(car_prices.sum())
|
Output:
Calling these on a whole DataFrame may not be as helpful as targeting an individual column. But it's helpful to know they're there.
.columns will show you all the columns of a DataFrame.
print(car_details.columns)
|
Output:
Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')
|
.index will show you the values in a DataFrame's index (the column on the far left).
Output:
RangeIndex(start=0, stop=5, step=1)
|
Show the length of a DataFrame
Output:
If you have any doubts or queries related to this chapter, get them clarified from our Python Team experts on ibmmainframer Community!