Pandas is an open source library which helps you analyse and manipulate data.
Pandas provides a simple to use but very capable set of functions you can use to on your data.
Pandas is the most popular python library that is used for data analysis. It provides highly optimized performance with back-end source code is purely written in C or Python.
It's integrated with many other data science and machine learning tools which use Python so having an understanding of it will be helpful throughout your journey.
One of the main use cases you'll come across is using pandas to transform your data in a way which makes it usable with machine learning algorithms.
To get started using pandas, the first step is to import it.
The most common way (and method you should use) is to import pandas as the abbreviation pd (alias name for Pandas Package).
If you see the letters pd used everywhere in pandas, it's probably referring to the pandas library.
import pandas as pd |
Pandas has two data structures, Series, DataFrame.
1. Series - 1-Dimensional column of data.
2. DataFrame - 2-Dimesional table of data with rows and columns.
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index
A pandas Series can be created using the following constructor -
pd.Series(data, index, dtype, copy) |
Here, data can be many different things:
1. a Python dict
2. an ndarray
3. a scalar value (like 5)
# Creating a series of student name StudentName = pd.Series(["Michael", "John", "Sachin"]) print(StudentName) |
0 Michael 1 John 2 Sachin dtype: object |
# Creating a series of age StudentAge = pd.Series([30, 28, 35]) print(StudentAge) |
0 30 1 28 2 35 dtype: int64 |
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table
A pandas DataFrame can be created using the following constructor
pd.DataFrame( data, index, columns, dtype, copy) |
DataFrame accepts many different kinds of input:
1. Dict of 1D ndarrays, lists, dicts, or Series
2. 2-D numpy.ndarray
3. Structured or record ndarray
4. A Series
5. Another DataFrame
Let's use our two Series as the values.
# Creating a DataFrame of student and age student_detail = pd.DataFrame({"StudentName": StudentName, "StudentAge": StudentAge}) print(student_detail) |
StudentName StudentAge 0 Michael 30 1 John 28 2 Sachin 35 |
You can see the keys of the dictionary became the column headings (text in bold) and the values of the two Series's became the values in the DataFrame.
It's important to note, many different types of data could go into the DataFrame.
If you have any doubts or queries related to this chapter, get them clarified from our Python Team experts on ibmmainframer Community!