Pandas- Another library widely used in Machine learning
Pandas- Step By Step Guide By Sagar Jaybhay
Pandas is an open source library which is use full for high performance, easy to use a data structure and data analysis tools. It makes data science very easy and effective. (Shift + Tab ) is used to get information about the functions in all python.
Link:- https://pandas.pydata.org/
Which problem pandas solve?
Python is a good programming language for data mugging and preparation but when it comes to data analysis it behind in some areas so to fill this gap panda is theirs. So by using you can carry the whole workflow of your data analysis in python and not go to switch the language like R.
To start to learn first you need to import pandas lib.
import pandas as pd pd.__version__
this is used to check the current version which you install on your machine. If your version is not latest then you can upgrade this to the latest version by using below commands.
python3 -m pip install --upgrade pandas==0.23.0 print(pd.show_versions())
DataFrames
Data Frames are like 2-dimensional array like our spread sheet.Its size is mutable,possibly heterogeneous We took Olympics data for processing.
Link to data download is-https://docs.google.com/spreadsheet/ccc?key=0AonYZs4MzlZbdHlfd0F1QlAxYjgtOW53ZXNOZ0JzNVE
The site on which data link is present is below link.
https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data
City | Edition | Sport | Discipline | Athlete | NOC | Gender | Event | Event_gender | Medal |
Athens | 1896 | Aquatics | Swimming | HAJOS, Alfred | HUN | Men | 100m freestyle | M | Gold |
Athens | 1896 | Aquatics | Swimming | HERSCHMANN, Otto | AUT | Men | 100m freestyle | M | Silver |
Athens | 1896 | Aquatics | Swimming | DRIVAS, Dimitrios | GRE | Men | 100m freestyle for sailors | M | Bronze |
Athens | 1896 | Aquatics | Swimming | MALOKINIS, Ioannis | GRE | Men | 100m freestyle for sailors | M | Gold |
Athens | 1896 | Aquatics | Swimming | CHASAPIS, Spiridon | GRE | Men | 100m freestyle for sailors | M | Silver |
Athens | 1896 | Aquatics | Swimming | CHOROPHAS, Efstathios | GRE | Men | 1200m freestyle | M | Bronze |
Athens | 1896 | Aquatics | Swimming | HAJOS, Alfred | HUN | Men | 1200m freestyle | M | Gold |
Athens | 1896 | Aquatics | Swimming | ANDREOU, Joannis | GRE | Men | 1200m freestyle | M | Silver |
Athens | 1896 | Aquatics | Swimming | CHOROPHAS, Efstathios | GRE | Men | 400m freestyle | M | Bronze |
To read csv file and print some part of it on notebook use below code.
head=pd.read_csv('C:\data\olympicData.csv',skiprows=0) head.head()
If you use dataframe name only to print data in jyputer notebook then it display first 30 rows and last 30 rows.
Series
It is one dimensional array of indexed data. Is having capability of holding any kind of data like int, string, python object. In this axis labels are collectively called as index. When you consider a data frame then every column in that data frame is considered as series and every row also considered as series.

If you want to get 2 series use below code which is list of series
head[['City','Athlete']]
To get the type you can use below code
type(head) #o/p pandas.core.frame.DataFrame type(head.City)#o/p - pandas.core.series.Series type(head[['City','Athlete']]) #o/p pandas.core.frame.DataFrame
Data Input
pd.read_csv() pd.read_excel() pd.read_html()
These 3 methods commonly use in Pandas to get data or data frame.
- Head.shape:= By using this we get the shape of an data frame in our case 29216 is the rows count and 10 are column count. The output is tuple format.
head.shape #o/p (29216,10)(rows,columns)
- Head method:– in this if you don’t specify the rows by default you will get first 5 rows
- head.head()
- Tail Method:– In this same if you don’t specify rows you will get last 5 rows
- head.tail()
- Info Method:- By using this method you will able to get information about data frames like below image.

- values_count():- This method is on pandas index count which returns the count of unique values. By default result is descending order if we not specify order.
head.Gender.value_counts(ascending=True)
- sort_values(): this method is used on series then it sort series by ascending order by default but If you want to sort data frame in this you can use by which series you want to sort.
head.sort_values(by=[‘City’,’Athlete’])
the result you get a data frame but the rows are sorted by using the column name which are provided into that sort_values function .