Introduction to Python libraries - Pandas, Matplotlib.
Data structures in Pandas - 1. Series and 2. Data Frames.
1. Series: Creation of Series from –
Head and Tail functions; Selection, Indexing and Slicing.
2. Data Frames: creation - from dictionary of Series, list of dictionaries, Text/CSV files, display; iteration; Operations on rows and columns: add(insert/append), select, delete(drop column and row), rename;
Head and Tail functions; Indexing using Labels, Boolean Indexing;
Joining, Merging and Concatenation.
Importing/Exporting Data between CSV files and Data Frames.Topic 3 --
Data handling using Pandas – II
Descriptive Statistics: max, min, count, sum, mean, median, mode, quartile, Standard deviation, variance.
DataFrame operations: Aggregation, group by,
Sorting, Deleting and Renaming Index, Pivoting.
Handling missing values – dropping and filling.
Importing/Exporting Data between MySQL database and Pandas.Topic 4 --Data Visualization Purpose of plotting;
Drawing and saving following types of plots using Matplotlib –
1. line plot 2. bar graph 3. histogram
4. pie chart 5. frequency polygon 6. box plot and 7. scatter plot.Customizing plots:
color, style (dashed, dotted), width; adding label, title, and legend in plots.
Learning objectives of this blog -* What is Python, what are the views of the developer Guido Van Rossum * RAD projects and python* What are libraries, Python libraries, their purpose * Introduction to Pandas
Text highlighted in blue is to be written in the register.
Let us start with Python is an interpreted, object-oriented, high-level programming language with dynamic semantics.
·Interpreter – is a translator which converts the program/code line by line. You will/might have notice that when you do coding in Python, the error is highlighted immediately and before moving onto the next line you fix the error.
·Object-oriented programming – is a concept or a paradigm with the help of which we create instances of modules/classes(predefined in library packages of Python) in our program rather than calling/using them directly. When we make an instance of a module/class we are free to call/use all or some of its subclasses and built-in functions as per the need of our program.
· High-level programming – any programming which can be done with an easy set-up, independent of platform specification, friendlier to use ( writing, understanding, support and execution)
·Dynamic semantics – Semantics are tools which help a programmer to make her program user interactive. Dynamic semantics are the ways/features through which a programmer can make her program, maybe, to update the data automatically or save memory spaces. The tools are objects which are constructs which we create as an instance of the modules/classes to bind them with their properties and functions, variables assigned with multiple values, variable declaration is initiated only during run-time, in the program
Python’s high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together.
Data structures – are containers which hold data in particular patterns (Some linear, non-linear, heterogenous, homogenous, tree like, etc. ) to establish relationship on these data, to perform certain operations on these data in order to obtain a desired result.
Dynamic typing – A variable is not declared in the Python program (independent of
/* Coding in C language to find the sum of two given numbers */
#include <stdio.h>
int main( )
{
int num1=2, num2=5, sum;
sum=num1+num2;
printf("%i", sum);
return 0;
}
/* Coding in Python to find the sum of two given numbers */
num1=2; num2=5
sum=num1+num2
print(sum)
Dynamic binding - binding means using objects and the functions together (as objects are instances of modules/classes)
Rapid Application Development -
Steps in Rapid Application Development
1.Define the requirements
2.Prototype
3.Receive Feedback
4.Finalize Software
1. Define the Requirements
At the very beginning, rapid application development sets itself apart from traditional software development models. It doesn’t require you to sit with end users and get a detailed list of specifications; instead, it asks for a broad requirement.
2. Prototype
This is where the actual development takes place. Instead of following a strict set of requirements, developers create prototypes with different features and functions as fast as they can. These prototypes are then shown to the clients who decide what they like and what they don’t.
3. Receive Feedback
In this stage, feedback on what’s good, what’s not, what works, and what doesn’t is shared. Feedback isn’t limited to just pure functionality, but also visuals and interfaces.
4. Finalize Software
Here, features, functions, aesthetics, and interface of the software are finalized with the client. Stability, usability, and maintainability are of paramount importance before delivering to the client.
Scripting Language - A script or scripting language is a computer language with a series of commands within a file that is capable of being executed without being compiled but interpreted. It brings new functions to applications and glue complex system together.
Glue Language - the extension ("glue") modules are required because Python cannot call C/C++ functions directly; the glue extensions handle conversion between Python data types and C/C++ data types and error checking, translation error return values into Python exception.
Q What is the purpose of this glue…?
To develop an application we may require combining the desirable qualities: like speed of C and Java (internally faster because uses compilers as translators) with ease of use of Python (highly-user friendly because of dynamic semantics but internally slower because of interpreter as translator). Turns out, executing C/Java code from Python is not that hard. So it became a practice to run fast C/Java code through Python. The "through Python" part is why it's called a "glue" language
Summary
Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.
1. Pandas
Library in Computer Languages – Library is a collection of various packages which contain purpose-alike pre-defined modules/ classes/ subclasses and their built-in functions which a programmer may use in her code as per the task requirement. (Just like we have dictionaries in our spoken languages to refer with). Most of the programming languages have a standard library.
Python’s standard library is very extensive, offering a wide range of built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers.
Pandas
The name is derived from the term "Panel Data"
Example –
Stud_id | Age | Height | Class | Eyesight |
A101 | 15 | 160 | 11 | 6/6 |
A101 | 14 | 158 | 10 | 6/6 |
A101 | 13 | 158 | 9 | 6/6 |
A102 | 13 | 162 | 9 | 2/3 |
A102 | 14 | 164 | 10 | 2/3 |
A102 | 15 | 164 | 11 | 2/3 |
In the above example a dataset with a panel structure is shown.
Individual characteristics ( age, height, eyesight) are collected for different students in their different classes. Here the two students (A101, A102) are observed in each class 9, 10 & 11.
The above is a Balanced Data Panel.
Stud_id | Age | Height | Class | Eyesight |
A201 | 11 | 154 | 7 | 6/6 |
A201 | 9 | 148 | 5 | 6/6 |
A201 | 10 | 150 | 6 | 6/6 |
A202 | 9 | 146 | 5 | 6/6 |
A203 | 10 | 164 | 6 | 2/3 |
A202 | 8 | 164 | 4 | 2/3 |
In the above example a dataset with a panel structure is shown. Where dataset of 3 students are observed but not in a balanced way.
Three students are observed but in different period of time and at that too the period of time is not the same for all the three.
This is an example of an Unbalanced Data Panel.
2. Series
We can build, edit and compare data in pandas through –
1. Series
2. DataFrames
Series is a data structure of Pandas which is used to create one dimensional(1-D) homogeneous array, size is immutable but values are mutable.
Series: Creation of Series from – ndarray(numpy array), dictionary, scalar value,
Creation of Series from – 1. ndarray(numpy array),
NumPy (Numerical Python)
NumPy-
Data which need to be calculated and manipulated are first stored in the simplest form called Array and then operations are performed on it to get the desired result.
Let’s create an ndarray (Numpy Array) !!
Open the site of Jupyter.org
Scroll Down the page to The Jupyter Notebook section
and click the orange button with caption – “Try it in your browser”
Now File--> New Notebook --> Python 3
1-D array in Numpy can be created in 2 ways- i) through numpy.array(obj)
ii) through numpy.fromstring(objs)
i. with array(object_name)- is a method / function of numpy module which converts the specified object in the argument to an ndarray.
These object_name can be any valid data structure which holds data in it, like a list, dictionary, tuple etc.
Step 1 -- > List
Can consist of elements belonging to different data types |
|
No need to explicitly import a module for declaration |
|
Cannot directly handle arithmetic operations |
MyList = [ 1 , 2 , 3 , 4 , 5 ]
print("Check the list :", MyList)
Step 2 -- > Convert the list into ndarray.
Array
Only consists of elements belonging to the same data type
Need to explicitly import a module for declaration
Can directly handle arithmetic operations
** Since array( ) belongs to the numpy package so numpy should be imported in the program.
import statement has four parts –
2. Module name – here 'numpy'
3. 'as' keyword
4. object-name – instance name of the module created by user so
that user can use the submodules / classes / built-in functions of that particular module.
import numpy
MyList= [1, 2, 3, 4, 5]
print(numpy.array(MyList)) OR arr1=numpy.array(MyList) print(arr1)
Or import the NumPy library -
import numpy as <object_name> #object name is user-defined name and should
abide by the identifiers naming rules.
Eg -- > import numpy as np
MyList= [1, 2, 3, 4, 5]
** why do we need to convert a list to an array !! why can’t we directly use a list instead!!
A list accepts the data value as string be it numbers, alphabets, characters so in case of number values the mathematical operations will not be possible. So we need to convert a list into an array.
import numpy
MyList = [1, 2, 3, 4, 5]
arr1=numpy.array(MyList)
print("The ndarray from the list object is :", arr1, "\n")
A Pandas Series is a labeled (indexed) array that holds data.
Series(data [, index]) - is the construct( )/method/ function of the library Pandas (So always remember to import pandas to use this method).
This method converts the data (ndarray/scalar list/dictionary) specified in its arguments into a series. index allows to rearrange/ assign a new data label to the the element / items of the series.
import pandas
series1=pandas.Series(arr1)
print("The Series from ndarray is :")
print(series1) OR series1
The Series will be created now. Notice the data structure appearance.
O/P --> The ndarray from the list object is : [1 2 3 4 5]
The Series from ndarray is :
0 1
1 2
2 3
3 4
4 5
dtype: int64
The ndarray is created, now create a series from this ndarray!! import numpy import pandas list1=['Naman','Abhishek','Prakhar'] print("Check the list :",list1)
arr1=numpy.array(list1) print() print("Check the ndarray created from list :",arr1)
series1=pandas.Series(arr1) print() print("Check the Series created from ndarray :") print(series1)
O/P -->
Check the list : ['Naman', 'Abhishek', 'Prakhar']
Check the ndarray created from list : ['Naman' 'Abhishek' 'Prakhar']
Check the Series created from ndarray :
0 Naman
1 Abhishek
2 Prakhar
dtype: object
0, 1 and 2 are the data labels assigned by the pandas for identifying each element uniquely. (Can it be called index address too??)
Indexes
are of two types: positional index and labelled index. Positional index takes an integer value that corresponds to its position in the series starting from 0, whereas labelled index takes any user-defined label as index.
*** can we change ? What if we Change?? Why to change??
dtype - data type of the series (by default the data type of a series is float) but depending on the type of data/element/value it changes ** How to find the dtype??? Can we change ???
Series is a 1-D array which appears in vertical manner.
Creation of a Series from 2. Scalar Value
A scalar value is a value of one single data type. for example if one element of the series is integer type then all the other element needs to be integer type.
Ex - import pandas scalarvalue=[100, 200, 300, 400, 500]
series2=pandas.Series(scalarvalue) print("The Series from the Scalar Value is :") print(series2)
O/P-->
The Series from the Scalar Value is :
0 100
1 200
2 300
3 400
4 500
dtype: int64
|
|
ii. with fromstring(string_data, [ dtype,] sep) - this method / function is used to create an array from a string data.
dtype - is the keyword used to define the data type of the array; and the default data type is float.
sep - is the separator keyword which separates numbers in the string;
values assigned to separator can be a comma, a period, a blank quote.
Eg -- > import numpy as np
print(np.fromstring('1234'))
Observe the output in each different arguments.
In the code In[21] :
when fromstring( ) is used without the second argument which is 'sep' (separator)
then the output is ValueError which means the size of the data passed as an argument is lesser to the required data length.
Imagine if we tried to put a Great Dane (dog) into a Chihuahua’s kennel. This would be a problem with the value of the dog, because although they are both of type ‘dog’, a Chihuahua’s kennel would not be able to accept a dog the size of a Great Dane.
So here, the string size is lesser than to be specified.
In the code In[20] :
When the second argument of the method fromstring( ) is ‘sep’ keyword with the value ‘,’ (Comma) then the output is like the string ends with a decimal point within the array.
In the code In[3] :
When the second argument of the method fromstring( ) is ‘sep’ keyword with the value ‘ ’ (blank space) then the output is like the string elements are actually separated with blank spaces within the array.
In the code In[19] :
When the second argument of the method fromstring( ) is ‘sep’ keyword with the value ‘.' (dot) then the output is like the string ends with a decimal point within the array.
array(object_name)- is a method / function of numpy module which converts the specified object in the argument to an ndarray.
These object_name can be any valid data structure which holds data in it, like a list, dictionary, tuple etc.
Case 1:
*
** in the above example 3 lists are converted into one array of 3 rows and 5 columns by the
list 1=[1,2,3,4]
list2=[11,12,13,]
** in the above example the 3 lists are converted into a nested list and not in array.
ii. empty( [rows,columns], dtype=data_type) - is a method / function of numpy module which creates an array with random values.
[rows, columns] - to specify the total number of rows and columns of the array
dtype - is used to specify which type of data is to be generated; by default the data type is float.
Example -- numpy.empty( [ 3, 2 ], dtype=int )
** In the above program the empty( ) has generated an array with random values in a matrix of 3x2 where the random values are shown as integer value. Kindly remember that these random values will be different each time when the prohram is executed.
** the output is of system generated random default numbers of type float (long exponential type numbers )
iii. numpy.zeroes( rows, columns , dtype=data_type) - this method/ function is used to create an array of specified rows and columns with the data type specified.
[rows, columns] - to specify the total number of rows and columns of the array
dtype - is used to specify which type of data is to be generated; by default the data type is float.
Example --
In the above example 5 columns and 1 row has been generated for the 2-D array all with the value '0' and of type integer (which means without the decimal dot.)
In the above example 3 columns and 2 rows have been generated for the 2-D array all with the value '0' and of type float(which means each zero value is suffixed with the decimal dot.)
Stay Safe Stay Healthy!!!