Implementation of a prededictive model

THIS PAGE IS UNDER DEVELOPMENT

  1. Defining Business Objectives: data exploration
  2. Preparing the Data for Modeling : cleaning and preparing the data to explore it.
  3. Sampling Your Data : the data is splitted into two sets, training and test datasets. The model is built using the training dataset.
  4. Processing the Model
  5. Validating the Model: The test data set is used to verify the accuracy of the model’s output
  6. Deploying the model

Data Exploration using Panda

A dataframe is similar to an Excel worksheet:

Let's load a dataset into a dataframe and let's apply basic operations:

We will use a file named airLineExtract.csv (csv for Comma Separated Values)

To install libraries on Debian, if you are the root user:


apt-get install python3-pip 

pip install numpy

pip install pandas

pip install matplotlib
import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt


dataframe = pd.read_csv("./airLineExtract.csv") #load the dataset into a dataframe using Pandas
print("First 3 rows of dataset: ")
print(dataframe.head(3))
   
print("DESCRIPTION of the dataset: ")  
print(dataframe.describe())

print("ZOOM on a DESCRIPTION field of the dataset: ") 
dataframe['Month'].value_counts()
dataframe['DepDelay'].value_counts()
dataframe.boxplot(column='DepDelay', by='DayOfWeek')
First 3 rows of dataset: 
   Year  Month  DayofMonth  DayOfWeek  ArrDelay  DepDelay
0  1987     10          14          3        23        11
1  1987     10          15          4        14        -1
2  1987     10          17          6        29        11
DESCRIPTION of the dataset: 
       Year  Month  DayofMonth   DayOfWeek    ArrDelay    DepDelay
count   200    200  200.000000  200.000000  196.000000  196.000000
mean   1987     10   15.585000    3.885000   15.653061    9.040816
std       0      0    8.807989    1.920996   15.778338   13.741883
min    1987     10    1.000000    1.000000   -7.000000   -2.000000
25%    1987     10    8.000000    2.000000    4.750000    0.000000
50%    1987     10   15.500000    4.000000   13.000000    3.500000
75%    1987     10   23.000000    5.000000   22.250000   13.250000
max    1987     10   31.000000    7.000000   88.000000   87.000000
ZOOM on a DESCRIPTION field of the dataset: 
plot of chunk Figure-3

plot of chunk Figure-3

dataframe['DepDelay'].hist(bins=50)
dataframe.apply(lambda x: sum(x.isnull()),axis=0) 
Year          0
Month         0
DayofMonth    0
DayOfWeek     0
ArrDelay      4
DepDelay      4
dtype: int64
plot of chunk Figure-4

plot of chunk Figure-4

describe() function provides : count, mean, standard deviation (std), min, quartiles and max for each numerical variable.

From the output of describe() function, we can get information about data:

From the count_column we can detect missing data for each field. df['DepDelay'].value_counts()

TEST

4 + 3
## 7
import matplotlib.pyplot as plt
plt.plot([1, 2, 3])
plot of chunk Figure-6

plot of chunk Figure-6

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 2 * np.pi, 100)
y1 = np.sin(x)
y2 = np.sin(3 * x)
plt.fill(x, y1, 'b', x, y2, 'r', alpha=0.3)
# If your are woorking with inline ipython you need to add:
plt.show()


y = [3,5,9,2,6,4,7,8,1,5]   # a list of numbers
plt.plot(y)                     # draw the graph
plt.savefig("fig1.png")                    # show it

plot of chunk Figure-7 !(./fig1.png)

plt.plot([1, 2, 3])

plot of chunk Figure-8 ### Matplotlib

{python, include=F} import matplotlib.pyplot as plt plt.plot([1, 2, 3]) plt.savefig("figure/fig.png")

{r showfig,include=TRUE,echo=F, results='asis'} knitr::include_graphics("figure/fig.png")