PANDAS LIBRARY FOR PYTHON

Pandas is an open source library providing high-performance data structures and data analysis tools to Python programming. It allowed Python to exceed its traditional programming which consisted in data munging and preparation.

Pandas combined with IPython toolkit and other libraries provide a high performance environment in data analysis.

To add statistical modeling functionality to Python you can import statsmodels and scikit-learn libraries: Pandas is limited to linear and panel regression functionality.

Data structure creation

Series is a one-dimensional labeled array containing any data type: integers, strings, floating point numbers, Python objects, etc.).

Syntax for creating a serie : s = pandas.Series(data, index=index)

Where data is set to be:

Creating Series

Series creation from a Python DICTIONARY

Let's create a dictionary and let's create a serie

import numpy as np
import pandas as pd

dict={'key1' : 1.0, 'key2' : 4.0, 'key3' : 9.}
s = pd.Series(dict)
print(s)
key1    1
key2    4
key3    9
dtype: float64

The dictionary keys stand for the serie index and corresponding dictionary values stand for the serie data.

Let's start from the same dictionary and let's create another ordering for the keys of the serie and let's add a new key:

import numpy as np
import pandas as pd

dict={'key1' : 1., 'key2' : 4., 'key3' : 9.}
s= pd.Series(dict, index=['key2', 'key3', 'new_key', 'key1'])
print(s)
key2        4
key3        9
new_key   NaN
key1        1
dtype: float64

The serie is displayed following the expected keys order.

In case an index is specified (while creating the series) and no predefined corresponding value exists for this index, the returned value is NaN.

Series creation from an NDARRAY

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

import numpy as np
import pandas as pd

print("Series creation from an NDARRAY : ")
ndarray=[1.0,2.0,4.0,9.0]
s = pd.Series(ndarray, index=['col1', 'col2', 'col3', 'col4'])
print(s)
Series creation from an NDARRAY : 
col1    1
col2    2
col3    4
col4    9
dtype: float64

Series creation from a scalar value:

import numpy as np
import pandas as pd

print("Series creation from a scalar value : ")
s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
print(s)
Series creation from a scalar value : 
a    5
b    5
c    5
d    5
e    5
dtype: float64

Array behaviour of series:

import numpy as np
import pandas as pd

print("Series creation from an ndarray : ")
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

print("s.index: ")
print(s.index)

print("s[0]: ")
print(s[0])

print("s[3]: ")
print(s[:3])

print("s.median(): ")
print(s.median())

print("s>s.median(): ")
print(s[s>s.median()])
Series creation from an ndarray : 
s.index: 
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
s[0]: 
-1.3474545302
s[3]: 
a   -1.347455
b    1.039866
c   -1.491646
dtype: float64
s.median(): 
-0.940016271728
s>s.median(): 
b    1.039866
e   -0.422849
dtype: float64

Dictionary behaviour of series:

print(s['a']) print('e' in s) print('f' in s)

Creating Dataframes

Dataframe Manipulation

Renaming columns in a dataframe:

To reorder columns of a dataframe whose columns are 'col1', 'col2', 'col3' for example:

Juxtaposition of dataframes:

To concatenate 2 dataframes with the same columns, one below the other: dataframe1.append (dataframe2) returns the concatenation by lines:

import numpy as np
import pandas as pd

dataframe1 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) 
dataframe2 = pd.DataFrame({'col1': [10, 20],'col2': [7, 8]})
dataframe = dataframe1.append(dataframe2)

print(dataframe)
   col1  col2
0     1     4
1     2     5
2     3     6
0    10     7
1    20     8

If the columns are not the same, no error is sent back: the resulting dataframe has NaN values wherever they are not defined.

import numpy as np
import pandas as pd

dataframe1 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) 
dataframe2 = pd.DataFrame({'col1': [10, 20],'col3': [7, 8]})
dataframe=dataframe1.append(dataframe2)

print(dataframe)
   col1  col2  col3
0     1     4   NaN
1     2     5   NaN
2     3     6   NaN
0    10   NaN     7
1    20   NaN     8

To concatenate 2 dataframes with the same number of rows and columns, next to each other: pandas.concat ([dataframe1, dataframe2], axis = 2) returns columns concatenation:

import numpy as np
import pandas as pd

dataframe1 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
dataframe2 = pd.DataFrame({'col1': [10, 20, 30], 'col2': [7, 8, 9]})
dataframe = pd.concat([dataframe1, dataframe2], axis = 2) 

print(dataframe)
   col1  col2  col1  col2
0     1     4    10     7
1     2     5    20     8
2     3     6    30     9

Joining dataframes:

import numpy as np
import pandas as pd
     
dataframe1 = pd.DataFrame({'key': ['col', 'col', 'col3', 'col4'], 'val1': [1, 2, 3, 4]})
dataframe2 = pd.DataFrame({'key': ['col3', 'col', 'col','col5'], 'val2': [50, 60, 70, 80]})
dataframe = pd.merge(dataframe1, dataframe2, on = 'key') 
dataframe = pd.merge(dataframe1, dataframe2, on = 'key', how = 'outer') 
dataframe = pd.merge(dataframe1, dataframe2, on = 'key', how = 'left') 
dataframe = pd.merge(dataframe1, dataframe2, on = 'key', how = 'right') 
print(dataframe)
    key  val1  val2
0   col     1    60
1   col     2    60
2   col     1    70
3   col     2    70
4  col3     3    50
5  col5   NaN    80