­> This page is still under development...

Data science and Big Data Introduction

What is Data Science according to Wikipedia:

Whether a data is "Big data" or not, we can use Data Science to support Data Driven Decisions and take better decisions.

What is Big Data?

Big Data refers to humongous volumes of data for which traditional storage and processing methods reached their limits and do not fit.

Big Data definition consists in Three notions (Gardner Group) : 3V

Collected data volume from connected objects (different sensors), social networks (Linkedin, Facebook), smart objects, search engines (Google), Science research center (Cern), banks, etc., is colossal. Connected Objects are expanding. They will interact more and more in the future with many everyday life aspects. Collected data is exponentially increasing leading to the use of more measurement unity as Tera Byte, Peta Byte, Exa Byte, etc.

Measurement Unity

SYMBOL VALUE NAME
Kb 10^3 Kilo Byte
Mb 10^6 Mega Byte
Gb 10^9 Giga Byte
Tb 10^12 Tera Byte
Pb 10^15 Peta Byte
Eb 10^18 Exa Byte
Zb 10^21 Zetta Byte
Yb 10^25 Yotta Byte

How to deal with Big data storage and processing

The most popular database management systems since the 1980s (DBMS) have all supported the relational model (invented in 1970 by Edgar F. Codd) as represented by the SQL language (Structured Query Language).

Well-known DBMSs include MySQL, PostgreSQL, Microsoft SQL Server, Oracle, Sybase, SAP HANA, and IBM DB2.

Those DBMSs are efficient for a maximum data volume of tens or hundreds Tera Bytes. The Relational Database Management System is hosted on a single server.

In order to allow a larger volume of data to be stored, research and experimentation efforts in distributed systems began in earnest in the 1970s and continued through 1990s, with focused interest peaking in the late 1980s. A number of distributed operating systems were introduced during this period; however, very few of these implementations achieved even modest commercial success.

Nevertheless, to proceed computations, the necessary data had to be transfered over the network from data site to computation site. Therefore, in traditional distributed systems, processing speed is depending on network bandwidth. Furthermore, difficulties were encountered due to server failures.

The concept of distributing computations emerged to overcome this problem. Computations had to be moved to data and executed on data site. Only instructions to be executed and computation results have to be transfered through the network. Time-saving is substancial. This idea was implemented by Doug Cutting and Mike Cafarella in 2005 as a framework called Hadoop. Hadoop Distributed File System (HDFS) is one of the Hadoop basic modules and is used for distributed storage.

Hadoop was developped as an open source software framework for storage and large scale processing of data-sets on clusters. It allows to store data volumes up to several Peta Bytes and offers flexibility in adding or removing servers from a cluster. Server failures are managed by the system (the file content is replicated on multiple DataNodes for reliability).

An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and the execution of application computations in parallel close to their data.

MapReduce

MapReduce is a generic framework to handle big data by:

  1. reducing data into subsets
  2. processing data separately on different machines
  3. putting all the pieces back together

Hadoop is an implementation of MapReduce.

NoSQL

The distributed data base system can not afford anymore the use of joins between tables as in conventional Relational DBMS (SQL). The relationnal model is no more adapted to the distributed database system. NoSQL technologies (meaning Not Only SQL) is used to describe this new information storage paradigm. It is used to describe database or data management systems that support new, more efficient ways to access data.

R and Python languages to process big data

R is an open source statistical programming language and environment.The R project was conceived in the 1990s. R is an interpreted language. It is limited to in-memory data processing but offers a wide variety of statistical and graphical techniques. It is popular for its great visualisation capabilities. Modern environments have extended R capabilities to overcome its in-memory limitations by creating libraries (e.g. Bigmemory) or integrating R in a distributed architecture (e.g. RHadoop).

Python is a scripting programming language. It was conceived in the late 1980s. It is great at handling text and unstructured data and turns to be more and more popular in data science. Libraries for data science transformed Python from a general purpose programming language into a powerful and robust tool for data analysis and visualization.