This page is still under development...

Dealing with large data sets with Linux commands


  1. Available disk space on your system to check if it is possible to download all the data.
  2. Import the data (e.g., wget)
  3. Always work on a a copy if you have enough space available
  4. Compute the number of files if you have more than one
  5. Size of each file (e.g., du -ch *) and type of file (e.g., zip or tar.gz)
  6. Check the total size of the uncompressed files before uncompressing them (e.g., gzip -l file.gz)
  7. Uncompress the files if enough available space
  8. Describe the content of each file (e.g., rectangular table) by editing header lines (e.g., head or less).
  9. When apropriate, merge all files in a single one
  10. Check integrity of the data (e.g., missing value, expected type for each column such as integer, string, date, ...) using statistical techniques (e.g., distribution tables, graphs)
  11. Edit files to correct errors (e.g., replace missing values with NA)

We are going to analyse large data sets working with file, whose size is at least 50% larger than available RAM (but no more than tens or hundreds of GB) using different languages: Linux command tools, R and Python.

By default, R loads files into memory (RAM). If a file is bigger than the available RAM, it cannot load it. Also, even if a file can fits into memory, when you start working on it, you may start creating copies of it (or creating large matrices of output for some analyses) which may quickly overload the RAM.

Required environment

Linux tools

You should get familiar with these tools (e.g., use man cut): refer to the Linux useful tools section

What are these files about ?

Our application will be based on csv files extracted from "The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed."

The aim of the data expo was to focus on different aspects of the data... answering questions as :

The site suggested other ideas to explore data:

"You are also welcome to work with interesting subsets: you might want to compare flight patterns before and after 9/11, or between the pair of cities that you fly between most often, or all flights to and from a major airport like Chicago (ORD). Smaller subsets may also help you to match up the data to other interesting datasets."

At first, We are going to load and unzip csv files, it may take up to 45 min depending on your internet speed.

Linux Users

To check your filesystem available space : df -h

As /tmp content is usually cleaned up at every boot, you should consider saving the `Data/ folder in your home directory (if you have enough space, if so just replace /tmp with /home)

cd /tmp ; mkdir Data ; cd Data


Then you can list the downloaded files and check the total size of these files.

ls -sh

total 1,6G
 13M 1987.csv.bz2   48M 1992.csv.bz2   74M 1997.csv.bz2   73M 2002.csv.bz2  116M 2007.csv.bz2
 48M 1988.csv.bz2   48M 1993.csv.bz2   74M 1998.csv.bz2   91M 2003.csv.bz2  109M 2008.csv.bz2
 47M 1989.csv.bz2   49M 1994.csv.bz2   76M 1999.csv.bz2  106M 2004.csv.bz2
 50M 1990.csv.bz2   72M 1995.csv.bz2   79M 2000.csv.bz2  108M 2005.csv.bz2
 48M 1991.csv.bz2   73M 1996.csv.bz2   80M 2001.csv.bz2  110M 2006.csv.bz2

It is now a good idea to compute the size of these files once uncompressed. If these files were *.gz files we could have used gzip -l file.gz to do so. Unfortunately, bzip2 doesn't provide an option to display the size of the uncompressed file before actually uncompressing it. Nevertheless, we can do as follows:

sizeinbytes="$(bzcat {1987..2008}.csv.bz2 | wc -c)"

Size in Bytes: echo $sizeinbytes


Size in MB: echo "$sizeinbytes /1024/1024" | bc


Size in GB: echo "$sizeinbytes /1024/1024/1024" | bc


The advantage is that no disk space was used to compute these sizes. The disadvantage is that it takes time (and that it will take again the same time to really uncompress the files).

Since we have enough free space, we uncompress the files.

bzip2 -d *.bz2

and wait ... quite a long time.

Then we can list the uncompressed files and check the total size.

wc -c *

  127162942 1987.csv
  501039472 1988.csv
  486518821 1989.csv
  509194687 1990.csv
  491210093 1991.csv
  492313731 1992.csv
  490753652 1993.csv
  501558665 1994.csv
  530751568 1995.csv
  533922363 1996.csv
  540347861 1997.csv
  538432875 1998.csv
  552926022 1999.csv
  570151613 2000.csv
  600411462 2001.csv
  530507013 2002.csv
  626745242 2003.csv
  669879113 2004.csv
  671027265 2005.csv
  672068096 2006.csv
  702878193 2007.csv
  689413344 2008.csv
12029214093 total

This corresponds to the previous output.

For more details about the size of each file as well as the total:

ls -sh (or du -ch *)

total 12G

122M 1987.csv 469M 1991.csv 507M 1995.csv 528M 1999.csv 598M 2003.csv 671M 2007.csv 478M 1988.csv 470M 1992.csv 510M 1996.csv 544M 2000.csv 639M 2004.csv 658M 2008.csv 464M 1989.csv 469M 1993.csv 516M 1997.csv 573M 2001.csv 640M 2005.csv 486M 1990.csv 479M 1994.csv 514M 1998.csv 506M 2002.csv 641M 2006.csv

If you want to count the number of files you just unzipped:

ls * | wc -l


We have 22 files.

Let's have a look to the content of these files.

head 1987.csv


If we want only the header:

head -n 1 1988.csv

This should only display the first line.


We can see that this header (i.e., names of the variables) is present in all files, for example in the second:

head -n 1 1989.csv

At this point, we want to merge all files in one, so we will have to remove the header from all files but one. We investigated several commands to perform this operation and kept the fastest one. Details are given below using the time function.

Testing Linux tools for data scrubbing

Removing the header line

We are going to remove the header from this file and time this action with different Linux tools.

The time command is used to know how long the command takes to run.

The time option -p makes sure the result respects the POSIX standart (The Portable Operating System Interface is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems).

We are going to test different commands on a copy of the 1987.csv file.

Note that the result of the time command differ depending on processors speeds, disks performances and many factors (refer to the time command )

With awk

We first try with the awk command:

cp -f 1987.csv 1987-copy.csv
time -p awk 'NR != 1 {print $0}' 1987-copy.csv | sponge 1987-copy.csv

Redirection operator (> or >>) can not be used to redirect the content of 1987-copy.csv to itself, because these operators have a higher priority than the command and create/truncate the file before the command is even invoked. To avoid that, you should use appropriate tools such as tee, sponge, editors or any other tool which can write results to the file (e.g. sed -i or sort file -o file).

The time the awk command took to run on datalyptus server is as follows (the result certainly differ from your result):

real 10.88
user 0.66
sys 2.51

user and sys refer to CPU time used only by the process independantly from other processes which shares the CPU. real refers to the execution time including the execution time of the other processes.

user+sys = 3.17sec

If we try this different option awk 'NR>1' 1987-copy.csv | sponge 1987-copy.csv it also works

cp -f 1987.csv 1987-copy.csv
time -p awk 'NR>1 {print $0}' 1987-copy.csv | sponge 1987-copy.csv

The time (in seconds) the awk command took to run on datalyptus server is as follows (the result certainly differ from your result):

real 10.60
user 0.68
sys 2.45

user+sys = 3.13sec

We can note that to run the awk command with the testing pattern made of the '>' comparison sign or the inequality sign do not significantly change the run time.

With sed

sed -i '1d' 1987-copy.csv

cp -f 1987.csv 1987-copy.csv
time -p sed -i '1d' 1987-copy.csv

The result is as follows:

real 29.16
user 1.79
sys 4.84

user+sys = 6.63sec

with ex editor

To supress the first line of the file and redirect input to the same original file you can use appropriate in-place editors as Ex editor (part of Vim)

ex -c ':1d' -c ':wq' 1987-copy.csv


cp -f 1987.csv 1987-copy.csv
time -p ex -c ':1d' -c ':wq' 1987-copy.csv

The result is as follows:

real 15.66
user 2.56
sys 0.88

user+sys = 3.44sec

With the tail command:

time -p tail -n +2 1987-copy.csv | sponge 1987-copy.csv

real 6.21
user 0.05
sys 1.73

user+sys = 1.78sec

With the more command
real 6.69
user 0.08
sys 1.74

user+sys = 1.82sec

The more and tail command are the faster ways to remove the file header.

It would be better to get a more accurate benchmark of the tools to run each commands one hundred times and make an average of the execution time for each tool (as the CPU run time is different at each execution).

Merging two files

With | and sponge

Let’s make a copy of two files:

cp -f 1987.csv 1987-copy.csv
cp -f 1988.csv 1988-copy.csv
wc -l 1988-copy.csv
wc -l 1987-copy.csv

The number of lines for each file is:

1311827 1987-copy.csv
5202097 1988-copy.csv

Now let’s merge these two files into 1988-copy.csv and time it:

time -p cat 1987-copy.csv 1988-copy.csv | sponge 1988-copy.csv

real 40.35
user 0.03
sys 10.46

And concludes the Total CPU time = user + sys = 10.49sec

We can check the number of lines for the resulted file with the wc command:

wc -l 1988-copy.csv

6513924 1988-copy.csv

And check it worked as the total number of lines 6513924 is the sum of the number of lines of the two merged files (1311827 + 5202097).

With the redirection sign >>

Now let’s merge the same files with the redirection sign >> and time it again:

cp -f 1987.csv 1987-copy.csv
cp -f 1988.csv 1988-copy.csv
time -p cat 1988-copy.csv >> 1987-copy.csv
real 10.07
user 0.01
sys 2.08

And concludes the Total CPU time = user + sys = 2.09sec which is much faster than the 10.49sec we previously found.

We can also check the number of lines for the resulted file with the wc command:

wc -l 1987-copy.csv

6513924 1987-copy.csv

** We can conclude the method using the cat command and the redirection sign >> is much faster in merging the two files. This can be explained by the fact the >> sign will redirect the result of the cat command to the end of the second file. This way you do not have to copy both contents but just the first file one.**

Merging and removing header from all files

Now we are going to concatenate all files into airline.csv removing the header lines (except from 1987.csv). This takes around 2mns with command line tools:

nbfiles="$(ls {1987..2008}.csv | wc -l)" 
nblinesbefore="$(cat {1987..2008}.csv | wc -l)"
mv 1987.csv airline.csv
# The header line is removed from all the files except from the 1987.csv and 
# the content of the processed file is redirected to airline.csv
for res in {1988..2008}; do
 tail -n +2 ${res}.csv >> airline.csv
 # We delete each file once processed:
 rm ${res}.csv
# Since the first line of each file was removed except the first one, the total number of lines in the resulting file is given by:
echo $nbTotalLines |bc

We can check that the total number of lines is as expected for airline.csv:

wc -l airline.csv

123534970 airline.csv

Everything is fine!

Now we have one big CSV (comma separated values) file named airline.csv.

You need to refer to the website for a description of these variables: each columns is described as follows:

col Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes

Checking file and data integrity

It is now time to check for the integrity of both the csv file and our data. The file is too large to visually notice missing data or errors. Firstly we are going to look for any missing separators.

Checking the integrity of the field separators from the csv file:

Knowing we should have 29 variables for each line (record), we are going to check we have 29 fields per line(record):

date;awk 'BEGIN{FS=","} NR>1 && NF!=29{print "MISSING VALUE IN line #",FNR,"(",NR,"scanned records)\n\n ",$0,"\n"}' airline.csv;date
mardi 11 avril 2017, 11:19:01 (UTC+1000)
mardi 11 avril 2017, 11:25:00 (UTC+1000)

As nothing but dates were printed out, we can conlude the csv file was correctly created and no missing separators were encountered.

Checking data integrity

When for a given variable, the number of possible values is small and known in advance, we can create a table of counts of these possible values. This will allow us to check the integrity of the data (e.g., how many unespected values, how many missing values,...). From the description of the variable, we should do this for variables 1 (Year), 2 (Month), 3 (DayOfMonth, 4 (DayOfWeek), 9 (UniqueCarrier), 10 (FlightNum), 11 (TailNum), 17 (Origin), 18 (Dest), 22 (Cancelled), 23 (CancellationCode) and 24 (Diverted).

More generally, a table of counts of the other variables can also highlight some information. We are going to create a table of counts for all variables (29 columns) and store each table in a different file (count1.txt for the first var, count2.txt for the second var, ...):

As running a code on a large file may take a long time you may use the exit instruction combined ẁith NR conditions to test your awk script on a few lines first. For example, to process the first six lines of airline.csv and create a table of counts (count of the value, value) stored in count*.txt:

for var in {1..29}; do  awk 'BEGIN{ FS=","}NR>1 {var_count[$'$var']++}NR>6{for(value in var_count)print var_count[value], value;exit}' airline.csv>count${var}.txt; done; 

If everything is fine, we can process the command on the entire file:

date;for var in {1..29};  do  awk 'BEGIN{FS=","}NR>1 {var_count[$'$var']++}END{for(value in var_count)print var_count[value], value}' airline.csv>count${var}.txt; done;date;

For the variable 23, where the number of possible values should be only 5 (NA=Non Applicable, A = carrier, B = weather, C = NAS, D = security) we can have a look to the tmp23.txt

cat count23.txt | sort -n 601 D 149079 C 267054 B 317972 A 38955823 83844440 NA

We can notice that there are 38955824 empty values that should be set as NA.

R analysis

First, launch the memory monitor (e.g., gnome-system-monitor ).

Then Launch R

X <- read.csv("airline.csv", header = TRUE)

As you can see, this (most probably) fails.

Now, we use the bigmemory R package to read the file.

Most parts of what follows comes from slides of a talk given by John W. Emerson “Jay” and Michael J Jane from Yale University, at the UseR! 2009 conference. They used the data set airline.csv which unfortunately is not well formed (more on this later).

To download bigmemory library: install.packages("bigmemory", repos="")

# Takes around 16mns:
X <- read.big.matrix("airline.csv",
header = TRUE, type = "integer",
backingfile = "airline.bin",
descriptorfile = "airline.desc", 
extraCols = "age")

­­­>Unfortunately, all character codes from the files (i.e., columns 9, 17, 18 and 23) have been replaced by NA values. This is because we used the integer type (matrix parameter). Using the char type will not help (since this is only a C way of storing values on just 1 byte, so with only 256 different values). This might change in the future but for the moment we need to to replace all character entries by an integer numeric code.

Data Formating for the bigmemory library

To use bigmemory you need to replace all string codes from columns 9 (unique carrier code), 17 (origin IATA airport code), 18 (destination IATA airport code) and 23 (CancellationCode) with an integer code.

To do so, we will just replace the first column of the count.txt files with the line number which will stand for the integer code and keep the real code in the second column (string code) and store the result in code.txt.

In the 17th column representing the Origine variable (origin IATA airport code), we will set the integer 1 to stand for AA and 2 for XE... and store the result in code17.txt

date ; for i in {9,17,18,23}; do cut -d' ' -f2 count${i}.txt | cat -n | sed 's/ \+ //g' | sed 's/\t/,/g>code${i}.txt ; done; date

We can display the content of code9.txt to check everything is fine:

head code17.txt
   1    AA
   2    XE
   3    DL
   4    OO
   5    MQ
   6    PA
   7    TW
   8    B6
   9    FL
  10    AQ
  11    CO
  12    ML
  13    EA
  14    TZ
  15    AS
  16    YV
  17    HA
  18    UA
  19    PS
  20    OH
  21    9E
  22    NW
  23    HP
  24    WN
  25    F9
  26    DH
  27    PI
  28    EV
  29    US

**We are going to replace empty values ,, by ,NA, in the airline.csv file. We replace also all character entries by their corresponding codes (as given in the count*.txt files).**

In the awk command:

The following code Takes about 23mns to run:

date ; awk -F, '(FILENAME == ARGV[1]){a[$2]=$1;next}(FILENAME == ARGV[2]){b[$2]=$1;next}(FILENAME == ARGV[3]){c[$2]=$1;next}(FILENAME == ARGV[4]){d[$2]=$1;next}((FNR > 1) && FILENAME == ARGV[5]){OFS=",";$9=a[$9];$17=b[$17];$18=c[$18];$23=d[$23];gsub (",,", ",NA,", $0);}1' code9.txt code17.txt code18.txt code23.txt airline.csv>airline2.csv;date

Launch R

# Takes around 16mns:
X <- read.big.matrix("airline.csv",
header = TRUE, type = "integer",
backingfile = "airline.bin",
descriptorfile = "airline.desc", 
extraCols = "age")

Subsequent sessions can connect to the backing instantaneously, and we can interact with it (e.g., compute some statistics):

xdesc <- dget("airline.desc")
# The following command will be executed in a very short time compare to the previous read.big.matrix(...)
x <- attach.big.matrix(xdesc)

system.time(a <- x[,1])
max(x[,]) # This one will not work! -> Error in : Too many indices (>2^31-1) for extraction.
range(x[,1], na.rm = TRUE)
tail(x, 1)

Can we get all flights from JFK to SFO? Sure!

a <- read.table("count17.txt", sep = ",")
JFK <- a$V1[a$V2 == "JFK"]
SFO <- a$V1[a$V2 == "SFO"]
y <- x[x[, "Origin"] == JFK & x[, "Dest"] == SFO,]

var1 <- read.table("count1.txt", sep = " ")
barplot(var1$V1, names = var1$V2)

var26 <- read.table("count26.txt", sep = " ")
var26 <- var26[with(var26, order(V2, V1)), ]
var26 <- na.omit(var26)[-1,]
breaks26 <- hist(var26$V2, plot = FALSE)$breaks/10
count26 <- rep(NA, length(breaks26) - 1)
for (i in 1:(length(breaks26) - 1)) {
    count26[i] <- sum(var26$V1[(breaks26[i] < var26$V2) & (var26$V2 <= breaks26[i + 1])])
myhist26 <- list(breaks = breaks26, counts = count26)
class(myhist26) <- "histogram"


# The column range for the first column
colmean(x, 1, na.rm = TRUE)
# The first column is cached a second operation
# on the column is fast.
colrange(x, 1, na.rm = TRUE)

When is the best hour of the day to fly to minimize delays? A simple computation done in parallel on 3 cores.

gnome-system-monitor ```

registerDoMC(cores = 3)
probs <- c(0.9, 0.99, 0.999, 0.9999)
desc <- describe(x)

# delays by hour of day.
anshourofday <- foreach (i = seq(0, colmax(x, "CRSDepTime") - 1, by=60),
.combine = cbind)%dopar%
x <- attach.big.matrix(desc)
ind <- mwhich(x, "CRSDepTime", c(i, i + 60),
comps = c('ge', 'lt'))
m <- cbind(probs, quantile(x[ind, "DepDelay"],
probs = probs, na.rm = TRUE))
colnames(m) <- c("Probabilites", "Quantiles")

Try to understand this code!

CRSDepTime : scheduled departure time (local, hhmm). DepDelay : departure delay, in minutes.

What has been done just above is not correct!. Indeed, the variable CRSDepTime should be in hhmm format in the file airline.csv but this is not the case. Moreover, they assumed that this variable is in minutes, which is wrong!

What should be done?

Get back to the original website to download the original data: We will not do it since it would take too much time. Also what could be interesting would be to write scripts to download the files automatically, without having to click on the website.

Do older planes suffer more delays? Maybe. A computationally intensive example done in parallel.

uniqTailNum <- na.omit(unique(x[, 11]))
uniqTailNum.len <- length(uniqTailNum)
# 166 different planes whose TailNum is known
planeStart <- big.matrix(nrow = uniqTailNum.len,
ncol = 1, shared = TRUE)
psDesc <- describe(planeStart)
foreach(i=1:uniqTailNum.len) %dopar%
x <- attach.big.matrix(desc)
planeStart <- attach.big.matrix(psDesc)

# The first year plane i can be found:
yearInds <- mwhich(x, "TailNum", uniqTailNum[i],
comps = 'eq')
minYear <- min( x[yearInds, "Year"], na.rm = TRUE )
# First month in minYear where the plane can be found:
minMonth <- min( x[yearInds, "Month"], na.rm = TRUE )
planeStart[i, 1] <- 12 * minYear + minMonth

BadTailNum <- mwhich(x, 11, NA, 'eq')
x[BadTailNum, 30] <- NA
MoreRecentDate <- max(x[,1]) * 12 + max(x[,2])
system.time(foreach(i=1:uniqTailNum.len) %dopar%
x <- attach.big.matrix(desc)
planeStart <- attach.big.matrix(psDesc)
tmpInds <- mwhich(x, 11, uniqTailNum[i], 'eq')
x[tmpInds, 30] <- as.integer(MoreRecentDate -
planeStart[i, 1])
Column "age" (i.e., 30) of x is now filled.

blm1 <- biglm.big.matrix(ArrDelay  ̃ age, data = x)
( out <- summary(blm1) )
blm2 <- biglm.big.matrix(ArrDelay  ̃ age + Year, data = x)

Revolution Analytics’ RevoScaleR package overcomes the memory limitation of R by providing a binary file (extension .xdf ) format that is optimized for processing blocks of data at a time. The xdf file format stores data in a way that facilitates the operation of external memory algorithms. RevolScaleR provides functions for importing data from different kinds of sources into xdf files, manipulating data in xdf files and performing statistical analyses directly on data in these files.

This package is available in Revolution R Enterprise which is non-free (owned by Microsoft now...):

Analyze airline.csv with RevoScaleR :

Reste à faire: Corriger les erreurs

Pour les autres colonnes: Box plots ou histogrammes en R Le jeu de données étant trop volumineux, on importera une colonne à la fois

Utilisation de BigMemory:

Old stuff

Count the number of flights for each flight number in 2008 and save it to 2008- flights.csv:

cut -f9,10 -d, 2008.csv | sort | uniq -c > 2008-flights.csv


Sort by the 10th column (flightnum) (Source sort -t, -k 10,10 2008.csv

Filter rows using awk (Source

To show flights from Des Moines to Chicago O’hare:

awk -F, '$17 == "DSM" && $18 == "ORD"' 2008.csv

Filter columns using cut (source

Select only columns 9 (carrier) and 10 (flight num):

cut -f9,10 -d, 2008.csv