Like the total hopefully gets across, I’m not using the command line to do advanced data science. I’m doing first aid, not surgery. There are some basic questions you can answer by using the command line instead of having to write Python or R or import into a database. What I’m writing here isn’t the most efficient use of command line utilities, but it will get the job done, even for files a few gigabytes in size.
For my work, I’ll often have a CSV or a semi-structured log file on a server somewhere that I want to “query” quickly without having to download the file to my laptop to analyze. Servers that run code have constraints in that they aren’t going to have fancy Python packages or other more advanced tools. What they’re going to have are the basic core utilities included on a linux install.
SQL equivalent: when doing a select on potentially a lot of data, you’ll probably do a limit 10 on the query so that you don’t return all rows. Using head limits the output to 10 lines which makes interactive development easier.
What can we see here?
Row-wise, we have 2 lines of headers and then the data. For columns, the first few columns are a lot of geographic identifies the Census uses, then a friendly name, then population values from 2010 through 2016.
Single out the columns we want
cut is a utility that we can use to get all this data into a manageable state.
SQL equivalent: going from select * to selecting only the columns you actually want
We pass cut a few different parameters:
-d is the delimiter, since this a CSV we use a comma
-f takes column numbers, which columns we want. Note that cut doesn’t understand quotes, so while using a real CSV parser this would be 2 columns, we have to select 3 to get all the fields we want.
Formatting the data a little more
Two things with our output that I don’t like. One is we still have the two header rows and the other are the quotes around the place name. I’d like to see the place, the state, and the population as 3 distinct columns.
Switching to tail and starting at line 3 will get rid of the header columns for us.
Going back to using cut to get only the columns we want:
We can use sed to do a basic search and replace to get rid of the quotes and leave us with a plain csv file.
First question: Should we get rid of the “city”?
This data is sorted by population size, so the largest values are what we’re seeing at the top of the file. This city value seems to get repeated a lot, is it the same for all the data we have?
SQL equivalent: grep -v is the same as where != on a line by line basis. So grep -v city will return lines that don’t have “city” in them.
So there are some entries that don’t have city in them, so let’s keep that in.
“Saving” our work
We have a pretty sizeable command built up now. Let’s write this new format to a file to save our work.
At the end of our command (without head limiting it), we put the > character and a file name which redirects the output to the file.
Answering some questions
We have the data in a good format now, let’s answer some questions.
How many records do we have?
Since a CSV is one record per line, we just count the number of lines.
761 cities in our list.
Useless use of cat warning: Some people will be a jerk about writing a command like the one above, since wc can take the file directly, it could just be wc -l populations.csv. If that was the only command I was writing then sure. But playing with data is an interactive endeavor, changing the commands and arguments around many times. It’s a lot faster to edit the end of the command rather than jumping to the beginning of the command to edit it. So while our use of cat here is “useless”, that argument is missing the forest for the trees.
How many places in PA?
I live in Pennsylvania, so let’s see my home state.
SQL equivalent: grep is like a where = on a line by line basis.
How many places for each state?
This will take a few steps.
First, let’s only get the state field.
Now, we can use sort and uniq to group the values together.
uniq has to be paired with sort because uniq only works on sorted input.
Using uniq is similar to using a SQL group by. What the above command is doing is sort of like:
Sorting the results by state and then by population
This is one of my favorite tools. sort is often used on the entire line of input, but can be made to use individual fields of the line.
Let’s break the arguments down:
-t - This is the field delimiter, what our columns are separated by.
-k2,2 - This is the first “key” to sort on. We’re starting at the second column and ending at the second column, so this sorts by state.
-k3,3nr - This is the second key to sort on, the third column which is population. We add n to sort numerically not alphabetically and r to sort largest to smallest.
The command will output the entire file sorted, but we can look at some individual examples to see the results.