Author: steve miller
I recently came across an interesting account by a practical data scientist on how to munge 25 TB of data. What caught my eye at first was the article’s title: “Using AWK and R to parse 25tb”. I’m a big R user now and made a living with AWK 30 years ago as a budding data analyst. I also empathized with the author’s recountings of his painful but steady education on working with that volume of data: “I didn’t fail a thousand times, I just discovered a thousand ways not to parse lots of data into an easily query-able format.” Been there, done that.
After reading the article, I was again intrigued with AWK after all these years. A Unix-based munging predecessor of perl and python, AWK’s particularly adept at working with delimited text files, automatically splitting each record into fields identified as 1, 2, etc. My use of AWK generally revolved on selecting columns (projecting) and rows (filtering) from text files, in turn piping the results to other scripts for additional processing. I found that AWK did these simple tasks very well but didn’t scale for more demanding data programming — remembering well that trouble lurked when I attempted to contort AWK to do something it wasn’t intended to do. And indeed, I pretty much abandoned AWK when the more comprehensive perl emerged in the late 80s. In retrospect, I’m not sure that was the best course. Optimal might have been to continue using AWK for the simpler file project and filter work, saving perl (and then python) for more complex tasks.
So I just had to reacquaint myself with AWK, and downloaded the GNU version gawk. I then divined several quick tasks on a pretty large data source to test the language. The data for analyses consist of 4 large files of census information totaling over 14 GB which, in sum, comprise 15.8M records and 286 attributes. I use AWK to project/filter the input data, and then pipe the results to python or R for analytic processing. AWK does some pretty heavy albeit simple processing. In my tests, both R and python/pandas could have handled AWK’s tasks as well, but it’s not hard to imagine a pipeline that required pre project/filtering.
Unlike other blogs I’ve written using Jupyter Notebook, this one does not execute in a python or R kernel; rather the notebook simply displays the AWK, python, and R scripts and their outputs.
The technology used below is Windows 10, JupyterLab 0.35.4, Anaconda Python 3.7.3, Pandas 0.24.2, R 3.6.0, Cgywin 3.0.7, and GNU Awk (gawk) 5.0.1. All gawk, python, and R scripts are simply components in pipelines generated from bash shell command lines in Cgywin windows.
Read the entire blog here.