How I Learned Data Munging

ShareTweet about this on TwitterShare on Google+Share on FacebookShare on LinkedInShare on StumbleUponEmail this to someone

cleaning dataAny project involving data requires a specific format. Visualization libraries such at ggplot2 or matplotlib work with specific types of data. Any modeling or prediction is going to require a specific format. Most of the time a project requires several iterations of plotting or analysis, so data munging is a skill that you’ll use a LOT.

I used to struggle with this quite a bit. My philosophy on data transformations is, just get it done. Use whatever tools you’re comfortable with to complete this as fast as possible (without sacrificing quality). In my opinion, this is part of the “work” of analysis. If you can’t format the data everything else is at a standstill.

My toolbox is likely different than yours. I have a lot of experience with Excel, so I used that for years. Data cleaning is a chore in Excel, it’s a very manual process. Then I got wise and started using scripts and other tools. Now I can usually get what I want out of a dataset within an hour.  I use R for all of my projects. I also use PowerShell quite often (AWK or sed when on my Mac), Excel, sometimes Python, OpenRefine if I need to extend the data with API calls. Lately I’ve started learning Perl as well.

I tend to work with CSV files most of the time unless the data is really large (though I haven’t had trouble with file sizes up to 400 MB). This format is easy to work with in any tool, and it also makes it easy to switch between tools.

Once I have the data in the format needed, I do a couple of things. First, I back up all of the code used to clean it. This way I have a library of tricks available to me later, and its easy to apply my methods elsewhere as appropriate. Second, I make notes of what I did and why. This has helped me quickly recognize which methods are called for in a given situation.

When I started this path, I thought it was somewhat lazy because I wasn’t sticking to a specific tool. Honestly every tool I listed above is powerful enough to do most types of data transformation if you’re familiar enough with it and the packages/libraries available. Along the way an interesting thing happened though. Through keeping notes & referring to previously used code, I started asking better questions about how to manipulate data. Can this be done faster? Do I really need 400k rows to answer a question? Would it simplify this plot if I used weekly data instead of daily?

This path also lead me to explore (and then learn) the use of regex. That simplified my work quite a bit. Then I found that XPath is fantastic and can be used almost anywhere.

Now I’m at the point where I can realistically use a single tool for a project, but I still prefer to use a variety of tools. I find that it’s faster to start with a scripting tool like PowerShell, then switch over to R for analysis or plotting.

I want to learn more about Perl. I also want to go further with OpenRefine.