Any project involving data requires a specific format. Visualization libraries such at ggplot2 or matplotlib work with specific types of data. Any modeling or prediction is going to require a specific format. Most of the time a project requires several iterations of plotting or analysis, so data munging is a skill that you’ll use a LOT.
Last week I posted several scripts useful for cleaning up URL data with PowerShell. If you work in search marketing one of the next logical steps is to gather data on these URLs. For example, doing a content audit on your website. There are (at least) a hundred ways to scrape content from the web. One of the easiest methods that I’ve found is within Google Drive, using the
IMPORTXML() function and a couple of XPath queries.
As a analyst working within the marketing world, most of the data that I see includes URLs. In fact, much of the data is focused on URLs. I tend to scrub URLs a lot, and have collected a hand full of scripts to help me. Hopefully this will help you as well. I’ve included references where appropriate.