Cleaning URL Data in PowerShell

ShareTweet about this on TwitterShare on Google+Share on FacebookShare on LinkedInShare on StumbleUponEmail this to someone

Windows PowerShellAs a analyst working within the marketing world, most of the data that I see includes URLs. In fact, much of the data is focused on URLs. I tend to scrub URLs a lot, and have collected a hand full of scripts to help me. Hopefully this will help you as well. I’ve included references where appropriate.

Extract Domain from URL

If you’re doing any analysis or visualization on the data it can be helpful to include the domain as a separate data point. This comes from a StackExchange post, Extract Domains from List of URLs using Powershell.

Delete Everything after X in URL

It’s pretty common for URLs to include parameters (utm_campaign, utm_source, etc). These are great for greater data accuracy within Google Analytics, but not so good when you’re using another tool. I’ve pieced this together from these two articles.

This specific code will delete everything after the question mark (?), but you can set it to ampersand (&) or anything else that suits your needs (including a regex string if appropriate).

Delete Duplicate Rows

Once you remove all of the URL parameters, the next step is to delete the duplicates. Note that you’ll only want to use this script if your file only contains URLs. If it has data associate with each URL, such as visits, clicks, conversions, etc, you’ll want to use something like dplyr (within R) or pivot tables to summarize the data by row. Otherwise you’ll be losing data. I found this on secretGeek.

Delete Row Based on Partial String Match

You may want to only work with subdomains, or only blog articles. Sometimes I’ll scrape an entire site, then only work with section of the site at a time. This is super helpful when working with sites that have > 10k URLs. I found this on SuperUser, Deleting entire lines in a text file based on a partial string match with Windows Powershell.

Delete Rows That Don’t Include a Partial String

Similarly, if you only want to include blog content, you can exclude every URL that doesn’t include /blog/ (or whatever else you can identify.

Use Regular Expressions

Note that you can use regex within PowerShell. In most of these scripts I’m stating explicitly what I’d like to match, but you could easily extend the capabilities here by using your own regex string. I’ve used this to exclude older blog content, find specific subsections of sites and more. If you get stuck, find your favorite online regex tester (there are literally dozens of them) and experiment until you find something that works. Your mileage may vary, but this is a pretty nice way to extend the capabilities of everything that I’ve posted here.

Conclusion

Last week I was doing some competitive analysis for clients. I scraped 18 websites with about 4M distinct URLs. With these scripts it took me about an hour to pare those down to the ~40k URLs relevant to the research I was conducting. I’ve also used them in reviewing the performance of paid advertising campaigns, A/B testing and elsewhere. Hopefully they’ll help you in your work, too.

One thought on “Cleaning URL Data in PowerShell”

Comments are closed.