Last week I posted several scripts useful for cleaning up URL data with PowerShell. If you work in search marketing one of the next logical steps is to gather data on these URLs. For example, doing a content audit on your website. There are (at least) a hundred ways to scrape content from the web. One of the easiest methods that I’ve found is within Google Drive, using the
IMPORTXML() function and a couple of XPath queries.
Before opening up Google Drive, create a list of URLs that you want to gather data on. This could be the output of a sitemap, a list of blog URLs, or any other list of URLs that’s important to you. Create a spreadsheet within Google Drive and copy/paste your URL list into the document. This will be column A within your document.
We’ll place page titles in column B. The code for this is:
This will go in column C:
Once you have columns B & C set for a single row, highlight them and when you hover over the bottom right corner of the highlighted cells you’ll see a large
+ sign. Drag that to the bottom of your column of URLs. This will fill down your column with the code for page title & meta description. Wait a bit (depending on how many URLs you’re checking this could take up to a few minutes) and you’ll have your data.
There are a few things that can trigger an
N/A to appear, so if you see that you’ll need to a little digging to find out why. If you get page titles but the descriptions are
N/A, the most likely cause is that the site doesn’t have them at all. If you get
N/A‘s for both columns all the way down the site likely has a robots.txt file that excludes bots, which means that you’ll need to find another way to collect this data. I’d suggest looking at Beautiful Soup and possibly Mechanize. There are plenty of tutorials on how to scrape with Mechanize & Beautiful Soup.
All of the page titles & meta descriptions should show just fine and be easily readable within Google Drive. If you export your titles & descriptions and open this in Excel, you’ll likely see some crazy characters. This is because not all web content is in standard encoding. My recommendation is to stay within Google Drive and you won’t have to do anything.
That Was Easy
I’d been researching the easiest way to do this for a few weeks. I have about a dozen lists of URLs that I needed this data for. In fact I’d written a Python script and was working my way through a small list of bugs. No more, this is faster, easier and anyone can do it.