2014 has been a great year. I’ve been writing programs in various languages for the better part of 3 decades, though I would have never defined myself as a ‘programmer’. In 2013 I recognized a handful of projects that would simplify some of my work, so earlier this year I picked a couple and have completed them. The result is that I now have MUCH more confidence in my ability to write code and am pursuing more complicated projects to continue developing my skill set. Here’s a list of some of the things that I’ve completed.
In July I started taking classes within Coursera’s Data Science specialization, taught by 3 professors at Johns Hopkins University. Until now I’ve been taking 2 courses at a time, I just finished numbers 5 & 6. Before signing up for this certificate, I had read a number of negative reviews of the program on Reddit and within Coursera forums of other MOOC’s that I’d taken. That’s all bunk, this is a great program and I’m glad to have found it.
So far I’ve completed:
- The Data Scientist’s Toolbox
- R Programming
- Getting & Cleaning Data
- Exploratory Data Analysis
- Reproducible Research
- Statistical Inference
The first 5 of those courses were pretty easy for me. I have a background in programming and a couple of years of experience with R. I did pick up a couple of tips along the way, and the course has also introduced me to a couple of great packages that I’d been intending to learn about anyway.
The courses consist of video lectures where they explain concepts. You can download the accompanying presentation or even the video itself. There’s about an hour of videos for each week. For each idea that they’re presenting, they introduce the R code for working with the problem. Each week there is a quiz, and there is a course project as well (some courses have 2 projects).
The videos are well done, some show the slide being discussed, some show the instructor while talking and some show the desktop (or more specifically, RStudio).
The quizzes usually require some thought and a review of slides or notes, but honestly they aren’t difficult. You have 3 chances, so anyone that’s paying attention should be able to ace these. I usually spend about an hour on each quiz.
The projects are the real meat of the course. They provide a dataset and ask a few questions. You have to manipulate the data, find the answer and write a small report. The grading criteria is available, and they’re all yes/no questions, so you know exactly what you’re aiming for. You usually post your work on github or RPubs (which enforces that you know how to use git & knitr), and the reports are usually 2-4 pages. You’re showing blocks of R code and a plot or two, so the length of the reports hasn’t been difficult at all.
I’m just now getting to the point in the program that I’m most interested in. The Statistical Inference class was a good review for me, and a bit of a challenge (in a good way). I’m pretty eager to get into regression models and machine learning. From this point forward I’m only going to take one class at a time.
Will This Certificate Get Me a Job?
I’ve been asked this question a few times since starting the course. I’m not in a position to hire any data scientists so take this with a grain of salt, but if you understand the concepts in the course I believe you’ll be prepared to take on real analysis projects.
I can say that enrolling in this program has created a couple of opportunities for me. When I told my team that I signed up for this certificate, my manager shared ~2GB of data with me and said he needed to understand it. I cleaned it, provided some summary statistics and a few plots. That simple project has opened the door for others, now I’m working on predictive modeling and creating real time dashboards for my organization. In other words, this program has opened the door for work that I really enjoy and valuable experience.
I’m glad that I signed up for the program, and I’d encourage anyone that’s interested in data analysis to sign up as well. You can get this training for free, or you can pay $49 per course ($490 for all) and get an authenticated certificate. That’s well worth it in my opinion.
Re: students’ preparedness for internships. Interesting that they all know Matlab. I have some strong opinions about that, but I will save them until the next paragraph. Software engineer candidates who list Matlab as their primary language when they show up at Google for an interview will be treated with suspicion, so they should stress the C++ and Python instead. Good that the stats people know R, it seems to be the industry standard.
Matlab rant coming now: you need to stop teaching your students Matlab now. Matlab is a broken, outdated language that is proprietary and has extortionate pricing policies for licenses outside education. The language has been completely superceded by modern languages in the numerical computing space, such as the numerical extensions to Python (numpy etc), and Julia. Matlab only still exists for two reasons: one is the large amounts of legacy code at big defense contracting firms that is too expensive to rewrite, and the other is academic institutions who get sucked in by the free or cheap software licenses and keep exposing their students to it despite it being a relic that deserves to die. The language had some very big mistakes baked into it when it was first designed back in 1985, which the company is afraid to fix because of the legacy code base issue; computer scientists look at it with despair, to be honest, because it would have earned a B- in a language design class even back then. Optimizing compilation will never work properly with Matlab, for instance, because of one of the mistakes in the language. It does a few things very fast (matrix multiplication and solving Ax=b) but it is painfully slow at many other things (function calls, most critically) and its design promotes terrible software practices. On top of all this, it’s not even free: it’s a very expensive language to use unless you’re an academic. (By comparison, R is also a broken language, in some similar ways, but it at least has the saving grace of being open source.) If students turn up at Google (or any other software company that isn’t a Matlab shop – mostly just defense contractors these days, and one hedge fund I know of) listing Matlab as their language and they want a software engineer job, they will be treated as “might be able to program, but probably not”.
Not very gentle. I’m not surprised at the position, while I’ve never worked for Google they’re a company very focused on programmers. Matlab isn’t a general purpose language, and it wasn’t designed by a computer scientist. I am somewhat surprised at the vitriol though. However functional the language, it is useful in engineering. I used it a fair amount in college, and while it wasn’t my favorite it handled structural engineering problems easily. I’m also surprised that anyone wanting to work for Google would list it as a primary language on their resume.
I use Piwik to monitor a couple of sites. If you aren’t familiar with it, it’s an alternative to Google Analytics which you host yourself. I installed it as an experiment and fell in love with it. The thing is, most of my sites are hosted in a shared environment, yet I’m a bit obsessive compulsive about page load times. I host all of my static content (images, CSS & JS) on a CDN.
When you install Piwik it will prompt you to use the piwik.js file on the same domain where Piwik is running. Nothing wrong with that, but if you’re using a shared webhost like me that can add seconds to your page load times. Unacceptable. The difference is in the
g.src variable of the tracking code.
I just read that several large internet companies are joining the fight against FCC’s net neutrality proposal. This is good news for America, I’m not a fan of the proposal. I did some research on what I as an individual could do.
- Write your congressman. They were elected to serve you. The FCC is an independent agency, which means they can’t be bossed around by the House, but it never hurts to get elected officials involved.
Any project involving data requires a specific format. Visualization libraries such at ggplot2 or matplotlib work with specific types of data. Any modeling or prediction is going to require a specific format. Most of the time a project requires several iterations of plotting or analysis, so data munging is a skill that you’ll use a LOT.
Last week I posted several scripts useful for cleaning up URL data with PowerShell. If you work in search marketing one of the next logical steps is to gather data on these URLs. For example, doing a content audit on your website. There are (at least) a hundred ways to scrape content from the web. One of the easiest methods that I’ve found is within Google Drive, using the
IMPORTXML() function and a couple of XPath queries.
As a analyst working within the marketing world, most of the data that I see includes URLs. In fact, much of the data is focused on URLs. I tend to scrub URLs a lot, and have collected a hand full of scripts to help me. Hopefully this will help you as well. I’ve included references where appropriate.