Tag Archives: r

The Year of Code

computer programming2014 has been a great year. I’ve been writing programs in various languages for the better part of 3 decades, though I would have never defined myself as a ‘programmer’. In 2013 I recognized a handful of projects that would simplify some of my work, so earlier this year I picked a couple and have completed them. The result is that I now have MUCH more confidence in my ability to write code and am pursuing more complicated projects to continue developing my skill set. Here’s a list of some of the things that I’ve completed.

Continue reading

Review of Coursera’s Data Science Specialization

In July I started taking classes within Coursera’s Data Science specialization, taught by 3 professors at Johns Hopkins University. Until now I’ve been taking 2 courses at a time, I just finished numbers 5 & 6. Before signing up for this certificate, I had read a number of negative reviews of the program on Reddit and within Coursera forums of other MOOC’s that I’d taken. That’s all bunk, this is a great program and I’m glad to have found it.

So far I’ve completed:

  1. The Data Scientist’s Toolbox
  2. R Programming
  3. Getting & Cleaning Data
  4. Exploratory Data Analysis
  5. Reproducible Research
  6. Statistical Inference

The first 5 of those courses were pretty easy for me. I have a background in programming and a couple of years of experience with R. I did pick up a couple of tips along the way, and the course has also introduced me to a couple of great packages that I’d been intending to learn about anyway.

The courses consist of video lectures where they explain concepts. You can download the accompanying presentation or even the video itself. There’s about an hour of videos for each week. For each idea that they’re presenting, they introduce the R code for working with the problem. Each week there is a quiz, and there is a course project as well (some courses have 2 projects).

The videos are well done, some show the slide being discussed, some show the instructor while talking and some show the desktop (or more specifically, RStudio).

The quizzes usually require some thought and a review of slides or notes, but honestly they aren’t difficult. You have 3 chances, so anyone that’s paying attention should be able to ace these. I usually spend about an hour on each quiz.

The projects are the real meat of the course. They provide a dataset and ask a few questions. You have to manipulate the data, find the answer and write a small report. The grading criteria is available, and they’re all yes/no questions, so you know exactly what you’re aiming for. You usually post your work on github or RPubs (which enforces that you know how to use git & knitr), and the reports are usually 2-4 pages. You’re showing blocks of R code and a plot or two, so the length of the reports hasn’t been difficult at all.

I’m just now getting to the point in the program that I’m most interested in. The Statistical Inference class was a good review for me, and a bit of a challenge (in a good way). I’m pretty eager to get into regression models and machine learning. From this point forward I’m only going to take one class at a time.

Will This Certificate Get Me a Job?

I’ve been asked this question a few times since starting the course. I’m not in a position to hire any data scientists so take this with a grain of salt, but if you understand the concepts in the course I believe you’ll be prepared to take on real analysis projects.

I can say that enrolling in this program has created a couple of opportunities for me. When I told my team that I signed up for this certificate, my manager shared ~2GB of data with me and said he needed to understand it. I cleaned it, provided some summary statistics and a few plots. That simple project has opened the door for others, now I’m working on predictive modeling and creating real time dashboards for my organization. In other words, this program has opened the door for work that I really enjoy and valuable experience.

I’m glad that I signed up for the program, and I’d encourage anyone that’s interested in data analysis to sign up as well. You can get this training for free, or you can pay $49 per course ($490 for all) and get an authenticated certificate. That’s well worth it in my opinion.

Scathing Matlab Review from a Google Employee

Google-MatlabA screen capture of this email was posted to the datascience subreddit earlier today.

Re: students’ preparedness for internships. Interesting that they all know Matlab. I have some strong opinions about that, but I will save them until the next paragraph. Software engineer candidates who list Matlab as their primary language when they show up at Google for an interview will be treated with suspicion, so they should stress the C++ and Python instead. Good that the stats people know R, it seems to be the industry standard.

Matlab rant coming now: you need to stop teaching your students Matlab now. Matlab is a broken, outdated language that is proprietary and has extortionate pricing policies for licenses outside education. The language has been completely superceded by modern languages in the numerical computing space, such as the numerical extensions to Python (numpy etc), and Julia. Matlab only still exists for two reasons: one is the large amounts of legacy code at big defense contracting firms that is too expensive to rewrite, and the other is academic institutions who get sucked in by the free or cheap software licenses and keep exposing their students to it despite it being a relic that deserves to die. The language had some very big mistakes baked into it when it was first designed back in 1985, which the company is afraid to fix because of the legacy code base issue; computer scientists look at it with despair, to be honest, because it would have earned a B- in a language design class even back then. Optimizing compilation will never work properly with Matlab, for instance, because of one of the mistakes in the language. It does a few things very fast (matrix multiplication and solving Ax=b) but it is painfully slow at many other things (function calls, most critically) and its design promotes terrible software practices. On top of all this, it’s not even free: it’s a very expensive language to use unless you’re an academic. (By comparison, R is also a broken language, in some similar ways, but it at least has the saving grace of being open source.) If students turn up at Google (or any other software company that isn’t a Matlab shop – mostly just defense contractors these days, and one hedge fund I know of) listing Matlab as their language and they want a software engineer job, they will be treated as “might be able to program, but probably not”.

Not very gentle. I’m not surprised at the position, while I’ve never worked for Google they’re a company very focused on programmers. Matlab isn’t a general purpose language, and it wasn’t designed by a computer scientist. I am somewhat surprised at the vitriol though. However functional the language, it is useful in engineering. I used it a fair amount in college, and while it wasn’t my favorite it handled structural engineering problems easily. I’m also surprised that anyone wanting to work for Google would list it as a primary language on their resume.

How I Learned Data Munging

cleaning dataAny project involving data requires a specific format. Visualization libraries such at ggplot2 or matplotlib work with specific types of data. Any modeling or prediction is going to require a specific format. Most of the time a project requires several iterations of plotting or analysis, so data munging is a skill that you’ll use a LOT.

Continue reading