Wednesday, October 26, 2016

Data Science 101: Part 1

The idea for this project arose because one of the most common questions I am asked is: “how do I obtain a position as a data scientist?” It is not just the regularity of this question that got my attention, but also the diverse backgrounds from where it was coming from. And through these conversations, it has become apparent that there is a huge amount of misinformation out there, which has left people confused about what they need to do, in order to break into this field.

I decided, therefore, that I would investigate this subject to cut through the BS and provide a useful resource for anyone looking to move into commercial data science – whether you are just starting out, or already possess all the necessary skills but have no industry experience. And so I set out with the aim of answering two very broad questions:
  • What skills are required for data science, and how should you go about picking these up?
  • From a job market perspective, what steps can you take to maximise your chances of gaining employment in data science? 

Before we start off, it's important to know that the field of data science is incredibly broad and vaguely defined field. However, data science involves a skill-set that is somewhat bounded, if not incredibly broad. Here is a chart showing the skill set required for a data scientist.




Step 0. Fulfilling the prerequisites

Before you start to learn about the tools and techniques used in data science, you need to get your basics right. Basics in data science means maths, stats and machine learning.

  1. Math
    1. Start with Khan Academy's linear algebra course. You can skip this if you find it to be too basic.
    2. Follow it up with MIT’s Linear Algebra course. Use Gilbert Strang’s textbook to go through it faster. Make sure that you understand matrices properly. The basic idea of learning linear algebra is to get you familiar with matrix computation. Matrix decomposition algorithms are fundamental to many data mining applications and are usually underrepresented in a standard "machine learning" curriculum.
  2. Stats
    1. i. Start with Udacity’s Intro to stats
    2. ii. Use OpenIntro’s textbook to brush up. (Optional step, can skip if you want).
  3. Machine Learning
    1. i. Start Andrew Ng’s CS229. This is the course for ML. It doesn't involve hardcore coding, just some maths and basic matlab skills. Anyone with no experience of coding can also follow it.
    2. ii. Use Practical ML by John Hopkins to brush up.
Step 1: Baby steps
  1. Basic coding: This is the one of three pillars on which the field of data science stands. And you will need to be very good at this to excel in the field of DS.
    1. For DS, Python is the language to learn. There is no running away from this. 
    2. Setup your system now itself. Use Anaconda available here to setup a Python IDE as well as Ipython notebooks.
  2. Data Structures & Algorithms: Once you are familiar with the environment and the basic syntax of Python, it's time to learn the basic algorithms and data structures in Python.
    1. Learn DS and Algos using Interactive Python
    2. Once you're done with the above course, you need to practice what you've learnt otherwise you'll forget it.
      1. Get an account on Hacker rank and solve their algorithms section using Python. 
      2. Then solve Hackerrank's Data Structure section using Python. These two steps are very important.
  3. Databases: As a data scientist, you will need to learn two database solutions. One SQL based(e.g., MySQL) and a NoSQL solution (e.g., MongoDB).
    1. SQL: This is a very good place to learn the basics. W3 schools tutorial
    2. MongoDB: Tutorials Point has a good resource on MongoDB. Complete it.
  4. Learn R: R is a statistical computation language, and the most preferred language by data scientist.
    1. Install R.
    2. Install RStudio
    3. Install SWIRL and learn R. One of the best resources to learn R
At this point, I should tell you that you know almost all the tools that are required at present for you to continue with DS. From this point onwards, DS becomes an art. You have the tools, how you use them depends on you.

As an aside,from here you can go in two directions. If you are disciplined enough, you can take up Harvard's CS109 course available here. But there's a catch. You need to do this course within two months. All problem sets are designed as such. 

                                                              OR

You can continue to work through this manually, taking small steps according to your pace. 

NOTE: Part 2 of this guide will be published in two days.