Interset’s internal mission statement starts with “We catch bad guys with math”! As a company that places a lot of emphasis on a principled, analytics-first approach to threat detection, I’m very proud of the Data Science team at Interset, as well as the culture of principled statistics and education they foster here.
As a result of this, I am occasionally asked how someone could get started in data science. This is a great question, as some of the best data scientists in the world, and at Interset, do not have a traditional data science background (whatever that means). In fact, they were more likely to have moved into data science from some other quantitative or technical field.
I thought I’d share my suggestions here, in case they help others who have an interest in learning about data science and its applications. Each section has a number of suggestions, sorted in order from most accessible to most advanced.
Courses and Videos
This is my favorite one video on AI, its history, and deep learning: AI, Deep Learning, and Machine Learning: A Primer (Andreessen Horowitz). From Andreessen-Horowitz, it’s only an hour, but worth the watch if you haven’t seen it before. Frank Chen does a great job of describing this history of AI, its limitations, but also why it has such fantastic promise today. Frank’s follow up video is also great: The Promise of AI.
This is the best Coursera course on machine learning: Machine Learning. The course is taught by the famous Andrew Ng from Stanford, Google, Baidu, and considered one of the fathers of deep learning. The tooling is a bit dated, but Ng does a fantastic job layout out the foundational math, with exercises, in a very accessible way.
A deep-dive course into neural networks and deep learning: Neural Networks for Machine Learning. If you really want to dive into neural networks (deep learning), you want to learn from the master, Professor Geoffrey Hinton. He’s my professor from the University of Toronto who taught me neural networks in the first place! Hinton has always been a great lecturer, and he is also considered one of the fathers of neural networks and deep learning.
- Approachable, easy tour of big-data analytics applied to many domains: The Signal and the Noise
- Very readable introduction into the data-science thinking process: Doing Data Science: Straight Talk from the Frontline
- Pragmatic introduction into data science and R: Practical Data Science with R
Foundational textbook on statistical learning: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Required only if you really want a textbook with math. If you do, this is the one to get in my opinion. You can actually download the PDF of an early edition of Hastie, Tibshirani, and Friedman’s book for free. As an aside, Hastie and Tibshirani have a free online course on statistical learning, hosted by Stanford. I have no personal experience with this course, but it looks fantastic.
The great news is that there are a lot of freely available tools out there to experiment right away with machine learning and data science. My favorite environment remains R. It’s what most academic researchers use, and it’s what we ourselves use at Interset.
If you’re going to do anything in R, you’ll want to become familiar with Hadley’s R packages, known affectionately as the Hadleyverse. Here’s a guide: The Hitchhiker’s Guide to the Hadleyverse.
Of course, you’ll want to download RStudio itself. There are other IDE’s, but this is the best one.
Your First Data-Science Pet Project
Here’s something very important to me: The best, fastest way for you to go from a data scientist who just has “book knowledge,” to an experienced, effective data scientist, is to have actually solved a problem you cared about. You need to have felt the pain of dealing with large volumes of messy data, wrestled with long nights of feature engineering—trying to separate the good columns from the bad, spending days staring at a sea of numbers and text to find the hidden signal.
From trying to solve or better understand a problem that impacts you or your family, to figuring out the best way for your favorite sports team to win the next competition—if you can find that challenge you’d like to solve using data science, and spend time trying to squeeze every ounce of value from a data set, you will learn more from that exercise than any other exercise I can think of.
There are a lot of data sets out there, some cleaner than others. Here’s a very small subset to inspire some thought: