If you are new to data science, this title is not intended to insult you. It is my second post on the theme of a popular interview question that goes something like: “explain [insert technical topic] to me as though I were a five-year-old.”
Turns out, hitting the five-year-old comprehension level is pretty tough. So, while this article may not be perfectly clear to a kindergartener, it should be clear to someone with little to no background in data science (and if it isn’t by the end, please let me know in the comments).
I will start out by explaining what machine learning is, along with the different types of machine learning, and then I will jump into explaining common models. I won’t go into any of the math, but I am considering doing that in another article in the future. Enjoy!
Machine learning is when you load lots of data into a computer program and choose a model to “fit” the data, which allows the computer (without your help) to come up with predictions. The way the computer makes the model is through algorithms, which can range from a simple equation (like the equation of a line) to a very complex system of logic/math that gets the computer to the best predictions.
Read: [The difference between AI and machine learning, explained]
Machine learning is aptly named, because once you choose the model to use and tune it (a.k.a. improve it through adjustments), the machine will use the model to learn the patterns in your data. Then, you can input new conditions (observations) and it will predict the outcome!
Supervised learning is a type of machine learning where the data you put into the model is “labeled.” Labeled simply means that the outcome of the observation (a.k.a. the row of data) is known. For example, if your model is trying to predict whether your friends will go golfing or not, you might have variables like the temperature, the day of the week, etc. If your data is labeled, you would also have a variable that has a value of 1 if your friends actually went golfing or 0 if they did not.
As you may have guessed, unsupervised learning is the opposite of supervised learning when it comes to labeled data. With unsupervised learning, you do not know whether your friends went golfing or not — it is up to the computer to find patterns via a model to guess what happened or predict what will happen.
Logistic Regression
Logistic regression is used when you have a classification problem. This means that your target variable (a.k.a. the variable you are interested in predicting) is made up of categories. These categories could be yes/no, or something like a number between 1 and 10 representing customer satisfaction.
The logistic regression model uses an equation to create a curve with your data and then uses this curve to predict the outcome of a new observation.
In the graphic above, the new observation would get a prediction of 0 because it falls on the left side of the curve. If you look at the data this curve is based on, it makes sense because, in the “predict a value of 0” region of the graph, the majority of the data points have a y-value of 0.
Linear Regression
Linear regression is often one of the first machine learning models that people learn. This is because its algorithm (i.e. the equation behind the scenes) is relatively easy to understand when using just one x-variable — it is just making a best-fit line, a concept taught in elementary school. This best-fit line is then used to make predictions about new data points (see illustration).