🤖 GUIDE: MACHINE LEARNING ALGORITHMS

BY: RYAN ZERNACH

**SUMMARY** — What flavors of machine learning algorithms are there? What are the differences between them? Which instances are most appropriate for each algorithm?

ABC’s — FOUNDATIONAL KNOWLEDGE

A) FLAVORS OF MACHINE LEARNING ALGORITHMS

THERE ARE MANY DIFFERENT FLAVORS OF MACHINE LEARNING ALGORITHMS…

Like a magic spell, most machine learning algorithm can be called with only a few lines of code. However, **each algorithm has different mathematical computations that are happening behind the scenes.** Therefore, it’s important to know which machine learning algorithm would be best — determined by the features that are being used, the values that you’re trying to predict, and the amount of data that you have.

We’ll dig into more about when it’s best to use each algorithm, but first, here’s a diagram to illustrate the vast array of “flavors” of machine learning algorithms.

B) REGRESSION VS. CLASSIFICATION

WHAT ARE THE DIFFERENCES BETWEEN REGRESSION & CLASSIFICATION MACHINE LEARNING ALGORITHMS?

When trying to predict a a value that is numerically continuous, such as the value of a home, one would use a regression model. How much? How many?

However, when trying to classify a row into one of n-number of categories, such as object detection in an image, then you’d need to use a classification model.

C) SUPERVISED VS. UNSUPERVISED

WHAT ARE THE DIFFERENCES BETWEEN SUPERVISED & UNSUPERVISED MACHINE LEARNING ALGORITHMS?

In instances of **supervised** machine learning, we have prior knowledge of what the output values for our samples should be.

On the other hand, **unsupervised** machine learning does not have labeled outputs, so its goal is to infer structure within a set of data points.

123’s — MACHINE LEARNING ALGORITHMS

THIS IS A LIST OF NOT EVEN (10) MACHINE LEARNING ALGORITHMS.

DATAROBOT AUTOMATICALLY TRAINS 60+ ALGORITHMS WITH THE CLICK OF A BUTTON.

LEARN MORE ABOUT MY DATAROBOT EXPERIENCE HERE.

1) LINEAR REGRESSION ALGORITHM

**from ****sklearn.linear_model ****import**** LinearRegression **

Specifications?

1) Data is relatively linear

2) Instances have several attributes

3) Attributes are conditionally dependent

Applications?

1) Evaluate business trends to make estimates or forecasts

2)

3)

Colab Notebooks

2) RIDGE REGRESSION ALGORITHM

**from ****sklearn.linear_model ****import**** Ridge**

Specifications?

1) TBD

Applications?

1) TBD

Colab Notebooks

3) LOGISTIC REGRESSION ALGORITHM

**from**** ****sklearn.linear_model**** import**** ****LogisticRegression**

Specifications?

1) TBD

Applications?

1) TBD

4) DECISION TREE ALGORITHMS

**from**** ****sklearn.tree**** import**** ****DecisionTreeClassifier**

**from**** ****sklearn.tree**** import**** ****DecisionTreeRegressor**

Specifications?

1) TBD

Applications?

1) Unsupervised Categorization

2) Document Categorization

3) Classifying News Articles

5) NAIIVE BAYES ALGORITHM

**from**** ****sklearn.naive_bayes**** import**** ****GaussianNB**

Specifications?

1) TBD

Applications?

1) Sentiment Analysis

2) Document Categorization

3) Classifying News Articles

4) Email Spam Filtering

6) K-MEANS CLUSTERING ALGORITHM

**from**** ****sklearn.cluster**** import**** ****KMeans**

Specifications?

1) TBD

Applications?

1) Unsupervised Categorization

2) Document Categorization

3) Classifying News Articles

7) RANDOM FOREST ALGORITHMS

**from**** ****sklearn.ensemble**** import**** **RandomForestClassifier

**from**** ****sklearn.ensemble**** import**** **RandomForestRegressor

Specifications?

In the random forest, we grow multiple trees in a model. To classify a new object based on new attributes each tree gives a classification and we say that tree votes for that class. The forest chooses the classifications having the most votes of all the other trees in the forest and takes the average difference from the output of different trees.

In general, Random Forest built multiple trees and combines them together to get a more accurate result.While creating random trees it split into different nodes or subsets. Then it searches for the best outcome from the random subsets. This results in the better model of the algorithm. Thus, in a random forest, only the random subset is taken into consideration.

- Random forest algorithm can be used for both classifications and regression task.
- It provides higher accuracy.
- Random forest classifier will handle the missing values and maintain the accuracy of a large proportion of data.
- If there are more trees, it won’t allow overfitting trees in the model.
- It has the power to handle a large data set with higher dimensionality

Applications?

1) Unsupervised Categorization

2) Document Categorization

3) Classifying News Articles

8) K-NEAREST NEIGHBORS ALGORITHM

**from**** ****sklearn.neighbors**** import**** ****KNeighborsClassifier**

**from**** ****sklearn.neighbors**** import**** ****KNeighborsRegressor**

**from**** ****sklearn.neighbors**** import**** NearestNeighbors**

Specifications?

1) Moderate/large training dataset

2) Instances have several attributes

3) Attributes are conditionally dependent

Applications?

1) Unsupervised Categorization

2) Document Categorization

3) Classifying News Articles

9) DBSCAN (Density Based Spatial Clustering of Applications with Noise)

**from**** ****sklearn.cluster**** import**** ****DBSCAN**

Specifications?

Given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).

- DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to k-means.
- DBSCAN can find arbitrarily shaped clusters.
- It can even find a cluster completely surrounded by (but not connected to) a different cluster.
- Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.
- DBSCAN has a notion of noise, and is robust to outliers.
- DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database.
- However, points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed, and the cluster assignment is unique only up to isomorphism.
- DBSCAN is designed for use with databases that can accelerate region queries, e.g. using an R* tree.
- The parameters minPts and ε can be set by a domain expert, if the data is well understood.

Applications?

1) Biotech wearable signal processing/classifying

2) EEG neuroelelctrical transmissions

3) Noise reduction/filtering technologies

Click here to view DBSCAN’s CS-Unit-1-Build-Week Implementation in Google Colab