Five Data Science Algorithms
Learn about Five Data Science Algorithms
Data Science is empowering every potential technology and solving complex problems to enable a better potential future. At every stage, the possibility is helped or enhanced by Algorithms and more precisely, precise algorithms. In Data Science also there are algorithms that enable us to do our work efficiently and effectively. The article aims to deliver the knowledge and basic understanding of five Data Science Algorithms.
Five Data Science Algorithms :
- Linear Regression
- Logistic Regression
- Naive Bais
- K-Means Clustering
- Random Forest
If you have started your journey towards Data Science or even working with Machine Learning, you might have come across the term Linear Regression. Well, Linear Regression can be defined as the method used for predicting the value of a dependent variable by using an independent variable. The equation may be used :
y = b0 + b1x
- y is the dependent variable.
- x is the independent variable whose values are used for predicting the dependent variable.
- b0 and b1 are constants in which b0 is the Y-intercept and b1 is the slope.
Logistic Regression is used for representing the relation between continuous values. The equation to the logistic regression is:
P(x) = e^(b0+b1x)/1 + e^(b0+b1x)
Where b0 and b1 are coefficients and the goal of Logistic Regression is to find the value of these coefficients.
The Naive Bais algorithm helps in finding to calculate the probability of the occurrence of an event in the future. The Naive Bayes theorem is represented by:
P(A|B) = P(B|A) P(A) / P(B)
Where A and B are two events explained by :
- P(A|B) is the posterior probability i.e. the probability of A given that B has already occurred.
- P(B|A) is the likelihood i.e. the probability of B given that A has already occurred.
- P(A) is the class prior to probability.
- P(B) is the predictor prior probability.
K-means clustering is a type of unsupervised Machine Learning algorithm. The steps followed :
- First, select the value of k which is equal to the number of clusters.
- Then assign the random center values to each of these k clusters.
- Then start searching for the nearest data points to the cluster centers by using the Euclidean distance formula.
- Then, the mean of the data points assigned to each cluster.
- After that search for the nearest data points to the newly created centers and assign them to their closest clusters.
- Keep repeating the above steps until there is no change in the data points assigned to the k clusters.
Note Euclidean distance which is given by,
D = √(x1-x2)^2 + (y1-y2)^2
Random Forests overcomes the overfitting problem of decision trees and helps in solving both classification and regression problems. It works on the principle of Ensemble learning. Random Forests calculates the number of votes of predictions of different decision trees and the prediction with the largest number of votes becomes the prediction of the model.
The above was the basic understanding of few algorithms which are usually used in data science projects and applications, it is very much advised that understanding theory and then applying it in practical code is the best way to understand data science. Hope this article helped you to understand the ace for learning and understanding the algorithm to a level.