Jyoti Nigania

...

Full Bio

Listed Key Characteristics Of Cloud Computing
229 days ago

Data Science: A Team Spirit
276 days ago

### Machine Learning Can Reach Heights With These Algorithms

By Jyoti Nigania |Email | Oct 30, 2018 | 14748 Views

Machine Learning Algorithms are those that can learn from data and improve from experience, without human intervention. Learning tasks may include learning the function that maps the input to the output, learning the hidden structure in unlabeled data where a class label is produced for a new instance by comparing the new instance (row) to instances from the training data, which were stored in memory. Instance-based learning' does not create an abstraction from specific instances.

Types of Machine Learning Algorithms
There are 3 types of ML algorithms:
• Supervised Machine Learning Algorithm
• Unsupervised Machine Learning Algorithms
• Reinforcement Learning

Supervised Learning Algorithms:
Supervised learning can be explained as follows: use labeled training data to learn the mapping function from the input variables (X) to the output variable (Y).
Y = f (X)
Supervised learning problems can be of two types:
a. Classification: To predict the outcome of a given sample where the output variable is in the form of categories. Examples include labels such as male and female, sick and healthy.
b. Regression: To predict the outcome of a given sample where the output variable is in the form of real values. Examples include real-valued labels denoting the amount of rainfall, the height of a person.

Supervised Learning Algorithms:
1. Linear Regression
In ML, we have a set of input variables (x) that are used to determine the output variable (y). A relationship exists between the input variables and the output variable. The goal of ML is to quantify this relationship. Figure: Linear Regression is represented as a line in the form of y = a + bx.
In Linear Regression, the relationship between the input variables (x) and output variable (y) is expressed as an equation of the form y = a + bx. Thus, the goal of linear regression is to find out the values of coefficients a and b. Here, a is the intercept and b is the slope of the line.
Figure 1 shows the plotted x and y values for a dataset. The goal is to fit a line that is nearest to most of the points. This would reduce the distance (�?¢??error') between the y value of a data point and the line.
2. Logistic Regression
Linear regression predictions are continuous values (rainfall in cm),logistic regression predictions are discrete values (whether a student passed/failed) after applying a transformation function.
Logistic regression is best suited for binary classification (datasets where y = 0 or 1, where 1 denotes the default class. Example: In predicting whether an event will occur or not, the event that it occurs is classified as 1. In predicting whether a person will be sick or not, the sick instances are denoted as 1). It is named after the transformation function used in it, called the logistic function h(x)= 1/ (1 + ex), which is an S-shaped curve.
In logistic regression, the output is in the form of probabilities of the default class (unlike linear regression, where the output is directly produced). As it is a probability, the output lies in the range of 0-1. The output (y-value) is generated by log transforming the x-value, using the logistic function h(x)= 1/ (1 + e^ -x) . A threshold is then applied to force this probability into a binary classification. Figure: Logistic Regression to determine if a tumor is malignant or benign. Classified as malignant if the probability h(x)>= 0.5.

In Figure, to determine whether a tumor is malignant or not, the default variable is y=1 (tumor= malignant) ; the x variable could be a measurement of the tumor, such as the size of the tumor. As shown in the figure, the logistic function transforms the x-value of the various instances of the dataset, into the range of 0 to 1. If the probability crosses the threshold of 0.5 (shown by the horizontal line), the tumor is classified as malignant.
The logistic regression equation P(x) = e ^ (b0 +b1x) / (1 + e(b0 + b1x)) can be transformed into ln(p(x) / 1-p(x)) = b0 + b1x.
The goal of logistic regression is to use the training data to find the values of coefficients b0 and b1 such that it will minimize the error between the predicted outcome and the actual outcome. These coefficients are estimated using the technique of Maximum Likelihood Estimation.
3. CART
Classification and Regression Trees (CART) is an implementation of Decision Trees, among others such as ID3, C4.5.
The non-terminal nodes are the root node and the internal node. The terminal nodes are the leaf nodes. Each non-terminal node represents a single input variable (x) and a splitting point on that variable; the leaf nodes represent the output variable (y). The model is used as follows to make predictions: walk the splits of the tree to arrive at a leaf node and output the value present at the leaf node.
The decision tree in Figure3 classifies whether a person will buy a sports car or a minivan depending on their age and marital status. If the person is over 30 years and is not married, we walk the tree as follows : over 30 years?' -> yes -> 'married?' -> no. Hence, the model outputs a sports car. Figure: Parts of a decision tree
4. Naves Bayes
To calculate the probability that an event will occur, given that another event has already occurred, we use Bayes' Theorem. To calculate the probability of an outcome given the value of some variable, that is, to calculate the probability of a hypothesis(h) being true, given our prior knowledge(d), we use Bayes' Theorem as follows:
P(h|d)= (P(d|h) P(h)) / P(d)
where:
• P(h|d) = Posterior probability. The probability of hypothesis h being true, given the data d, where P(h|d)= P(d1| h) P(d2| h)....P(dn| h) P(d)
• P(d|h) = Likelihood. The probability of data d given that the hypothesis h was true.
• P(h) = Class prior probability. The probability of hypothesis h being true (irrespective of the data)
• P(d) = Predictor prior probability. Probability of the data (irrespective of the hypothesis)
This algorithm is called �?¢??naive' because it assumes that all the variables are independent of each other, which is a naive assumption to make in real-world examples.

5. KNN
The k-nearest neighbours algorithm uses the entire dataset as the training set, rather than splitting the dataset into a training set and test set.
When an outcome is required for a new data instance, the KNN algorithm goes through the entire dataset to find the k-nearest instances to the new instance, or the k number of instances most similar to the new record, and then outputs the mean of the outcomes (for a regression problem) or the mode (most frequent class) for a classification problem. The value of k is user-specified.
The similarity between instances is calculated using measures such as Euclidean distance and Hamming distance.

Unsupervised Learning Algorithm:
Unsupervised learning problems possess only the input variables (X) but no corresponding output variables. It uses unlabeled training data to model the underlying structure of the data.
Unsupervised learning problems can be of two types:
• Association: To discover the probability of the co-occurrence of items in a collection. It is extensively used in market-basket analysis. Example: If a customer purchases bread, he is 80% likely to also purchase eggs.
• Clustering: To group samples such that objects within the same cluster are more similar to each other than to the objects from another cluster.
• Dimensionality Reduction: True to its name, Dimensionality Reduction means reducing the number of variables of a dataset while ensuring that important information is still conveyed. Dimensionality Reduction can be done using Feature Extraction methods and Feature Selection methods. Feature Selection selects a subset of the original variables. Feature Extraction performs data transformation from a high-dimensional space to a low-dimensional space. Example: PCA algorithm is a Feature Extraction approach.

Unsupervised Learning Algorithm:
1. Apriori:
The Apriori algorithm is used in a transactional database to mine frequent item sets and then generate association rules. It is popularly used in market basket analysis, where one checks for combinations of products that frequently co-occur in the database. In general, we write the association rule for �?¢??if a person purchases item X, then he purchases item Y' as : X -> Y.
Example: if a person purchases milk and sugar, then he is likely to purchase coffee powder. This could be written in the form of an association rule as: {milk, sugar} -> coffee powder. Association rules are generated after crossing the threshold for support and confidence. Figure: Formulae for support, confidence and lift for the association rule X->Y.
The Support measure helps prune the number of candidate item sets to be considered during frequent item set generation. This support measure is guided by the Apriori principle. The Apriori principle states that if an item set is frequent, then all of its subsets must also be frequent.

2. K-means:
K-means is an iterative algorithm that groups similar data into clusters. It calculates the centroids of k clusters and assigns a data point to that cluster having least distance between its centroid and the data point. Figure: Steps of the K-means algorithm

Step 1: k-means initialization:
a) Choose a value of k. Here, let us take k=3.b) randomly assign each data point to any of the 3 cluster's) Compute cluster centroid for each of the clusters. The red, blue and green stars denote the centroids for each of the 3 clusters.

Step 2: Associating each observation to a cluster:
Reassign each point to the closest cluster centroid. Here, the upper 5 points got assigned to the cluster with the blue color centroid. Follow the same procedure to assign points to the clusters containing the red and green color centroid.

Step 3: Recalculating the centroids:
Calculate the centroids for the new clusters. The old centroids are shown by gray stars while the new centroids are the red, green and blue stars.

Step 4: Iterate, then exit if unchanged.
Repeat steps 2-3 until there is no switching of points from one cluster to another. Once there is no switching for 2 consecutive steps, exit the k-means algorithm.
3. PCA
Principal Component Analysis (PCA) is used to make data easy to explore and visualize by reducing the number of variables. This is done by capturing the maximum variance in the data into a new co-ordinate system with axes called �?¢??principal components'. Each component is a linear combination of the original variables and is orthogonal to one another. Orthogonality between components indicates that the correlation between these components is zero.
The first principal component captures the direction of the maximum variability in the data. The second principal component captures the remaining variance in the data but has variables uncorrelated with the first component. Similarly, all successive principal components (PC3, PC4 and so on) capture the remaining variance while being uncorrelated with the previous component. Figure: The 3 original variables (genes) are reduced to 2 new variables termed principal components

Ensemble learning techniques:
Ensemble means combining the results of multiple learners (classifiers) for improved results, by voting or averaging. Voting is used during classification and averaging is used during regression. The idea is that ensembles of learners perform better than single learners. There are 3 types of ensembling algorithms: Bagging, Boosting and Stacking.
4. Bagging with Random Forests:
Random Forest (multiple learners) is an improvement over bagged decision trees (a single learner).

Bagging: The first step in bagging is to create multiple models with datasets created using the Bootstrap
Sampling method. In Bootstrap Sampling, each generated training set is composed of random subsamples from the original dataset. Each of these training sets is of the same size as the original dataset, but some records repeat multiple times and some records do not appear at all. Then, the entire original dataset is used as the test set. Thus, if the size of the original dataset is N, then the size of each generated training set is also N, with the number of unique records being about (2N/3); the size of the test set is also N.

The second step in bagging is to create multiple models by using the same algorithm on the different generated training sets. In this case, let us discuss Random Forest. Unlike a decision tree, where each node is split on the best feature that minimizes error, in random forests, we choose a random selection of features for constructing the best split. The reason for randomness is: even with bagging, when decision trees choose a best feature to split on, they end up with similar structure and correlated predictions. The number of features to be searched at each split point is specified as a parameter to the random forest algorithm.
Thus, in bagging with Random Forest, each tree is constructed using a random sample of records and each split is constructed using a random sample of predictors.