Nand Kishor is the Product Manager of House of Bots. After finishing his studies in computer science, he ideated & re-launched Real Estate Business Intelligence Tool, where he created one of the leading Business Intelligence Tool for property price analysis in 2012. He also writes, research and sharing knowledge about Artificial Intelligence (AI), Machine Learning (ML), Data Science, Big Data, Python Language etc... ...

Full BioNand Kishor is the Product Manager of House of Bots. After finishing his studies in computer science, he ideated & re-launched Real Estate Business Intelligence Tool, where he created one of the leading Business Intelligence Tool for property price analysis in 2012. He also writes, research and sharing knowledge about Artificial Intelligence (AI), Machine Learning (ML), Data Science, Big Data, Python Language etc...

3 Best Programming Languages For Internet of Things Development In 2018

779 days ago

Data science is the big draw in business schools

952 days ago

7 Effective Methods for Fitting a Liner

962 days ago

3 Thoughts on Why Deep Learning Works So Well

962 days ago

3 million at risk from the rise of robots

962 days ago

Top 10 Hot Artificial Intelligence (AI) Technologies

342261 views

2018 Data Science Interview Questions for Top Tech Companies

95022 views

Want to be a millionaire before you turn 25? Study artificial intelligence or machine learning

89670 views

Here's why so many data scientists are leaving their jobs

88788 views

Google announces scholarship program to train 1.3 lakh Indian developers in emerging technologies

68853 views

### Top 10 Data Mining Algorithms, Explained

- Patient has a history of cancer
- Patient is expressing a gene highly correlated with cancer patients
- Patient has tumors
- Patient's tumor size is greater than 5cm

- First, C4.5 uses information gain when generating the decision tree.
- Second, although other systems also incorporate pruning, C4.5 uses a single-pass pruning process to mitigate over-fitting. Pruning results in many improvements.
- Third, C4.5 can work with both continuous and discrete data. My understanding is it does this by specifying ranges or thresholds for continuous data thus turning continuous data into discrete data.
- Finally, incomplete data is dealt with in its own ways.

- k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.
- Every patient will be closest to 1 of these k centroids. They hopefully won't all be closest to the same one, so they'll form a cluster around their nearest centroid.
- What we have are k clusters, and each patient is now a member of a cluster.
- k-means then finds the center for each of the k clusters based on its cluster members (yep, using the patient vectors!).
- This center becomes the new centroid for the cluster.
- Since the centroid is in a different place now, patients might now be closer to other centroids. In other words, they may change cluster membership.
- Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships stabilize. This is called convergence.

- The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.?
- The second is your support or the number of transactions containing the itemset divided by the total number of transactions. An itemset that meets the support is called a frequent itemset.
- The third is your confidence or the conditional probability of some item given you have certain other items in your itemset. A good example is given chips in your itemset, there is a 67% confidence of having soda also in the itemset.

- Join. Scan the whole database for how frequent 1-itemsets are.
- Prune. Those itemsets that satisfy the support and confidence move onto the next round for 2-itemsets.
- Repeat. This is repeated for each itemset level until we reach our previously defined size.

- The mean
- The variance

- E-step: Based on the model parameters, it calculates the probabilities for assignments of each data point to a cluster.
- M-step: Update the model parameters based on the cluster assignments from the E-step.
- Repeat until the model parameters and cluster assignments stabilize (a.k.a. convergence).

- First, EM is fast in the early iterations, but slow in the later iterations.
- Second, EM doesn't always find the optimal parameters and gets stuck in local optima rather than global optima.

- Dr Stefano Allesina, from the University of Chicago, applied PageRank to ecology to determine which species are critical for sustaining ecosystems.
- Twitter developed WTF (Who-to-Follow) which is a personalized PageRank recommendation engine about who to follow.
- Bin Jiang, from The Hong Kong Polytechnic University, used a variant of PageRank to predict human movement rates based on topographical metrics in London.

- C4.5 builds a decision tree classification model during training.
- SVM builds a hyperplane classification model during training.
- AdaBoost builds an ensemble classification model during training.

- First, it looks at the k closest labeled training data points â?? in other words, the k-nearest neighbors.
- Second, using the neighbors' classes, kNN gets a better idea of how the new data should be classified.

- Using Hamming distance as a metric for the â??closenessâ?? of 2 text strings.
- Transforming discrete data into binary features.

- Take a simple majority vote from the neighbors. Whichever class has the greatest number of votes becomes the class for the new data point.
- Take a similar vote except give a heavier weight to those neighbors that are closer. A simple way to do this is to use reciprocal distance e.g. if the neighbor is 5 units away, then weight its vote 1/5. As the neighbor gets further away, the reciprocal distance gets smaller and smaller... exactly what we want!

- kNN can get very computationally expensive when trying to determine the nearest neighbors on a large dataset.
- Noisy data can throw off kNN classifications.
- Features with a larger range of values can dominate the distance metric relative to features that have a smaller range, so feature scaling is important.
- Since data processing is deferred, kNN generally requires greater storage requirements than eager classifiers.
- Selecting a good distance metric is crucial to kNN's accuracy.

- If height increases, weight likely increases.
- If cholesterol level increases, weight likely increases.
- If cholesterol level increases, pulse likely increases as well.

- The fraction's numerator is the probability of Feature 1 given Class A multiplied by the probability of Feature 2 given Class A multiplied by the probability of Class A.
- The fraction's denominator is the probability of Feature 1 multiplied by the probability of Feature 2.

- We have a training dataset of 1,000 fruits.
- The fruit can be a Banana, Orange or Other (these are the classes).
- The fruit can be Long, Sweet or Yellow (these are the features).

- Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.
- Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.
- Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50 are yellow.

C4.5 | CART |
---|---|

Uses information gain to segment data during decision tree generation. | Uses Gini impurity (not to be confused with Gini coefficient). A good discussion of the differences between the impurity and coefficient is available on Stack Overflow. |

Uses a single-pass pruning process to mitigate over-fitting. | Uses the cost-complexity method of pruning. Starting at the bottom of the tree, CART evaluates the misclassification cost with the node vs. without the node. If the cost doesnâ??t meet a threshold, it is pruned away. |

The decision nodes can have 2 or more branches. | The decision nodes have exactly 2 branches. |

Probabilistically distributes missing values to children. | Uses surrogates to distribute the missing values to children. |