Is Data Science Dead? Long Live Business Science
208 days ago
World's Most Popular 5 Hardest Programming Language
Learn 10 Statistical Techniques to Become Data Scientists Master
- Identify the risk factors for prostate cancer.
- Classify a recorded phoneme based on a log-periodogram.
- Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements.
- Customize an email spam detection system.
- Identify the numbers in a handwritten zip code.
- Classify a tissue sample into one of several cancer classes.
- Establish the relationship between salary and demographic variables in population survey data.
- Machine learning arose as a subfield of Artificial Intelligence.
- Statistical learning arose as a subfield of Statistics.
- Machine learning has a greater emphasis on large-scale applications and prediction accuracy.
- Statistical learning emphasizes models and their interpretability, and precision and uncertainty.
- But the distinction has become and more blurred, and there is a great deal of "cross-fertilization."
- Machine learning has the upper hand in Marketing!
- What will be my monthly spending for next year?
- Which factor (monthly income or number of trips per month) is more important in deciding my monthly spending?
- How monthly income and trips per month are correlated with monthly spending?
- How does the probability of getting lung cancer (Yes vs No) change for every additional pound of overweight and for every pack of cigarettes smoked per day?
- Do bodyweight calorie intake, fat intake, and participant age has an influence on heart attacks (Yes vs No)?
- Linear Discriminant Analysis computes "discriminant scores" for each observation to classify what response variable class it is in. These scores are obtained by finding linear combinations of the independent variables. It assumes that the observations within each class are drawn from a multivariate Gaussian distribution and the covariance of the predictor variables are common across all k levels of the response variable Y.
- Quadratic Discriminant Analysis provides an alternative approach. Like LDA, QDA assumes that the observations from each class of Y are drawn from a Gaussian distribution. However, unlike LDA, QDA assumes that each class has its own covariance matrix. In other words, the predictor variables are not assumed to have common variance across each of the k levels in Y.
- Bootstrapping is a technique that helps in many situations like validation of a predictive model performance, ensemble methods, estimation of bias and variance of the model. It works by sampling with replacement from the original data, and take the "not chosen" data points as test cases. We can make this several times and calculate the average score as an estimation of our model performance.
- On the other hand, cross-validation is a technique for validating the model performance, and it's done by split the training data into k parts. We take the k - 1 parts as our training set and use the "held out" part as our test set. We repeat that k times differently. Finally, we take the average of the k scores as our performance estimation.
- Best-Subset Selection: Here we fit a separate OLS regression for each possible combination of the p predictors and then look at the resulting model fits. The algorithm is broken up into 2 stages: (1) Fit all models that contain k predictors, where k is the max length of the models, (2) Select a single model using cross-validated prediction error. It is important to use testing or validation error, and not training error to assess model fit because RSS and R² monotonically increase with more variables. The best approach is to cross-validate and choose the model with the highest R² and lowest RSS on testing error estimates.
- Forward Stepwise Selection considers a much smaller subset of p predictors. It begins with a model containing no predictors, then adds predictors to the model, one at a time until all of the predictors are in the model. The order of the variables being added is the variable, which gives the greatest addition improvement to the fit until no more variables improve model fit using cross-validated prediction error.
- Backward Stepwise Selection begins will all p predictors in the model, then iteratively removes the least useful predictor one at a time.
- Hybrid Methods follows the forward stepwise approach, however, after adding each new variable, the method may also remove variables that do not contribute to the model fit.
- Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS, however, they also have a shrinkage penalty when the coefficients come closer to zero. This penalty has the effect of shrinking the coefficient estimates towards zero. Without going into the math, it is useful to know that ridge regression shrinks the features with the smallest column space variance. Like in the principal component analysis, ridge regression projects the data into directional space and then shrinks the coefficients of the low-variance components more than the high variance components, which are equivalent to the largest and smallest principal components.
- Ridge regression had at least one disadvantage; it includes all p predictors in the final model. The penalty term will set many of them close to zero, but never exactly to zero. This isn't generally a problem for prediction accuracy, but it can make the model more difficult to interpret the results. Lasso overcomes this disadvantage and is capable of forcing some of the coefficients to zero granted that s is small enough. Since s = 1 results in regular OLS regression, as s approaches 0 the coefficients shrink towards zero. Thus, Lasso regression also performs variable selection.
- One can describe Principal Components Regression as an approach for deriving a low-dimensional set of features from a large set of variables. The first principal component direction of the data is along which the observations vary the most. In other words, the first PC is a line that fits as close as possible to the data. One can fit p distinct principal components. The second PC is a linear combination of the variables that are uncorrelated with the first PC and has the largest variance subject to this constraint. The idea is that the principal components capture the most variance in the data using linear combinations of the data in subsequently orthogonal directions. In this way, we can also combine the effects of correlated variables to get more information out of the available data, whereas in regular least squares we would have to discard one of the correlated variables.
- The PCR method that we described above involves identifying linear combinations of X that best represent the predictors. These combinations (directions) are identified in an unsupervised way since the response Y is not used to help determine the principal component directions. That is, the response Y does not supervise the identification of the principal components, thus there is no guarantee that the directions that best explain the predictors also are the best for predicting the response (even though that is often assumed). Partial least squares (PLS) are a supervised alternative to PCR. Like PCR, PLS is a dimension reduction method, which first identifies a new smaller set of features that are linear combinations of the original features, then fits a linear model via least squares to the new M features. Yet, unlike PCR, PLS makes use of the response variable in order to identify the new features.
- A function on the real numbers is called a step function if it can be written as a finite linear combination of indicator functions of intervals. Informally speaking, a step function is a piecewise constant function having only finitely many pieces.
- A piecewise function is a function which is defined by multiple sub-functions, each sub-function applying to a certain interval of the main function's domain. Piecewise is actually a way of expressing the function, rather than a characteristic of the function itself, but with additional qualification, it can describe the nature of the function. For example, a piecewise polynomial function is a function that is a polynomial on each of its sub-domains, but possibly a different one on each.
- A spline is a special function defined piecewise by polynomials. In computer graphics, spline refers to a piecewise polynomial parametric curve. Splines are popular curves because of the simplicity of their construction, their ease, and accuracy of evaluation, and their capacity to approximate complex shapes through curve fitting and interactive curve design.
- A generalized additive model is a generalized linear model in which the linear predictor depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.
- Bagging is the way decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multistep of the same carnality/size as your original data. By increasing the size of your training set you can't improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to the expected outcome.
- Boosting is an approach to calculate the output using several different models and then average the result using a weighted average approach. By combining the advantages and pitfalls of these approaches by varying your weighting formula you can come up with a good predictive force for a wider range of input data, using different narrowly tuned models.
- The random forest algorithm is actually very similar to bagging. Also here, you draw random bootstrap samples of your training set. However, in addition to the bootstrap samples, you also draw a random subset of features for training the individual trees; in bagging, you give each tree the full set of features. Due to the random feature selection, you make the trees more independent of each other compared to regular bagging, which often results in better predictive performance (due to better variance-bias trade-offs) and it's also faster because each tree learns only from a subset of features.
- Principal Component Analysis helps in producing a low dimensional representation of the dataset by identifying a set of linear combination of features which have maximum variance and are mutually uncorrelated. This linear dimensionality technique could be helpful in understanding the latent interaction between the variable in an unsupervised setting.
- k-Means clustering: partitions data into k distinct clusters based on distance to the centroid of a cluster.
- Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree.