## Category Archives: Big Data

In the previous article, we've talked about AdaBoost which combines output of weak learners into a weighted sum that represents the final output of the boosted classifier. If you know little about AdaBoost or additive model, we highly recommend you read the article first. Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. Boosting Tree Boosting tree is based on additive model which can be repsentated as following: where $T(x; \theta_m)$ stands for a decision tree,…

AdaBoost, short for "Adaptive Boosting", is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire who won the Gödel Prize in 2003 for their work. The output of the other learning algorithms (weak learners) is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers and is quite robust to overfitting. Bagging vs. Boosting A too complex model (unpruned decision trees) have high variance but low bias whereas a too simple model (Weak learners like decision stumps) have high bias but low variance. To minimize…

## Random Forest

Random forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. Random Forest builds many trees using a subset of the available input variables and their values, it inherently contains some underlying decision trees that omit the noise generating variable/feature(s). In the end, when it is time to generate a prediction a vote among all the underlying trees takes place and the majority prediction value wins. Ensembles are a divide-and-conquer approach used…

## Random Projection

Introduction In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are powerful methods known for their simplicity and less erroneous output compared with other methods. According to experimental results, random projection preserve distances well, but empirical results are sparse. Consider a problem as follows: We have a set of n points in a high-dimensional Euclidean space $\mathbf{R}^d$. We want to project the points onto a space of low dimension $\mathbf{R}^k$ in such a way that pairwise distances of the points are approximately the same as before. Formally, we are looking for a map f:$\mathbf{R}^d\rightarrow\mathbf{R}^k$ such that for any pair of original points u,v,$\|f(u)-f(v)\|$

## Decision Tree

What’s decision tree? A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules. Overview A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents…

Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer. Hadoop 1 popularized MapReduce programming for batch jobs and demonstrated the potential value of large scale, distributed processing. MapReduce, as implemented in Hadoop 1, can be I/O intensive, not suitable for interactive analysis, and constrained in support for graph, machine learning and on other memory intensive algorithms. Hadoop developers rewrote major…