gbm {gbm}
Description
Fits generalized boosted regression models.
Usage
gbm(formula = formula(data),
distribution = "bernoulli",
data = list(),
weights,
var.monotone = NULL,
n.trees = 100,
interaction.depth = 1,
n.minobsinnode = 10,
shrinkage = 0.001,
bag.fraction = 0.5,
train.fraction = 1.0,
cv.folds=0,
keep.data = TRUE,
verbose = "CV",
class.stratify.cv=NULL,
n.cores = NULL)
gbm.fit(x, y,
offset = NULL,
misc = NULL,
distribution = "bernoulli",
w = NULL,
var.monotone = NULL,
n.trees = 100,
interaction.depth = 1,
n.minobsinnode = 10,
shrinkage = 0.001,
bag.fraction = 0.5,
nTrain = NULL,
train.fraction = NULL,
keep.data = TRUE,
verbose = TRUE,
var.names = NULL,
response.name = "y",
group = NULL)
gbm.more(object,
n.new.trees = 100,
data = NULL,
weights = NULL,
offset = NULL,
verbose = NULL)
Arguments
- formula
- a symbolic description of the model to be fit. The formula may include an offset term (e.g. y~offset(n)+x). If
keep.data=FALSEin the initial call togbmthen it is the user's responsibility to resupply the offset togbm.more. - distribution
- either a character string specifying the name of the distribution to use or a list with a component
namespecifying the distribution and any additional parameters needed. If not specified,gbmwill try to guess: if the response has only 2 unique values, bernoulli is assumed; otherwise, if the response is a factor, multinomial is assumed; otherwise, if the response has class "Surv", coxph is assumed; otherwise, gaussian is assumed.Currently available options are "gaussian" (squared error), "laplace" (absolute loss), "tdist" (t-distribution loss), "bernoulli" (logistic regression for 0-1 outcomes), "huberized" (huberized hinge loss for 0-1 outcomes), "multinomial" (classification when there are more than 2 classes), "adaboost" (the AdaBoost exponential loss for 0-1 outcomes), "poisson" (count outcomes), "coxph" (right censored observations), "quantile", or "pairwise" (ranking measure using the LambdaMart algorithm).
If quantile regression is specified,
distributionmust be a list of the formlist(name="quantile",alpha=0.25)wherealphais the quantile to estimate. The current version's quantile regression method does not handle non-constant weights and will stop.If "tdist" is specified, the default degrees of freedom is 4 and this can be controlled by specifying
distribution=list(name="tdist", df=DF)whereDFis your chosen degrees of freedom.If "pairwise" regression is specified,
distributionmust be a list of the formlist(name="pairwise",group=...,metric=...,max.rank=...)(metricandmax.rankare optional, see below).groupis a character vector with the column names ofdatathat jointly indicate the group an instance belongs to (typically a query in Information Retrieval applications). For training, only pairs of instances from the same group and with different target labels can be considered.metricis the IR measure to use, one of
conc:- Fraction of concordant pairs; for binary labels, this is equivalent to the Area under the ROC Curve
mrr:- Mean reciprocal rank of the highest-ranked positive instance
map:- Mean average precision, a generalization of
mrrto multiple positive instances ndcg:- Normalized discounted cumulative gain. The score is the weighted sum (DCG) of the user-supplied target values, weighted by log(rank+1), and normalized to the maximum achievable value. This is the default if the user did not specify a metric.
ndcgandconcallow arbitrary target values, while binary targets {0,1} are expected formapandmrr. Forndcgandmrr, a cut-off can be chosen using a positive integer parametermax.rank. If left unspecified, all ranks are taken into account.Note that splitting of instances into training and validation sets follows group boundaries and therefore only approximates the specified
train.fractionratio (the same applies to cross-validation folds). Internally, queries are randomly shuffled before training, to avoid bias.Weights can be used in conjunction with pairwise metrics, however it is assumed that they are constant for instances from the same group.
For details and background on the algorithm, see e.g. Burges (2010).
- data
- an optional data frame containing the variables in the model. By default the variables are taken from
environment(formula), typically the environment from whichgbmis called. Ifkeep.data=TRUEin the initial call togbmthengbmstores a copy with the object. Ifkeep.data=FALSEthen subsequent calls togbm.moremust resupply the same dataset. It becomes the user's responsibility to resupply the same data at this point. - weights
- an optional vector of weights to be used in the fitting process. Must be positive but do not need to be normalized. If
keep.data=FALSEin the initial call togbmthen it is the user's responsibility to resupply the weights togbm.more. - var.monotone
- an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome.
- n.trees
- the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion.
- cv.folds
- Number of cross-validation folds to perform. If
cv.folds>1 thengbm, in addition to the usual fit, will perform a cross-validation, calculate an estimate of generalization error returned incv.error. - interaction.depth
- The maximum depth of variable interactions. 1 implies an additive model, 2 implies a model with up to 2-way interactions, etc.
- n.minobsinnode
- minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight.
- shrinkage
- a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction.
- bag.fraction
- the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If
bag.fraction<1 then running the same model twice will result in similar but different fits.gbmuses the R random number generator soset.seedcan ensure that the model can be reconstructed. Preferably, the user can save the returnedgbm.objectusingsave. - train.fraction
- The first
train.fraction * nrows(data)observations are used to fit thegbmand the remainder are used for computing out-of-sample estimates of the loss function. - nTrain
- An integer representing the number of cases on which to train. This is the preferred way of specification for
gbm.fit; The optiontrain.fractioningbm.fitis deprecated and only maintained for backward compatibility. These two parameters are mutually exclusive. If both are unspecified, all data is used for training. - keep.data
- a logical variable indicating whether to keep the data and an index of the data stored with the object. Keeping the data and index makes subsequent calls to
gbm.morefaster at the cost of storing an extra copy of the dataset. - object
- a
gbmobject created from an initial call togbm. - n.new.trees
- the number of additional trees to add to
object. - verbose
- If TRUE, gbm will print out progress and performance indicators. If this option is left unspecified for gbm.more then it uses
verbosefromobject. - class.stratify.cv
- whether or not the cross-validation should be stratified by class. Defaults to
TRUEfordistribution="multinomial"and is only implementated formultinomialandbernoulli. The purpose of stratifying the cross-validation is to help avoiding situations in which training sets do not contain all classes. - x, y
- For
gbm.fit:xis a data frame or data matrix containing the predictor variables andyis the vector of outcomes. The number of rows inxmust be the same as the length ofy. - offset
- a vector of values for the offset
- misc
- For
gbm.fit:miscis an R object that is simply passed on to the gbm engine. It can be used for additional data for the specific distribution. Currently it is only used for passing the censoring indicator for the Cox proportional hazards model. - w
- For
gbm.fit:wis a vector of weights of the same length as they. - var.names
- For
gbm.fit: A vector of strings of length equal to the number of columns ofxcontaining the names of the predictor variables. - response.name
- For
gbm.fit: A character string label for the response variable. - group
groupused whendistribution = 'pairwise'.- n.cores
- The number of CPU cores to use. The cross-validation loop will attempt to send different CV folds off to different cores. If
n.coresis not specified by the user, it is guessed using thedetectCoresfunction in theparallelpackage. Note that the documentation fordetectCoresmakes clear that it is not failsave and could return a spurious number of available cores.
Details
See the ../doc/gbm.pdfgbm vignette for technical details.
This package implements the generalized boosted modeling framework. Boosting is the process of iteratively adding basis functions in a greedy fashion so that each additional basis function further reduces the selected loss function. This implementation closely follows Friedman's Gradient Boosting Machine (Friedman, 2001).
In addition to many of the features documented in the Gradient Boosting Machine, gbm offers additional features including the out-of-bag estimator for the optimal number of iterations, the ability to store and manipulate the resulting gbm object, and a variety of other loss functions that had not previously had associated boosting algorithms, including the Cox partial likelihood for censored data, the poisson likelihood for count outcomes, and a gradient boosting implementation to minimize the AdaBoost exponential loss function.
gbm.fit provides the link between R and the C++ gbm engine. gbm is a front-end to gbm.fit that uses the familiar R modeling formulas. However, model.frame is very slow if there are many predictor variables. For power-users with many variables use gbm.fit. For general practice gbm is preferable.
Values
gbm, gbm.fit, and gbm.more return a gbm.object.
References
Y. Freund and R.E. Schapire (1997) “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, 55(1):119-139.
G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.
J.H. Friedman, T. Hastie, R. Tibshirani (2000). “Additive Logistic Regression: a Statistical View of Boosting,” Annals of Statistics 28(2):337-374.
J.H. Friedman (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics 29(5):1189-1232.
J.H. Friedman (2002). “Stochastic Gradient Boosting,” Computational Statistics and Data Analysis 38(4):367-378.
B. Kriegler (2007). http://statistics.ucla.edu/theses/uclastat-dissertation-2007:2Cost-Sensi... Stochastic Gradient Boosting Within a Quantitative Regression Framework. PhD dissertation, UCLA Statistics.
C. Burges (2010). “From RankNet to LambdaRank to LambdaMART: An Overview,” Microsoft Research Technical Report MSR-TR-2010-82.
http://sites.google.com/site/gregridgewayGreg Ridgeway's site.
The http://www-stat.stanford.edu/~jhf/R-MART.htmlMART website.
See Also
gbm.object, gbm.perf, plot.gbm, predict.gbm, summary.gbm, pretty.gbm.tree.
Examples
# A least squares regression example # create some data N <- 1000 X1 <- runif(N) X2 <- 2*runif(N) X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 <- factor(sample(letters[1:6],N,replace=TRUE)) X5 <- factor(sample(letters[1:3],N,replace=TRUE)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)] SNR <- 10 # signal-to-noise ratio Y <- X1**1.5 + 2 * (X2**.5) + mu sigma <- sqrt(var(Y)/SNR) Y <- Y + rnorm(N,0,sigma) # introduce some missing values X1[sample(1:N,size=500)] <- NA X4[sample(1:N,size=300)] <- NA data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # fit initial model gbm1 <- gbm(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution="gaussian", # see the help for other choices n.trees=1000, # number of trees shrinkage=0.05, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 3, # do 3-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=FALSE, # don't print out progress n.cores=1) # use only a single core (detecting #cores is # error-prone, so avoided here) # check performance using an out-of-bag estimator # OOB underestimates the optimal number of iterations best.iter <- gbm.perf(gbm1,method="OOB") print(best.iter) # check performance using a 50% heldout test set best.iter <- gbm.perf(gbm1,method="test") print(best.iter) # check performance using 5-fold cross-validation best.iter <- gbm.perf(gbm1,method="cv") print(best.iter) # plot the performance # plot variable influence summary(gbm1,n.trees=1) # based on the first tree summary(gbm1,n.trees=best.iter) # based on the estimated best number of trees # compactly print the first and last trees for curiosity print(pretty.gbm.tree(gbm1,1)) print(pretty.gbm.tree(gbm1,gbm1$n.trees)) # make some new data N <- 1000 X1 <- runif(N) X2 <- 2*runif(N) X3 <- ordered(sample(letters[1:4],N,replace=TRUE)) X4 <- factor(sample(letters[1:6],N,replace=TRUE)) X5 <- factor(sample(letters[1:3],N,replace=TRUE)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)] Y <- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma) data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # predict on the new data using "best" number of trees # f.predict generally will be on the canonical scale (logit,log,etc.) f.predict <- predict(gbm1,data2,best.iter) # least squares error print(sum((data2$Y-f.predict)^2)) # create marginal plots # plot variable X1,X2,X3 after "best" iterations par(mfrow=c(1,3)) plot(gbm1,1,best.iter) plot(gbm1,2,best.iter) plot(gbm1,3,best.iter) par(mfrow=c(1,1)) # contour plot of variables 1 and 2 after "best" iterations plot(gbm1,1:2,best.iter) # lattice plot of variables 2 and 3 plot(gbm1,2:3,best.iter) # lattice plot of variables 3 and 4 plot(gbm1,3:4,best.iter) # 3-way plots plot(gbm1,c(1,2,6),best.iter,cont=20) plot(gbm1,1:3,best.iter) plot(gbm1,2:4,best.iter) plot(gbm1,3:5,best.iter) # do another 100 iterations gbm2 <- gbm.more(gbm1,100, verbose=FALSE) # stop printing detailed progress
Documentation reproduced from package gbm, version 2.1. License: GPL (>= 2) | file LICENSE
