earth {earth}
Description
Build a regression model using the techniques in Friedman's papers "Multivariate Adaptive Regression Splines" and "Fast MARS".
See the package vignette “../doc/earth-notes.pdfNotes on the earth package”.
Usage
## S3 method for class 'formula':
earth((formula = stop("no 'formula' arg"),
data, weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
na.action = na.fail, glm = NULL, trace = 0,
keepxy = FALSE, ncross=1, nfold=0, stratify=TRUE, ...))
## S3 method for class 'default':
earth((x = stop("no 'x' arg"), y = stop("no 'y' arg"),
weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
na.action = na.fail, glm = NULL, trace = 0,
keepxy = FALSE, ncross=1, nfold=0, stratify=TRUE, ...))
## S3 method for class 'fit':
earth((x = stop("no 'x' arg"), y = stop("no 'y' arg"),
weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
na.action = na.fail, glm = NULL, trace = 0,
nk = min(200, max(20, 2 * ncol(x))) + 1,
degree = 1,
penalty = if(degree > 1) 3 else 2, thresh = 0.001,
minspan = 0, newvar.penalty = 0, fast.k = 20, fast.beta = 1,
linpreds = FALSE, allowed = NULL,
pmethod = c("backward", "none", "exhaustive", "forward", "seqrep"),
nprune = NULL,
Object = NULL, Get.crit = get.gcv, Eval.model.subsets = eval.model.subsets,
Force.xtx.prune = FALSE, Use.beta.cache = TRUE, Exhaustive.tol = 1e-10, ...))
Arguments
To start off, look at the arguments formula, data, x, y, nk, degree, and trace. Many users will find that those arguments are all they need, plus in some cases keepxy, nprune, penalty, and minspan. For GLM models, use the glm argument. For cross validation, use the nfold argument.
- formula
- Model formula.
- data
- Data frame for
formula. - x
- Matrix or dataframe containing the independent variables.
- y
- Vector containing the response variable, or, in the case of multiple responses, a matrix or dataframe whose columns are the values for each response.
- subset
- Index vector specifying which cases to use, i.e., which rows in
xto use. Default is NULL, meaning all. - weights
- Weight vector (not yet supported).
- wp
- Vector of response weights. Default is NULL, meaning no response weights. If specified,
wpmust have an element for each column ofy(afterfactorsiny, if any, have been expanded). - scale.y
Scaleyin the forward pass for better numeric stability. Scaling here means subtract the mean and divide by the standard deviation. Default isNCOL(y)==1, i.e., scaleyunlessyhas multiple columns.- na.action
- NA action. Default is
na.fail, and onlyna.failis supported. - glm
- NULL (default) or a list of arguments to pass on to
glm. See the documentation ofglmfor a description of these arguments (but not allglm's arguments are supported,earthwill give an error message if you use a glm argument it does not support).
See also “Generalized linear models” in the package vignette. Example:
earth(y~x, glm=list(family=binomial)) - trace
- Trace
earth's execution. Default is 0. Values:
0 no tracing
0.5 cross validation
1 overview
2 forward pass
3 pruning
4 model mats, pruning details
5 internal details of operation - keepxy
- Set to TRUE to retain the following in the returned value:
xandy(ordata),subset, andweights. Default is FALSE. The functionupdate.earthand friends will use these if present instead of searching for them in the environment at the timeupdate.earthis invoked. This argument also affects the amount of data kept when thenfoldargument is used (seecv.listin the “Value” section below).The following arguments are for the forward pass.
- nk
- Maximum number of model terms before pruning, i.e., the maximum number of terms created by the forward pass. Includes the intercept.
The actual number of terms created by the forward pass will often be less thannk. See “Termination conditions for the forward pass” in the package vignette.
The default is semi-automatically calculated from the number of predictors but may need adjusting. Usetrace=1or more to see why the forward pass terminated --- if it is "Reached max number of terms" then you could consider increasingnk. - degree
- Maximum degree of interaction (Friedman's mi). Default is 1, meaning build an additive model (i.e., no interaction terms).
- penalty
- Generalized Cross Validation (GCV) penalty per knot. Default is
if(degree>1) 3 else 2. A value of 0 penalizes only terms, not knots. The value -1 is treated specially to mean no penalty, so GCV=RSS/n. Simulation studies have suggested values in the range of about 2 to 4. In practice, for big datasets larger values can be useful to force a smaller model. The FAQ section in the package vignette has some information on GCVs. - thresh
- Forward stepping threshold. Default is 0.001. This is one of the arguments used to decide when forward stepping should terminate: the forward pass terminates if adding a term changes RSq by less than
thresh. See also “Termination conditions for the forward pass” in the package vignette. - minspan
- Minimum distance between knots. The
minspanargument is intended to increase resistance to runs of noise in the input data. (Note: predictor value extremes are ineligible for knots regardless of theminspansetting, as per the MARS paper equation 45.)
Useminspan=1to consider all x values (which is good if the data are not noisy).
The default isminspan=0. This value 0 is treated specially and means calculate theminspaninternally as per Friedman's MARS paper section 3.8 with alpha = 0.05. Settrace>=2to see the calculated value.
minspan=-1is also treated specially. It is for back compatibility, and means calculate minspan using the (incorrect) method used in versions ofearthprior to 2.4-0. Usingminspan=-1instead of the defaultminspan=0will usually build a model with a very similar GCV, although with slightly different knots and terms. - newvar.penalty
- Penalty for adding a new variable in the forward pass (Friedman's gamma, equation 74 in the MARS paper). Default is 0, meaning no penalty for adding a new variable. Useful non-zero values typically range from about 0.01 to 0.2 --- you will need to experiment.
With the defaultnewvar.penalty=0, if two variables have nearly the same effect (e.g. they are collinear), at any step in the forward passearthmay arbitrarily select one or the other. Both variables can appear in the final model, complicating model interpretation. On the other hand with a non-zeronewvar.penalty, the forward pass will be reluctant to add a new variable --- it will rather try to use a variable already in the model, if that does not affect RSq too much. The resulting final model may be easier to interpret, if you are lucky. There will often be a small performance hit (a worse GCV). - fast.k
- Maximum number of parent terms considered at each step of the forward pass. Friedman invented this parameter to speed up the forward pass (see the Fast MARS paper section 3.0). Default is 20. Values of 0 is treated specially (as being equivalent to infinity), meaning no Fast MARS. Typical values, apart from 0, are 20, 10, or 5. In general, with a lower
fast.k(say 5),earthis faster; with a higherfast.k, or withfast.kdisabled (set to 0),earthbuilds a better model. However it is not unusual because of random variation to get a slightly better model with a lowerfast.k. - fast.beta
- Fast MARS aging coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results.
- linpreds
- Index vector specifying which predictors should enter linearly, as in
lm. This does not say that a predictor must enter the model; only that if it enters, it enters linearly.
The default is FALSE, meaning all predictors enter in the standard MARS fashion, i.e., in hinge functions.
A predictor's index inlinpredsis the column number in the input matrixxafter factors have been expanded. Examples are given in the package vignette. - allowed
- Function specifying which predictors can interact and how. Default is NULL, meaning all standard MARS terms are allowed.
During the forward pass,earthcalls theallowedfunction before considering a term for inclusion; the term can go into the model only if theallowedfunction returns TRUE. See the package vignette for details.The following arguments are for cross validation.
- ncross
- Only applies if
nfold>1. Number of cross-validations. Each cross-validation hasnfoldfolds. Default 1. - nfold
- Number of cross-validation folds. Default is 0, no cross validation. If greater than 1,
earthfirst builds a standard model as usual with all the data. It then buildsnfoldcross-validated models, measuring R-Squared on the out-of-fold (left out) data each time. The final cross validation R-Squared (cv.rsq) is the mean of these fold R-Squareds.
The above process of buildingnfoldmodels is repeatedncrosstimes (by default, once).
Further statistics are calculated ifkeepxy=TRUEor if a binomial or poisson model (theglmargument). For details, see “Cross validation” in the package vignette. - stratify
- Only applies if
nfold>1. Default is TRUE. Stratify the cross-validation samples so that an approximately equal number of cases with a non-zero response occur in each cross validation subset. So if the responseyis logical, theTRUEs will be spread evenly across folds. If the response is a factor, there will be an approximately equal number of each factor level in each fold (because a multilevel factor response gets expanded to columns of zeros and ones, see “Factors” in the package vignette). We say “approximately equal” because the number of occurrences of a factor level may not be exactly divisible by the number of folds.The following arguments are for the pruning pass.
- pmethod
- Pruning method. One of:
backward none exhaustive forward seqrep.
Default is"backward".
Usenoneto retain all the terms created by the forward pass.
Ifyhas multiple columns, then onlybackwardornoneis allowed.
Pruning can take a while ifexhaustiveis chosen and the model is big (more than about 30 terms). The current version of theleapspackage used during pruning does not allow user interrupts (i.e., you have to kill your R session to interrupt; in Windows use the Task Manager or from the command line usetaskkill). - nprune
- Maximum number of terms (including intercept) in the pruned model. Default is NULL, meaning all terms created by the forward pass (but typically not all terms will remain after pruning). Use this to enforce an upper bound on the model size (that is less than
nk), or to reduce exhaustive search time withpmethod="exhaustive".The following arguments are for internal or advanced use.
- Object
- Earth object to be updated, for use by
update.earth. - Get.crit
- Criterion function for model selection during pruning. By default a function that returns the GCV. See “The pruning pass” in the package vignette.
- Eval.model.subsets
- Function to evaluate model subsets --- see notes in source code.
- Force.xtx.prune
- Default is FALSE. This argument pertains to subset evaluation in the pruning pass. By default, if
yhas a single column thenearthcalls theleapsroutines; ifyhas multiple columns thenearthcallsEvalSubsetsUsingXtx. Theleapsroutines are more accurate but do not support multiple responses (leapsis based on the QR decomposition andEvalSubsetsUsingXtxis based on the inverse of X'X). SettingForce.xtx.prune=TRUEforces use ofEvalSubsetsUsingXtx, even ifyhas a single column. - Use.beta.cache
- Default is TRUE. Using the “beta cache” takes more memory but is faster (by 20% and often much more for large models). The beta cache uses
nk * nk * ncol(x) * sizeof(double)bytes. SetUse.beta.cache=FALSEto save memory. (The beta cache is an innovation in this implementation of MARS and does not appear in Friedman's papers. It is not related to thefast.betaargument. Certain regression coefficients in the forward pass can be saved and re-used, thus saving recalculation time.) - Exhaustive.tol
- Default
1e-10. Applies only whenpmethod="exhaustive". If the reciprocal of the condition number ofbxis less thanExhaustive.tol,earthforcespmethod="backward". See “XHAUST returned error code -999” in the package vignette. - ...
- Dots are passed on to
earth.fit.
Values
An object of class "earth" which is a list with the components listed below. Term refers to a term created during the forward pass (each line of the output from format.earth is a term). Term number 1 is always the intercept.
- rss
- Residual sum-of-squares (RSS) of the model (summed over all responses if
yhas multiple columns). - rsq
1-rss/tss. R-Squared of the model (calculated over all responses). A measure of how well the model fits the training data. Note thattssis the total sum-of-squares,sum((y - mean(y))^2).- gcv
- Generalized Cross Validation (GCV) of the model (summed over all responses) The GCV is calculated using the
penaltyargument. For details of the GCV calculation, see equation 30 in Friedman's MARS paper andearth:::get.gcv. - grsq
1-gcv/gcv.null. An estimate of the predictive power of the model (calculated over all responses,gcv.nullis the GCV of an intercept-only model). See “CanGRSqbe negative?” in the package vignette for some discussion.- bx
- Matrix of basis functions applied to
x. Each column corresponds to a selected term. Each row corresponds to a row in in the input matrixx, after takingsubset. Seemodel.matrix.earthfor an example ofbxhandling. Examplebx:(Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(... [1,] 1 0.0 4.6 0 [2,] 1 0.0 4.3 0 [3,] 1 0.0 4.1 0 ...
- dirs
- Matrix with one row per MARS term, and with with ij-th element equal to
if predictor j is not in term i
-1if an expression of the formh(const - xj)is in term i
1if an expression of the formh(xj - const)is in term i
2if predictor j enters term i linearly (thelinpredsargument).This matrix includes all terms generated by the forward.pass, including those not in
selected.terms. Note that here the terms may not all be in pairs, because although the forward pass add terms as hinged pairs (so both sides of the hinge are available as building blocks for further terms), it also deletes linearly dependent terms before handing control to the pruning pass. Exampledirs:Girth Height (Intercept) 0 0 #intercept h(Girth-12.9) 1 0 #2nd term uses Girth h(12.9-Girth) -1 0 #3rd term uses Girth h(Girth-12.9)*h(Height-76) 1 1 #4th term uses Girth and Height ...
- cuts
- Matrix with ij-th element equal to the cut point for predictor j in term i. This matrix includes all terms generated by the forward.pass, including those not in
selected.terms. Note for programmers: the precedent is to usedirsfor term names etc. and to only usecutswhere cut information needed. Examplecuts:Girth Height (Intercept) 0 0 #intercept, no cuts h(Girth-12.9) 12.9 0 #2nd term has cut at 12.9 h(12.9-Girth) 12.9 0 #3rd term has cut at 12.9 h(Girth-12.9)*h(Height-76) 12.9 76 #4th term has two cuts ...
- selected.terms
- Vector of term numbers in the selected model. Can be used as a row index vector into
cutsanddirs. The first elementselected.terms[1]is always 1, the intercept. - prune.terms
- A matrix specifying which terms appear in which pruning pass subsets. The row index of
prune.termsis the model size. (The model size is the number of terms in the model. The intercept is counted as a term.) Each row is a vector of term numbers for the best model of that size. An element is 0 if the term is not in the model, thusprune.termsis a lower triangular matrix, with dimensionsnprune x nprune. The model selected by the pruning pass is at row numberlength(selected.terms). Exampleprune.terms:[1,] 1 0 0 0 0 0 0 #intercept-only model [2,] 1 2 0 0 0 0 0 #best 2 term model uses terms 1,2 [3,] 1 2 4 0 0 0 0 #best 3 term model uses terms 1,2,4 [4,] 1 2 6 9 0 0 0 #and so on ...
- rss.per.response
- A vector of the RSS for each response. Length is the number of responses, i.e.,
ncol(y)after factors inyhave been expanded. Thersscomponent above is equal tosum(rss.per.response). - rsq.per.response
- A vector of the R-Squared for each response. Length is the number of responses.
- gcv.per.response
- A vector of the GCV for each response. Length is the number of responses. The
gcvcomponent above is equal tosum(gcv.per.response). - grsq.per.response
- A vector of the GRSq for each response. Length is the number of responses.
- rss.per.subset
- A vector of the RSS (measured on the training data) for each model subset generated by the pruning pass. Length is
nprune. For multiple responses, the RSS is summed over all responses for each subset. Therssabove is
rss.per.subset[length(selected.terms)]. The RSS of an intercept only-model isrss.per.subset[1]. - gcv.per.subset
- A vector of the GCV for each model in
prune.terms. Length isnprune. For multiple responses, the GCV is summed over all responses for each subset. Thegcvabove isgcv.per.subset[length(selected.terms)]. The GCV of an intercept-only model isgcv.per.subset[1]. - fitted.values
- Fitted values. A matrix with dimensions
nrow(y)xncol(y)after factors inyhave been expanded. - residuals
- Residuals. A matrix with dimensions
nrow(y)xncol(y)after factors inyhave been expanded. - coefficients
- Regression coefficients. A matrix with dimensions
length(selected.terms)xncol(y)after factors inyhave been expanded. Each column holds the least squares coefficients from regressing that column ofyonbx. The first row holds the intercept coefficient(s). - penalty
- The GCV penalty used during pruning. A copy of
earth'spenaltyargument. - call
- The call used to invoke
earth. - terms
- Model frame terms. This component exists only if the model was built using
earth.formula. - namesx
- Column names of
x, generated internally byearthwhen necessary so each column ofxhas a name. Used, for example, bypredict.earthto name columns if necessary. - namesx.org
- Original column names of
x. - levels
- Levels of
yifyis afactor
c(FALSE,TRUE)ifyislogical
Else NULL - wp
- Copy of the
wpargument toearth.The following fields appear only if
earth's argumentkeepxyis TRUE. - x
- y
- data
- subset
- weights
- Copies of the corresponding arguments to
earth. Only exist ifkeepxy=TRUE.The following fields appear only if
earth'sglmargument is used. - glm.list
- List of GLM models. Each element is the value returned by
earth's internal call toglmfor each response.
Thus if there is a single response (or a single binomial pair, see “Binomial pairs” in the package vignette) this will be a one element list and you access the GLM model withmy.earth.model$glm.list[[1]]. - glm.coefficients
- GLM regression coefficients. Analogous to the
coefficientsfield described above but for the GLM model(s). A matrix with dimensionslength(selected.terms)xncol(y)after factors inyhave been expanded. Each column holds the coefficients from the GLM regression of that column ofyonbx. This duplicates, for convenience, information buried inglm.list. - glm.bpairs
- NULL unless there are paired binomial columns. A logical vector, derived internally by
earth, or a copy thebpairsspecified by the user in theglmlist. See “Binomial pairs” in the package vignette.The following fields appear only if the
nfoldargument is greater than 1. - cv.list
- List of
earthmodels, one model for each fold (ncross * nfoldmodels). To save memory, lengthy fields in the fold models are removed unless you usekeepxy=TRUE. The “lengthy fields” are$bx,$fitted.values, and$residuals. The fold models have two extra fields,icross(the cross-validation index,1:ncross) andifold(the fold index,1:nfold). - cv.nterms
- Vector of length
ncross * nfold + 1. Number of MARS terms in the model generated at each cross-validation fold, with the final element being the mean of these. - cv.nvars
- Vector of length
ncross * nfold + 1. Number of predictors in the model generated at each cross-validation fold, with the final element being the mean of these. - cv.groups
- Specifies which cases went into which folds. Matrix with two columns and number of rows equal to the the number of cases
nrow(x)Elements of the first column specify the cross-validation number,1:ncross. Elements of the second column specify the fold number,1:nfold. - cv.rsq.tab
- Matrix with
ncross * nfold + 1rows andnresponse+1columns, wherenresponseis the number of responses, i.e.,ncol(y)after factors inyhave been expanded. The firstnresponseelements of a row are thecv.rsq's on the out-of-fold data for each response of the model generated at that row's fold. (Acv.rsqis calculated from predictions on the out-of-fold data using the best model built from the in-fold data; where “best” means the model was selected using the in-fold GCV.) The final column holds the row mean (a weighted mean ifwpif specified). The final row holds the column means. The values in this final row is the meancv.rsqprinted bysummary.earth.Example for a single response model (where the
meancolumn is redundant but included for uniformity with multiple response models):y mean fold1 0.909 0.909 fold2 0.869 0.869 fold3 0.952 0.952 fold4 0.157 0.157 fold5 0.961 0.961 mean 0.769 0.769
Example for a multiple response model:
y1 y2 y3 mean fold1 0.915 0.951 0.944 0.937 fold2 0.962 0.970 0.970 0.968 fold3 0.914 0.940 0.942 0.932 fold4 0.907 0.929 0.925 0.920 fold5 0.947 0.987 0.979 0.971 mean 0.929 0.955 0.952 0.946
- cv.oof.rsq.tab
- Matrix with
ncross * nfold + 1rows andmax.ntermscolumns, Only calculated and kept ifkeepxy=TRUE. Each element holds an out-of-fold RSq (oof.rsq), calculated from predictions from the out-of-fold observations using the model built with the in-fold data. The final row is the mean over all folds. - cv.infold.rsq.tab
- Like
cv.oof.rsq.tabbut from predictions made on the in-fold observations. - cv.class.rate.tab
- Like
cv.rsq.tabbut is the classification rate at each fold i.e. the fraction of classes correctly predicted. Models with discrete response only. Calculated withthresh=.5for binary responses. For responses with more than two levels, the final row is the overall classification rate. The other rows are the classification rates for each level (the level versus versus not-the-level), which are usually higher than the overall classification rate (predicting the level versus not-the-level is easier than correctly predicting one of many levels). - cv.maxerr.tab
- Like
cv.rsq.tabbut is theMaxErrat each fold. This is the signed max absolute value at each fold. Results are aggregated for the final column and final row using the signed max absolute value. The signed max absolute value is defined as the maximum of the absolute difference between the predicted and observed response values, multiplied by-1if the sign of that difference is negative. - cv.auc.tab
- Like
cv.rsq.tabbut is theAUCat each fold. Binomial models only. - cv.cor.tab
- Like
cv.rsq.tabbut is thecorat each fold. Poisson models only. - cv.deviance.tab
- Like
cv.rsq.tabbut is theMeanDevat each fold. Binomial models only. - cv.calib.int.tab
- Like
cv.rsq.tabbut is theCalibIntat each fold. Binomial models only. - cv.calib.slope.tab
- Like
cv.rsq.tabbut is theCalibSlopeat each fold. Binomial models only.
References
The primary references are the Friedman papers. Readers may find the MARS section in Hastie, Tibshirani, and Friedman a more accessible introduction. The Wikipedia article is recommended for an elementary introduction. Faraway takes a hands-on approach, using the ozone data to compare mda::mars with other techniques. (If you use Faraway's examples with earth instead of mars, use $bx instead of $x, and check out the book's errata.) Friedman and Silverman is recommended background reading for the MARS paper. Earth's pruning pass uses code from the leaps package which is based on techniques in Miller.
Faraway (2005) Extending the Linear Model with R http://www.maths.bath.ac.uk/~jjf23
Friedman (1991) Multivariate Adaptive Regression Splines (with discussion) Annals of Statistics 19/1, 1--141 http://www.salfordsystems.com/doc/MARS.pdf
Friedman (1993) Fast MARS Stanford University Department of Statistics, Technical Report 110 http://www.milbo.users.sonic.net/earth/Friedman-FastMars.pdf, http://www-stat.stanford.edu/research/index.html
Friedman and Silverman (1989) Flexible Parsimonious Smoothing and Additive Modeling Technometrics, Vol. 31, No. 1. http://links.jstor.org/sici?sici=0040-1706%28198902%2931%3A1%3C3%3AFPSAAM%3E2.0.CO%3B2-Z
Hastie, Tibshirani, and Friedman (2009) The Elements of Statistical Learning (2nd ed.) http://www-stat.stanford.edu/~hastie/pub.htm
Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005) Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish Freshwater Biology, 50, 2034-2052 http://www-stat.stanford.edu/~hastie/pub.htm, http://www.botany.unimelb.edu.au/envisci/about/staff/elith.html
Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression http://www.cmis.csiro.au/Alan_Miller/index.html
Wikipedia article on MARS http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines
Note
Please see the package vignette “../doc/earth-notes.pdfNotes on the earth package”. The vignette can also be downloaded from http://www.milbo.org/doc/earth-notes.pdf.
See Also
Start with summary.earth, plot.earth, evimp, and plotmo.
Please see the package vignette “../doc/earth-notes.pdfNotes on the earth package”. The vignette can also be downloaded from http://www.milbo.org/doc/earth-notes.pdf.
Documentation reproduced from package earth, version 3.2-6. License: GPL-3
