# earth {earth}

### Description

Build a regression model using the techniques in Friedman's papers "Multivariate Adaptive Regression Splines" and "Fast MARS".

See the package vignette “../doc/earth-notes.pdfNotes on the earth package”.

### Usage

## S3 method for class 'formula': earth((formula = stop("no 'formula' argument"), data = NULL, weights = NULL, wp = NULL, subset = NULL, na.action = na.fail, pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"), keepxy = FALSE, trace = 0, glm = NULL, degree = 1, nprune = NULL, ncross=1, nfold=0, stratify=TRUE, varmod.method = "none", varmod.exponent = 1, varmod.conv = 1, varmod.clamp = .1, varmod.minspan = -3, Scale.y = (NCOL(y)==1), ...)) ## S3 method for class 'default': earth((x = stop("no 'x' argument"), y = stop("no 'y' argument"), weights = NULL, wp = NULL, subset = NULL, na.action = na.fail, pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"), keepxy = FALSE, trace = 0, glm = NULL, degree = 1, nprune = NULL, ncross=1, nfold=0, stratify=TRUE, varmod.method = "none", varmod.exponent = 1, varmod.conv = 1, varmod.clamp = .1, varmod.minspan = -3, Scale.y = (NCOL(y)==1), ...)) ## S3 method for class 'fit': earth((x = stop("no 'x' argument"), y = stop("no 'y' argument"), weights = NULL, wp = NULL, subset = NULL, na.action = na.fail, pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"), keepxy = FALSE, trace = 0, glm = NULL, degree = 1, penalty = if(degree > 1) 3 else 2, nk = min(200, max(20, 2 * ncol(x))) + 1, thresh = 0.001, minspan = 0, endspan = 0, newvar.penalty = 0, fast.k = 20, fast.beta = 1, linpreds = FALSE, allowed = NULL, nprune = NULL, Object = NULL, Scale.y = (NCOL(y)==1), Adjust.endspan = 2, Force.weights = FALSE, Use.beta.cache = TRUE, Force.xtx.prune = FALSE, Get.leverages = NROW(x) < 1e5, Exhaustive.tol = 1e-10, ...))

### Arguments

To start off, look at the arguments `formula`

, `data`

, `x`

, `y`

, `nk`

, `degree`

, and `trace`

.

If the response is binary or a factor, consider using the `glm`

argument.

For cross validation, use the `nfold`

argument.

For prediction intervals, use the `varmod.method`

argument.

Most users will find that the above arguments are all they need, plus in some cases `keepxy`

and `nprune`

. Unless you are a knowledgeable use, it's best not subvert the standard algorithm by toying with tuning parameters such as `thresh`

, `penalty`

, and `endspan`

.

- formula
- Model formula.
- data
- Data frame for
`formula`

. - x
- Matrix or dataframe containing the independent variables.
- y
- Vector containing the response variable, or, in the case of multiple responses, a matrix or dataframe whose columns are the values for each response.
- subset
- Index vector specifying which cases to use, i.e., which rows in
`x`

to use. Default is NULL, meaning all. - weights
- Case weights. Default is NULL, meaning no case weights. If specified,
`weights`

must have length equal to`nrow(x)`

before applying`subset`

. Zero weights are converted to a very small nonzero value. - wp
- Response weights. Default is NULL, meaning no response weights. If specified,
`wp`

must have an element for each column of`y`

(after`factors`

in`y`

, if any, have been expanded). Zero weights are converted to a very small nonzero value. - na.action
- NA action. Default is
`na.fail`

, and only`na.fail`

is supported. - keepxy
- Default is
`FALSE`

. Set to`TRUE`

to retain the following in the returned value:`x`

and`y`

(or`data`

),`subset`

, and`weights`

. The function`update.earth`

and friends will use these if present instead of searching for them in the environment at the time`update.earth`

is invoked.

When the`nfold`

argument is used with`keepxy=TRUE`

,`earth`

keeps more data and calls`predict.earth`

multiple times to generate`cv.oof.rsq.tab`

and`cv.infold.rsq.tab`

(see the`cv.`

arguments in the “*Value*” section below). It therefore makes cross-validation significantly slower. - trace
- Trace
`earth`

's execution. Default is

`.3`

variance model (the`varmod.method`

arg)

`.5`

cross validation (the`nfold`

arg)

`1`

overview

`2`

forward pass

`3`

pruning

`4`

model mats summary, pruning details

`5`

full model mats, internal details of operation - glm
- NULL (default) or a list of arguments to pass on to
`glm`

. See the documentation of`glm`

for a description of these arguments See “*Generalized linear models*” in the vignette. Example:

`earth(survived~., data=etitanic, degree=2, glm=list(family=binomial))`

**The following arguments are for the forward pass.** - degree
- Maximum degree of interaction (Friedman's mi). Default is
`1`

, meaning build an additive model (i.e., no interaction terms). - penalty
- Generalized Cross Validation (GCV) penalty per knot. Default is
`if(degree>1) 3 else 2`

. Simulation studies suggest values in the range of about`2`

to`4`

. The FAQ section in the vignette has some information on GCVs.

Special values (for use by knowledgeable users): The value`-1`

means no penalty, so GCV = RSS/n. - nk
- Maximum number of model terms before pruning, i.e., the maximum number of terms created by the forward pass. Includes the intercept.

The actual number of terms created by the forward pass will often be less than`nk`

because of other stopping conditions. See “*Termination conditions for the forward pass*” in the vignette.

The default is semi-automatically calculated from the number of predictors but may need adjusting. - thresh
- Forward stepping threshold. Default is
`0.001`

. This is one of the arguments used to decide when forward stepping should terminate: the forward pass terminates if adding a term changes RSq by less than`thresh`

. See “*Termination conditions for the forward pass*” in the vignette. - minspan
- Minimum number of observations between knots. (This increases resistance to runs of correlated noise in the input data.)

The default`minspan=0`

is treated specially and means calculate the`minspan`

internally, as per Friedman's MARS paper section 3.8 with alpha = 0.05. Set`trace>=2`

to see the calculated value.

Use`minspan=1`

and`endspan=1`

to consider all x values.

Negative values of`minspan`

specify the maximum number of knots per predictor. These will be equally spaced. For example,`minspan=-3`

allows three evenly spaced knots for each predictor. As always, knots that fall in the endzones specified by`endspan`

will be ignored. - endspan
- Minimum number of observations before the first and after the final knot.

The default`endspan=0`

is treated specially and means calculate the`minspan`

internally, as per the MARS paper equation 45 with alpha = 0.05. Set`trace>=2`

to see the calculated value.

Be wary of reducing`endspan`

, especially if you plan to make predictions beyond or near the limits of the training data. Overfitting near the edges of training data is much more likely with a small`endspan`

. The model's`RSq`

and`GRSq`

won't indicate when this overfitting is occurring. (A`plotmo`

plot can help: look for sharp hinges at the edges of the data). See also the`Adjust.endspan`

argumen. - newvar.penalty
- Penalty for adding a new variable in the forward pass (Friedman's gamma, equation 74 in the MARS paper). Default is
`0.01`

to`0.2`

and sometimes higher --- you will need to experiment.

A word of explanation. With the default`newvar.penalty=0`

, if two variables have nearly the same effect (e.g. they are collinear), at any step in the forward pass`earth`

will arbitrarily select one or the other (depending on noise in the sample). Both variables can appear in the final model, complicating model interpretation. On the other hand with a non-zero`newvar.penalty`

, the forward pass will be reluctant to add a new variable --- it will rather try to use a variable already in the model, if that does not affect RSq too much. The resulting final model may be easier to interpret, if you are lucky. There will often be a small performance hit (a worse GCV). - fast.k
- Maximum number of parent terms considered at each step of the forward pass. (This speeds up the forward pass. See the Fast MARS paper section 3.0.)

Default is`20`

. A value of`20`

,`10`

, or`5`

.

In general, with a lower`fast.k`

(say`5`

),`earth`

is faster; with a higher`fast.k`

, or with`fast.k`

disabled (set to`earth`

builds a better model. However, because of random variation this general rule often doesn't apply. - fast.beta
- Fast MARS ageing coefficient, as described in the Fast MARS paper section 3.1. Default is
`1`

. A value of - linpreds
- Index vector specifying which predictors should enter linearly, as in
`lm`

. The default is`FALSE`

, meaning all predictors enter in the standard MARS fashion, i.e., in hinge functions.

This does not say that a predictor*must*enter the model; only that if it enters, it enters linearly. See “*The*” in the vignette.`linpreds`

argument

A predictor's index in`linpreds`

is the column number in the input matrix`x`

(after factors have been expanded).

`linpreds=TRUE`

makes all predictors enter linearly (the`TRUE`

gets recycled).

`linpreds`

may also be a character vector e.g.`linpreds=c("wind", "vis")`

. Note:`grep`

is used for matching. Thus`"wind"`

will match all variables that have`"wind"`

in their names. Use`"^wind$"`

to match only the variable named`"wind"`

. - allowed
- Function specifying which predictors can interact and how. Default is NULL, meaning all standard MARS terms are allowed.

During the forward pass,`earth`

calls the`allowed`

function before considering a term for inclusion; the term can go into the model only if the`allowed`

function returns`TRUE`

. See “*The allowed argument*” in the vignette.**The following arguments are for the pruning pass.** - pmethod
- Pruning method. One of:
`backward none exhaustive forward seqrep cv`

.

Default is`"backward"`

.

**New in version 4.4.0:**Specify`pmethod="cv"`

to use cross-validation to select the number of terms. This selects the number of terms that gives the maximum mean out-of-fold RSq on the fold models. Requires the`nfold`

argument.

Use`"none"`

to retain all the terms created by the forward pass.

If`y`

has multiple columns, then only`"backward"`

or`"none"`

is allowed.

Pruning can take a while if`"exhaustive"`

is chosen and the model is big (more than about 30 terms). The current version of the`leaps`

package used during pruning does not allow user interrupts (i.e., you have to kill your R session to interrupt; in Windows use the Task Manager or from the command line use`taskkill`

). - nprune
- Maximum number of terms (including intercept) in the pruned model. Default is NULL, meaning all terms created by the forward pass (but typically not all terms will remain after pruning). Use this to enforce an upper bound on the model size (that is less than
`nk`

), or to reduce exhaustive search time with`pmethod="exhaustive"`

.**The following arguments are for cross validation.** - ncross
- Only applies if
`nfold>1`

. Number of cross-validations. Each cross-validation has`nfold`

folds. Default`1`

. - nfold
- Number of cross-validation folds. Default is
`1`

,`earth`

first builds a standard model as usual with all the data. It then builds`nfold`

cross-validated models, measuring R-Squared on the out-of-fold (left out) data each time. The final cross validation R-Squared (`CVRSq`

) is the mean of these out-of-fold R-Squareds.

The above process of building`nfold`

models is repeated`ncross`

times (by default, once). Use`trace=.5`

to trace cross-validation.

Further statistics are calculated if`keepxy=TRUE`

or if a binomial or poisson model (specified with the`glm`

argument). See “*Cross validation*” in the vignette. - stratify
- Only applies if
`nfold>1`

. Default is`TRUE`

. Stratify the cross-validation samples so that an approximately equal number of cases with a non-zero response occur in each cross validation subset. So if the response`y`

is logical, the`TRUE`

s will be spread evenly across folds. And if the response is a multilevel factor, there will be an approximately equal number of each factor level in each fold (because a multilevel factor response gets expanded to columns of zeros and ones, see “*Factors*” in the vignette). We say “approximately equal” because the number of occurrences of a factor level may not be exactly divisible by the number of folds.**The following arguments are for variance models (new in version 4.0.0).** - varmod.method
- Construct a variance model. For details, see
`varmod`

and the vignette “../doc/earth-varmod.pdfVariance models in earth”. Use`trace=.3`

to trace construction of the variance model.

This argument requires`nfold`

and`ncross`

. (We suggest at least`ncross=30`

here to properly calculate the variance of the errors --- although you can use a smaller value, say`3`

, for debugging.)

The`varmod.method`

argument should be one of

Default. Don't build a variance model.`"none"`

Assume homoscedastic errors.`"const"`

Use`"lm"`

`lm`

to estimate standard deviation as a function of the predicted response.

Use`"rlm"`

`rlm`

.

Use`"earth"`

`earth`

.

Use`"gam"`

`gam`

. This will use either`gam`

or the`mgcv`

package, whichever is loaded.

Estimate standard deviation as`"power"`

`intercept + coef * predicted.response^exponent`

, where`intercept`

,`coef`

, and`exponent`

will be estimated by`nls`

. This is equivalent to`varmod.method="lm"`

except that`exponent`

is automatically estimated instead of being held at the value set by the`varmod.exponent`

argument.

Same as`"power0"`

`"power"`

but no intercept (offset) term.

,`"x.lm"`

,`"x.rlm"`

,`"x.earth"`

Like the similarly named options above, but estimate standard deviation by regressing on the predictors`"x.gam"`

`x`

(instead of the predicted response). A current implementation restriction is that`"x.gam"`

allows only models with one predictor (`x`

must have only one column). - varmod.exponent
- Power transform applied to the rhs before regressing the absolute residuals with the specified
`varmod.method`

. Default is`1`

.

For example, with`varmod.method="lm"`

, if you expect the standard deviance to increase linearly with the mean response, use`varmod.exponent=1`

. If you expect the standard deviance to increase with the square root of the mean response, use`varmod.exponent=.5`

(where negative response values will be treated as - varmod.conv
- Convergence criterion for the Iteratively Reweighted Least Squares used when creating the variance model.

Iterations stop when the mean value of the coefficients of the residual model change by less than`varmod.conv`

percent. Default is`1`

percent.

Negative values force the specified number of iterations, e.g.`varmod.conv=-2`

means iterate twice.

Positive values are ignored for`varmod="const"`

and also currently ignored for`varmod="earth"`

(these are iterated just once, the same as using`varmod.conv=-1`

). - varmod.clamp
- The estimated standard deviation of the main model errors is forced to be at least a small positive value, which we call
`min.sd`

. This prevents negative or absurdly small estimated standard deviations. Clamping takes place in`predict.varmod`

, which is called by`predict.earth`

when estimating prediction intervals. The value of`min.sd`

is determined when building the variance model as`min.sd = varmod.clamp * mean(sd(training.residuals))`

. The default`varmod.clamp`

is`0.1`

. - varmod.minspan
- Only applies when
`varmod.method="earth"`

or`"x.earth"`

. This is the`minspan`

used in the internal call to`earth`

when creating the variance model (not the main`earth`

model).

Default is`-3`

, i.e., three evenly spaced knots per predictor. Residuals tend to be very noisy, and allowing only this small number of knots helps prevent overfitting.**The following arguments are for internal or advanced use.** - Object
- Earth object to be updated, for use by
`update.earth`

. - Scale.y
`Scale`

`y`

in the forward pass for better numeric stability. Scaling here means subtract the mean and divide by the standard deviation. Default is`NCOL(y)==1`

, i.e., scale`y`

unless`y`

has multiple columns.- Adjust.endspan
**New in version 4.2.0.**In interaction terms,`endspan`

gets multiplied by this value. This reduces the possibility of an overfitted interaction term supported by just a few cases on the boundary of the predictor space (as sometimes seen in our simulation studies).

The default is`2`

. Use`Adjust.endspan=1`

for compatibility with old versions of`earth`

.- Force.weights
- Default is
`FALSE`

. For testing the`weights`

argument. Force use of the code for handling weights in the`earth`

code, even if`weights=NULL`

or all the weights are the same. This will not necessarily generate an identical model, primarily because the non-weighted code requires some tests for numerical stability that can sometimes affect knot selection. - Use.beta.cache
- Default is
`TRUE`

. Using the “beta cache” takes a little more memory but is faster (by 20% and often much more for large models). The beta cache uses`nk * nk * ncol(x) * sizeof(double)`

bytes. (The beta cache is an innovation in this implementation of MARS and does not appear in Friedman's papers. It is not related to the`fast.beta`

argument. Certain regression coefficients in the forward pass can be saved and re-used, thus saving recalculation time.) - Force.xtx.prune
- Default is
`FALSE`

. This argument pertains to subset evaluation in the pruning pass. By default, if`y`

has a single column then`earth`

calls the`leaps`

routines; if`y`

has multiple columns then`earth`

calls`EvalSubsetsUsingXtx`

. The`leaps`

routines are numerically more stable but do not support multiple responses (`leaps`

is based on the QR decomposition and`EvalSubsetsUsingXtx`

is based on the inverse of X'X). Setting`Force.xtx.prune=TRUE`

forces use of`EvalSubsetsUsingXtx`

, even if`y`

has a single column. - Get.leverages
- New in version 4.4.0. Default is
`TRUE`

unless the model has more than 100 thousand cases. The leverages are the diagonal hat values for the linear regression of`y`

on`bx`

. The leverages are needed only for certain model checks, for example when`plotres`

is called with`versus=4`

).

Details: This argument was introduced to reduce peak memory usage. When`n >> p`

, memory use peaks when`earth`

is calculating the leverages. - Exhaustive.tol
- Default
`1e-10`

. Applies only when`pmethod="exhaustive"`

. If the reciprocal of the condition number of`bx`

is less than`Exhaustive.tol`

,`earth`

forces`pmethod="backward"`

. See “*XHAUST returned error code -999*” in the vignette. - ...
- Dots are passed on to
`earth.fit`

.

### Values

An object of class `"earth"`

which is a list with the components listed below. *Term* refers to a term created during the forward pass (each line of the output from `format.earth`

is a term). Term number 1 is always the intercept.

`rss`

- Residual sum-of-squares (RSS) of the model (summed over all responses, if
`y`

has multiple columns). `rsq`

`1-rss/tss`

. R-Squared of the model (calculated over all responses, and calculated using the`weights`

argument if it was supplied). A measure of how well the model fits the training data. Note that`tss`

is the total sum-of-squares,`sum((y - mean(y))^2)`

.`gcv`

- Generalized Cross Validation (GCV) of the model (summed over all responses). The GCV is calculated using the
`penalty`

argument. For details of the GCV calculation, see equation 30 in Friedman's MARS paper and`earth:::get.gcv`

. `grsq`

`1-gcv/gcv.null`

. An estimate of the predictive power of the model (calculated over all responses, and calculated using the`weights`

argument if it was supplied).`gcv.null`

is the GCV of an intercept-only model. See “*Can*” in the vignette.`GRSq`

be negative?`bx`

- Matrix of basis functions applied to
`x`

. Each column corresponds to a selected term. Each row corresponds to a row in in the input matrix`x`

, after taking`subset`

. See`model.matrix.earth`

for an example of`bx`

handling. Example`bx`

:(Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(... [1,] 1 0.0 4.6 0 [2,] 1 0.0 4.3 0 [3,] 1 0.0 4.1 0 ...

`dirs`

- Matrix with one row per MARS term, and with with ij-th element equal to

`-1`

if an expression of the form`h(const - xj)`

is in term i

`1`

if an expression of the form`h(xj - const)`

is in term i

`2`

if predictor j should enter term i linearly (either because specified by the`linpreds`

argument or because earth discovered that a knot was unnecessary).This matrix includes all terms generated by the forward pass, including those not in

`selected.terms`

. Note that here the terms may not all be in pairs, because although the forward pass add terms as hinged pairs (so both sides of the hinge are available as building blocks for further terms), it also deletes linearly dependent terms before handing control to the pruning pass. Example`dirs`

:Girth Height (Intercept) 0 0 #intercept h(12.9-Girth) -1 0 #2nd term uses Girth h(Girth-12.9) 1 0 #3rd term uses Girth h(Girth-12.9)*h(Height-76) 1 1 #4th term uses Girth and Height ...

`cuts`

- Matrix with ij-th element equal to the cut point for predictor j in term i. This matrix includes all terms generated by the forward pass, including those not in
`selected.terms`

. Note for programmers: the precedent is to use`dirs`

for term names etc. and to only use`cuts`

where cut information needed. Example`cuts`

:Girth Height (Intercept) 0 0 #intercept, no cuts h(12.9-Girth) 12.9 0 #2nd term has cut at 12.9 h(Girth-12.9) 12.9 0 #3rd term has cut at 12.9 h(Girth-12.9)*h(Height-76) 12.9 76 #4th term has two cuts ...

`prune.terms`

- A matrix specifying which terms appear in which pruning pass subsets. The row index of
`prune.terms`

is the model size. (The model size is the number of terms in the model. The intercept is counted as a term.) Each row is a vector of term numbers for the best model of that size. An element is 0 if the term is not in the model, thus`prune.terms`

is a lower triangular matrix, with dimensions`nprune x nprune`

. The model selected by the pruning pass is at row number`length(selected.terms)`

. Example`prune.terms`

:[1,] 1 0 0 0 0 0 0 #intercept-only model [2,] 1 2 0 0 0 0 0 #best 2 term model uses terms 1,2 [3,] 1 2 4 0 0 0 0 #best 3 term model uses terms 1,2,4 [4,] 1 2 6 9 0 0 0 #and so on ...

`selected.terms`

- Vector of term numbers in the selected model. Can be used as a row index vector into
`cuts`

and`dirs`

. The first element`selected.terms[1]`

is always 1, the intercept. `fitted.values`

- Fitted values. A matrix with dimensions
`nrow(y)`

x`ncol(y)`

after factors in`y`

have been expanded. `residuals`

- Residuals. A matrix with dimensions
`nrow(y)`

x`ncol(y)`

after factors in`y`

have been expanded. `coefficients`

- Regression coefficients. A matrix with dimensions
`length(selected.terms)`

x`ncol(y)`

after factors in`y`

have been expanded. Each column holds the least squares coefficients from regressing that column of`y`

on`bx`

. The first row holds the intercept coefficient(s). `rss.per.response`

- A vector of the RSS for each response. Length is the number of responses, i.e.,
`ncol(y)`

after factors in`y`

have been expanded. The`rss`

component above is equal to`sum(rss.per.response)`

. `rsq.per.response`

- A vector of the R-Squared for each response (where R-Squared is calculated using the
`weights`

argument if it was supplied). Length is the number of responses. `gcv.per.response`

- A vector of the GCV for each response. Length is the number of responses. The
`gcv`

component above is equal to`sum(gcv.per.response)`

. `grsq.per.response`

- A vector of the GRSq for each response (calculated using the
`weights`

argument if it was supplied). Length is the number of responses. `rss.per.subset`

- A vector of the RSS for each model subset generated by the pruning pass. Length is
`nprune`

. For multiple responses, the RSS is summed over all responses for each subset. The`rss`

above is

`rss.per.subset[length(selected.terms)]`

. The RSS of an intercept only-model is`rss.per.subset[1]`

. `gcv.per.subset`

- A vector of the GCV for each model in
`prune.terms`

. Length is`nprune`

. For multiple responses, the GCV is summed over all responses for each subset. The`gcv`

above is`gcv.per.subset[length(selected.terms)]`

. The GCV of an intercept-only model is`gcv.per.subset[1]`

. `leverages`

- Diagonal of the hat matrix (from the linear regression of the response on
`bx`

). `penalty,nk,thresh`

- Copies of the corresponding arguments to
`earth`

. `pmethod,nprune`

- Copies of the corresponding arguments to
`earth`

. `weights,wp`

- Copies of the corresponding arguments to
`earth`

. `termcond`

- Reason the forward pass terminated (an integer).
`call`

- The call used to invoke
`earth`

. `terms`

- Model frame terms. This component exists only if the model was built using
`earth.formula`

. `namesx`

- Column names of
`x`

, generated internally by`earth`

when necessary so each column of`x`

has a name. Used, for example, by`predict.earth`

to name columns if necessary. `namesx.org`

- Original column names of
`x`

. `levels`

- Levels of
`y`

if`y`

is a`factor`

`c(FALSE,TRUE)`

if`y`

is`logical`

Else NULL**The following fields appear only if**`earth`

's argument`keepxy`

is`TRUE`

. `x`

,`y`

,`data`

,`subset`

- Copies of the corresponding arguments to
`earth`

. Only exist if`keepxy=TRUE`

.**The following fields appear only if**`earth`

's`glm`

argument is used. `glm.list`

- List of GLM models. Each element is the value returned by
`earth`

's internal call to`glm`

for each response.

Thus if there is a single response (or a single binomial pair, see “*Binomial pairs*” in the vignette) this will be a one element list and you access the GLM model with`earth.mod$glm.list[[1]]`

. `glm.coefficients`

- GLM regression coefficients. Analogous to the
`coefficients`

field described above but for the GLM model(s). A matrix with dimensions`length(selected.terms)`

x`ncol(y)`

after factors in`y`

have been expanded. Each column holds the coefficients from the GLM regression of that column of`y`

on`bx`

. This duplicates, for convenience, information buried in`glm.list`

. `glm.bpairs`

- NULL unless there are paired binomial columns. A logical vector, derived internally by
`earth`

, or a copy the`bpairs`

specified by the user in the`glm`

list. See “*Binomial pairs*” in the vignette.**The following fields appear only if the**`nfold`

argument is greater than 1. `cv.list`

- List of
`earth`

models, one model for each fold (`ncross * nfold`

models).

The fold models have two extra fields,`icross`

(an integer from`1`

to`ncross`

) and`ifold`

(an integer from`1`

to`nfold`

).

To save memory, lengthy fields in the fold models are removed unless you use`keepxy=TRUE`

. The “lengthy fields” are`$bx`

,`$fitted.values`

, and`$residuals`

. `cv.nterms`

- Vector of length
`ncross * nfold + 1`

. Number of MARS terms in the model generated at each cross-validation fold, with the final element being the mean of these. `cv.nvars`

- Vector of length
`ncross * nfold + 1`

. Number of predictors in the model generated at each cross-validation fold, with the final element being the mean of these. `cv.groups`

- Specifies which cases went into which folds. Matrix with two columns and number of rows equal to the the number of cases
`nrow(x)`

Elements of the first column specify the cross-validation number,`1:ncross`

. Elements of the second column specify the fold number,`1:nfold`

. `cv.rsq.tab`

- Matrix with
`ncross * nfold + 1`

rows and`nresponse+1`

columns, where`nresponse`

is the number of responses, i.e.,`ncol(y)`

after factors in`y`

have been expanded. The first`nresponse`

elements of a row are the`cv.rsq`

's on the out-of-fold data for each response of the model generated at that row's fold. (A`cv.rsq`

is calculated from predictions on the out-of-fold data using the best model built from the in-fold data; where “best” means the model was selected using the in-fold GCV. The R-Squareds are calculated using the`weights`

argument if it was supplied. The final column holds the row mean (a weighted mean if`wp`

if specified)). The final row holds the column means. The values in this final row is the mean`cv.rsq`

printed by`summary.earth`

.Example for a single response model (where the

`mean`

column is redundant but included for uniformity with multiple response models):y mean fold1 0.909 0.909 fold2 0.869 0.869 fold3 0.952 0.952 fold4 0.157 0.157 fold5 0.961 0.961 mean 0.769 0.769

Example for a multiple response model:

y1 y2 y3 mean fold1 0.915 0.951 0.944 0.937 fold2 0.962 0.970 0.970 0.968 fold3 0.914 0.940 0.942 0.932 fold4 0.907 0.929 0.925 0.920 fold5 0.947 0.987 0.979 0.971 mean 0.929 0.955 0.952 0.946

`cv.class.rate.tab`

- Like
`cv.rsq.tab`

but is the classification rate at each fold i.e. the fraction of classes correctly predicted. Models with discrete response only. Calculated with`thresh=.5`

for binary responses. For responses with more than two levels, the final row is the overall classification rate. The other rows are the classification rates for each level (the level versus not-the-level), which are usually higher than the overall classification rate (predicting the level versus not-the-level is easier than correctly predicting one of many levels). The`weights`

argument is ignored for all cross-validation stats except R-Squareds. `cv.maxerr.tab`

- Like
`cv.rsq.tab`

but is the`MaxErr`

at each fold. This is the signed max absolute value at each fold. Results are aggregated for the final column and final row using the signed max absolute value. The*signed max absolute value*is defined as the maximum of the absolute difference between the predicted and observed response values, multiplied by`-1`

if the sign of that difference is negative. `cv.auc.tab`

- Like
`cv.rsq.tab`

but is the`AUC`

at each fold. Binomial models only. `cv.cor.tab`

- Like
`cv.rsq.tab`

but is the`cor`

at each fold. Poisson models only. `cv.deviance.tab`

- Like
`cv.rsq.tab`

but is the`MeanDev`

at each fold. Binomial models only. `cv.calib.int.tab`

- Like
`cv.rsq.tab`

but is the`CalibInt`

at each fold. Binomial models only. `cv.calib.slope.tab`

- Like
`cv.rsq.tab`

but is the`CalibSlope`

at each fold. Binomial models only. `cv.oof.rsq.tab`

- Generated only if
`keepxy=TRUE`

or`pmethod="cv"`

.

A matrix with`ncross * nfold + 1`

rows and`max.nterms`

columns, Each element holds an out-of-fold RSq (`oof.rsq`

), calculated from predictions from the out-of-fold observations using the model built with the in-fold data. The final row is the mean over all folds. The R-Squareds are calculated using the`weights`

argument if it was supplied. `cv.infold.rsq.tab`

- Generated only if
`keepxy=TRUE`

. Like`cv.oof.rsq.tab`

but from predictions made on the in-fold observations. `cv.oof.fit.tab`

- Generated only if the
`varmod.method`

argument is used. Predicted values on the out-of-fold data. Dataframe with`nrow(data)`

rows and`ncross`

columns.**The following field appears only if the**`varmod.method`

is specified. `varmod`

- An object of class
`"varmod"`

. See the`varmod`

help page for a description. Only appears if the`varmod.method`

argument is used.

### References

The primary references are the Friedman papers. Readers may find the MARS section in Hastie, Tibshirani, and Friedman a more accessible introduction. The Wikipedia article is recommended for an elementary introduction. Faraway takes a hands-on approach, using the `ozone`

data to compare `mda::mars`

with other techniques. (If you use Faraway's examples with `earth`

instead of `mars`

, use `$bx`

instead of `$x`

, and check out the book's errata.) Friedman and Silverman is recommended background reading for the MARS paper. Earth's pruning pass uses code from the `leaps`

package which is based on techniques in Miller.

Faraway (2005) *Extending the Linear Model with R* http://www.maths.bath.ac.uk/~jjf23

Friedman (1991) *Multivariate Adaptive Regression Splines (with discussion)* Annals of Statistics 19/1, 1--141 https://statistics.stanford.edu/research/multivariate-adaptive-regression-splines

Friedman (1993) *Fast MARS* Stanford University Department of Statistics, Technical Report 110 http://www.milbo.users.sonic.net/earth/Friedman-FastMars.pdf, https://statistics.stanford.edu/research/fast-mars

Friedman and Silverman (1989) *Flexible Parsimonious Smoothing and Additive Modeling* Technometrics, Vol. 31, No. 1. http://links.jstor.org/sici?sici=0040-1706%28198902%2931%3A1%3C3%3AFPSAAM%3E2.0.CO%3B2-Z

Hastie, Tibshirani, and Friedman (2009) *The Elements of Statistical Learning (2nd ed.)* http://www-stat.stanford.edu/~hastie/pub.htm

Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005) *Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish* Freshwater Biology, 50, 2034-2052 http://www-stat.stanford.edu/~hastie/pub.htm, http://www.botany.unimelb.edu.au/envisci/about/staff/elith.html

Miller, Alan (1990, 2nd ed. 2002) *Subset Selection in Regression* http://wp.csiro.au/alanmiller/index.html

Wikipedia article on MARS http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines

### See Also

Start with `summary.earth`

, `plot.earth`

, `evimp`

, and `plotmo`

.

Please see the main package vignette “../doc/earth-notes.pdfNotes on the earth package”. The vignette can also be downloaded from http://www.milbo.org/doc/earth-notes.pdf.

The vignette “../doc/earth-varmod.pdfVariance models in earth” is also included with the package. It describes how to build variance models and generate prediction intervals for `earth`

models.

### Examples

Documentation reproduced from package earth, version 4.4.4. License: GPL-3