# Regularizers¶

The goal of training a neural network model is to minimize the loss function by making adjustments to the model parameters. In most practical applications, the loss is not known a priori, but an estimate of it is computed using a set of data (the “training data”) that has been gathered from the problem being modeled.

If a model has many parameters compared with the size of the training dataset,
then many machine learning models exhibit a phenomenon called *overfitting*: the
model may learn to predict the training data with no measurable error, but then
if it is applied to a new dataset, it makes lots of mistakes. In such a case,
the model has essentially memorized the training data at the cost of not being
able to *generalize* to new and unseen, yet similar, datasets. The risk of
overfitting usually increases with the size of the model and decreases with the
size of the training dataset.

A heuristic that can prevent models from overfitting on small datasets is based on the observation that “good” parameter values in most models are typically small: large parameter values often indicate overfitting.

One way to encourage a model to use small parameter values is to assume that the parameter values are sampled from some prior distribution, rather than assuming that all parameter values in the model are equally likely. In this way of thinking about parameters, we can manipulate the prior distribution of the parameter values to express our knowledge as modelers of the problem at hand.

In `theanets`

, regularization hyperparameters are provided when you train your
model:

```
net = theanets.Classifier(layers=[784, 1000, 784])
net.train(..., hidden_l1=0.1)
```

Here we’ve specified that our model has a single, overcomplete hidden layer, and
then when we train it, we specify that the activity of the hidden units in the
network will be penalized with a 0.1 coefficient. The rest of this section
details the built-in regularizers that are available in `theanets`

.

## Decay¶

Using “weight decay,” we assume that parameters in a model are drawn from a zero-mean Gaussian distribution with an isotropic, modeler-specified standard deviation. In terms of loss functions, this equates to adding a term to the loss function that computes the \(L_2\) norm of the parameter values in the model:

If the loss \(\mathcal{L}(\cdot)\) represents some approximation to the log-posterior distribution of the model parameters given the data

then the term with the \(L_2\) norm on the parameters is like an unscaled Gaussian distribution.

This type of regularization is specified using the `weight_l2`

keyword
argument during training:

```
net.train(..., weight_l2=1e-4)
```

The value of the argument is the strength of the regularizer in the loss for the model. Larger values create more pressure for small model weights.

## Sparsity¶

Sparse models have been shown to capture regularities seen in the mammalian visual cortex. In addition, sparse models in machine learning are often more performant than “dense” models (i.e., models without restriction on the hidden representation). Furthermore, sparse models tend to yield latent representations that are easier for humans to interpret than dense models.

There are two main types of sparsity regularizers provided with `theanets`

:
parameter sparsity and representation sparsity.

The first type of sparse regularizer is just like weight decay, but instead of assuming that weights are drawn from a Gaussian distribution, here we assume that weights in the model are drawn from a distribution with a taller peak at zero and heavier tails, like a Laplace distribution. In terms of loss function, this regularizer adds a term with an \(L_1\) norm to the model:

If the loss \(\mathcal{L}(\cdot)\) represents some approximation to the log-posterior distribution of the model parameters given the data

then this term is like an unscaled Laplace distribution. In practice, this
regularizer encourages many of the model *parameters* to be zero.

In `theanets`

, this sparse parameter regularization is specified using the
`weight_l1`

keyword argument during training:

```
net.train(..., weight_l1=1e-4)
```

The value of the argument is the strength of the regularizer in the loss for the model. The larger the regularization parameter, the more pressure for zero-valued weights.

The second type of sparsity regularization puts pressure on the model to develop
hidden *representations* that are mostly zero-valued. In this type of
regularization, the model weights are penalized indirectly, since the hidden
representation (i.e., the values of the hidden layer neurons in the network) are
functions of both the model weights and the input data. In terms of loss
functions, this regularizer adds a term to the loss that penalizes the
\(L_1\) norm of the hidden layer activations

where \(f_i(x)\) represents the neuron activations of hidden layer \(i\).

Sparse hidden activations have shown much promise in computational neural
networks. In `theanets`

this type of regularization is specified using the
`hidden_l1`

keyword argument during training:

```
net.train(..., hidden_l1=0.1)
```

The value of the argument is the strength of the regularizer in the loss for the model. Large values create more pressure for hidden representations that use mostly zeros.

## Noise¶

Another way of regularizing a model to prevent overfitting is to inject noise
into the data or the representations during training. While noise could always
be injected into the training batches manually, `theanets`

provides two types
of noise regularizers: additive Gaussian noise and multiplicative dropout
(binary) noise.

In one method, zero-mean Gaussian noise is added to the input data or hidden
representations. These are specified during training using the `input_noise`

and `hidden_noise`

keyword arguments, respectively:

```
net.train(..., input_noise=0.1)
net.train(..., hidden_noise=0.1)
```

The value of the argument specifies the standard deviation of the noise.

In the other input regularization method, some of the inputs are randomly set to
zero during training (this is sometimes called “dropout” or “multiplicative
masking noise”). This type of noise is specified using the `input_dropout`

and
`hidden_dropout`

keyword arguments, respectively:

```
net.train(..., input_dropout=0.3)
net.train(..., hidden_dropout=0.3)
```

The value of the argument specifies the fraction of values in each input or hidden activation that are randomly set to zero.

Instead of adding additional terms like the other regularizers, the noise regularizers can be seen as modifying the original loss for a model. For instance, consider an autoencoder model with two hidden layers:

```
net = theanets.Autoencoder([
100,
dict(size=50, name='a'),
dict(size=80, name='b'),
dict(size=100, name='o')])
```

The loss for this model, without regularization, can be written as:

where we’ve ignored the bias terms, and \(\theta_a\), \(\theta_b\), and \(\theta_o\) are the parameters for layers a, b, and o, respectively. Also, \(\sigma_a\) and \(\sigma_b\) are the activation functions for their respective hidden layers.

If we train this model using input and hidden noise:

```
net.train(..., input_noise=q, hidden_noise=r)
```

then the loss becomes:

where \(\epsilon_q\) is white Gaussian noise drawn from \(\mathcal{N}(0, qI)\) and \(\epsilon_r\) is white Gaussian noise drawn separately for each hidden layer from \(\mathcal{N}(0, rI)\). The additive noise pushes the data and the representations off of their respective manifolds, but the loss is computed with respect to the uncorrupted input. This is thought to encourage the model to develop representations that push towards the true manifold of the data.

## Predefined Regularizers¶

This module contains implementations of common regularizers.

In `theanets`

regularizers are thought of as additional terms that get
combined with the `Loss`

for a model at
optimization time. Regularizer terms in the loss are usually used to “encourage”
a model to behave in a particular way—for example, the pattern and arrangement
of learned features can be changed by including a sparsity (L1-norm) regularizer
on the hidden unit activations, or units can randomly be dropped out (set to
zero) while running the model.

`Regularizer` ([pattern, weight]) |
A regularizer for a neural network model. |

`HiddenL1` ([pattern, weight]) |
Penalize the activation of hidden layers under an L1 norm. |

`WeightL1` ([pattern, weight]) |
Decay the weights in a model using an L1 norm penalty. |

`WeightL2` ([pattern, weight]) |
Decay the weights in a model using an L2 norm penalty. |

`Contractive` ([pattern, weight, wrt]) |
Penalize the derivative of hidden layers with respect to their inputs. |

`BernoulliDropout` ([pattern, weight, rng]) |
Randomly set activations of a layer output to zero. |

`GaussianNoise` ([pattern, weight, rng]) |
Add isotropic Gaussian noise to one or more graph outputs. |

## Custom Regularizers¶

To create a custom regularizer in `theanets`

, you need to create a custom
subclass of the `theanets.Regularizer`

class, and then provide this regularizer
when you run your model.

To illustrate, let’s suppose you created a linear autoencoder model that had a larger hidden layer than your dataset:

```
net = theanets.Autoencoder([4, (8, 'linear'), (4, 'tied')])
```

Then, at least in theory, you risk learning an uninteresting “identity” model such that some hidden units are never used, and the ones that are have weights equal to the identity matrix. To prevent this from happening, you can impose a sparsity penalty when you train your model:

```
net.train(..., hidden_l1=0.001)
```

But then you might run into a situation where the sparsity penalty drives some of the hidden units in the model to zero, to “save” loss during training. Zero-valued features are probably not so interesting, so we can introduce another penalty to prevent feature weights from going to zero:

```
class WeightInverse(theanets.Regularizer):
def loss(self, layers, outputs):
return sum((1 / (p * p).sum(axis=0)).sum()
for l in layers for p in l.params
if p.ndim == 2)
net = theanets.Autoencoder([4, (8, 'linear'), (4, 'tied')])
net.train(..., hidden_l1=0.001, weightinverse=0.001)
```

This code adds a new regularizer that penalizes the inverse of the squared length of each of the weights in the model’s layers. Here we detect weights by only including parameters with 2 dimensions.