Deep Learning | Tristan Ballard

Hyperparameter tuning with R's tfruns package

Mon, 21 Jun 2021 13:00:00 +0000

Motivation

Hyperparameter tuning, the process of choosing parameter values within a model, is an essential step for training AI/ML models. For neural networks, choice of hyperparameter values such as dropout rates for each layer, type of optimizer, and numer of nodes can make a huge impact performance. Unfortunately, there are ‘rules of thumb’ for these at best, so it is often better to let the data inform what values to choose.

In the case of two hyperparameters, one could systematically search over a grid of candidate values (top left), selecting the pair of values that lead to the best performing model. Or to remove a layer of subjectivity, a random search (top right) may be better. Regardless, even with two hyperparemeters it is easy to see how the number of potential combinations is very large, a challenge for neural networks that are expensive to retrain. In reality, even simple networks can have a few dozen hyperparameters, and some of the more complex networks have tens of thousands to millions of hyperparameters. The curse of dimensionality makes searching across this variable space for strong candidates very difficult.

To tackle the challenge of optimizing hyperparemeter search, many more advanced algorithms than grid or random search have been proposed and implemented, such as a Bayesian optimization, which uses past evaluations of hyperparameter values to update the prior and inform prospective hyperparameter choices. Google even has an AutoML suite service for hyperparameter tuning.

Here I will explore how to manage all of the runs generated from a hyperparameter search using R’s tfruns package. It can be used for manual search, grid search, and random search quite flexibly.

TensorFlow’s tfruns package

I came across the tfruns package recently and used it to train the RiverHeatNet river temperature model and found it incredibly handy. The package is created by the TensorFlow community. The package is designed to help:

Systematically store hyperparameter values, performance metrics, and source code of each run
Identify the best performing model across a set of training runs
Visualize and summarize training run performance

If you have a network coded in Keras already, there are only a few tweaks necessary to make it compatible.

Step 0: Select hyperparameters

In a perfect world you could search across all of the hyperparameters of your model at a great deal of granularity. However, most of us do not have free access to supercomputer resources (a perk of grad school I will miss).

Even if you could train models with unlimited resources doesn’t mean you should. The carbon emissions of fitting neural networks is non-negligible, increasing the need for efficient hyperparameter search algorithms.

Instead, you will have to isolate the most influential hyperparameters and reasonable ranges of values to explore. The learning rate, for example, is well known to be a critical hyperparameter, but the difference between having 100 layers and 101 layers is probably quite small. There are a few rules of thumb to consider for hyperparameter values:

For learning rates, do a search over 10^-5 to 10 in log-space.
Tapering the number of neurons in successive layers used to by commonplace (pyramid shape), but it is less common now and easier to include the same number of neurons in layers.
It is typically better to have a network with multiple layers, say 5 layers with 32 neurons each, than a single layer with many neurons, say 1 layer with 160 neurons, because then each subsequent layer can capture higher-level structures in the data.
Picking a model with more layers and neurons than you need, then using regularization and early stopping to prevent overfitting, can work well.

Step 1: Set default values for hyperparameters

I am going to use the example model from the previous RiverHeatNet post.

Here I demonstrate a search over just 9 different hyperparameters. 4 of them are the number of nodes in different layers. 4 of them are dropout rates for particular layers. The last hyperparameter is associated with the learning rate. Rather than setting the learning rate as constant, I reduce the learning rate by a certain factor when performance begins to stagnate. That reduction factor is the last hyperparameter. This is referred to as learning rate annealing, starting with a high learning rate then reducing it over time, and can considerably speed up convergence.

The default values here can be thought of as placeholders.

# Hyperparameter flags ---------------------------------------
## Default values
FLAGS <- flags(
# nodes
flag_numeric("nodes1", 8),
flag_numeric("nodes2", 64),
flag_numeric("nodes3", 64),
flag_numeric("nodes4", 64),
# dropout
flag_numeric("dropout1", 0.2),
flag_numeric("dropout2", 0.2),
flag_numeric("dropout3", 0.2),
flag_numeric("dropout4", 0.2),
# learning paramaters
flag_numeric("lr_annealing", 0.1)
)

Step 2: Prepare model

Simply, where you have a hyperparameter value hard-coded, replace it with a FLAGS argument.

For example,

airTLayer <- airTInput %>%
layer_lstm(units = 7, dropout = 0.3)
airTlocalLayer <- airTlocalInput %>%
layer_lstm(units = 7, dropout = 0.3)

becomes:

airTLayer <- airTInput %>%
layer_lstm(units = 7, dropout = FLAGS$dropout1)
airTlocalLayer <- airTlocalInput %>%
layer_lstm(units = 7, dropout = FLAGS$dropout1)

The name dropout1 was defined in Step 1. A handy feature of this is that you can force different layers to have the same hyperparameter values. For example, I force the dropout rates for my LSTM layers to be identical to reduce the dimensionality.

My input ‘layer’ associated with time-invariant covariates now looks like this:

siteAttrLayer <- siteAttrInput %>%
layer_dense(units = FLAGS$nodes1) %>%
layer_activation_leaky_relu() %>%
layer_batch_normalization() %>%
layer_dropout(rate = FLAGS$dropout2) %>%
layer_dense(units = 8) %>% ## I want this to remain at 8, the number of input attributes
layer_activation_leaky_relu() %>%
layer_batch_normalization() %>%
layer_dropout(rate = FLAGS$dropout2)

I then concatenate the three layers and feed them into another set of layers, with hyperparameters for dropout and the number of units.

concatenated <- layer_concatenate(list(airTLayer, airTlocalLayer, siteAttrLayer))
#-------------------------------------------------------------
## Define layers following concatenation
waterTOutput <- concatenated %>%
layer_dense(units = FLAGS$nodes2) %>% ## Prior to 3/22/21 used 62,62,10,1 units for layers
layer_activation_leaky_relu() %>%
layer_batch_normalization() %>%
layer_dropout(rate = FLAGS$dropout3) %>%
layer_dense(units = FLAGS$nodes3) %>%
layer_activation_leaky_relu() %>%
layer_batch_normalization() %>%
layer_dropout(rate = FLAGS$dropout4) %>%
layer_dense(units = FLAGS$nodes4) %>%
layer_activation_leaky_relu() %>%
layer_dense(units = 1) %>%
layer_activation(activation = "linear")

Step 3: Specify candidate hyperparameter values

Next, we create a separate script that reads in our model script (fitNNet.R) and fits the model for a range of potential hyperparameter inputs.

For simplicity I show how to do a grid search. For example, I tell it to consider only 8 nodes and 16 nodes for the nodes1 argument. To do a random search, you would simply need to replace those two values with a reasonable random number generator.

library(dplyr)
library(keras)
library(tfruns)
set.seed(23523)
runs <- tuning_run("fitNNet.R",
runs_dir = 'runs',
flags = list(
nodes1 = c(8, 16),
nodes2 = c(16, 32, 64),
nodes3 = c(16, 32, 64),
nodes4 = c(16, 32, 64),
dropout1 = c(0.0, 0.2),
dropout2 = c(0.1, 0.3),
dropout3 = c(0.1, 0.3),
dropout4 = c(0.1, 0.3),
lr_annealing = c(0.1, 0.05)
),
sample = 0.10
)

There are a few considerations here. First, the more options you provide, the greater the space of potential values. With only 2-3 candidate values for each hyperparameter value, there are already 2^5x3^3 = 864 possibilities to search over. For a small dataset or simple model, this may be manageable. In my case, a single iteration of the model takes over 30 minutes on my group’s supercomputer with the task parallelized across 24 nodes. It would take me 18 days on the supercomputer to search all possible values.

To solve this problem, you can randomly search over the different combinations of candidate values. The sample = 0.10 argument tells it to randomly sample 10%, or roughly 86 models, to run rather than all possible combinations. I ended up training 104 models in total.

Step 4: Compare runs

Comparing runs is made quite easy with the package, another key advantage. The following command lists all of the runs, their loss values on the training and validation data, and associated values for each of the hyperparameters. The loss here is mean-squared error, so I list them with the smallest loss for the validation set first and increasing from there.

ls_runs(order = metric_val_loss, decreasing = F, runs_dir = '.')

 run_dir metric_val_loss metric_loss flag_nodes1 flag_nodes2
83 ./2021-04-14T03-19-02Z 0.0658 0.0755 16 64
27 ./2021-04-16T00-06-20Z 0.0737 0.0786 16 32
90 ./2021-04-14T01-37-36Z 0.0747 0.0764 16 64
47 ./2021-04-15T23-07-05Z 0.0777 0.0886 8 64
76 ./2021-04-14T05-07-44Z 0.0794 0.0895 16 64

In the first row we see the ‘best’ model, as indicated by loss on the validation set, corresponds to the 83rd run. It has a mean-squared error of 0.0658 and sets 16 nodes for the first flag, 64 for the second flag, and so on (additional columns are hidden here).

To compare two model runs and get some visualizations simple visualizations:

compare_runs(runs = c('./2021-04-14T03-19-02Z', './2021-04-16T00-06-20Z'))

Visualization dashboard for comparing model runs.

This launches a page similar to Tensorboard, providing basics about the runs and their loss metrics. The Github-esque track changes shows that nodes1_ has the same value between runs, but _nodes2_ has 64 for the first run and 32 for the second run.

The dropout rates are interestingly all the same between the two runs–and low–suggesting the model requires relatively little dropout. If all of the top models show low levels of dropout, I might fix them and instead shift focus to the learning rate or nodes tuning. There are also many more modeling options to consider, like the optimizer choice.

What else can we tell from the model runs? Below I plot the distribution of validation losses across the n=104 runs, each representing a different combination of hyperparameter values.

hist(ls_runs(order = metric_val_loss, decreasing = F, runs_dir = '.')$metric_val_loss,
breaks = 9,
las = 1,
xlab = 'Validation Loss',
main = 'Validation Loss Across\nDifferent Hyperparameter Combos',
col = 'plum2')

The losses are roughly normally distributed with a mean of 0.12. This tells us that a typical ‘blind’ guess at the hyperparameter values would land around a loss value of 0.12. Our best model’s loss is nearly half of that, suggesting the hyperparameter tuning improved model performance by 50%! I would be curious to see in practice if distributions are typically normal, and in the case when they aren’t normal, what that might tell us.

Tips

Run the model with default values first to get a sense of the expected computation costs.
Randomly sample hyperparameter value combinations rather than searching all possible combinations. When looking at the results, you may find that certain hyperparameters gravitate towards the same value. Fix those values, then perform a more exhaustive follow-up search on the remaining hyperparameters.
Similar to #2, the random search may tell you that, for instance, a value between 0 and 0.2 tends to be better for a dropout rate. Then you can narrow your search over just that span for follow-up searches. Repeating this process to narrow the windows for each hyperparameter is essentially a brute-force Bayesian search optimization.
By comparing the best and average performers from your search, you can see what the opportunity gap is for hyperparameter tuning. If it is negligible for your end goal, you can focus your efforts elsewhere.

RiverHeatNet: Building a river temperature neural network with Keras functional API

Mon, 21 Jun 2021 12:00:00 +0000

Motivation

I came upon the Keras functional API after setting out to develop a model for predicting river temperatures and their response to climate change. The dataset I collected consists of over 4.1 million temperature measurements across 1,210 rivers. Input variables like air temperature and precipitation are of the same time series structure as the output, but I also have information about each of the rivers like elevation and ecosystem type that I know are relevant and would like to incorporate.

The modeling challenge is that the predicted variable, water temperature, is a time series, while the input variables are a mixture of time series and static variables.

There are a few options for tackling this variable mismatch:

Use an ANN and ignore the time series aspect. Pros: Fast and easy. Cons: Lower accuracy since it leaves out so much useful information.
Use a recurrent neural network with the static variables converted to time series, repeating the same values at every time step. Pros: Incorporates time series information. Cons: Woefully computationally inefficient.
Combine the best aspects of #1 and #2.

Every neural network developed for river temperatures I have come across in the literature has gone with Option #1, sidestepping the modeling challenge by ignoring the time series structure of the data. This leaves a great opportunity to improve upon existing models. By the end of this post, you will see how I developed a model that flexibly incorporates both time series inputs and static inputs to predict a time series output using the Keras functional API.

I trained and evaluated the model on 920 rivers (black), each having a time series of daily river temperature over many years along with associated time-series and time-invariant covariates. The pink and blue sights were both used to test the model performance.

Keras Functional API

Thanks to the Keras functional API, it is remarkably easy to combine recurrent neural network layers with standard, fully connected ANN layers.

The functional API can build all of the same models as the sequential API but has much greater flexibility to incorporate different inputs. The functional API can also be used to build models with multiple outputs such as combined classification and regression tasks (e.g. this image is a cat predicted to be 4.3 years old). I experimented with multiple outputs and the functional API when building a network for wildfires, combining a binary detection task (fire, no fire) and a regression task (if there is fire, what is its magnitude?).

Model Architecture

LSTM Layers

I feed all input variables that are time series (two distinct air temperature time series and a precipitation time series I later dropped) into separate long short-term memory (LSTM) cells. LSTMs, first proposed in 1997 by Sepp Hochreiter and Jurgen Schmidhuber, are a type of recurrent neural network commonly used in time series applications. LSTMS are capable of taking the input, storing it for as long as needed, and extracting its value later. As a result, they have been very successful in tasks with long-term patterns like speech recognition and long texts. A commonly used alternative is the gated recurrent unit (GRU), which can be easily swapped in the API.

airTLayer <- airTInput %>%
layer_lstm(units = 7, dropout = 0.15, recurrent_dropout = 0.15)

The code above gives an example LSTM layer for one of the input time series, the local air temperature. I likewise created separate layers for the basin-wide average air temperature and precipitation.

The units argument determines the dimensionality of the output space. I set it to 7 days. The input data are organized such that the ‘lookback’ period is 7 days as well, meaning that the model can only see data from the previous week when trying to fit today’s value. I chose the lookback period based on a literature review, where we have little reason to believe river temperatures from more than a week prior will give useful information about todays temperature. The output dimension of 7 days is a tuneable hyperparameter and does not need to match the lookback period.

The two dropout specifications impose a moderate regularization effect to help mitigate overfitting. How does it work? Simply put, during each forward or backward pass of the algorithm, 15% of the nodes are randomly ignored. This 15% level is also a tuneable hyperparameter, and for this dataset I found that 15% was actually too high.

Fully connected layers

For the 8 time-invariant basin attributes, such as elevation for a given sampling location, I fed them into a standard, fully connected ANN with two layers. Since I only have 8 basin attributes to feed in, the layers do not have many nodes.

## Basin attribute layer
nAttr = 8
siteAttrInput <- layer_input(shape = nAttr, name = 'siteAttr')
siteAttrLayer <- siteAttrInput %>%
layer_dense(units = 16) %>%
layer_activation_leaky_relu() %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.1) %>%
layer_dense(units = 8) %>%
layer_activation_leaky_relu() %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.3)

The code above sets up the basin attribute layer, feeding it into two layers. For each layer I use the leaky ReLU activation function, a typical default choice, along with batch normalization, which helps prevent overfitting and accelerate training. I also use dropout in each layer, with the dropout rates selected via grid search (more on that later).

Merge LSTM and fully connected layers

Next, I combine the different LSTM and basin attribute layers into an additional set of three fully connected layers. The output is the current day’s water temperature at a given river. Keras makes this very straightforward.

## Merge Input layers
concatenated <- layer_concatenate(list(airTLayer, airTlocalLayer, precipLayer, siteAttrLayer))

Just like that, the code above merges the different layers, which I then feed into follow-on fully connected layers:

## Define layers following concatenation
waterTOutput <- concatenated %>%
layer_dense(units = 64) %>%
layer_activation_leaky_relu() %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 64) %>%
layer_activation_leaky_relu() %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 32) %>%
layer_activation(activation = "linear") %>%
layer_dense(units = 1) %>%
layer_activation(activation = "linear")

Why so few layers? Compared with the top networks designed for ImageNet classification, which can have hundreds of layers, this is indeed a simple and ‘shallow’ model. It is not for lack of data either, which would necessitate a simple model. Instead, I chose a handful of layers because the underlying task is relatively simple, controlled by physical relationships (hotter air = hotter water). The number of layers should correspond loosely with how complex you imagine the task is. Identifying faces, for instance, is far more complex a task than predicting temperature.

At this point we are pretty much done 💃 🕺!

## Compile model and add final specs
model <- keras_model(list(airTInput, airTlocalInput, precipInput, siteAttrInput), waterTOutput)
model %>% compile(
optimizer = optimizer_adam(),
loss = 'mse'
)

The code above compiles the model, telling it what to expect as inputs and outputs, along with specifying the type of optimizer for training and the loss metric (mean-squared error).

Hyperparameter tuning

Choice of hyperparameter values, such as dropout rates for each layer, type of optimizer, and numer of nodes can make a huge impact on model performance. Unfortunately, there are ‘rules of thumb’ for these at best, so it is often best to let the data tell you what to choose. I performed a grid search across potential values to make my final choices, aided by the tfruns package created by the TensorFlow community, which I’ll save for a different post.

Model evaluation

I set up an over-the-top set of steps for evaluating model performance, including three distinct test sets serving different purposes. But I won’t go into that here–it is all documented in an upcoming river heatwave manuscript. The quick version is that R² values were very good, on the order of 90% for test data consisting of rivers not used in training.

Final thoughts

I wrote this post in part because so many of the online tutorials I have seen use the Keras sequential API. However, in most of my work, I have eventually needed to add more customization and make the switch to the functional API, to the point where I now use the functional API as my default.

The code above is in all in R. I have used the Keras functional API in both R and Python, and the syntax is nearly identical. This makes deployment, migration between languages, and finding debugging solutions online a breeze.

FireSRnet: Geoscience-driven super-resolution of future fire risk from climate change

Mon, 21 Jun 2021 10:00:00 +0000

At Sust Global, we developed FireSRnet, a novel super-resolution (SR) architecture operating on a 3-channel geospatial dataset incorporating NASA satellite fire data 🛰, local temperature🌡️, and local land cover burnability🌲.

We compared FireSRnet performance at 2x, 4x, and 8x SR against a benchmark interpolation technique and validated model results with the recent fires in California and Australia.

Then, we showcased how FireSRnet can leverage CMIP6 climate model simulations of burned area and temperature to enable more precise forward-looking estimates of fire exposure 🔥.

I presented this work at the 2020 NeurIPS workshop on Tackling Climate Change with Machine Learning and was selected to give a spotlight talk.

FireSRnet: Geoscience-Driven Super-Resolution of Future Fire Risk from Climate Change

Tue, 24 Nov 2020 00:00:00 +0000