That is the ultimate put up in a four-part introduction to time-series forecasting with torch
. These posts have been the story of a quest for multiple-step prediction, and by now, we’ve seen three totally different approaches: forecasting in a loop, incorporating a multi-layer perceptron (MLP), and sequence-to-sequence fashions. Right here’s a fast recap.
-
As one ought to when one units out for an adventurous journey, we began with an in-depth examine of the instruments at our disposal: recurrent neural networks (RNNs). We skilled a mannequin to foretell the very subsequent remark in line, after which, considered a intelligent hack: How about we use this for multi-step prediction, feeding again particular person predictions in a loop? The end result , it turned out, was fairly acceptable.
-
Then, the journey actually began. We constructed our first mannequin “natively” for multi-step prediction, relieving the RNN a little bit of its workload and involving a second participant, a tiny-ish MLP. Now, it was the MLP’s process to mission RNN output to a number of time factors sooner or later. Though outcomes had been fairly passable, we didn’t cease there.
-
As a substitute, we utilized to numerical time collection a way generally utilized in pure language processing (NLP): sequence-to-sequence (seq2seq) prediction. Whereas forecast efficiency was not a lot totally different from the earlier case, we discovered the method to be extra intuitively interesting, because it displays the causal relationship between successive forecasts.
In the present day we’ll enrich the seq2seq method by including a brand new part: the consideration module. Initially launched round 2014, consideration mechanisms have gained monumental traction, a lot so {that a} latest paper title begins out “Consideration is Not All You Want”.
The thought is the next.
Within the basic encoder-decoder setup, the decoder will get “primed” with an encoder abstract only a single time: the time it begins its forecasting loop. From then on, it’s by itself. With consideration, nevertheless, it will get to see the entire sequence of encoder outputs once more each time it forecasts a brand new worth. What’s extra, each time, it will get to zoom in on these outputs that appear related for the present prediction step.
It is a significantly helpful technique in translation: In producing the subsequent phrase, a mannequin might want to know what a part of the supply sentence to deal with. How a lot the method helps with numerical sequences, in distinction, will seemingly depend upon the options of the collection in query.
As earlier than, we work with vic_elec
, however this time, we partly deviate from the way in which we used to make use of it. With the unique, bi-hourly dataset, coaching the present mannequin takes a very long time, longer than readers will wish to wait when experimenting. So as an alternative, we mixture observations by day. With a view to have sufficient information, we practice on years 2012 and 2013, reserving 2014 for validation in addition to post-training inspection.
We’ll try to forecast demand as much as fourteen days forward. How lengthy, then, ought to be the enter sequences? It is a matter of experimentation; all of the extra so now that we’re including within the consideration mechanism. (I believe that it may not deal with very lengthy sequences so effectively).
Under, we go along with fourteen days for enter size, too, however that won’t essentially be the very best selection for this collection.
n_timesteps 7 * 2
n_forecast 7 * 2
elec_dataset dataset(
identify = "elec_dataset",
initialize = perform(x, n_timesteps, sample_frac = 1) {
self$n_timesteps n_timesteps
self$x torch_tensor((x - train_mean) / train_sd)
n size(self$x) - self$n_timesteps - 1
self$begins type(pattern.int(
n = n,
dimension = n * sample_frac
))
},
.getitem = perform(i) {
begin self$begins[i]
finish begin + self$n_timesteps - 1
lag 1
record(
x = self$x[start:end],
y = self$x[(start+lag):(end+lag)]$squeeze(2)
)
},
.size = perform() {
size(self$begins)
}
)
batch_size 32
train_ds elec_dataset(elec_train, n_timesteps)
train_dl train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)
valid_ds elec_dataset(elec_valid, n_timesteps)
valid_dl valid_ds %>% dataloader(batch_size = batch_size)
test_ds elec_dataset(elec_test, n_timesteps)
test_dl test_ds %>% dataloader(batch_size = 1)
Mannequin-wise, we once more encounter the three modules acquainted from the earlier put up: encoder, decoder, and top-level seq2seq module. Nonetheless, there’s an extra part: the consideration module, utilized by the decoder to acquire consideration weights.
Encoder
The encoder nonetheless works the identical manner. It wraps an RNN, and returns the ultimate state.
encoder_module nn_module(
initialize = perform(sort, input_size, hidden_size, num_layers = 1, dropout = 0) {
self$sort sort
self$rnn if (self$sort == "gru") {
nn_gru(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout = dropout,
batch_first = TRUE
)
} else {
nn_lstm(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
dropout = dropout,
batch_first = TRUE
)
}
},
ahead = perform(x) {
# return outputs for all timesteps, in addition to last-timestep states for all layers
x %>% self$rnn()
}
)
Consideration module
In fundamental seq2seq, at any time when it needed to generate a brand new worth, the decoder took into consideration two issues: its prior state, and the earlier output generated. In an attention-enriched setup, the decoder moreover receives the entire output from the encoder. In deciding what subset of that output ought to matter, it will get assist from a brand new agent, the eye module.
This, then, is the eye module’s raison d’être: Given present decoder state and effectively as full encoder outputs, get hold of a weighting of these outputs indicative of how related they’re to what the decoder is at present as much as. This process ends in the so-called consideration weights: a normalized rating, for every time step within the encoding, that quantify their respective significance.
Consideration could also be carried out in quite a few alternative ways. Right here, we present two implementation choices, one additive, and one multiplicative.
Additive consideration
In additive consideration, encoder outputs and decoder state are generally both added or concatenated (we select to do the latter, beneath). The ensuing tensor is run via a linear layer, and a softmax is utilized for normalization.
attention_module_additive nn_module(
initialize = perform(hidden_dim, attention_size) {
self$consideration nn_linear(2 * hidden_dim, attention_size)
},
ahead = perform(state, encoder_outputs) {
# perform argument shapes
# encoder_outputs: (bs, timesteps, hidden_dim)
# state: (1, bs, hidden_dim)
# multiplex state to permit for concatenation (dimensions 1 and a pair of should agree)
seq_len dim(encoder_outputs)[2]
# ensuing form: (bs, timesteps, hidden_dim)
state_rep state$permute(c(2, 1, 3))$repeat_interleave(seq_len, 2)
# concatenate alongside characteristic dimension
concat torch_cat(record(state_rep, encoder_outputs), dim = 3)
# run via linear layer with tanh
# ensuing form: (bs, timesteps, attention_size)
scores self$consideration(concat) %>%
torch_tanh()
# sum over consideration dimension and normalize
# ensuing form: (bs, timesteps)
attention_weights scores %>%
torch_sum(dim = 3) %>%
nnf_softmax(dim = 2)
# a normalized rating for each supply token
attention_weights
}
)
Multiplicative consideration
In multiplicative consideration, scores are obtained by computing dot merchandise between decoder state and the entire encoder outputs. Right here too, a softmax is then used for normalization.
attention_module_multiplicative nn_module(
initialize = perform() {
NULL
},
ahead = perform(state, encoder_outputs) {
# perform argument shapes
# encoder_outputs: (bs, timesteps, hidden_dim)
# state: (1, bs, hidden_dim)
# enable for matrix multiplication with encoder_outputs
state state$permute(c(2, 3, 1))
# put together for scaling by variety of options
d torch_tensor(dim(encoder_outputs)[3], dtype = torch_float())
# scaled dot merchandise between state and outputs
# ensuing form: (bs, timesteps, 1)
scores torch_bmm(encoder_outputs, state) %>%
torch_div(torch_sqrt(d))
# normalize
# ensuing form: (bs, timesteps)
attention_weights scores$squeeze(3) %>%
nnf_softmax(dim = 2)
# a normalized rating for each supply token
attention_weights
}
)
Decoder
As soon as consideration weights have been computed, their precise utility is dealt with by the decoder. Concretely, the strategy in query, weighted_encoder_outputs()
, computes a product of weights and encoder outputs, ensuring that every output could have acceptable influence.
The remainder of the motion then occurs in ahead()
. A concatenation of weighted encoder outputs (usually known as “context”) and present enter is run via an RNN. Then, an ensemble of RNN output, context, and enter is handed to an MLP. Lastly, each RNN state and present prediction are returned.
decoder_module nn_module(
initialize = perform(sort, input_size, hidden_size, attention_type, attention_size = 8, num_layers = 1) {
self$sort sort
self$rnn if (self$sort == "gru") {
nn_gru(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = TRUE
)
} else {
nn_lstm(
input_size = input_size,
hidden_size = hidden_size,
num_layers = num_layers,
batch_first = TRUE
)
}
self$linear nn_linear(2 * hidden_size + 1, 1)
self$consideration if (attention_type == "multiplicative") attention_module_multiplicative()
else attention_module_additive(hidden_size, attention_size)
},
weighted_encoder_outputs = perform(state, encoder_outputs) {
# encoder_outputs is (bs, timesteps, hidden_dim)
# state is (1, bs, hidden_dim)
# ensuing form: (bs * timesteps)
attention_weights self$consideration(state, encoder_outputs)
# ensuing form: (bs, 1, seq_len)
attention_weights attention_weights$unsqueeze(2)
# ensuing form: (bs, 1, hidden_size)
weighted_encoder_outputs torch_bmm(attention_weights, encoder_outputs)
weighted_encoder_outputs
},
ahead = perform(x, state, encoder_outputs) {
# encoder_outputs is (bs, timesteps, hidden_dim)
# state is (1, bs, hidden_dim)
# ensuing form: (bs, 1, hidden_size)
context self$weighted_encoder_outputs(state, encoder_outputs)
# concatenate enter and context
# NOTE: this repeating is completed to compensate for the absence of an embedding module
# that, in NLP, would give x a better proportion within the concatenation
x_rep x$repeat_interleave(dim(context)[3], 3)
rnn_input torch_cat(record(x_rep, context), dim = 3)
# ensuing shapes: (bs, 1, hidden_size) and (1, bs, hidden_size)
rnn_out self$rnn(rnn_input, state)
rnn_output rnn_out[[1]]
next_hidden rnn_out[[2]]
mlp_input torch_cat(record(rnn_output$squeeze(2), context$squeeze(2), x$squeeze(2)), dim = 2)
output self$linear(mlp_input)
# shapes: (bs, 1) and (1, bs, hidden_size)
record(output, next_hidden)
}
)
seq2seq
module
The seq2seq
module is mainly unchanged (aside from the truth that now, it permits for consideration module configuration). For an in depth clarification of what occurs right here, please seek the advice of the earlier put up.
seq2seq_module nn_module(
initialize = perform(sort, input_size, hidden_size, attention_type, attention_size, n_forecast,
num_layers = 1, encoder_dropout = 0) {
self$encoder encoder_module(sort = sort, input_size = input_size, hidden_size = hidden_size,
num_layers, encoder_dropout)
self$decoder decoder_module(sort = sort, input_size = 2 * hidden_size, hidden_size = hidden_size,
attention_type = attention_type, attention_size = attention_size, num_layers)
self$n_forecast n_forecast
},
ahead = perform(x, y, teacher_forcing_ratio) {
outputs torch_zeros(dim(x)[1], self$n_forecast)
encoded self$encoder(x)
encoder_outputs encoded[[1]]
hidden encoded[[2]]
# record of (batch_size, 1), (1, batch_size, hidden_size)
out self$decoder(x[ , n_timesteps, , drop = FALSE], hidden, encoder_outputs)
# (batch_size, 1)
pred out[[1]]
# (1, batch_size, hidden_size)
state out[[2]]
outputs[ , 1] pred$squeeze(2)
for (t in 2:self$n_forecast) {
teacher_forcing runif(1) teacher_forcing_ratio
enter if (teacher_forcing == TRUE) y[ , t - 1, drop = FALSE] else pred
enter enter$unsqueeze(3)
out self$decoder(enter, state, encoder_outputs)
pred out[[1]]
state out[[2]]
outputs[ , t] pred$squeeze(2)
}
outputs
}
)
When instantiating the top-level mannequin, we now have an extra selection: that between additive and multiplicative consideration. Within the “accuracy” sense of efficiency, my exams didn’t present any variations. Nonetheless, the multiplicative variant is so much sooner.
web seq2seq_module("gru", input_size = 1, hidden_size = 32, attention_type = "multiplicative",
attention_size = 8, n_forecast = n_forecast)
Identical to final time, in mannequin coaching, we get to decide on the diploma of instructor forcing. Under, we go along with a fraction of 0.0, that’s, no forcing in any respect.
optimizer optim_adam(web$parameters, lr = 0.001)
num_epochs 1000
train_batch perform(b, teacher_forcing_ratio) {
optimizer$zero_grad()
output web(b$x, b$y, teacher_forcing_ratio)
goal b$y
loss nnf_mse_loss(output, goal[ , 1:(dim(output)[2])])
loss$backward()
optimizer$step()
loss$merchandise()
}
valid_batch perform(b, teacher_forcing_ratio = 0) {
output web(b$x, b$y, teacher_forcing_ratio)
goal b$y
loss nnf_mse_loss(output, goal[ , 1:(dim(output)[2])])
loss$merchandise()
}
for (epoch in 1:num_epochs) {
web$practice()
train_loss c()
coro::loop(for (b in train_dl) {
loss train_batch(b, teacher_forcing_ratio = 0.0)
train_loss c(train_loss, loss)
})
cat(sprintf("nEpoch %d, coaching: loss: %3.5f n", epoch, imply(train_loss)))
web$eval()
valid_loss c()
coro::loop(for (b in valid_dl) {
loss valid_batch(b)
valid_loss c(valid_loss, loss)
})
cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, imply(valid_loss)))
}
# Epoch 1, coaching: loss: 0.83752
# Epoch 1, validation: loss: 0.83167
# Epoch 2, coaching: loss: 0.72803
# Epoch 2, validation: loss: 0.80804
# ...
# ...
# Epoch 99, coaching: loss: 0.10385
# Epoch 99, validation: loss: 0.21259
# Epoch 100, coaching: loss: 0.10396
# Epoch 100, validation: loss: 0.20975
For visible inspection, we choose a number of forecasts from the check set.
web$eval()
test_preds vector(mode = "record", size = size(test_dl))
i 1
vic_elec_test vic_elec_daily %>%
filter(12 months(Date) == 2014, month(Date) %in% 1:4)
coro::loop(for (b in test_dl) {
output web(b$x, b$y, teacher_forcing_ratio = 0)
preds as.numeric(output)
test_preds[[i]] preds
i i + 1
})
test_pred1 test_preds[[1]]
test_pred1 c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_test) - n_timesteps - n_forecast))
test_pred2 test_preds[[21]]
test_pred2 c(rep(NA, n_timesteps + 20), test_pred2, rep(NA, nrow(vic_elec_test) - 20 - n_timesteps - n_forecast))
test_pred3 test_preds[[41]]
test_pred3 c(rep(NA, n_timesteps + 40), test_pred3, rep(NA, nrow(vic_elec_test) - 40 - n_timesteps - n_forecast))
test_pred4 test_preds[[61]]
test_pred4 c(rep(NA, n_timesteps + 60), test_pred4, rep(NA, nrow(vic_elec_test) - 60 - n_timesteps - n_forecast))
test_pred5 test_preds[[81]]
test_pred5 c(rep(NA, n_timesteps + 80), test_pred5, rep(NA, nrow(vic_elec_test) - 80 - n_timesteps - n_forecast))
preds_ts vic_elec_test %>%
choose(Demand, Date) %>%
add_column(
ex_1 = test_pred1 * train_sd + train_mean,
ex_2 = test_pred2 * train_sd + train_mean,
ex_3 = test_pred3 * train_sd + train_mean,
ex_4 = test_pred4 * train_sd + train_mean,
ex_5 = test_pred5 * train_sd + train_mean) %>%
pivot_longer(-Date) %>%
update_tsibble(key = identify)
preds_ts %>%
autoplot() +
scale_color_hue(h = c(80, 300), l = 70) +
theme_minimal()

Determine 1: A pattern of two-weeks-ahead predictions for the check set, 2014.
We are able to’t immediately examine efficiency right here to that of earlier fashions in our collection, as we’ve pragmatically redefined the duty. The primary objective, nevertheless, has been to introduce the idea of consideration. Particularly, the way to manually implement the method – one thing that, when you’ve understood the idea, chances are you’ll by no means need to do in observe. As a substitute, you’d seemingly make use of present instruments that include torch
(multi-head consideration and transformer modules), instruments we might introduce in a future “season” of this collection.
Thanks for studying!
Photograph by David Clode on Unsplash