Posit AI Weblog: torch time sequence, take three: Sequence-to-sequence prediction

April 26, 2025

36

Right now, we proceed our exploration of multi-step time-series forecasting with torch. This submit is the third in a sequence.

Initially, we lined fundamentals of recurrent neural networks (RNNs), and skilled a mannequin to foretell the very subsequent worth in a sequence. We additionally discovered we might forecast fairly a couple of steps forward by feeding again particular person predictions in a loop.
Subsequent, we constructed a mannequin “natively” for multi-step prediction. A small multi-layer-perceptron (MLP) was used to undertaking RNN output to a number of time factors sooner or later.

Of each approaches, the latter was the extra profitable. However conceptually, it has an unsatisfying contact to it: When the MLP extrapolates and generates output for, say, ten consecutive cut-off dates, there isn’t any causal relation between these. (Think about a climate forecast for ten days that by no means bought up to date.)

Now, we’d prefer to strive one thing extra intuitively interesting. The enter is a sequence; the output is a sequence. In pure language processing (NLP), one of these job is quite common: It’s precisely the type of scenario we see with machine translation or summarization.

Fairly fittingly, the varieties of fashions employed to those ends are named sequence-to-sequence fashions (usually abbreviated seq2seq). In a nutshell, they cut up up the duty into two elements: an encoding and a decoding half. The previous is completed simply as soon as per input-target pair. The latter is completed in a loop, as in our first strive. However the decoder has extra data at its disposal: At every iteration, its processing relies on the earlier prediction in addition to earlier state. That earlier state would be the encoder’s when a loop is began, and its personal ever thereafter.

Earlier than discussing the mannequin intimately, we have to adapt our information enter mechanism.

We proceed working with vic_elec , offered by tsibbledata.

Once more, the dataset definition within the present submit appears to be like a bit totally different from the best way it did earlier than; it’s the form of the goal that differs. This time, y equals x, shifted to the left by one.

The explanation we do that is owed to the best way we’re going to prepare the community. With seq2seq, individuals usually use a method known as “trainer forcing” the place, as a substitute of feeding again its personal prediction into the decoder module, you go it the worth it ought to have predicted. To be clear, that is performed throughout coaching solely, and to a configurable diploma.

library(torch)
library(tidyverse)
library(tsibble)
library(tsibbledata)
library(lubridate)
library(fable)
library(zeallot)

n_timesteps  7 * 24 * 2
n_forecast  n_timesteps

vic_elec_get_year  perform(12 months, month = NULL) {
  vic_elec %>%
    filter(12 months(Date) == 12 months, month(Date) == if (is.null(month)) month(Date) else month) %>%
    as_tibble() %>%
    choose(Demand)
}

elec_train  vic_elec_get_year(2012) %>% as.matrix()
elec_valid  vic_elec_get_year(2013) %>% as.matrix()
elec_test  vic_elec_get_year(2014, 1) %>% as.matrix()

train_mean  imply(elec_train)
train_sd  sd(elec_train)

elec_dataset  dataset(
  identify = "elec_dataset",
  
  initialize = perform(x, n_timesteps, sample_frac = 1) {
    
    self$n_timesteps  n_timesteps
    self$x  torch_tensor((x - train_mean) / train_sd)
    
    n  size(self$x) - self$n_timesteps - 1
    
    self$begins  kind(pattern.int(
      n = n,
      measurement = n * sample_frac
    ))
    
  },
  
  .getitem = perform(i) {
    
    begin  self$begins[i]
    finish  begin + self$n_timesteps - 1
    lag  1
    
    listing(
      x = self$x[start:end],
      y = self$x[(start+lag):(end+lag)]$squeeze(2)
    )
    
  },
  
  .size = perform() {
    size(self$begins) 
  }
)

Dataset in addition to dataloader instantations then can proceed as earlier than.

batch_size  32

train_ds  elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl  train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)

valid_ds  elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl  valid_ds %>% dataloader(batch_size = batch_size)

test_ds  elec_dataset(elec_test, n_timesteps)
test_dl  test_ds %>% dataloader(batch_size = 1)

Technically, the mannequin consists of three modules: the aforementioned encoder and decoder, and the seq2seq module that orchestrates them.

Encoder

The encoder takes its enter and runs it via an RNN. Of the 2 issues returned by a recurrent neural community, outputs and state, to date we’ve solely been utilizing output. This time, we do the other: We throw away the outputs, and solely return the state.

If the RNN in query is a GRU (and assuming that of the outputs, we take simply the ultimate time step, which is what we’ve been doing all through), there actually isn’t any distinction: The ultimate state equals the ultimate output. If it’s an LSTM, nonetheless, there’s a second type of state, the “cell state”. In that case, returning the state as a substitute of the ultimate output will carry extra data.

encoder_module  nn_module(
  
  initialize = perform(sort, input_size, hidden_size, num_layers = 1, dropout = 0) {
    
    self$sort  sort
    
    self$rnn  if (self$sort == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    }
    
  },
  
  ahead = perform(x) {
    
    x  self$rnn(x)
    
    # return final states for all layers
    # per layer, a single tensor for GRU, a listing of two tensors for LSTM
    x  x[[2]]
    x
    
  }
  
)

Decoder

Within the decoder, identical to within the encoder, the principle element is an RNN. In distinction to previously-shown architectures, although, it doesn’t simply return a prediction. It additionally stories again the RNN’s closing state.

decoder_module  nn_module(
  
  initialize = perform(sort, input_size, hidden_size, num_layers = 1) {
    
    self$sort  sort
    
    self$rnn  if (self$sort == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    }
    
    self$linear  nn_linear(hidden_size, 1)
    
  },
  
  ahead = perform(x, state) {
    
    # enter to ahead:
    # x is (batch_size, 1, 1)
    # state is (1, batch_size, hidden_size)
    x  self$rnn(x, state)
    
    # break up RNN return values
    # output is (batch_size, 1, hidden_size)
    # next_hidden is
    c(output, next_hidden) % x
    
    output  output$squeeze(2)
    output  self$linear(output)
    
    listing(output, next_hidden)
    
  }
  
)

`seq2seq` module

seq2seq is the place the motion occurs. The plan is to encode as soon as, then name the decoder in a loop.

For those who look again to decoder ahead(), you see that it takes two arguments: x and state.

Relying on the context, x corresponds to one in every of three issues: closing enter, previous prediction, or prior floor reality.

The very first time the decoder is known as on an enter sequence, x maps to the ultimate enter worth. That is totally different from a job like machine translation, the place you’ll go in a begin token. With time sequence, although, we’d prefer to proceed the place the precise measurements cease.
In additional calls, we would like the decoder to proceed from its most up-to-date prediction. It is just logical, thus, to go again the previous forecast.
That mentioned, in NLP a method known as “trainer forcing” is often used to hurry up coaching. With trainer forcing, as a substitute of the forecast we go the precise floor reality, the factor the decoder ought to have predicted. We do this solely in a configurable fraction of instances, and – naturally – solely whereas coaching. The rationale behind this system is that with out this type of re-calibration, consecutive prediction errors can shortly erase any remaining sign.

state, too, is polyvalent. However right here, there are simply two prospects: encoder state and decoder state.

The primary time the decoder is known as, it’s “seeded” with the ultimate state from the encoder. Be aware how that is the one time we make use of the encoding.
From then on, the decoder’s personal earlier state will probably be handed. Bear in mind the way it returns two values, forecast and state?

seq2seq_module  nn_module(
  
  initialize = perform(sort, input_size, hidden_size, n_forecast, num_layers = 1, encoder_dropout = 0) {
    
    self$encoder  encoder_module(sort = sort, input_size = input_size,
                                   hidden_size = hidden_size, num_layers, encoder_dropout)
    self$decoder  decoder_module(sort = sort, input_size = input_size,
                                   hidden_size = hidden_size, num_layers)
    self$n_forecast  n_forecast
    
  },
  
  ahead = perform(x, y, teacher_forcing_ratio) {
    
    # put together empty output
    outputs  torch_zeros(dim(x)[1], self$n_forecast)$to(system = system)
    
    # encode present enter sequence
    hidden  self$encoder(x)
    
    # prime decoder with closing enter worth and hidden state from the encoder
    out  self$decoder(x[ , n_timesteps, , drop = FALSE], hidden)
    
    # decompose into predictions and decoder state
    # pred is (batch_size, 1)
    # state is (1, batch_size, hidden_size)
    c(pred, state) % out
    
    # retailer first prediction
    outputs[ , 1]  pred$squeeze(2)
    
    # iterate to generate remaining forecasts
    for (t in 2:self$n_forecast) {
      
      # name decoder on both floor reality or earlier prediction, plus earlier decoder state
      teacher_forcing  runif(1)  teacher_forcing_ratio
      enter  if (teacher_forcing == TRUE) y[ , t - 1, drop = FALSE] else pred
      enter  enter$unsqueeze(3)
      out  self$decoder(enter, state)
      
      # once more, decompose decoder return values
      c(pred, state) % out
      # and retailer present prediction
      outputs[ , t]  pred$squeeze(2)
    }
    outputs
  }
  
)

web  seq2seq_module("gru", input_size = 1, hidden_size = 32, n_forecast = n_forecast)

# coaching RNNs on the GPU at the moment prints a warning that will muddle 
# the console
# see https://github.com/mlverse/torch/points/461
# alternatively, use 
# system 
system  torch_device(if (cuda_is_available()) "cuda" else "cpu")

web  web$to(system = system)

The coaching process is primarily unchanged. We do, nonetheless, must resolve about teacher_forcing_ratio, the proportion of enter sequences we need to carry out re-calibration on. In valid_batch(), this could at all times be 0, whereas in train_batch(), it’s as much as us (or moderately, experimentation). Right here, we set it to 0.3.

optimizer  optim_adam(web$parameters, lr = 0.001)

num_epochs  50

train_batch  perform(b, teacher_forcing_ratio) {
  
  optimizer$zero_grad()
  output  web(b$x$to(system = system), b$y$to(system = system), teacher_forcing_ratio)
  goal  b$y$to(system = system)
  
  loss  nnf_mse_loss(output, goal)
  loss$backward()
  optimizer$step()
  
  loss$merchandise()
  
}

valid_batch  perform(b, teacher_forcing_ratio = 0) {
  
  output  web(b$x$to(system = system), b$y$to(system = system), teacher_forcing_ratio)
  goal  b$y$to(system = system)
  
  loss  nnf_mse_loss(output, goal)
  
  loss$merchandise()
  
}

for (epoch in 1:num_epochs) {
  
  web$prepare()
  train_loss  c()
  
  coro::loop(for (b in train_dl) {
    loss train_batch(b, teacher_forcing_ratio = 0.3)
    train_loss  c(train_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, coaching: loss: %3.5f n", epoch, imply(train_loss)))
  
  web$eval()
  valid_loss  c()
  
  coro::loop(for (b in valid_dl) {
    loss  valid_batch(b)
    valid_loss  c(valid_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, imply(valid_loss)))
}

Epoch 1, coaching: loss: 0.37961 

Epoch 1, validation: loss: 1.10699 

Epoch 2, coaching: loss: 0.19355 

Epoch 2, validation: loss: 1.26462 

# ...
# ...

Epoch 49, coaching: loss: 0.03233 

Epoch 49, validation: loss: 0.62286 

Epoch 50, coaching: loss: 0.03091 

Epoch 50, validation: loss: 0.54457

It’s fascinating to check performances for various settings of teacher_forcing_ratio. With a setting of 0.5, coaching loss decreases much more slowly; the other is seen with a setting of 0. Validation loss, nonetheless, just isn’t affected considerably.

The code to examine test-set forecasts is unchanged.

web$eval()

test_preds  vector(mode = "listing", size = size(test_dl))

i  1

coro::loop(for (b in test_dl) {
  
  output  web(b$x$to(system = system), b$y$to(system = system), teacher_forcing_ratio = 0)
  preds  as.numeric(output)
  
  test_preds[[i]]  preds
  i  i + 1
  
})

vic_elec_jan_2014  vic_elec %>%
  filter(12 months(Date) == 2014, month(Date) == 1)

test_pred1  test_preds[[1]]
test_pred1  c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_jan_2014) - n_timesteps - n_forecast))

test_pred2  test_preds[[408]]
test_pred2  c(rep(NA, n_timesteps + 407), test_pred2, rep(NA, nrow(vic_elec_jan_2014) - 407 - n_timesteps - n_forecast))

test_pred3  test_preds[[817]]
test_pred3  c(rep(NA, nrow(vic_elec_jan_2014) - n_forecast), test_pred3)


preds_ts  vic_elec_jan_2014 %>%
  choose(Demand) %>%
  add_column(
    mlp_ex_1 = test_pred1 * train_sd + train_mean,
    mlp_ex_2 = test_pred2 * train_sd + train_mean,
    mlp_ex_3 = test_pred3 * train_sd + train_mean) %>%
  pivot_longer(-Time) %>%
  update_tsibble(key = identify)


preds_ts %>%
  autoplot() +
  scale_colour_manual(values = c("#08c5d1", "#00353f", "#ffbf66", "#d46f4d")) +
  theme_minimal()

Determine 1: One-week-ahead predictions for January, 2014.

Evaluating this to the forecast obtained from final time’s RNN-MLP combo, we don’t see a lot of a distinction. Is that this stunning? To me it’s. If requested to invest concerning the purpose, I’d in all probability say this: In all the architectures we’ve used to date, the principle provider of data has been the ultimate hidden state of the RNN (one and solely RNN within the two earlier setups, encoder RNN on this one). Will probably be fascinating to see what occurs within the final a part of this sequence, once we increase the encoder-decoder structure by consideration.

Thanks for studying!

Picture by Suzuha Kozuki on Unsplash

Previous articleThis month in safety with Tony Anscombe – January 2025 version

Next articleDeel information countersuit in opposition to Rippling as rivalry escalates

Posit AI Weblog: torch time sequence, take three: Sequence-to-sequence prediction

Encoder

Decoder

`seq2seq` module

Constructing Superior Multi-Agent AI Workflows by Leveraging AutoGen and Semantic Kernel

Meet Jim O’Neill, the longevity fanatic who’s now RFK Jr.’s right-hand man

Reworking Life, Work & Society

LEAVE A REPLY Cancel reply

Most Popular

This PEZ Launcher Flings Sweet Into Keen Mouths at Excessive Velocity

Google Preserve for Apple Watch faraway from App Retailer

DeepSeek App Faces Ban In Germany For Unlawful Switch Of Person Knowledge

Early Prime Day deal drops Apple’s A17 Professional iPad mini right down to its greatest worth ever

Recent Comments

ABOUT US

POPULAR POSTS

This PEZ Launcher Flings Sweet Into Keen Mouths at Excessive Velocity

Google Preserve for Apple Watch faraway from App Retailer

DeepSeek App Faces Ban In Germany For Unlawful Switch Of Person Knowledge

POPULAR CATEGORY

Posit AI Weblog: torch time sequence, take three: Sequence-to-sequence prediction

Encoder

Decoder

seq2seq module

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

ABOUT US

POPULAR POSTS

POPULAR CATEGORY

`seq2seq` module