Introductory time-series forecasting with torch

April 26, 2025

120

That is the primary publish in a collection introducing time-series forecasting with torch. It does assume some prior expertise with torch and/or deep studying. However so far as time collection are involved, it begins proper from the start, utilizing recurrent neural networks (GRU or LSTM) to foretell how one thing develops in time.

On this publish, we construct a community that makes use of a sequence of observations to foretell a worth for the very subsequent time limit. What if we’d prefer to forecast a sequence of values, akin to, say, every week or a month of measurements?

One factor we may do is feed again into the system the beforehand forecasted worth; that is one thing we’ll attempt on the finish of this publish. Subsequent posts will discover different choices, a few of them involving considerably extra complicated architectures. It is going to be fascinating to check their performances; however the important objective is to introduce some torch “recipes” that you would be able to apply to your individual information.

We begin by analyzing the dataset used. It’s a low-dimensional, however fairly polyvalent and sophisticated one.

The vic_elec dataset, obtainable by package deal tsibbledata, offers three years of half-hourly electrical energy demand for Victoria, Australia, augmented by same-resolution temperature info and a each day vacation indicator.

Rows: 52,608
Columns: 5
$ Time         2012-01-01 00:00:00, 2012-01-01 00:30:00, 2012-01-01 01:00:00,…
$ Demand       4382.825, 4263.366, 4048.966, 3877.563, 4036.230, 3865.597, 369…
$ Temperature  21.40, 21.05, 20.70, 20.55, 20.40, 20.25, 20.10, 19.60, 19.10, …
$ Date         2012-01-01, 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-01, 20…
$ Vacation      TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…

Relying on what subset of variables is used, and whether or not and the way information is temporally aggregated, these information could serve for example quite a lot of totally different strategies. For instance, within the third version of Forecasting: Rules and Observe each day averages are used to show quadratic regression with ARMA errors. On this first introductory publish although, in addition to in most of its successors, we’ll try and forecast Demand with out counting on further info, and we preserve the unique decision.

To get an impression of how electrical energy demand varies over totally different timescales. Let’s examine information for 2 months that properly illustrate the U-shaped relationship between temperature and demand: January, 2014 and July, 2014.

First, right here is July.

vic_elec_2014   vic_elec %>%
  filter(yr(Date) == 2014) %>%
  choose(-c(Date, Vacation)) %>%
  mutate(Demand = scale(Demand), Temperature = scale(Temperature)) %>%
  pivot_longer(-Time, names_to = "variable") %>%
  update_tsibble(key = variable)

vic_elec_2014 %>% filter(month(Time) == 7) %>% 
  autoplot() + 
  scale_colour_manual(values = c("#08c5d1", "#00353f")) +
  theme_minimal()

Temperature and electricity demand (normalized). Victoria, Australia, 07/2014.

Determine 1: Temperature and electrical energy demand (normalized). Victoria, Australia, 07/2014.

It’s winter; temperature fluctuates under common, whereas electrical energy demand is above common (heating). There’s sturdy variation over the course of the day; we see troughs within the demand curve akin to ridges within the temperature graph, and vice versa. Whereas diurnal variation dominates, there is also variation over the times of the week. Between weeks although, we don’t see a lot distinction.

Evaluate this with the information for January:

vic_elec_2014 %>% filter(month(Time) == 1) %>% 
  autoplot() + 
  scale_colour_manual(values = c("#08c5d1", "#00353f")) +
  theme_minimal()

Temperature and electricity demand (normalized). Victoria, Australia, 01/2014.

Determine 2: Temperature and electrical energy demand (normalized). Victoria, Australia, 01/2014.

We nonetheless see the sturdy circadian variation. We nonetheless see some day-of-week variation. However now it’s excessive temperatures that trigger elevated demand (cooling). Additionally, there are two intervals of unusually excessive temperatures, accompanied by distinctive demand. We anticipate that in a univariate forecast, not making an allowance for temperature, this can be onerous – and even, inconceivable – to forecast.

Let’s see a concise portrait of how Demand behaves utilizing feasts::STL(). First, right here is the decomposition for July:

vic_elec_2014   vic_elec %>%
  filter(yr(Date) == 2014) %>%
  choose(-c(Date, Vacation))

cmp  vic_elec_2014 %>% filter(month(Time) == 7) %>%
  mannequin(STL(Demand)) %>% 
  elements()

cmp %>% autoplot()

STL decomposition of electricity demand. Victoria, Australia, 07/2014.

Determine 3: STL decomposition of electrical energy demand. Victoria, Australia, 07/2014.

And right here, for January:

STL decomposition of electricity demand. Victoria, Australia, 01/2014.

Determine 4: STL decomposition of electrical energy demand. Victoria, Australia, 01/2014.

Each properly illustrate the sturdy circadian and weekly seasonalities (with diurnal variation considerably stronger in January). If we glance carefully, we will even see how the development element is extra influential in January than in July. This once more hints at a lot stronger difficulties predicting the January than the July developments.

Now that we have now an thought what awaits us, let’s start by making a torch dataset.

Here’s what we intend to do. We wish to begin our journey into forecasting through the use of a sequence of observations to foretell their instant successor. In different phrases, the enter (x) for every batch merchandise is a vector, whereas the goal (y) is a single worth. The size of the enter sequence, x, is parameterized as n_timesteps, the variety of consecutive observations to extrapolate from.

The dataset will replicate this in its .getitem() methodology. When requested for the observations at index i, it can return tensors like so:

checklist(
      x = self$x[start:end],
      y = self$x[end+1]
)

the place begin:finish is a vector of indices, of size n_timesteps, and finish+1 is a single index.

Now, if the dataset simply iterated over its enter so as, advancing the index one after the other, these strains may merely learn

checklist(
      x = self$x[i:(i + self$n_timesteps - 1)],
      y = self$x[self$n_timesteps + i]
)

Since many sequences within the information are comparable, we will scale back coaching time by making use of a fraction of the information in each epoch. This may be achieved by (optionally) passing a sample_frac smaller than 1. In initialize(), a random set of begin indices is ready; .getitem() then simply does what it usually does: search for the (x,y) pair at a given index.

Right here is the entire dataset code:

elec_dataset  dataset(
  identify = "elec_dataset",
  
  initialize = perform(x, n_timesteps, sample_frac = 1) {

    self$n_timesteps  n_timesteps
    self$x  torch_tensor((x - train_mean) / train_sd)
    
    n  size(self$x) - self$n_timesteps 
    
    self$begins  type(pattern.int(
      n = n,
      measurement = n * sample_frac
    ))

  },
  
  .getitem = perform(i) {
    
    begin  self$begins[i]
    finish  begin + self$n_timesteps - 1
    
    checklist(
      x = self$x[start:end],
      y = self$x[end + 1]
    )

  },
  
  .size = perform() {
    size(self$begins) 
  }
)

You might have observed that we normalize the information by globally outlined train_mean and train_sd. We but should calculate these.

The way in which we break up the information is simple. We use the entire of 2012 for coaching, and all of 2013 for validation. For testing, we take the “troublesome” month of January, 2014. You’re invited to check testing outcomes for July that very same yr, and evaluate performances.

vic_elec_get_year  perform(yr, month = NULL) {
  vic_elec %>%
    filter(yr(Date) == yr, month(Date) == if (is.null(month)) month(Date) else month) %>%
    as_tibble() %>%
    choose(Demand)
}

elec_train  vic_elec_get_year(2012) %>% as.matrix()
elec_valid  vic_elec_get_year(2013) %>% as.matrix()
elec_test  vic_elec_get_year(2014, 1) %>% as.matrix() # or 2014, 7, alternatively

train_mean  imply(elec_train)
train_sd  sd(elec_train)

Now, to instantiate a dataset, we nonetheless want to select sequence size. From prior inspection, every week looks like a good selection.

n_timesteps  7 * 24 * 2 # days * hours * half-hours

Now we will go forward and create a dataset for the coaching information. Let’s say we’ll make use of fifty% of the information in every epoch:

train_ds  elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
size(train_ds)

Fast test: Are the shapes right?

$x
torch_tensor
-0.4141
-0.5541
[...]       ### strains eliminated by me
 0.8204
 0.9399
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{336,1} ]

$y
torch_tensor
-0.6771
[ CPUFloatType{1} ]

Sure: That is what we wished to see. The enter sequence has n_timesteps values within the first dimension, and a single one within the second, akin to the one function current, Demand. As supposed, the prediction tensor holds a single worth, corresponding– as we all know – to n_timesteps+1.

That takes care of a single input-output pair. As ordinary, batching is organized for by torch’s dataloader class. We instantiate one for the coaching information, and instantly once more confirm the result:

batch_size  32
train_dl  train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)
size(train_dl)

b  train_dl %>% dataloader_make_iter() %>% dataloader_next()
b

$x
torch_tensor
(1,.,.) = 
  0.4805
  0.3125
[...]       ### strains eliminated by me
 -1.1756
 -0.9981
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{32,336,1} ]

$y
torch_tensor
 0.1890
 0.5405
[...]       ### strains eliminated by me
 2.4015
 0.7891
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{32,1} ]

We see the added batch dimension in entrance, leading to general form (batch_size, n_timesteps, num_features). That is the format anticipated by the mannequin, or extra exactly, by its preliminary RNN layer.

Earlier than we go on, let’s rapidly create datasets and dataloaders for validation and take a look at information, as properly.

valid_ds  elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl  valid_ds %>% dataloader(batch_size = batch_size)

test_ds  elec_dataset(elec_test, n_timesteps)
test_dl  test_ds %>% dataloader(batch_size = 1)

The mannequin consists of an RNN – of sort GRU or LSTM, as per the consumer’s selection – and an output layer. The RNN does many of the work; the single-neuron linear layer that outputs the prediction compresses its vector enter to a single worth.

Right here, first, is the mannequin definition.

mannequin  nn_module(
  
  initialize = perform(sort, input_size, hidden_size, num_layers = 1, dropout = 0) {
    
    self$sort  sort
    self$num_layers  num_layers
    
    self$rnn  if (self$sort == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    }
    
    self$output  nn_linear(hidden_size, 1)
    
  },
  
  ahead = perform(x) {
    
    # checklist of [output, hidden]
    # we use the output, which is of measurement (batch_size, n_timesteps, hidden_size)
    x  self$rnn(x)[[1]]
    
    # from the output, we solely need the ultimate timestep
    # form now's (batch_size, hidden_size)
    x  x[ , dim(x)[2], ]
    
    # feed this to a single output neuron
    # last form then is (batch_size, 1)
    x %>% self$output() 
  }
  
)

Most significantly, that is what occurs in ahead().

The RNN returns a listing. The checklist holds two tensors, an output, and a synopsis of hidden states. We discard the state tensor, and preserve the output solely. The excellence between state and output, or slightly, the best way it’s mirrored in what a torch RNN returns, deserves to be inspected extra carefully. We’ll try this in a second.
Of the output tensor, we’re involved in solely the ultimate time-step, although.
Solely this one, thus, is handed to the output layer.
Lastly, the mentioned output layer’s output is returned.

Now, a bit extra on states vs. outputs. Think about Fig. 1, from Goodfellow, Bengio, and Courville (2016).

Let’s fake there are three time steps solely, akin to (t-1), (t), and (t+1). The enter sequence, accordingly, consists of (x_{t-1}), (x_{t}), and (x_{t+1}).

At every (t), a hidden state is generated, and so is an output. Usually, if our objective is to foretell (y_{t+2}), that’s, the very subsequent remark, we wish to keep in mind the entire enter sequence. Put in another way, we wish to have run by the entire equipment of state updates. The logical factor to do would thus be to decide on (o_{t+1}), for both direct return from ahead() or for additional processing.

Certainly, return (o_{t+1}) is what a Keras LSTM or GRU would do by default. Not so its torch counterparts. In torch, the output tensor includes all of (o). That is why, in step two above, we choose the only time step we’re involved in – specifically, the final one.

In later posts, we’ll make use of greater than the final time step. Generally, we’ll use the sequence of hidden states (the (h)s) as an alternative of the outputs (the (o)s). So you might really feel like asking, what if we used (h_{t+1}) right here as an alternative of (o_{t+1})? The reply is: With a GRU, this is able to not make a distinction, as these two are equivalent. With LSTM although, it might, as LSTM retains a second, specifically, the “cell,” state.

On to initialize(). For ease of experimentation, we instantiate both a GRU or an LSTM based mostly on consumer enter. Two issues are price noting:

We go batch_first = TRUE when creating the RNNs. That is required with torch RNNs after we wish to constantly have batch objects stacked within the first dimension. And we do need that; it’s arguably much less complicated than a change of dimension semantics for one sub-type of module.
num_layers can be utilized to construct a stacked RNN, akin to what you’d get in Keras when chaining two GRUs/LSTMs (the primary one created with return_sequences = TRUE). This parameter, too, we’ve included for fast experimentation.

Let’s instantiate a mannequin for coaching. It is going to be a single-layer GRU with thirty-two models.

# coaching RNNs on the GPU at the moment prints a warning that will muddle 
# the console
# see https://github.com/mlverse/torch/points/461
# alternatively, use 
# system 
system  torch_device(if (cuda_is_available()) "cuda" else "cpu")

internet  mannequin("gru", 1, 32)
internet  internet$to(system = system)

In any case these RNN specifics, the coaching course of is totally normal.

optimizer  optim_adam(internet$parameters, lr = 0.001)

num_epochs  30

train_batch  perform(b) {
  
  optimizer$zero_grad()
  output  internet(b$x$to(system = system))
  goal  b$y$to(system = system)
  
  loss  nnf_mse_loss(output, goal)
  loss$backward()
  optimizer$step()
  
  loss$merchandise()
}

valid_batch  perform(b) {
  
  output  internet(b$x$to(system = system))
  goal  b$y$to(system = system)
  
  loss  nnf_mse_loss(output, goal)
  loss$merchandise()
  
}

for (epoch in 1:num_epochs) {
  
  internet$practice()
  train_loss  c()
  
  coro::loop(for (b in train_dl) {
    loss train_batch(b)
    train_loss  c(train_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, coaching: loss: %3.5f n", epoch, imply(train_loss)))
  
  internet$eval()
  valid_loss  c()
  
  coro::loop(for (b in valid_dl) {
    loss  valid_batch(b)
    valid_loss  c(valid_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, imply(valid_loss)))
}

Epoch 1, coaching: loss: 0.21908 

Epoch 1, validation: loss: 0.05125 

Epoch 2, coaching: loss: 0.03245 

Epoch 2, validation: loss: 0.03391 

Epoch 3, coaching: loss: 0.02346 

Epoch 3, validation: loss: 0.02321 

Epoch 4, coaching: loss: 0.01823 

Epoch 4, validation: loss: 0.01838 

Epoch 5, coaching: loss: 0.01522 

Epoch 5, validation: loss: 0.01560 

Epoch 6, coaching: loss: 0.01315 

Epoch 6, validation: loss: 0.01374 

Epoch 7, coaching: loss: 0.01205 

Epoch 7, validation: loss: 0.01200 

Epoch 8, coaching: loss: 0.01155 

Epoch 8, validation: loss: 0.01157 

Epoch 9, coaching: loss: 0.01118 

Epoch 9, validation: loss: 0.01096 

Epoch 10, coaching: loss: 0.01070 

Epoch 10, validation: loss: 0.01132 

Epoch 11, coaching: loss: 0.01003 

Epoch 11, validation: loss: 0.01150 

Epoch 12, coaching: loss: 0.00943 

Epoch 12, validation: loss: 0.01106 

Epoch 13, coaching: loss: 0.00922 

Epoch 13, validation: loss: 0.01069 

Epoch 14, coaching: loss: 0.00862 

Epoch 14, validation: loss: 0.01125 

Epoch 15, coaching: loss: 0.00842 

Epoch 15, validation: loss: 0.01095 

Epoch 16, coaching: loss: 0.00820 

Epoch 16, validation: loss: 0.00975 

Epoch 17, coaching: loss: 0.00802 

Epoch 17, validation: loss: 0.01120 

Epoch 18, coaching: loss: 0.00781 

Epoch 18, validation: loss: 0.00990 

Epoch 19, coaching: loss: 0.00757 

Epoch 19, validation: loss: 0.01017 

Epoch 20, coaching: loss: 0.00735 

Epoch 20, validation: loss: 0.00932 

Epoch 21, coaching: loss: 0.00723 

Epoch 21, validation: loss: 0.00901 

Epoch 22, coaching: loss: 0.00708 

Epoch 22, validation: loss: 0.00890 

Epoch 23, coaching: loss: 0.00676 

Epoch 23, validation: loss: 0.00914 

Epoch 24, coaching: loss: 0.00666 

Epoch 24, validation: loss: 0.00922 

Epoch 25, coaching: loss: 0.00644 

Epoch 25, validation: loss: 0.00869 

Epoch 26, coaching: loss: 0.00620 

Epoch 26, validation: loss: 0.00902 

Epoch 27, coaching: loss: 0.00588 

Epoch 27, validation: loss: 0.00896 

Epoch 28, coaching: loss: 0.00563 

Epoch 28, validation: loss: 0.00886 

Epoch 29, coaching: loss: 0.00547 

Epoch 29, validation: loss: 0.00895 

Epoch 30, coaching: loss: 0.00523 

Epoch 30, validation: loss: 0.00935

Loss decreases rapidly, and we don’t appear to be overfitting on the validation set.

Numbers are fairly summary, although. So, we’ll use the take a look at set to see how the forecast really seems to be.

Right here is the forecast for January, 2014, thirty minutes at a time.

internet$eval()

preds  rep(NA, n_timesteps)

coro::loop(for (b in test_dl) {
  output  internet(b$x$to(system = system))
  preds  c(preds, output %>% as.numeric())
})

vic_elec_jan_2014   vic_elec %>%
  filter(yr(Date) == 2014, month(Date) == 1) %>%
  choose(Demand)

preds_ts  vic_elec_jan_2014 %>%
  add_column(forecast = preds * train_sd + train_mean) %>%
  pivot_longer(-Time) %>%
  update_tsibble(key = identify)

preds_ts %>%
  autoplot() +
  scale_colour_manual(values = c("#08c5d1", "#00353f")) +
  theme_minimal()

Determine 6: One-step-ahead predictions for January, 2014.

Total, the forecast is superb, however it’s fascinating to see how the forecast “regularizes” probably the most excessive peaks. This type of “regression to the imply” can be seen far more strongly in later setups, after we attempt to forecast additional into the long run.

Can we use our present structure for multi-step prediction? We are able to.

One factor we will do is feed again the present prediction, that’s, append it to the enter sequence as quickly as it’s obtainable. Successfully thus, for every batch merchandise, we acquire a sequence of predictions in a loop.

We’ll attempt to forecast 336 time steps, that’s, an entire week.

n_forecast  2 * 24 * 7

test_preds  vector(mode = "checklist", size = size(test_dl))

i  1

coro::loop(for (b in test_dl) {
  
  enter  b$x
  output  internet(enter$to(system = system))
  preds  as.numeric(output)
  
  for(j in 2:n_forecast) {
    enter  torch_cat(checklist(enter[ , 2:length(input), ], output$view(c(1, 1, 1))), dim = 2)
    output  internet(enter$to(system = system))
    preds  c(preds, as.numeric(output))
  }
  
  test_preds[[i]]  preds
  i  i + 1
  
})

For visualization, let’s decide three non-overlapping sequences.

test_pred1  test_preds[[1]]
test_pred1  c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_jan_2014) - n_timesteps - n_forecast))

test_pred2  test_preds[[408]]
test_pred2  c(rep(NA, n_timesteps + 407), test_pred2, rep(NA, nrow(vic_elec_jan_2014) - 407 - n_timesteps - n_forecast))

test_pred3  test_preds[[817]]
test_pred3  c(rep(NA, nrow(vic_elec_jan_2014) - n_forecast), test_pred3)


preds_ts  vic_elec %>%
  filter(yr(Date) == 2014, month(Date) == 1) %>%
  choose(Demand) %>%
  add_column(
    iterative_ex_1 = test_pred1 * train_sd + train_mean,
    iterative_ex_2 = test_pred2 * train_sd + train_mean,
    iterative_ex_3 = test_pred3 * train_sd + train_mean) %>%
  pivot_longer(-Time) %>%
  update_tsibble(key = identify)

preds_ts %>%
  autoplot() +
  scale_colour_manual(values = c("#08c5d1", "#00353f", "#ffbf66", "#d46f4d")) +
  theme_minimal()

Determine 7: Multi-step predictions for January, 2014, obtained in a loop.

Even with this very fundamental forecasting approach, the diurnal rhythm is preserved, albeit in a strongly smoothed type. There even is an obvious day-of-week periodicity within the forecast. We do see, nevertheless, very sturdy regression to the imply, even in loop cases the place the community was “primed” with a better enter sequence.

Hopefully this publish offered a helpful introduction to time collection forecasting with torch. Evidently, we picked a difficult time collection – difficult, that’s, for a minimum of two causes:

To accurately issue within the development, exterior info is required: exterior info in type of a temperature forecast, which, “in actuality,” could be simply obtainable.
Along with the extremely necessary development element, the information are characterised by a number of ranges of seasonality.

Of those, the latter is much less of an issue for the strategies we’re working with right here. If we discovered that some degree of seasonality went undetected, we may attempt to adapt the present configuration in quite a few uncomplicated methods:

Use an LSTM as an alternative of a GRU. In idea, LSTM ought to higher have the ability to seize further lower-frequency elements as a consequence of its secondary storage, the cell state.
Stack a number of layers of GRU/LSTM. In idea, this could permit for studying a hierarchy of temporal options, analogously to what we see in a convolutional neural community.

To deal with the previous impediment, larger adjustments to the structure could be wanted. We could try to do this in a later, “bonus,” publish. However within the upcoming installments, we’ll first dive into often-used strategies for sequence prediction, additionally porting to numerical time collection issues which might be generally completed in pure language processing.

Thanks for studying!

Photograph by Nick Dunn on Unsplash

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Studying. MIT Press.

Previous articleAV-Comparatives Crowns McAfee as 2024’s Chief in On-line Safety and Velocity

Next articleshow – Utilizing 2009 imac with Excessive Sierra 10.13.6 for window 11 laptop computer utilizing

Introductory time-series forecasting with torch

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

One Large Leap for Chatbot-kind

Contained in the Thoughts of an Educator

HackyPi 2.0 Is Right here! – Hackster.io

AI, Digital Twins and the Way forward for Infrastructure

Recent Comments

ABOUT US

POPULAR POSTS

One Large Leap for Chatbot-kind

Contained in the Thoughts of an Educator

HackyPi 2.0 Is Right here! – Hackster.io

POPULAR CATEGORY