Easy audio classification with torch

April 27, 2025

46

This text interprets Daniel Falbel’s ‘Easy Audio Classification’ article from tensorflow/keras to torch/torchaudio. The principle aim is to introduce torchaudio and illustrate its contributions to the torch ecosystem. Right here, we deal with a preferred dataset, the audio loader and the spectrogram transformer. An fascinating aspect product is the parallel between torch and tensorflow, displaying generally the variations, generally the similarities between them.

Downloading and Importing

torchaudio has the speechcommand_dataset in-built. It filters out background_noise by default and lets us select between variations v0.01 and v0.02.

# set an present folder right here to cache the dataset
DATASETS_PATH  "~/datasets/"

# 1.4GB obtain
df  speechcommand_dataset(
  root = DATASETS_PATH, 
  url = "speech_commands_v0.01",
  obtain = TRUE
)

# anticipate folder: _background_noise_
df$EXCEPT_FOLDER
# [1] "_background_noise_"

# variety of audio recordsdata
size(df)
# [1] 64721

# a pattern
pattern  df[1]

pattern$waveform[, 1:10]

torch_tensor
0.0001 *
 0.9155  0.3052  1.8311  1.8311 -0.3052  0.3052  2.4414  0.9155 -0.9155 -0.6104
[ CPUFloatType{1,10} ]

pattern$sample_rate
# 16000
pattern$label
# mattress

plot(pattern$waveform[1], sort = "l", col = "royalblue", essential = pattern$label)

Determine 1: A pattern waveform for a ‘mattress’.

Lessons

 [1] "mattress"    "hen"   "cat"    "canine"    "down"   "eight"  "5"  
 [8] "4"   "go"     "comfortable"  "home"  "left"   "marvin" "9"  
[15] "no"     "off"    "on"     "one"    "proper"  "seven"  "sheila"
[22] "six"    "cease"   "three"  "tree"   "two"    "up"     "wow"   
[29] "sure"    "zero"

Generator Dataloader

torch::dataloader has the identical activity as data_generator outlined within the authentic article. It’s answerable for getting ready batches – together with shuffling, padding, one-hot encoding, and so forth. – and for caring for parallelism / system I/O orchestration.

In torch we do that by passing the prepare/check subset to torch::dataloader and encapsulating all of the batch setup logic inside a collate_fn() operate.

At this level, dataloader(train_subset) wouldn’t work as a result of the samples are usually not padded. So we have to construct our personal collate_fn() with the padding technique.

I counsel utilizing the next strategy when implementing the collate_fn():

start with collate_fn .
instantiate dataloader with the collate_fn()
create an surroundings by calling enumerate(dataloader) so you may ask to retrieve a batch from dataloader.
run surroundings[[1]][[1]]. Now you have to be despatched inside collate_fn() with entry to batch enter object.
construct the logic.

collate_fn  operate(batch) {
  browser()
}

ds_train  dataloader(
  train_subset, 
  batch_size = 32, 
  shuffle = TRUE, 
  collate_fn = collate_fn
)

ds_train_env  enumerate(ds_train)
ds_train_env[[1]][[1]]

The ultimate collate_fn() pads the waveform to size 16001 after which stacks every thing up collectively. At this level there aren’t any spectrograms but. We going to make spectrogram transformation part of mannequin structure.

pad_sequence  operate(batch) {
    # Make all tensors in a batch the identical size by padding with zeros
    batch  sapply(batch, operate(x) (x$t()))
    batch  torch::nn_utils_rnn_pad_sequence(batch, batch_first = TRUE, padding_value = 0.)
    return(batch$permute(c(1, 3, 2)))
  }

# Remaining collate_fn
collate_fn  operate(batch) {
 # Enter construction:
 # listing of 32 lists: listing(waveform, sample_rate, label, speaker_id, utterance_number)
 # Transpose it
 batch  purrr::transpose(batch)
 tensors  batch$waveform
 targets  batch$label_index

 # Group the listing of tensors right into a batched tensor
 tensors  pad_sequence(tensors)
 
 # goal encoding
 targets  torch::torch_stack(targets)

 listing(tensors = tensors, targets = targets) # (64, 1, 16001)
}

Batch construction is:

batch[[1]]: waveforms – tensor with dimension (32, 1, 16001)
batch[[2]]: targets – tensor with dimension (32, 1)

Additionally, torchaudio comes with 3 loaders, av_loader, tuner_loader, and audiofile_loader– extra to come back. set_audio_backend() is used to set one in every of them because the audio loader. Their performances differ based mostly on audio format (mp3 or wav). There isn’t a good world but: tuner_loader is greatest for mp3, audiofile_loader is greatest for wav, however neither of them has the choice of partially loading a pattern from an audio file with out bringing all the info into reminiscence first.

For a given audio backend we’d like move it to every employee via worker_init_fn() argument.

ds_train  dataloader(
  train_subset, 
  batch_size = 128, 
  shuffle = TRUE, 
  collate_fn = collate_fn,
  num_workers = 16,
  worker_init_fn = operate(.) {torchaudio::set_audio_backend("audiofile_loader")},
  worker_globals = c("pad_sequence") # pad_sequence is required for collect_fn
)

ds_test  dataloader(
  test_subset, 
  batch_size = 64, 
  shuffle = FALSE, 
  collate_fn = collate_fn,
  num_workers = 8,
  worker_globals = c("pad_sequence") # pad_sequence is required for collect_fn
)

Mannequin definition

As an alternative of keras::keras_model_sequential(), we’re going to outline a torch::nn_module(). As referenced by the unique article, the mannequin relies on this structure for MNIST from this tutorial, and I’ll name it ‘DanielNN’.

dan_nn  torch::nn_module(
  "DanielNN",
  
  initialize = operate(
    window_size_ms = 30, 
    window_stride_ms = 10
  ) {
    
    # spectrogram spec
    window_size  as.integer(16000*window_size_ms/1000)
    stride  as.integer(16000*window_stride_ms/1000)
    fft_size  as.integer(2^trunc(log(window_size, 2) + 1))
    n_chunks  size(seq(0, 16000, stride))
    
    self$spectrogram  torchaudio::transform_spectrogram(
      n_fft = fft_size, 
      win_length = window_size, 
      hop_length = stride, 
      normalized = TRUE, 
      energy = 2
    )
    
    # convs 2D
    self$conv1  torch::nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = c(3,3))
    self$conv2  torch::nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = c(3,3))
    self$conv3  torch::nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = c(3,3))
    self$conv4  torch::nn_conv2d(in_channels = 128, out_channels = 256, kernel_size = c(3,3))
    
    # denses
    self$dense1  torch::nn_linear(in_features = 14336, out_features = 128)
    self$dense2  torch::nn_linear(in_features = 128, out_features = 30)
  },
  
  ahead = operate(x) {
    x %>% # (64, 1, 16001)
      self$spectrogram() %>% # (64, 1, 257, 101)
      torch::torch_add(0.01) %>%
      torch::torch_log() %>%
      self$conv1() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      self$conv2() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      self$conv3() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      self$conv4() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      torch::nnf_dropout(p = 0.25) %>%
      torch::torch_flatten(start_dim = 2) %>%
      
      self$dense1() %>%
      torch::nnf_relu() %>%
      torch::nnf_dropout(p = 0.5) %>%
      self$dense2() 
  }
)

mannequin  dan_nn()


system  torch::torch_device(if(torch::cuda_is_available()) "cuda" else "cpu")
mannequin$to(system = system)

print(mannequin)

An `nn_module` containing 2,226,846 parameters.

── Modules ──────────────────────────────────────────────────────
● spectrogram:  #0 parameters
● conv1:  #320 parameters
● conv2:  #18,496 parameters
● conv3:  #73,856 parameters
● conv4:  #295,168 parameters
● dense1:  #1,835,136 parameters
● dense2:  #3,870 parameters

Mannequin becoming

In contrast to in tensorflow, there is no such thing as a mannequin %>% compile(...) step in torch, so we’re going to set loss criterion, optimizer technique and analysis metrics explicitly within the coaching loop.

loss_criterion  torch::nn_cross_entropy_loss()
optimizer  torch::optim_adadelta(mannequin$parameters, rho = 0.95, eps = 1e-7)
metrics  listing(acc = yardstick::accuracy_vec)

Coaching loop

library(glue)
library(progress)

pred_to_r  operate(x) {
  lessons  issue(df$lessons)
  lessons[as.numeric(x$to(device = "cpu"))]
}

set_progress_bar  operate(whole) {
  progress_bar$new(
    whole = whole, clear = FALSE, width = 70,
    format = ":present/:whole [:bar] - :elapsed - loss: :loss - acc: :acc"
  )
}

epochs  20
losses  c()
accs  c()

for(epoch in seq_len(epochs)) {
  pb  set_progress_bar(size(ds_train))
  pb$message(glue("Epoch {epoch}/{epochs}"))
  coro::loop(for(batch in ds_train) {
    optimizer$zero_grad()
    predictions  mannequin(batch[[1]]$to(system = system))
    targets  batch[[2]]$to(system = system)
    loss  loss_criterion(predictions, targets)
    loss$backward()
    optimizer$step()
    
    # eval reviews
    prediction_r  pred_to_r(predictions$argmax(dim = 2))
    targets_r  pred_to_r(targets)
    acc  metrics$acc(targets_r, prediction_r)
    accs  c(accs, acc)
    loss_r  as.numeric(loss$merchandise())
    losses  c(losses, loss_r)
    
    pb$tick(tokens = listing(loss = spherical(imply(losses), 4), acc = spherical(imply(accs), 4)))
  })
}



# check
predictions_r  c()
targets_r  c()
coro::loop(for(batch_test in ds_test) {
  predictions  mannequin(batch_test[[1]]$to(system = system))
  targets  batch_test[[2]]$to(system = system)
  predictions_r  c(predictions_r, pred_to_r(predictions$argmax(dim = 2)))
  targets_r  c(targets_r, pred_to_r(targets))
})
val_acc  metrics$acc(issue(targets_r, ranges = 1:30), issue(predictions_r, ranges = 1:30))
cat(glue("val_acc: {val_acc}nn"))

Epoch 1/20                                                            
[W SpectralOps.cpp:590] Warning: The operate torch.rfft is deprecated and might be eliminated in a future PyTorch launch. Use the brand new torch.fft module capabilities, as a substitute, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (operate operator())
354/354 [=========================] -  1m - loss: 2.6102 - acc: 0.2333
Epoch 2/20                                                            
354/354 [=========================] -  1m - loss: 1.9779 - acc: 0.4138
Epoch 3/20                                                            
354/354 [============================] -  1m - loss: 1.62 - acc: 0.519
Epoch 4/20                                                            
354/354 [=========================] -  1m - loss: 1.3926 - acc: 0.5859
Epoch 5/20                                                            
354/354 [==========================] -  1m - loss: 1.2334 - acc: 0.633
Epoch 6/20                                                            
354/354 [=========================] -  1m - loss: 1.1135 - acc: 0.6685
Epoch 7/20                                                            
354/354 [=========================] -  1m - loss: 1.0199 - acc: 0.6961
Epoch 8/20                                                            
354/354 [=========================] -  1m - loss: 0.9444 - acc: 0.7181
Epoch 9/20                                                            
354/354 [=========================] -  1m - loss: 0.8816 - acc: 0.7365
Epoch 10/20                                                           
354/354 [=========================] -  1m - loss: 0.8278 - acc: 0.7524
Epoch 11/20                                                           
354/354 [=========================] -  1m - loss: 0.7818 - acc: 0.7659
Epoch 12/20                                                           
354/354 [=========================] -  1m - loss: 0.7413 - acc: 0.7778
Epoch 13/20                                                           
354/354 [=========================] -  1m - loss: 0.7064 - acc: 0.7881
Epoch 14/20                                                           
354/354 [=========================] -  1m - loss: 0.6751 - acc: 0.7974
Epoch 15/20                                                           
354/354 [=========================] -  1m - loss: 0.6469 - acc: 0.8058
Epoch 16/20                                                           
354/354 [=========================] -  1m - loss: 0.6216 - acc: 0.8133
Epoch 17/20                                                           
354/354 [=========================] -  1m - loss: 0.5985 - acc: 0.8202
Epoch 18/20                                                           
354/354 [=========================] -  1m - loss: 0.5774 - acc: 0.8263
Epoch 19/20                                                           
354/354 [==========================] -  1m - loss: 0.5582 - acc: 0.832
Epoch 20/20                                                           
354/354 [=========================] -  1m - loss: 0.5403 - acc: 0.8374
val_acc: 0.876705979296493

Making predictions

We have already got all predictions calculated for test_subset, let’s recreate the alluvial plot from the unique article.

library(dplyr)
library(alluvial)
df_validation  information.body(
  pred_class = df$lessons[predictions_r],
  class = df$lessons[targets_r]
)
x   df_validation %>%
  mutate(appropriate = pred_class == class) %>%
  rely(pred_class, class, appropriate)

alluvial(
  x %>% choose(class, pred_class),
  freq = x$n,
  col = ifelse(x$appropriate, "lightblue", "pink"),
  border = ifelse(x$appropriate, "lightblue", "pink"),
  alpha = 0.6,
  conceal = x$n  20
)

Model performance: true labels <--> predicted labels.

Determine 2: Mannequin efficiency: true labels predicted labels.

Mannequin accuracy is 87,7%, considerably worse than tensorflow model from the unique submit. However, all conclusions from authentic submit nonetheless maintain.

Reuse

Textual content and figures are licensed below Artistic Commons Attribution CC BY 4.0. The figures which have been reused from different sources do not fall below this license and could be acknowledged by a notice of their caption: “Determine from …”.

Quotation

For attribution, please cite this work as

Damiani (2021, Feb. 4). Posit AI Weblog: Easy audio classification with torch. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2021-02-04-simple-audio-classification-with-torch/

BibTeX quotation

@misc{athossimpleaudioclassification,
  writer = {Damiani, Athos},
  title = {Posit AI Weblog: Easy audio classification with torch},
  url = {https://blogs.rstudio.com/tensorflow/posts/2021-02-04-simple-audio-classification-with-torch/},
  12 months = {2021}
}

Previous articleMcAfee Deepfake Detector: Combating Misinformation with AMD AI-Powered Precision

Next articleWelcome to Chat Haus, the coworking area for AI chatbots

Easy audio classification with torch

Downloading and Importing

Lessons

Generator Dataloader

Mannequin definition

Mannequin becoming

Coaching loop

Making predictions

Reuse

Quotation

What Does Python’s slots Truly Do?

The Obtain: run an LLM, and a historical past of “three-parent infants”

Google Search Simply Obtained a Main AI Improve: Gemini 2.5 Professional, Deep Search, and Agentic Intelligence

LEAVE A REPLY Cancel reply

Most Popular

Polymaker Continues Enlargement into Skilled Filaments with Launch of Fiberon PA612-ESD – 3DPrint.com

What Does Python’s slots Truly Do?

Promethium Desires to Make Self Service Information Work at AI Scale

How Splunk Improves Catalyst SD-WAN Community Troubleshooting

Recent Comments

ABOUT US

POPULAR POSTS

Polymaker Continues Enlargement into Skilled Filaments with Launch of Fiberon PA612-ESD – 3DPrint.com

What Does Python’s slots Truly Do?

Promethium Desires to Make Self Service Information Work at AI Scale

POPULAR CATEGORY