Getting began with TensorFlow Likelihood from R

With the abundance of nice libraries, in R, for statistical computing, why would you be taken with TensorFlow Likelihood (TFP, for brief)? Effectively – let’s have a look at an inventory of its elements:

Distributions and bijectors (bijectors are reversible, composable maps)
Probabilistic modeling (Edward2 and probabilistic community layers)
Probabilistic inference (through MCMC or variational inference)

Now think about all these working seamlessly with the TensorFlow framework – core, Keras, contributed modules – and likewise, operating distributed and on GPU. The sector of potential purposes is huge – and much too various to cowl as a complete in an introductory weblog publish.

As an alternative, our goal right here is to supply a primary introduction to TFP, specializing in direct applicability to and interoperability with deep studying.
We’ll rapidly present the best way to get began with one of many fundamental constructing blocks: distributions. Then, we’ll construct a variational autoencoder just like that in Illustration studying with MMD-VAE. This time although, we’ll make use of TFP to pattern from the prior and approximate posterior distributions.

We’ll regard this publish as a “proof on idea” for utilizing TFP with Keras – from R – and plan to comply with up with extra elaborate examples from the world of semi-supervised illustration studying.

To put in TFP along with TensorFlow, merely append tensorflow-probability to the default listing of additional packages:

library(tensorflow)
install_tensorflow(
  extra_packages = c("keras", "tensorflow-hub", "tensorflow-probability"),
  model = "1.12"
)

Now to make use of TFP, all we have to do is import it and create some helpful handles.

And right here we go, sampling from a normal regular distribution.

n  tfd$Regular(loc = 0, scale = 1)
n$pattern(6L)

tf.Tensor(
"Normal_1/pattern/Reshape:0", form=(6,), dtype=float32
)

Now that’s good, but it surely’s 2019, we don’t wish to should create a session to judge these tensors anymore. Within the variational autoencoder instance under, we’re going to see how TFP and TF keen execution are the right match, so why not begin utilizing it now.

To make use of keen execution, now we have to execute the next traces in a recent (R) session:

… and import TFP, identical as above.

tfp  import("tensorflow_probability")
tfd  tfp$distributions

Now let’s rapidly have a look at TFP distributions.

Utilizing distributions

Right here’s that customary regular once more.

n  tfd$Regular(loc = 0, scale = 1)

Issues generally performed with a distribution embrace sampling:

# simply as in low-level tensorflow, we have to append L to point integer arguments
n$pattern(6L)

tf.Tensor(
[-0.34403768 -0.14122334 -1.3832929   1.618252    1.364448   -1.1299014 ],
form=(6,),
dtype=float32
)

In addition to getting the log likelihood. Right here we try this concurrently for 3 values.

tf.Tensor(
[-1.4189385 -0.9189385 -1.4189385], form=(3,), dtype=float32
)

We will do the identical issues with a number of different distributions, e.g., the Bernoulli:

b  tfd$Bernoulli(0.9)
b$pattern(10L)

tf.Tensor(
[1 1 1 0 1 1 0 1 0 1], form=(10,), dtype=int32
)

tf.Tensor(
[-1.2411538 -0.3411539 -1.2411538 -1.2411538], form=(4,), dtype=float32
)

Word that within the final chunk, we’re asking for the log chances of 4 impartial attracts.

Batch shapes and occasion shapes

In TFP, we will do the next.

ns  tfd$Regular(
  loc = c(1, 10, -200),
  scale = c(0.1, 0.1, 1)
)
ns

tfp.distributions.Regular(
"Regular/", batch_shape=(3,), event_shape=(), dtype=float32
)

Opposite to what it’d seem like, this isn’t a multivariate regular. As indicated by batch_shape=(3,), this can be a “batch” of impartial univariate distributions. The truth that these are univariate is seen in event_shape=(): Every of them lives in one-dimensional occasion house.

If as a substitute we create a single, two-dimensional multivariate regular:

n  tfd$MultivariateNormalDiag(loc = c(0, 10), scale_diag = c(1, 4))
n

tfp.distributions.MultivariateNormalDiag(
"MultivariateNormalDiag/", batch_shape=(), event_shape=(2,), dtype=float32
)

we see batch_shape=(), event_shape=(2,), as anticipated.

After all, we will mix each, creating batches of multivariate distributions:

nd_batch  tfd$MultivariateNormalFullCovariance(
  loc = listing(c(0., 0.), c(1., 1.), c(2., 2.)),
  covariance_matrix = listing(
    matrix(c(1, .1, .1, 1), ncol = 2),
    matrix(c(1, .3, .3, 1), ncol = 2),
    matrix(c(1, .5, .5, 1), ncol = 2))
)

This instance defines a batch of three two-dimensional multivariate regular distributions.

Changing between batch shapes and occasion shapes

Unusual as it might sound, conditions come up the place we wish to rework distribution shapes between these varieties – in reality, we’ll see such a case very quickly.

tfd$Unbiased is used to transform dimensions in batch_shape to dimensions in event_shape.

Here’s a batch of three impartial Bernoulli distributions.

bs  tfd$Bernoulli(probs=c(.3,.5,.7))
bs

tfp.distributions.Bernoulli(
"Bernoulli/", batch_shape=(3,), event_shape=(), dtype=int32
)

We will convert this to a digital “three-dimensional” Bernoulli like this:

b  tfd$Unbiased(bs, reinterpreted_batch_ndims = 1L)
b

tfp.distributions.Unbiased(
"IndependentBernoulli/", batch_shape=(), event_shape=(3,), dtype=int32
)

Right here reinterpreted_batch_ndims tells TFP how lots of the batch dimensions are getting used for the occasion house, beginning to rely from the appropriate of the form listing.

With this fundamental understanding of TFP distributions, we’re able to see them utilized in a VAE.

We’ll take the (not so) deep convolutional structure from Illustration studying with MMD-VAE and use distributions for sampling and computing chances. Optionally, our new VAE will be capable of be taught the prior distribution.

Concretely, the next exposition will include three elements.
First, we current frequent code relevant to each a VAE with a static prior, and one which learns the parameters of the prior distribution.
Then, now we have the coaching loop for the primary (static-prior) VAE. Lastly, we focus on the coaching loop and extra mannequin concerned within the second (prior-learning) VAE.

Presenting each variations one after the opposite results in code duplications, however avoids scattering complicated if-else branches all through the code.

The second VAE is offered as a part of the Keras examples so that you don’t have to repeat out code snippets. The code additionally comprises further performance not mentioned and replicated right here, comparable to for saving mannequin weights.

So, let’s begin with the frequent half.

On the danger of repeating ourselves, right here once more are the preparatory steps (together with a couple of further library hundreds).

Dataset

For a change from MNIST and Style-MNIST, we’ll use the model new Kuzushiji-MNIST(Clanuwat et al. 2018).

From: Deep Learning for Classical Japanese Literature (Clanuwat et al. 2018) — From: Deep Studying for Classical Japanese Literature (Clanuwat et al. 2018)

np  import("numpy")

kuzushiji  np$load("kmnist-train-imgs.npz")
kuzushiji  kuzushiji$get("arr_0")
 
train_images  kuzushiji %>%
  k_expand_dims() %>%
  k_cast(dtype = "float32")

train_images  train_images %>% `/`(255)

As in that different publish, we stream the info through tfdatasets:

buffer_size %
  dataset_shuffle(buffer_size) %>%
  dataset_batch(batch_size)

Now let’s see what adjustments within the encoder and decoder fashions.

Encoder

The encoder differs from what we had with out TFP in that it doesn’t return the approximate posterior means and variances instantly as tensors. As an alternative, it returns a batch of multivariate regular distributions:

# you would possibly wish to change this relying on the dataset
latent_dim  2

encoder_model  perform(identify = NULL) {

  keras_model_custom(identify = identify, perform(self) {
  
    self$conv1 
      layer_conv_2d(
        filters = 32,
        kernel_size = 3,
        strides = 2,
        activation = "relu"
      )
    self$conv2 
      layer_conv_2d(
        filters = 64,
        kernel_size = 3,
        strides = 2,
        activation = "relu"
      )
    self$flatten  layer_flatten()
    self$dense  layer_dense(models = 2 * latent_dim)
    
    perform (x, masks = NULL) {
      x  x %>%
        self$conv1() %>%
        self$conv2() %>%
        self$flatten() %>%
        self$dense()
        
      tfd$MultivariateNormalDiag(
        loc = x[, 1:latent_dim],
        scale_diag = tf$nn$softplus(x[, (latent_dim + 1):(2 * latent_dim)] + 1e-5)
      )
    }
  })
}

Let’s do that out.

encoder  encoder_model()

iter  make_iterator_one_shot(train_dataset)
x   iterator_get_next(iter)

approx_posterior  encoder(x)
approx_posterior

tfp.distributions.MultivariateNormalDiag(
"MultivariateNormalDiag/", batch_shape=(256,), event_shape=(2,), dtype=float32
)

approx_posterior$pattern()

tf.Tensor(
[[ 5.77791929e-01 -1.64988488e-02]
 [ 7.93901443e-01 -1.00042784e+00]
 [-1.56279251e-01 -4.06365871e-01]
 ...
 ...
 [-6.47531569e-01  2.10889503e-02]], form=(256, 2), dtype=float32)

We don’t find out about you, however we nonetheless benefit from the ease of inspecting values with keen execution – loads.

Now, on to the decoder, which too returns a distribution as a substitute of a tensor.

Decoder

Within the decoder, we see why transformations between batch form and occasion form are helpful.
The output of self$deconv3 is four-dimensional. What we want is an on-off-probability for each pixel.
Previously, this was achieved by feeding the tensor right into a dense layer and making use of a sigmoid activation.
Right here, we use tfd$Unbiased to successfully tranform the tensor right into a likelihood distribution over three-dimensional photos (width, peak, channel(s)).

decoder_model  perform(identify = NULL) {
  
  keras_model_custom(identify = identify, perform(self) {
    
    self$dense  layer_dense(models = 7 * 7 * 32, activation = "relu")
    self$reshape  layer_reshape(target_shape = c(7, 7, 32))
    self$deconv1 
      layer_conv_2d_transpose(
        filters = 64,
        kernel_size = 3,
        strides = 2,
        padding = "identical",
        activation = "relu"
      )
    self$deconv2 
      layer_conv_2d_transpose(
        filters = 32,
        kernel_size = 3,
        strides = 2,
        padding = "identical",
        activation = "relu"
      )
    self$deconv3 
      layer_conv_2d_transpose(
        filters = 1,
        kernel_size = 3,
        strides = 1,
        padding = "identical"
      )
    
    perform (x, masks = NULL) {
      x  x %>%
        self$dense() %>%
        self$reshape() %>%
        self$deconv1() %>%
        self$deconv2() %>%
        self$deconv3()
      
      tfd$Unbiased(tfd$Bernoulli(logits = x),
                      reinterpreted_batch_ndims = 3L)
      
    }
  })
}

Let’s do that out too.

decoder  decoder_model()
decoder_likelihood  decoder(approx_posterior_sample)

tfp.distributions.Unbiased(
"IndependentBernoulli/", batch_shape=(256,), event_shape=(28, 28, 1), dtype=int32
)

This distribution shall be used to generate the “reconstructions,” in addition to decide the loglikelihood of the unique samples.

KL loss and optimizer

Each VAEs mentioned under will want an optimizer …

optimizer  tf$prepare$AdamOptimizer(1e-4)

… and each will delegate to compute_kl_loss to compute the KL a part of the loss.

This helper perform merely subtracts the log chance of the samples below the prior from their loglikelihood below the approximate posterior.

compute_kl_loss  perform(
  latent_prior,
  approx_posterior,
  approx_posterior_sample) {
  
  kl_div  approx_posterior$log_prob(approx_posterior_sample) -
    latent_prior$log_prob(approx_posterior_sample)
  avg_kl_div  tf$reduce_mean(kl_div)
  avg_kl_div
}

Now that we’ve seemed on the frequent elements, we first focus on the best way to prepare a VAE with a static prior.

On this VAE, we use TFP to create the standard isotropic Gaussian prior.
We then instantly pattern from this distribution within the coaching loop.

latent_prior  tfd$MultivariateNormalDiag(
  loc  = tf$zeros(listing(latent_dim)),
  scale_identity_multiplier = 1
)

And right here is the whole coaching loop. We’ll level out the essential TFP-related steps under.

for (epoch in seq_len(num_epochs)) {
  iter  make_iterator_one_shot(train_dataset)
  
  total_loss  0
  total_loss_nll  0
  total_loss_kl  0
  
  until_out_of_range({
    x   iterator_get_next(iter)
    
    with(tf$GradientTape(persistent = TRUE) %as% tape, {
      approx_posterior  encoder(x)
      approx_posterior_sample  approx_posterior$pattern()
      decoder_likelihood  decoder(approx_posterior_sample)
      
      nll  -decoder_likelihood$log_prob(x)
      avg_nll  tf$reduce_mean(nll)
      
      kl_loss  compute_kl_loss(
        latent_prior,
        approx_posterior,
        approx_posterior_sample
      )

      loss  kl_loss + avg_nll
    })
    
    total_loss  total_loss + loss
    total_loss_nll  total_loss_nll + avg_nll
    total_loss_kl  total_loss_kl + kl_loss
    
    encoder_gradients  tape$gradient(loss, encoder$variables)
    decoder_gradients  tape$gradient(loss, decoder$variables)
    
    optimizer$apply_gradients(purrr::transpose(listing(
      encoder_gradients, encoder$variables
    )),
    global_step = tf$prepare$get_or_create_global_step())
    optimizer$apply_gradients(purrr::transpose(listing(
      decoder_gradients, decoder$variables
    )),
    global_step = tf$prepare$get_or_create_global_step())
 
  })
  
  cat(
    glue(
      "Losses (epoch): {epoch}:",
      "  {(as.numeric(total_loss_nll)/batches_per_epoch) %>% spherical(4)} nll",
      "  {(as.numeric(total_loss_kl)/batches_per_epoch) %>% spherical(4)} kl",
      "  {(as.numeric(total_loss)/batches_per_epoch) %>% spherical(4)} complete"
    ),
    "n"
  )
}

Above, taking part in round with the encoder and the decoder, we’ve already seen how

approx_posterior  encoder(x)

provides us a distribution we will pattern from. We use it to acquire samples from the approximate posterior:

approx_posterior_sample  approx_posterior$pattern()

These samples, we take them and feed them to the decoder, who provides us on-off-likelihoods for picture pixels.

decoder_likelihood  decoder(approx_posterior_sample)

Now the loss consists of the standard ELBO elements: reconstruction loss and KL divergence.
The reconstruction loss we instantly get hold of from TFP, utilizing the discovered decoder distribution to evaluate the chance of the unique enter.

nll  -decoder_likelihood$log_prob(x)
avg_nll  tf$reduce_mean(nll)

The KL loss we get from compute_kl_loss, the helper perform we noticed above:

kl_loss  compute_kl_loss(
        latent_prior,
        approx_posterior,
        approx_posterior_sample
      )

We add each and arrive on the total VAE loss:

Aside from these adjustments resulting from utilizing TFP, the coaching course of is simply regular backprop, the best way it appears to be like utilizing keen execution.

Now let’s see how as a substitute of utilizing the usual isotropic Gaussian, we may be taught a mix of Gaussians.
The selection of variety of distributions right here is fairly arbitrary. Simply as with latent_dim, you would possibly wish to experiment and discover out what works finest in your dataset.

mixture_components  16

learnable_prior_model  perform(identify = NULL, latent_dim, mixture_components) {
  
  keras_model_custom(identify = identify, perform(self) {
    
    self$loc 
      tf$get_variable(
        identify = "loc",
        form = listing(mixture_components, latent_dim),
        dtype = tf$float32
      )
    self$raw_scale_diag  tf$get_variable(
      identify = "raw_scale_diag",
      form = c(mixture_components, latent_dim),
      dtype = tf$float32
    )
    self$mixture_logits 
      tf$get_variable(
        identify = "mixture_logits",
        form = c(mixture_components),
        dtype = tf$float32
      )
      
    perform (x, masks = NULL) {
        tfd$MixtureSameFamily(
          components_distribution = tfd$MultivariateNormalDiag(
            loc = self$loc,
            scale_diag = tf$nn$softplus(self$raw_scale_diag)
          ),
          mixture_distribution = tfd$Categorical(logits = self$mixture_logits)
        )
      }
    })
  }

In TFP terminology, components_distribution is the underlying distribution kind, and mixture_distribution holds the possibilities that particular person elements are chosen.

Word how self$loc, self$raw_scale_diag and self$mixture_logits are TensorFlow Variables and thus, persistent and updatable by backprop.

Now we create the mannequin.

latent_prior_model  learnable_prior_model(
  latent_dim = latent_dim,
  mixture_components = mixture_components
)

How will we get hold of a latent prior distribution we will pattern from? A bit unusually, this mannequin shall be known as with out an enter:

latent_prior  latent_prior_model(NULL)
latent_prior

tfp.distributions.MixtureSameFamily(
"MixtureSameFamily/", batch_shape=(), event_shape=(2,), dtype=float32
)

Right here now could be the whole coaching loop. Word how now we have a 3rd mannequin to backprop by way of.

for (epoch in seq_len(num_epochs)) {
  iter  make_iterator_one_shot(train_dataset)
  
  total_loss  0
  total_loss_nll  0
  total_loss_kl  0
  
  until_out_of_range({
    x   iterator_get_next(iter)
    
    with(tf$GradientTape(persistent = TRUE) %as% tape, {
      approx_posterior  encoder(x)
      
      approx_posterior_sample  approx_posterior$pattern()
      decoder_likelihood  decoder(approx_posterior_sample)
      
      nll  -decoder_likelihood$log_prob(x)
      avg_nll  tf$reduce_mean(nll)
      
      latent_prior  latent_prior_model(NULL)
      
      kl_loss  compute_kl_loss(
        latent_prior,
        approx_posterior,
        approx_posterior_sample
      )

      loss  kl_loss + avg_nll
    })
    
    total_loss  total_loss + loss
    total_loss_nll  total_loss_nll + avg_nll
    total_loss_kl  total_loss_kl + kl_loss
    
    encoder_gradients  tape$gradient(loss, encoder$variables)
    decoder_gradients  tape$gradient(loss, decoder$variables)
    prior_gradients 
      tape$gradient(loss, latent_prior_model$variables)
    
    optimizer$apply_gradients(purrr::transpose(listing(
      encoder_gradients, encoder$variables
    )),
    global_step = tf$prepare$get_or_create_global_step())
    optimizer$apply_gradients(purrr::transpose(listing(
      decoder_gradients, decoder$variables
    )),
    global_step = tf$prepare$get_or_create_global_step())
    optimizer$apply_gradients(purrr::transpose(listing(
      prior_gradients, latent_prior_model$variables
    )),
    global_step = tf$prepare$get_or_create_global_step())
    
  })
  
  checkpoint$save(file_prefix = checkpoint_prefix)
  
  cat(
    glue(
      "Losses (epoch): {epoch}:",
      "  {(as.numeric(total_loss_nll)/batches_per_epoch) %>% spherical(4)} nll",
      "  {(as.numeric(total_loss_kl)/batches_per_epoch) %>% spherical(4)} kl",
      "  {(as.numeric(total_loss)/batches_per_epoch) %>% spherical(4)} complete"
    ),
    "n"
  )
}

And that’s it! For us, each VAEs yielded related outcomes, and we didn’t expertise nice variations from experimenting with latent dimensionality and the variety of combination distributions. However once more, we wouldn’t wish to generalize to different datasets, architectures, and many others.

Talking of outcomes, how do they give the impression of being? Right here we see letters generated after 40 epochs of coaching. On the left are random letters, on the appropriate, the standard VAE grid show of latent house.

Hopefully, we’ve succeeded in displaying that TensorFlow Likelihood, keen execution, and Keras make for a sexy mixture! In case you relate complete quantity of code required to the complexity of the duty, in addition to depth of the ideas concerned, this could seem as a reasonably concise implementation.

Within the nearer future, we plan to comply with up with extra concerned purposes of TensorFlow Likelihood, principally from the world of illustration studying. Keep tuned!

Clanuwat, Tarin, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. 2018. “Deep Studying for Classical Japanese Literature.” December 3, 2018. https://arxiv.org/abs/cs.CV/1812.01718.

Getting began with TensorFlow Likelihood from R

Utilizing distributions

Batch shapes and occasion shapes

Changing between batch shapes and occasion shapes

Dataset

Encoder

Decoder

KL loss and optimizer

Cybersecurity’s world alarm system is breaking down

From Notion to Motion: The Position of World Fashions in Embodied AI Programs

The Rise of Prediction Markets within the Crypto Age

LEAVE A REPLY Cancel reply

Most Popular

Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free

How Luna Glamping builds a high-ticket B2B gross sales funnel

OpenAI delays the discharge of its open mannequin, once more

Google Native & Enterprise Profiles Drops Round June twenty fifth

Recent Comments

ABOUT US

POPULAR POSTS

Moonshot AI’s Kimi K2 outperforms GPT-4 in key benchmarks — and it’s free

How Luna Glamping builds a high-ticket B2B gross sales funnel

OpenAI delays the discharge of its open mannequin, once more

POPULAR CATEGORY