HomeArtificial IntelligencePosit AI Weblog: torch for optimization

Posit AI Weblog: torch for optimization


To date, all torch use instances we’ve mentioned right here have been in deep studying. Nonetheless, its automated differentiation characteristic is helpful in different areas. One distinguished instance is numerical optimization: We will use torch to seek out the minimal of a operate.

The truth is, operate minimization is precisely what occurs in coaching a neural community. However there, the operate in query usually is way too advanced to even think about discovering its minima analytically. Numerical optimization goals at increase the instruments to deal with simply this complexity. To that finish, nevertheless, it begins from capabilities which can be far much less deeply composed. As a substitute, they’re hand-crafted to pose particular challenges.

This submit is a primary introduction to numerical optimization with torch. Central takeaways are the existence and usefulness of its L-BFGS optimizer, in addition to the influence of operating L-BFGS with line search. As a enjoyable add-on, we present an instance of constrained optimization, the place a constraint is enforced by way of a quadratic penalty operate.

To heat up, we take a detour, minimizing a operate ā€œourselvesā€ utilizing nothing however tensors. This can become related later, although, as the general course of will nonetheless be the identical. All adjustments will likely be associated to integration of optimizers and their capabilities.

Operate minimization, DYI strategy

To see how we will decrease a operate ā€œby handā€, let’s strive the long-lasting Rosenbrock operate. This can be a operate with two variables:

[
f(x_1, x_2) = (a – x_1)^2 + b * (x_2 – x_1^2)^2
]

, with (a) and (b) configurable parameters typically set to 1 and 5, respectively.

In R:

library(torch)

a  1
b  5

rosenbrock  operate(x) {
  x1  x[1]
  x2  x[2]
  (a - x1)^2 + b * (x2 - x1^2)^2
}

Its minimal is positioned at (1,1), inside a slender valley surrounded by breakneck-steep cliffs:


Rosenbrock function.

Determine 1: Rosenbrock operate.

Our purpose and technique are as follows.

We wish to discover the values (x_1) and (x_2) for which the operate attains its minimal. We’ve to start out someplace; and from wherever that will get us on the graph we observe the adverse of the gradient ā€œdownwardsā€, descending into areas of consecutively smaller operate worth.

Concretely, in each iteration, we take the present ((x1,x2)) level, compute the operate worth in addition to the gradient, and subtract some fraction of the latter to reach at a brand new ((x1,x2)) candidate. This course of goes on till we both attain the minimal – the gradient is zero – or enchancment is under a selected threshold.

Right here is the corresponding code. For no particular causes, we begin at (-1,1) . The educational charge (the fraction of the gradient to subtract) wants some experimentation. (Strive 0.1 and 0.001 to see its influence.)

num_iterations  1000

# fraction of the gradient to subtract 
lr  0.01

# operate enter (x1,x2)
# that is the tensor w.r.t. which we'll have torch compute the gradient
x_star  torch_tensor(c(-1, 1), requires_grad = TRUE)

for (i in 1:num_iterations) {

  if (i %% 100 == 0) cat("Iteration: ", i, "n")

  # name operate
  worth  rosenbrock(x_star)
  if (i %% 100 == 0) cat("Worth is: ", as.numeric(worth), "n")

  # compute gradient of worth w.r.t. params
  worth$backward()
  if (i %% 100 == 0) cat("Gradient is: ", as.matrix(x_star$grad), "nn")

  # guide replace
  with_no_grad({
    x_star$sub_(lr * x_star$grad)
    x_star$grad$zero_()
  })
}
Iteration:  100 
Worth is:  0.3502924 
Gradient is:  -0.667685 -0.5771312 

Iteration:  200 
Worth is:  0.07398106 
Gradient is:  -0.1603189 -0.2532476 

...
...

Iteration:  900 
Worth is:  0.0001532408 
Gradient is:  -0.004811743 -0.009894371 

Iteration:  1000 
Worth is:  6.962555e-05 
Gradient is:  -0.003222887 -0.006653666 

Whereas this works, it actually serves as an instance the precept. With torch offering a bunch of confirmed optimization algorithms, there is no such thing as a want for us to manually compute the candidate (mathbf{x}) values.

Operate minimization with torch optimizers

As a substitute, we let a torch optimizer replace the candidate (mathbf{x}) for us. Habitually, our first strive is Adam.

Adam

With Adam, optimization proceeds so much quicker. Reality be advised, although, selecting a superb studying charge nonetheless takes non-negligeable experimentation. (Strive the default studying charge, 0.001, for comparability.)

num_iterations  100

x_star  torch_tensor(c(-1, 1), requires_grad = TRUE)

lr  1
optimizer  optim_adam(x_star, lr)

for (i in 1:num_iterations) {
  
  if (i %% 10 == 0) cat("Iteration: ", i, "n")
  
  optimizer$zero_grad()
  worth  rosenbrock(x_star)
  if (i %% 10 == 0) cat("Worth is: ", as.numeric(worth), "n")
  
  worth$backward()
  optimizer$step()
  
  if (i %% 10 == 0) cat("Gradient is: ", as.matrix(x_star$grad), "nn")
  
}
Iteration:  10 
Worth is:  0.8559565 
Gradient is:  -1.732036 -0.5898831 

Iteration:  20 
Worth is:  0.1282992 
Gradient is:  -3.22681 1.577383 

...
...

Iteration:  90 
Worth is:  4.003079e-05 
Gradient is:  -0.05383469 0.02346456 

Iteration:  100 
Worth is:  6.937736e-05 
Gradient is:  -0.003240437 -0.006630421 

It took us a few hundred iterations to reach at a good worth. This can be a lot quicker than the guide strategy above, however nonetheless quite a bit. Fortunately, additional enhancements are attainable.

L-BFGS

Among the many many torch optimizers generally utilized in deep studying (Adam, AdamW, RMSprop …), there may be one ā€œoutsiderā€, significantly better identified in basic numerical optimization than in neural-networks area: L-BFGS, a.ok.a. Restricted-memory BFGS, a memory-optimized implementation of the Broyden–Fletcher–Goldfarb–Shanno optimization algorithm (BFGS).

BFGS is probably essentially the most broadly used among the many so-called Quasi-Newton, second-order optimization algorithms. Versus the household of first-order algorithms that, in deciding on a descent course, make use of gradient info solely, second-order algorithms moreover take curvature info into consideration. To that finish, actual Newton strategies really compute the Hessian (a pricey operation), whereas Quasi-Newton strategies keep away from that value and, as an alternative, resort to iterative approximation.

Trying on the contours of the Rosenbrock operate, with its extended, slender valley, it isn’t troublesome to think about that curvature info may make a distinction. And, as you’ll see in a second, it actually does. Earlier than although, one be aware on the code. When utilizing L-BFGS, it’s essential to wrap each operate name and gradient analysis in a closure (calc_loss(), within the under snippet), for them to be callable a number of occasions per iteration. You may persuade your self that the closure is, in actual fact, entered repeatedly, by inspecting this code snippet’s chatty output:

num_iterations  3

x_star  torch_tensor(c(-1, 1), requires_grad = TRUE)

optimizer  optim_lbfgs(x_star)

calc_loss  operate() {

  optimizer$zero_grad()

  worth  rosenbrock(x_star)
  cat("Worth is: ", as.numeric(worth), "n")

  worth$backward()
  cat("Gradient is: ", as.matrix(x_star$grad), "nn")
  worth

}

for (i in 1:num_iterations) {
  cat("Iteration: ", i, "n")
  optimizer$step(calc_loss)
}
Iteration:  1 
Worth is:  4 
Gradient is:  -4 0 

Worth is:  6 
Gradient is:  -2 10 

...
...

Worth is:  0.04880721 
Gradient is:  -0.262119 -0.1132655 

Worth is:  0.0302862 
Gradient is:  1.293824 -0.7403332 

Iteration:  2 
Worth is:  0.01697086 
Gradient is:  0.3468466 -0.3173429 

Worth is:  0.01124081 
Gradient is:  0.2420997 -0.2347881 

...
...

Worth is:  1.111701e-09 
Gradient is:  0.0002865837 -0.0001251698 

Worth is:  4.547474e-12 
Gradient is:  -1.907349e-05 9.536743e-06 

Iteration:  3 
Worth is:  4.547474e-12 
Gradient is:  -1.907349e-05 9.536743e-06 

Though we ran the algorithm for 3 iterations, the optimum worth actually is reached after two. Seeing how properly this labored, we strive L-BFGS on a harder operate, named flower, for fairly self-evident causes.

(But) extra enjoyable with L-BFGS

Right here is the flower operate. Mathematically, its minimal is close to (0,0), however technically the operate itself is undefined at (0,0), because the atan2 used within the operate will not be outlined there.

a  1
b  1
c  4

flower  operate(x) {
  a * torch_norm(x) + b * torch_sin(c * torch_atan2(x[2], x[1]))
}

Flower function.

Determine 2: Flower operate.

We run the identical code as above, ranging from (20,20) this time.

num_iterations  3

x_star  torch_tensor(c(20, 0), requires_grad = TRUE)

optimizer  optim_lbfgs(x_star)

calc_loss  operate() {

  optimizer$zero_grad()

  worth  flower(x_star)
  cat("Worth is: ", as.numeric(worth), "n")

  worth$backward()
  cat("Gradient is: ", as.matrix(x_star$grad), "n")
  
  cat("X is: ", as.matrix(x_star), "nn")
  
  worth

}

for (i in 1:num_iterations) {
  cat("Iteration: ", i, "n")
  optimizer$step(calc_loss)
}
Iteration:  1 
Worth is:  28.28427 
Gradient is:  0.8071069 0.6071068 
X is:  20 20 

...
...

Worth is:  19.33546 
Gradient is:  0.8100872 0.6188223 
X is:  12.957 14.68274 

...
...

Worth is:  18.29546 
Gradient is:  0.8096464 0.622064 
X is:  12.14691 14.06392 

...
...

Worth is:  9.853705 
Gradient is:  0.7546976 0.7025688 
X is:  5.763702 8.895616 

Worth is:  2635.866 
Gradient is:  -0.7407354 -0.6717985 
X is:  -1949.697 -1773.551 

Iteration:  2 
Worth is:  1333.113 
Gradient is:  -0.7413024 -0.6711776 
X is:  -985.4553 -897.5367 

Worth is:  30.16862 
Gradient is:  -0.7903821 -0.6266789 
X is:  -21.02814 -21.72296 

Worth is:  1281.39 
Gradient is:  0.7544561 0.6563575 
X is:  964.0121 843.7817 

Worth is:  628.1306 
Gradient is:  0.7616636 0.6480014 
X is:  475.7051 409.7372 

Worth is:  4965690 
Gradient is:  -0.7493951 -0.662123 
X is:  -3721262 -3287901 

Worth is:  2482306 
Gradient is:  -0.7503822 -0.6610042 
X is:  -1862675 -1640817 

Worth is:  8.61863e+11 
Gradient is:  0.7486113 0.6630091 
X is:  645200412672 571423064064 

Worth is:  430929412096 
Gradient is:  0.7487153 0.6628917 
X is:  322643460096 285659529216 

Worth is:  Inf 
Gradient is:  0 0 
X is:  -2.826342e+19 -2.503904e+19 

Iteration:  3 
Worth is:  Inf 
Gradient is:  0 0 
X is:  -2.826342e+19 -2.503904e+19 

This has been much less of successful. At first, loss decreases properly, however immediately, the estimate dramatically overshoots, and retains bouncing between adverse and constructive outer area ever after.

Fortunately, there’s something we will do.

Taken in isolation, what a Quasi-Newton technique like L-BFGS does is decide the most effective descent course. Nonetheless, as we simply noticed, a superb course will not be sufficient. With the flower operate, wherever we’re, the optimum path results in catastrophe if we keep on it lengthy sufficient. Thus, we want an algorithm that fastidiously evaluates not solely the place to go, but additionally, how far.

For that reason, L-BFGS implementations generally incorporate line search, that’s, a algorithm indicating whether or not a proposed step size is an efficient one, or ought to be improved upon.

Particularly, torch’s L-BFGS optimizer implements the Sturdy Wolfe circumstances. We re-run the above code, altering simply two strains. Most significantly, the one the place the optimizer is instantiated:

optimizer  optim_lbfgs(x_star, line_search_fn = "strong_wolfe")

And secondly, this time I discovered that after the third iteration, loss continued to lower for some time, so I let it run for 5 iterations. Right here is the output:

Iteration:  1 
...
...

Worth is:  -0.8838741 
Gradient is:  3.742207 7.521572 
X is:  0.09035123 -0.03220009 

Worth is:  -0.928809 
Gradient is:  1.464702 0.9466625 
X is:  0.06564617 -0.026706 

Iteration:  2 
...
...

Worth is:  -0.9991404 
Gradient is:  39.28394 93.40318 
X is:  0.0006493925 -0.0002656128 

Worth is:  -0.9992246 
Gradient is:  6.372203 12.79636 
X is:  0.0007130796 -0.0002947929 

Iteration:  3 
...
...

Worth is:  -0.9997789 
Gradient is:  3.565234 5.995832 
X is:  0.0002042478 -8.457939e-05 

Worth is:  -0.9998025 
Gradient is:  -4.614189 -13.74602 
X is:  0.0001822711 -7.553725e-05 

Iteration:  4 
...
...

Worth is:  -0.9999917 
Gradient is:  -382.3041 -921.4625 
X is:  -6.320081e-06 2.614706e-06 

Worth is:  -0.9999923 
Gradient is:  -134.0946 -321.2681 
X is:  -6.921942e-06 2.865841e-06 

Iteration:  5 
...
...

Worth is:  -0.9999999 
Gradient is:  -3446.911 -8320.007 
X is:  -7.267168e-08 3.009783e-08 

Worth is:  -0.9999999 
Gradient is:  -3419.361 -8253.501 
X is:  -7.404627e-08 3.066708e-08 

It’s nonetheless not good, however so much higher.

Lastly, let’s go one step additional. Can we use torch for constrained optimization?

Quadratic penalty for constrained optimization

In constrained optimization, we nonetheless seek for a minimal, however that minimal can’t reside simply anyplace: Its location has to meet some variety of extra circumstances. In optimization lingo, it must be possible.

As an instance, we stick with the flower operate, however add on a constraint: (mathbf{x}) has to lie outdoors a circle of radius (sqrt(2)), centered on the origin. Formally, this yields the inequality constraint

[
2 – {x_1}^2 – {x_2}^2

A way to minimize flower and yet, at the same time, honor the constraint is to use a penalty function. With penalty methods, the value to be minimized is a sum of two things: the target function’s output and a penalty reflecting potential constraint violation. Use of a quadratic penalty, for example, results in adding a multiple of the square of the constraint function’s output:

# x^2 + y^2 >= 2
# 2 - x^2 - y^2 
constraint  function(x) 2 - torch_square(torch_norm(x))

# quadratic penalty
penalty  function(x) torch_square(torch_max(constraint(x), other = 0))

A priori, we can’t know how big that multiple has to be to enforce the constraint. Therefore, optimization proceeds iteratively. We start with a small multiplier, (1), say, and increase it for as long as the constraint is still violated:

penalty_method  function(f, p, x, k_max, rho = 1, gamma = 2, num_iterations = 1) {

  for (k in 1:k_max) {
    cat("Starting step: ", k, ", rho = ", rho, "n")

    minimize(f, p, x, rho, num_iterations)

    cat("Value: ",  as.numeric(f(x)), "n")
    cat("X: ",  as.matrix(x), "n")
    
    current_penalty  as.numeric(p(x))
    cat("Penalty: ", current_penalty, "n")
    if (current_penalty == 0) break
    
    rho  rho * gamma
  }

}

minimize(), called from penalty_method(), follows the usual proceedings, but now it minimizes the sum of the target and up-weighted penalty function outputs:

minimize  function(f, p, x, rho, num_iterations) {

  calc_loss  function() {
    optimizer$zero_grad()
    value  f(x) + rho * p(x)
    value$backward()
    value
  }

  for (i in 1:num_iterations) {
    cat("Iteration: ", i, "n")
    optimizer$step(calc_loss)
  }

}

This time, we start from a low-target-loss, but unfeasible value. With yet another change to default L-BFGS (namely, a decrease in tolerance), we see the algorithm exiting successfully after twenty-two iterations, at the point (0.5411692,1.306563).

x_star  torch_tensor(c(0.5, 0.5), requires_grad = TRUE)

optimizer  optim_lbfgs(x_star, line_search_fn = "strong_wolfe", tolerance_change = 1e-20)

penalty_method(flower, penalty, x_star, k_max = 30)
Starting step:  1 , rho =  1 
Iteration:  1 
Value:  0.3469974 
X:  0.5154735 1.244463 
Penalty:  0.03444662 

Starting step:  2 , rho =  2 
Iteration:  1 
Value:  0.3818618 
X:  0.5288152 1.276674 
Penalty:  0.008182613 

Starting step:  3 , rho =  4 
Iteration:  1 
Value:  0.3983252 
X:  0.5351116 1.291886 
Penalty:  0.001996888 

...
...

Starting step:  20 , rho =  524288 
Iteration:  1 
Value:  0.4142133 
X:  0.5411959 1.306563 
Penalty:  3.552714e-13 

Starting step:  21 , rho =  1048576 
Iteration:  1 
Value:  0.4142134 
X:  0.5411956 1.306563 
Penalty:  1.278977e-13 

Starting step:  22 , rho =  2097152 
Iteration:  1 
Value:  0.4142135 
X:  0.5411962 1.306563 
Penalty:  0 

Conclusion

Summing up, we’ve gotten a first impression of the effectiveness of torch’s L-BFGS optimizer, especially when used with Strong-Wolfe line search. In fact, in numerical optimization – as opposed to deep learning, where computational speed is much more of an issue – there is hardly ever a reason to not use L-BFGS with line search.

We’ve then caught a glimpse of how to do constrained optimization, a task that arises in many real-world applications. In that regard, this post feels a lot more like a beginning than a stock-taking. There is a lot to explore, from general method fit – when is L-BFGS well suited to a problem? – via computational efficacy to applicability to different species of neural networks. Needless to say, if this inspires you to run your own experiments, and/or if you use L-BFGS in your own projects, we’d love to hear your feedback!

Thanks for reading!

Appendix

Rosenbrock function plotting code

library(tidyverse)

a  1
b  5

rosenbrock  function(x) {
  x1  x[1]
  x2  x[2]
  (a - x1)^2 + b * (x2 - x1^2)^2
}

df  expand_grid(x1 = seq(-2, 2, by = 0.01), x2 = seq(-2, 2, by = 0.01)) %>%
  rowwise() %>%
  mutate(x3 = rosenbrock(c(x1, x2))) %>%
  ungroup()

ggplot(information = df,
       aes(x = x1,
           y = x2,
           z = x3)) +
  geom_contour_filled(breaks = as.numeric(torch_logspace(-3, 3, steps = 50)),
                      present.legend = FALSE) +
  theme_minimal() +
  scale_fill_viridis_d(course = -1) +
  theme(facet.ratio = 1)

Flower operate plotting code

a  1
b  1
c  4

flower  operate(x) {
  a * torch_norm(x) + b * torch_sin(c * torch_atan2(x[2], x[1]))
}

df  expand_grid(x = seq(-3, 3, by = 0.05), y = seq(-3, 3, by = 0.05)) %>%
  rowwise() %>%
  mutate(z = flower(torch_tensor(c(x, y))) %>% as.numeric()) %>%
  ungroup()

ggplot(information = df,
       aes(x = x,
           y = y,
           z = z)) +
  geom_contour_filled(present.legend = FALSE) +
  theme_minimal() +
  scale_fill_viridis_d(course = -1) +
  theme(facet.ratio = 1)

Picture by Michael Trimble on Unsplash

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments