LLaMA in R with Keras and TensorFlow

OpenAI’s chatGPT has woke up a collective consciousness of what Massive
Language Fashions (LLMs) are able to. With that awakening comes a every day
march of LLM information: new merchandise, new options, new fashions, new
capabilities, (and new worries). It appears we’re within the early levels of a
Cambrian explosion of LLMs and LLM powered instruments; it’s not but clear how
LLMs will affect and affect our skilled and private lives, however
it appears clear that they are going to, ultimately.

Since LLMs are right here to remain, it’s worthwhile to take a while to
perceive how these fashions work from a first-principles perspective.
Beginning with the mechanics may also help foster sturdy intuitions that may
inform our utilization of those fashions now and sooner or later. (Particularly if
the long run is one the place LLMs are a staple of the information scientist’s
toolbox, as widespread as an lm() operate name).

And what higher approach is there to study than by doing. So with that
preamble, on this publish we’ll stroll via an implementation of an LLM,
LLaMA (Touvron et al. 2023)
particularly, in TensorFlow and Keras, with the aim being to develop
understanding first, functionality second.

Why LLaMA? With the sheer quantity of LLM associated content material and information out
there, it could actually appear formidable to know the place to get began. Virtually weekly
it appears there’s a new mannequin introduced. Looking some hubs of LLM
exercise (HuggingFace,
TFHub,
reddit,
HackerNews) muddies the waters even
extra. The best way to decide a particular mannequin?

Of the various LLM-related information gadgets previously months, one which stands
head-and-shoulders above the group is the launch of
LLaMA,
a contemporary, foundational LLM made out there to the general public by Meta AI in
February 2023. On widespread benchmarks, LLaMA outperforms OpenAI’s GPT-3,
whereas being considerably smaller (although nonetheless massive).

LLaMA is a superb beginning place as a result of it’s a easy and fashionable
structure, has glorious efficiency on benchmarks, and is open. The
mannequin structure has had just some new concepts integrated into it since
the unique Transformer structure first described in,
“Consideration Is All You Want”
printed from Google (Vaswani et al. 2017). 4 totally different sizes of
LLaMA have been launched: 7 billion and 13 billion parameter fashions
skilled on 1 Trillion tokens, and 33 billion and 65 billion parameter
fashions skilled on 1.4 trillion tokens. This is a gigantic quantity of
coaching knowledge these fashions have seen–the biggest 65B mannequin has been
skilled on roughly the “Chinchilla
compute-optimum” (Hoffmann et al. 2022)
variety of tokens, whereas the smaller LLaMAs are considerably
past that optimum. On this weblog publish we’ll concentrate on the smallest, 7B
parameter LLaMA mannequin, which you’ll be able to comfortably load regionally and run on
CPU with solely 64Gb of RAM.

Whereas not strictly mandatory, to observe alongside regionally, you’ll in all probability
wish to purchase the pre-trained LLaMA weights one
approach or
one other. Be aware, the
weights do include their very own license, which you’ll be able to preview
right here.

So, with out additional ado, let’s get began.

Setup

First, we’ll wish to set up the required R and Python packages, and
configure a digital surroundings:

remotes::install_github(c("rstudio/reticulate",
                          "rstudio/tensorflow",
                          "rstudio/keras"))
# reticulate::install_python("3.10:newest")                          
reticulate::virtualenv_create("./.venv", model = "3.10:newest")
tensorflow::install_tensorflow(envname = "./.venv", model = "launch",
                               extra_packages = "tensorflow-text")

library(purrr)
library(envir)

library(tensorflow)
library(tfautograph)
library(keras)

use_virtualenv("./.venv")
choices(tensorflow.extract.warn_tensors_passed_asis = FALSE)

attach_eval({
  import_from(glue, glue)
  import_from(jsonlite, read_json)
  import_from(withr, with_dir, with_options)
  import_from(keras$layers, Dense)
  np  reticulate::import("numpy", convert = FALSE)

  seq_len0  operate(x) seq.int(from = 0L, size.out = x)
})

# reticulate::py_install("torch", pip = TRUE)
torch  reticulate::import("torch", convert = FALSE)
with_dir("~/github/facebookresearch/llama/weights/LLaMA/7B", {
  pretrained_weights  torch$load("consolidated.00.pth",
                                   map_location = "cpu")
  for (identify in names(pretrained_weights)) {
    filename  sprintf("%s.npy", identify)
    array  pretrained_weights[[name]]$numpy()
    np$save(filename, array)
    message(glue(
      "wrote: '{basename(filename)}' with form: {array$form}"))
  }
})

weights_path  operate(filename) normalizePath(file.path(
  "~/github/facebookresearch/llama/weights/LLaMA/",
  glue(filename, .envir = mother or father.body())), mustWork = TRUE)

params  read_json(weights_path("7B/params.json"))
str(params)

Checklist of 6
 $ dim        : int 4096
 $ multiple_of: int 256
 $ n_heads    : int 32
 $ n_layers   : int 32
 $ norm_eps   : num 1e-06
 $ vocab_size : int -1

Tokenizer

SentencePiece tokenizer from
Google. SentencePiece is obtainable as a TensorFlow graph operation
via
tf_text.SentencepieceTokenizer,
and likewise as a Keras layer in
keras_nlp.tokenizers.SentencepieceTokenizer.
By selection of a coin flip, we’ll use the lower-level tf_text interface.

tf_text  reticulate::import("tensorflow_text")
tokenizer_path  weights_path("tokenizer.mannequin")
tokenizer  tf_text$SentencepieceTokenizer(
  tf$io$gfile$GFile(tokenizer_path, "rb")$learn(),
  add_bos = TRUE, add_eos = FALSE,
)

immediate  "One of the best ways to draw bees"
tokenizer$tokenize(immediate)

tf.Tensor([    1   450  1900   982   304 13978   367   267], form=(8), dtype=int32)

immediate |> tokenizer$tokenize() |> tokenizer$detokenize()

tf.Tensor(b'One of the best ways to draw bees', form=(), dtype=string)

Let’s outline a show_tokens() helper operate and play with the
tokenizer somewhat.

show_tokens  operate(what) > tokenizer$tokenize() 

show_tokens(immediate)

        1       450      1900       982       304     13978       367       267
       ""     "The"    "finest"     "approach"      "to" "entice"      "be"      "es"

Be aware that “bees” is 2 tokens. Not each token corresponds to a phrase.
For instance, one non-word token we will reliably count on to indicate up in a
tokenizer skilled on a corpus of English textual content is “ing.” Nonetheless, when the
“ing” token exhibits up is not going to all the time observe your intuitions, as a result of
widespread phrases get their very own token id, even when they are often decomposed into
a number of tokens.

    1  2348
   "" "ing"

        1      1985
       "" "working"

     1   8525    292
    "" "flex"  "ing"

     1   2113   9292
    ""  "received" "king"

One other factor to notice concerning the tokenizer is that every token sequence
begins with token id 1. This can be a particular beginning-of-sequence
token that we requested be added once we loaded the tokenizer with
add_bos = TRUE. There are two different such particular tokens that we’ll
encounter later: an end-of-sequence particular tokens with id 2, and an
unknown-token with id 0.

as.character(tokenizer$id_to_string(0L))

[1] ""

as.character(tokenizer$id_to_string(1L))

[1] ""

as.character(tokenizer$id_to_string(2L))

[1] ""

    1     0     2
   "" " ⁇ "    ""

Total, there are 32,000 tokens.

as.integer(tokenizer$vocab_size())

[1] 32000

One final commentary is that the extra incessantly encountered tokens are
assigned decrease ids.

show_tokens(seq(50, len = 10))

 50  51  52  53  54  55  56  57  58  59
"/" "0" "1" "2" "3" "4" "5" "6" "7" "8"

show_tokens(seq(100, len = 10))

100 101 102 103 104 105 106 107 108 109
"a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

show_tokens(seq(1000, len = 10))

   1000    1001    1002    1003    1004    1005    1006    1007    1008    1009
  "ied"    "ER"  "stat"   "fig"    "me"   "von" "inter"  "roid"  "ater" "their"

show_tokens(seq(10000, len = 10))

   10000    10001    10002    10003    10004    10005    10006    10007
   "ång"  "citep"    "In poor health"   "rank" "sender"   "beim"    "рак" "compat"
   10008    10009
"happens"  "diese"

show_tokens(seq(20000, len = 10))

    20000     20001     20002     20003     20004     20005     20006     20007
  "admit" "Remark"     "стя"    "Vien"      "ці"  "permut"     "cgi"    "crít"
    20008     20009
"Console"    "ctic"

show_tokens(seq(to = as.integer(tokenizer$vocab_size()) - 1, len = 10))

31990 31991 31992 31993 31994 31995 31996 31997 31998 31999
  "ὀ"  "げ"  "べ"  "边"  "还"  "黃"  "왕"  "收"  "弘"  "给"

Transferring on, the subsequent step after tokenization is embedding. An embedding
layer is successfully a dictionary lookup that converts an integer (token
id) to a 1-d float array. For this we will use the usual keras
Embedding layer.

tok_embeddings  keras$layers$Embedding(
  input_dim = tokenizer$vocab_size(),
  output_dim = params$dim,
  embeddings_initializer =
    (...) np$load(weights_path("7B/tok_embeddings.weight.npy"))
)

tok_embeddings(3L) |> str()

immediate |> # "One of the best ways to draw bees"
  tokenizer$tokenize() |>
  tok_embeddings() |>
  str()

`TransformerBlock`

As soon as it’s tokenized and embedded, the enter then passes via the majority
of the mannequin, a sequence of repeating TransformerBlock layers. The 7B
mannequin has 32 of those TransformerBlock layers, whereas the 65B mannequin has
80 of them.

weights_path("7B/params.json")  |> read_json() |> _$n_layers

[1] 32

weights_path("65B/params.json") |> read_json() |> _$n_layers

[1] 80

Here’s what the transformer block seems to be like:

TransformerBlock(keras$layers$Layer) %py_class% {
  initialize  operate(attn_head_size, attn_n_heads,
                         norm_eps = k_epsilon(), ...,
                         block_id = NULL) {
    tremendous$initialize(...)

    self$consideration  Consideration(attn_head_size, attn_n_heads,
                                block_id = block_id)

    self$feed_forward  FeedForward(
      hidden_dim = 4 * attn_head_size * attn_n_heads,
      block_id = block_id)

    self$attention_norm  RMSNorm(eps = norm_eps,
                                   block_id = block_id,
                                   feeds_into = "consideration")
    self$feed_forward_norm  RMSNorm(eps = norm_eps,
                                      block_id = block_id,
                                      feeds_into = "ffn")
  }

  name  operate(x) >
      self$consideration()

    x  x + x2 # add residual

    # norm and swiglu
    x2  x %>%
      self$feed_forward_norm() %>%
      self$feed_forward()

    x  x + x2 # residual once more

    x
  
}

Whereas there’s not a variety of code, there are a variety of concepts packed in
there. This block types the primary trunk of the mannequin, so it’s price
taking the time to undergo it slowly.

We implement the TransformerBlock as a subclassed
keras.layers.Layer. That is provides us some niceties like the power to
compose with different Keras layers, however these are largely irrelevant to the
function of this weblog publish; we may simply as simply implement this as,
for instance, a vanilla R6 class. Our TransformerBlock class has two
strategies: initialize, known as once we first create the block, and
name, known as once we run the ahead go of the block.

In initialize, we create 4 layers: an Consideration layer, a
FeedForward layer, and a couple of RMSNorm layers. We’ll take a detailed take a look at
every of those quickly, however even earlier than we achieve this, we will see how they match
collectively by wanting on the TransformerBlock$name() technique.

The name technique has a number of easy concepts. In no explicit order, the
first one to watch is the composition sample of including residuals.

x2  x |> ...
x  x + x2 # add residual x to x2

vanishing gradient
downside. It’s
a skip-connection within the other-wise linear sequence of matrix
transformations. It reinjects data (in the course of the ahead go), and
gradients (throughout again propagation), again into the trunk. You’ll be able to assume
of those residual connections as releasing the learnable layers in-between
(the ... within the pseudo code) from the burden of getting to
“pass-through” or “protect” data in x, permitting the weights to
as a substitute concentrate on studying transformations which might be, (in corporatese
vernacular), value-adding.

The following composition sample to notice is the repeating utilization of a
normalization layer:

x2  x |> norm() |> ...
x  x + x2

There are lots of sorts of normalization layers, however to barely
over-generalize, they’ll all be regarded as a stabilizer that helps
with coaching. Like their deep-learning cousins the regularizers, their
important operate is to maintain values passing via in a wise vary–in
the ball park of (-1, 1), sometimes. We’ll take a more in-depth take a look at
RMSNorm quickly.

Stripped of two tips which might be largely there to assist the mannequin practice,
residuals and normalization, the core of the TransformerBlock is simply
this:

x |> consideration() |> feed_forward()

In a second we’ll see that that feed_foward is a barely fancier
variation of a traditional sequence of Dense layer. Earlier than we get
there we will we safely skip forward to distill the next instinct: a
TransformerBlock is principally an Consideration layer adopted by a number of
(fancy) dense layers, with some easy composition patterns (tips)
that assist with coaching. Consideration is the guts of the mannequin: it’s the
most attention-grabbing, and likewise probably the most concerned.

With the framing in place, let’s undergo and take a more in-depth take a look at
RMSNorm, FeedForward, after which with the muse in place, we’ll
flip our consideration to Consideration.

`RMSNorm`

RMSNorm(keras$layers$Layer) %py_class% {
  initialize 
    operate(eps = 1e-6, ..., block_id = NULL, feeds_into = NULL) {
      tremendous$initialize(...)
      self$eps  eps
      self$block_id  block_id
      self$feeds_into  feeds_into
    }

  construct  operate(input_shape) {
    # input_shape == (batch_size, seqlen, params$dim)
    # self$w will broadcast over batch_size and seqlen dims.
    # w_shape == (1, 1, params$dim)
    w_shape  rep(1L, size(input_shape))
    w_shape[length(input_shape)]  as.integer(input_shape) |> tail(1L)

    # outline an area operate that may load
    # the pretrained-weights if we provided `block_id` and `feeds_into`
    import_from({self}, block_id, feeds_into)
    initializer if (is.null(block_id))
      "ones"
      else if (block_id >=0) {
        (...) weights_path("7B/layers.{block_id}.{feeds_into}_norm.weight.npy") |>
               np$load() |> np$expand_dims(0:1)
      } else if(block_id == -1)
        # load weights for the ultimate output normalization layer, which isn't
        # a part of a TransformerBlock
        (...) weights_path("7B/norm.weight.npy") |>
               np$load() |> np$expand_dims(0:1)

    self$w  self$add_weight(form = w_shape,
                              initializer = initializer,
                              trainable = TRUE)
  }

  rrms  operate(x) {
    # reciprocal root imply sq. alongside the final axis
    x %>% # (batch_size, seqlen, n_features)
      tf$math$sq.() %>%
      tf$reduce_mean(axis = -1L, keepdims = TRUE) %>% # (batch_size, seqlen, 1)
      tf$math$add(self$eps) %>% # for numerical stability
      tf$math$rsqrt()
  }

  name  operate(x) {
    x * self$rrms(x) * self$w
  }
}

RMSnorm() has a single trainable tensor w. Within the ahead go, every
worth within the enter is multiplied by the reciprocal-root-mean-square of
all of the values within the function axis and by w. Definitely a mouthful, however
only a easy sequence of arithmetic transformations in the long run,
designed for the specific function of adjusting the vary of values
passing via.

Let’s kick the tires on it:

norm  RMSNorm()
m  matrix(c(0, 1,
              2, 3), nrow = 2)
norm(m)

tf.Tensor(
[[0.         1.4142132 ]
 [0.44721353 1.3416406 ]], form=(2, 2), dtype=float32)

tf.Tensor(
[[0.         1.4142137 ]
 [0.44721362 1.3416408 ]], form=(2, 2), dtype=float32)

tf.Tensor(
[[0.        1.4142137]
 [0.4472136 1.3416408]], form=(2, 2), dtype=float32)

`FeedForward`

Subsequent up is FeedForward()

FeedForward(keras$layers$Layer) %py_class% {

  initialize  operate(hidden_dim, multiple_of = 256L,
                         ..., block_id = NULL) {
    tremendous$initialize()

    if(!is.null(multiple_of)) {
      hidden_dim  hidden_dim %>%
        { as.integer( . * (2/3)) } %>%
        { (. + multiple_of - 1) %/% multiple_of } %>%
        { . * multiple_of }
    }

    self$hidden_dim  hidden_dim
    self$block_id  block_id
  }

  construct  operate(input_shape) {
    output_dim  input_shape |> as.integer() |> tail(1)

    if(is.null(self$block_id))
      load_weight  (...) NULL
    else
      load_weight  (identify) (...) np$load(weights_path(
        "7B/layers.{self$block_id}.feed_forward.{identify}.weight.npy"))$`T`

    self$w1  Dense(self$hidden_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w1"))
    self$w2  Dense(output_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w2"))
    self$w3  Dense(self$hidden_dim, use_bias = FALSE,
                     kernel_initializer = load_weight("w3"))

    tremendous$construct(input_shape)
  }

  name  operate(x) {
    import_from({self}, w1, w2, w3)
    import_from(tf$nn, silu)

    x %>%
      { silu(w1(.)) * w3(.) } %>% # SwiGLU
      w2()
  }

}

Shazeer (2020)
of SwiGLU and different variations on GLU is an exemplar of the kinds
of explorations and enhancements across the Transformer structure
since its preliminary publication in
2017; a gradual accretion of
enhancements that has introduced us to at this time. The Feedforward$name() is
only a single SwiGLU adopted by a linear projection. In its essence,
it’s a intelligent composition of three (discovered) linear projections, an
element-wise multiplication, and a silu()
activation
operate.

Maybe probably the most stunning commentary to make right here is the relative
dearth of activation capabilities, and even non-linearities, not simply in
FeedForward, however total. The silu() on this feedforward, the
reciprocal-root-mean-square in RMSnorm(), and a softmax() in
Consideration() are the one non-linear transformations in the entire
sequence of TransformerBlocks. Every little thing else is a linear
transformation!

`Consideration`

Lastly, let’s flip our consideration to Consideration().

Consideration(keras$layers$Layer) %py_class% {
  initialize  operate(head_size, n_heads,
                         ..., block_id = NULL) {
    tremendous$initialize(...)

    self$head_size  head_size
    self$n_heads  n_heads

    if (is.null(block_id))
      load_weight  operate(identify) NULL
    else
      load_weight  (identify) (...) np$load(weights_path(
        "7B/layers.{block_id}.consideration.{identify}.weight.npy"))$`T`

    Dense  operate(identify) keras$layers$Dense(
      models = n_heads * head_size,
      use_bias = FALSE,
      kernel_initializer = load_weight(identify)
    )

    self$wq  Dense("wq")
    self$wk  Dense("wk")
    self$wv  Dense("wv")
    self$wo  Dense("wo")
  }

  name  operate(x) {
    c(batch_size, seqlen, n_features) % tf$unstack(tf$form(x))

    # 1. mission (linear remodel) x into
    #    question, key, and worth tensors
    # 2. reshape q ok v, splitting out the final dim (n_features)
    #    into n_heads unbiased subspaces,
    #    every with dimension head_size.
    #    (n_features == head_size * n_heads)
    split_heads_shape  c(batch_size, seqlen,
                           self$n_heads, self$head_size)
    q  x |> self$wq() |> tf$reshape(split_heads_shape)
    ok  x |> self$wk() |> tf$reshape(split_heads_shape)
    v  x |> self$wv() |> tf$reshape(split_heads_shape)

    # embed positional data in question and key
    # (bsz, seqlen, n_heads, head_size)
    q %% apply_rotary_embedding()
    ok %% apply_rotary_embedding()

    # reshape:
    #   transfer heads out of the final 2 axes,
    #   so later matmuls are carried out throughout the subspaces (heads)
    #   between (seqlen, head_size) axes
    v  tf$transpose(v, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
    q  tf$transpose(q, c(0L, 2L, 1L, 3L)) # (bsz, n_heads, seqlen, head_size)
    ok  tf$transpose(ok, c(0L, 2L, 3L, 1L)) # (bsz, n_heads, head_size, seqlen)

    # calculate and normalize consideration scores
    scores  q %*% ok                       # (bsz, n_heads, seqlen, seqlen)
    scores  scores / sqrt(self$head_size) # scale

    # apply causal masks, so the mannequin cannot "look forward" throughout coaching
    masks  make_mask(seqlen, dtype = scores$dtype)
    scores %% { . + masks }

    scores  tf$nn$softmax(scores, axis = -1L)

    # modify values tensor with consideration scores
                      # scores (bsz, n_heads, seqlen, seqlen)
                      # v      (bsz, n_heads, seqlen, head_size)
    output  scores %*% v   # (bsz, n_heads, seqlen, head_size)

    # mix heads again right into a single options dim,
    # so Consideration output_shape==input_shape
    output  output |>
      tf$transpose(c(0L, 2L, 1L, 3L)) |> # (bsz, seqlen, n_heads, head_size)
      tf$reshape(tf$form(x))            # (bsz, seqlen, n_heads * head_size)

    # yet another trainable linear projection for good luck
    output  self$wo(output) # (bsz, seqlen, n_heads * head_size)

    output
  }
}

unique Transformers
paper (and out there as a keras
builtin underneath keras$layers$MultiHeadAttention()). The core novelty is
the addition of the apply_rotary_embedding() operate, which we’ll
describe shortly. The extra novelty is balanced by the simplicity
from the truth that the layer is performing self-attention—we don’t want
to go in several question, key, and worth tensors (or cause about what
which means), for the reason that identical enter serves all three roles. Be aware that the
typical MultiHeadAttention() layer is roofed fairly completely in
the 2nd Version of Deep Studying with R,
together with a full implementation of consideration in base R.

To develop an understanding of the mechanics in a layer like this, it’s
useful to briefly unsee a few of the minutia that may act as a fog
obscuring the essence of the operation. On this occasion, if we
briefly strip out the transpose()s and reshape()s (as intelligent and
very important as they’re), that is what’s left:

name  operate(x) > self$wv()

  # rotate q,ok to inject place data.
  # cross q,ok to calculate an consideration rating for every token pair.
  scores  rotate(q) %*% rotate(ok)

Returning to the transpose()s and reshapes(), you possibly can observe that
their function is to make it in order that the eye calculations are
carried out throughout n_heads unbiased subspaces, moderately than in a
single bigger house. The identical reasoning drives this resolution as that
driving utilization of depthwise-separable convolutions in picture fashions.
Empirically, for the mounted compute price range, factoring options into
unbiased subspaces performs higher than doing the identical core
operations in single bigger function house. As with all issues, there’s
a stability to strike between n_heads (the variety of subspaces) and
head_dim (the scale of every subspace). The LLaMA authors have struck
the stability like this on the varied mannequin sizes:

lapply(c("7B", "13B", "30B", "65B"), (dimension) {
  p  read_json(weights_path("{dimension}/params.json"))
  with(p, checklist(llama_size = dimension,
               n_heads = n_heads,
               head_dim = dim %/% n_heads))
}) |> dplyr::bind_rows()

# A tibble: 4 × 3
  llama_size n_heads head_dim
              
1 7B              32      128
2 13B             40      128
3 30B             52      128
4 65B             64      128

Subsequent lets flip our consideration to the causal consideration masks.

make_mask  operate(seqlen, dtype = k_floatx()) {
  x  tf$vary(seqlen)
  masks  tf$the place(x[, tf$newaxis]  x[tf$newaxis, ],
                   tf$fixed(-Inf, dtype = dtype),
                   tf$fixed(0, dtype = dtype))

  # broadcast over batch and heads dim
  masks[tf$newaxis, tf$newaxis, , ] # (1, 1, seqlen, seqlen)
}

The masks is a strictly higher triangular matrix full of -Inf
values. Including the masks to the eye scores prevents the mannequin from
having the ability to “look forward” and see the eye rating for a token
pairing it hasn’t seen but at a selected place within the sequence.
This want for a masks is finest regarded as a vestige from coaching,
an equipment that the mannequin wanted to study with and now it could actually’t operate with out.
Throughout coaching, gradients are calculated for predictions from all
token positions in a sequence, together with predictions tokens the place the right
reply is proper there, because the very subsequent token in identical sequence. The masks
prevents the mannequin from having the ability to cheat and look forward into the long run,
one thing it received’t have the ability to do as soon as it’s we’re working it for inference.

tf.Tensor(
[[[[  0. -inf -inf -inf -inf]
   [  0.   0. -inf -inf -inf]
   [  0.   0.   0. -inf -inf]
   [  0.   0.   0.   0. -inf]
   [  0.   0.   0.   0.   0.]]]], form=(1, 1, 5, 5), dtype=float32)

Rotary Place Embedding

Su et al. (2022) within the paper titled
“RoFormer: Enhanced Transformer with Rotary Place Embedding”.

Some context:

The naked Consideration() mechanism doesn’t go away any chance for a
token’s place in a sequence to have an effect on the eye scores, since
solely token-pairs are scored. Consideration treats its enter like a
bag-of-tokens.
The place of a token in a sequence is clearly necessary, and the
consideration layer ought to have entry to that data.
Absolutely the place of a token in a sequence is much less necessary
than the relative place between tokens. (Particularly so for lengthy
sequences).

Which leads us into the advanced aircraft. If we think about the options as
advanced numbers, we will rotate them, and we will calculate angles between
them. From the Roformers paper:

Particularly, incorporating the relative place embedding is
simple: merely rotate the affine-transformed phrase embedding
vector by quantity of angle multiples of its place index and thus
interprets the instinct behind Rotary Place Embedding

Increasing barely: the rotation matrix is designed in order that
subsequently, after rotating our q and ok token sequence embedding
the identical approach, the angle between token options is a operate of the
relative distance between these tokens within the token sequence. The
relative angle between two tokens is invariant to absolutely the
place of these tokens within the full sequence.

Briefly, the rotation injects positional data. The that means or
interpretability of that positional data, or how it’s meant to
be used, and even extracted from the results of q %*% ok, is left to the
mannequin to study.

Right here is the code:

apply_rotary_embedding  operate(x) {
  c(., seqlen, ., head_size) %
    tf$unstack(tf$form(x))

  rotation_matrix  compute_rotation_matrix(seqlen, head_size)

  x %>%
    view_as_complex() %>%
    { . * rotation_matrix } %>%
    view_as_real()

}

compute_rotation_matrix 
  operate(seqlen, feature_dim, theta = 10000) {
    # `feature_dim` right here goes to be consideration$head_size
    # `seqlen` goes to match the token sequence size.

    t  tf$vary(seqlen, dtype = tf$float32)
    freqs  tf$vary(begin = 0, restrict = 1, delta = 1 / (feature_dim %/% 2),
                      dtype = tf$float32)
    tf_assert(tf$dimension(freqs) == feature_dim %/% 2)
    freqs  1.0 / (theta ^ freqs)

    # outer product; (seqlen, head_size/2)
    freqs  tf$einsum('a,b->ab', t, freqs)

    rot_mat  tf$advanced(tf$cos(freqs), tf$sin(freqs))

    # the positional embedding will likely be broadcast throughout batch and heads dim
    rot_mat[tf$newaxis, , tf$newaxis, ] #(1, seqlen, 1, headdim/2)
  }

view_as_complex  operate(x) {
  tf$advanced(x[all_dims(), `::2`],
             x[all_dims(), `2::2`])
}

view_as_real  operate(x) {
  # xs = (..., f);  xs2 = (..., f*2)
  xs  tf$form(x)
  xs2  tf$concat(checklist(xs[1:(length(xs)-1)],
                        xs[length(xs), drop = FALSE] * 2L),
                   axis = 0L)

  x2  tf$stack(checklist(Re(x), Im(x)), axis = -1L)

  # (..., f, 2) -> (..., f*2)
  tf$reshape(x2, xs2)
}

As you possibly can see, to think about the embedding options as present within the
advanced aircraft, we merely deal with adjoining pairs of floats within the
underlying array as the actual and imaginary a part of a posh quantity. We
rotate the embeddings within the advanced aircraft, then return to imagining
the options as present in the actual aircraft. Once more, the job of
decoding the that means of the options after rotation is left to the
mannequin to study.

We will shortly verify that the rotary embeddings solely rotate options
and don’t scale them:

close to  operate (x, y, tol = 1e-6) abs(x - y)  tol
all(close to(1, Mod(compute_rotation_matrix(2048L, 128L))))

tf.Tensor(True, form=(), dtype=bool)

There may be yet another trick to watch earlier than shifting on: due to a few of
the mathematical properties of the rotation matrix, it’s potential to
keep away from doing a full advanced multiply operation and nonetheless arrive on the
identical end result. Additionally, for the reason that rotation matrix by no means modifications, it makes
sense to solely compute it as soon as and cache it, like so:

precomputed_rotation_matrix  compute_rotation_matrix(
  seqlen = 2048L, # LLaMA max seqlen
  feature_dim = with(params, dim %/% n_heads)  # head_size
)

apply_rotary_embedding_faster  operate(x) {

  rotate_every_two  operate(x) {
    x1  x[all_dims(), `::2`]
    x2  x[all_dims(), `2::2`]
    x_  tf$stack(checklist(-x2, x1), axis = -1L)
    tf$reshape(x_, tf$form(x))
  }

  repeat_each_twice  operate(x) {
    tf$`repeat`(x, 2L, axis = -1L)
  }

  seqlen  tf$form(x)[2]
  rot  precomputed_rotation_matrix[, NA:seqlen, , ]

  cos  Re(rot) |> repeat_each_twice()
  sin  Im(rot) |> repeat_each_twice()

  (x * cos) + (rotate_every_two(x) * sin)
}

rand  tf$random$uniform(form(3, 8, params$n_heads, 128))
all(apply_rotary_embedding(rand) ==
    apply_rotary_embedding_faster(rand))

tf.Tensor(True, form=(), dtype=bool)

apply_rotary_embedding  apply_rotary_embedding_faster

Lastly, observe that the rotary positional embeddings are utilized inside
every Consideration layer. That is totally different from the unique Transformer
implementation, the place a positional embedding was solely added as soon as on the
head of the mannequin. Just like residual connections, you possibly can consider the
presence of those repeated injections of positional data as
relieving the remaining trainable layers from the burden of allocating
a few of their weights to the duty of “passing via” or “preserving”
the positional data for later layers.

Falbel and Keydana 2023),
so time spent understanding them higher is time effectively
spent. For the needs of this weblog publish we’ve coated the factors
wanted and we’ll transfer on to tying all items collectively. To go deeper and
develop a extra mathematically knowledgeable perceive of RoPE, two glorious
beginning factors are:

The unique paper by Su et al. (2022)
This weblog publish by
Biderman et al. (2021)

Tying all of it collectively

With Tokenizer, Embedding, TransformerBlock (RMSNorm,
Consideration FeedForward and apply_rotary_embedding) all coated,
it’s time to tie all of the items collectively right into a Transformer mannequin. We
may do that utilizing %py_class% like with the opposite layers above, however
it’s simply as simple to maneuver over to utilizing the Keras practical API at this
level.

layer_transformer_block  create_layer_wrapper(TransformerBlock)
layer_rms_norm  create_layer_wrapper(RMSNorm)

# enter to the mannequin will likely be output from the tokenizer
enter  layer_input(form(NA)) #, dtype = "int32")

x  enter |>
  tok_embeddings()  # instantiated earlier within the blog-post

for(block_id in seq_len0(params$n_layers)) >
    layer_transformer_block(attn_head_size = params$dim %/% params$n_heads,
                            attn_n_heads = params$n_heads,
                            norm_eps = params$norm_eps,
                            block_id = block_id)


# last output projection into logits of output tokens
x  x |>
  layer_rms_norm(block_id = -1, eps = params$norm_eps) |>
  layer_dense(
    tokenizer$vocab_size(), use_bias = FALSE,
    kernel_initializer = (...) np$load(weights_path("7B/output.weight.npy"))$`T`
  )

# slice out the logits for the final token
with_options(c(tensorflow.extract.warn_negatives_pythonic = FALSE), {
  output  x[, -1, ]
})

llama  keras_model(enter, output) %>%
  compile(jit_compile = TRUE)

next_token_probs  immediate %>%
  tokenizer$tokenize() %>%
  llama()

next_token_probs

tf.Tensor(
[[-2.4503722e+00 -3.4463339e+00  1.3200411e+01 ...  4.8804146e-01
  -1.3277926e+00  9.9985600e-03]], form=(1, 32000), dtype=float32)

Deep Studying with
R ebook), however this weblog publish is lengthy sufficient
already. So for now, let’s simply take the argmax().

sampler  (logits) tf$argmax(logits, axis = -1L, output_type = "int32")

(next_token  sampler(next_token_probs))

tf.Tensor([304], form=(1), dtype=int32)

tokenizer$detokenize(next_token) |> as.character()

[1] "to"

Let’s run it for a number of tokens and let LLaMa end the sentence:

prompt_tokens  tokenizer$tokenize("One of the best ways to draw bees")

for (i in 1:20) {

  next_token_probs  prompt_tokens |> llama()
  next_token  sampler(next_token_probs)

  prompt_tokens %% { tf$concat(c(., next_token), axis = -1L) }

  # finish of sentence
  if (as.logical(next_token == tokenizer$string_to_id(".")))
    break
}

prompt_tokens |>
  tokenizer$detokenize() |>
  as.character() |>
  strwrap(60) |> writeLines()

One of the best ways to draw bees to your backyard is to plant a
number of flowers that bloom at totally different instances.

Wrapping up

On this weblog publish we’ve walked via the LLaMA structure
carried out in R TensorFlow, together with methods to load pretrained weights,
after which run the mannequin to generate a sentence. Be aware, a lot of the code in
this weblog publish is tailor-made for didactic functions. Whereas the
implementation of the LLaMA structure coated on this weblog publish is
acceptable for coaching, there are a number of modifications you’ll wish to
make earlier than doing a variety of textual content era. These embody issues like:

Within the Consideration layer, caching the ok and v tensors. Then,
after the primary ahead go with the preliminary immediate, solely feeding
the mannequin the one new token from the sampler(), moderately than
feeding the mannequin all of the tokens of the total immediate on every ahead
go.
Solely producing the causal masks make_mask() and rotary_matrix
slices as soon as per ahead go, as a substitute of inside every Consideration
name.
Updating the TransformerBlock to be cache-aware and to go
via the suitable arguments to Consideration()
Wrapping all the extra book-keeping logic in a customized
TransformerDecoder() class.

The modifications required to implement these optimizations for inference
balloon the code dimension and are largely about book-keeping, so we received’t go
via them on this weblog publish. Nonetheless, you’ll find a fuller
implementation of LLaMA in R Tensorflow, together with a cache-aware
generate() technique that solely feeds the mannequin one token at a time throughout
the primary inference loop, (and compiles to XLA!),
right here.

That’s all for now. Thanks for studying and blissful travels to all
exploring this thrilling LLM terrain!

Picture by Sébastien Goldberg on Unsplash

Biderman, Stella, Sid Black, Charles Foster, Leo Gao, Eric Hallahan, Horace He, Ben Wang, and Phil Wang. 2021. “Rotary Embeddings: A Relative Revolution.” weblog.eleuther.ai/rotary-embeddings/.

Falbel, Daniel, and Sigrid Keydana. 2023. “Posit AI Weblog: De-Noising Diffusion with Torch.” https://blogs.rstudio.com/tensorflow/posts/2023-04-13-denoising-diffusion/.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Coaching Compute-Optimum Massive Language Fashions.” https://arxiv.org/abs/2203.15556.

Shazeer, Noam. 2020. “GLU Variants Enhance Transformer.” https://arxiv.org/abs/2002.05202.

Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2022. “RoFormer: Enhanced Transformer with Rotary Place Embedding.” https://arxiv.org/abs/2104.09864.

Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023. “LLaMA: Open and Environment friendly Basis Language Fashions.” https://doi.org/10.48550/ARXIV.2302.13971.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Consideration Is All You Want.” https://arxiv.org/abs/1706.03762.

LLaMA in R with Keras and TensorFlow

Setup

Tokenizer

`TransformerBlock`

`RMSNorm`

`FeedForward`

`Consideration`

Rotary Place Embedding

Tying all of it collectively

Wrapping up

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

Cell App Safety with Ryan Lloyd

Monitor Your Model’s AI Visiblity in 2026

Unlocking the Information Layer for Agentic AI with Simba Khadder

Hype and Actuality of the AI Coding Shift

Recent Comments

ABOUT US

POPULAR POSTS

Cell App Safety with Ryan Lloyd

Monitor Your Model’s AI Visiblity in 2026

Unlocking the Information Layer for Agentic AI with Simba Khadder

POPULAR CATEGORY