State-of-the-art NLP fashions from R

May 3, 2025

181

State-of-the-art NLP fashions from R

Introduction

The Transformers repository from “Hugging Face” accommodates a whole lot of prepared to make use of, state-of-the-art fashions, that are simple to obtain and fine-tune with Tensorflow & Keras.

For this objective the customers normally must get:

The mannequin itself (e.g. Bert, Albert, RoBerta, GPT-2 and and so forth.)
The tokenizer object
The weights of the mannequin

On this submit, we’ll work on a traditional binary classification activity and prepare our dataset on 3 fashions:

Nevertheless, readers ought to know that one can work with transformers on a wide range of down-stream duties, similar to:

function extraction
sentiment evaluation
textual content classification
query answering
summarization
translation and many extra.

Conditions

Our first job is to put in the transformers package deal by way of reticulate.

reticulate::py_install('transformers', pip = TRUE)

Then, as regular, load commonplace ‘Keras’, ‘TensorFlow’ >= 2.0 and a few traditional libraries from R.

Observe that if operating TensorFlow on GPU one might specify the next parameters as a way to keep away from reminiscence points.

physical_devices = tf$config$list_physical_devices('GPU')
tf$config$experimental$set_memory_growth(physical_devices[[1]],TRUE)

tf$keras$backend$set_floatx('float32')

Template

We already talked about that to coach a knowledge on the particular mannequin, customers ought to obtain the mannequin, its tokenizer object and weights. For instance, to get a RoBERTa mannequin one has to do the next:

# get Tokenizer
transformer$RobertaTokenizer$from_pretrained('roberta-base', do_lower_case=TRUE)

# get Mannequin with weights
transformer$TFRobertaModel$from_pretrained('roberta-base')

Information preparation

A dataset for binary classification is offered in text2vec package deal. Let’s load the dataset and take a pattern for quick mannequin coaching.

Break up our information into 2 elements:

idx_train = pattern.int(nrow(df)*0.8)

prepare = df[idx_train,]
take a look at = df[!idx_train,]

Information enter for Keras

Till now, we’ve simply coated information import and train-test break up. To feed enter to the community we’ve to show our uncooked textual content into indices by way of the imported tokenizer. After which adapt the mannequin to do binary classification by including a dense layer with a single unit on the finish.

Nevertheless, we wish to prepare our information for 3 fashions GPT-2, RoBERTa, and Electra. We have to write a loop for that.

Observe: one mannequin typically requires 500-700 MB

# record of three fashions
ai_m = record(
  c('TFGPT2Model',       'GPT2Tokenizer',       'gpt2'),
   c('TFRobertaModel',    'RobertaTokenizer',    'roberta-base'),
   c('TFElectraModel',    'ElectraTokenizer',    'google/electra-small-generator')
)

# parameters
max_len = 50L
epochs = 2
batch_size = 10

# create a listing for mannequin outcomes
gather_history = record()

for (i in 1:size(ai_m)) {
  
  # tokenizer
  tokenizer = glue::glue("transformer${ai_m[[i]][2]}$from_pretrained('{ai_m[[i]][3]}',
                         do_lower_case=TRUE)") %>% 
    rlang::parse_expr() %>% eval()
  
  # mannequin
  model_ = glue::glue("transformer${ai_m[[i]][1]}$from_pretrained('{ai_m[[i]][3]}')") %>% 
    rlang::parse_expr() %>% eval()
  
  # inputs
  textual content = record()
  # outputs
  label = record()
  
  data_prep = perform(information) {
    for (i in 1:nrow(information)) {
      
      txt = tokenizer$encode(information[['comment_text']][i],max_length = max_len, 
                             truncation=T) %>% 
        t() %>% 
        as.matrix() %>% record()
      lbl = information[['target']][i] %>% t()
      
      textual content = textual content %>% append(txt)
      label = label %>% append(lbl)
    }
    record(do.name(plyr::rbind.fill.matrix,textual content), do.name(plyr::rbind.fill.matrix,label))
  }
  
  train_ = data_prep(prepare)
  test_ = data_prep(take a look at)
  
  # slice dataset
  tf_train = tensor_slices_dataset(record(train_[[1]],train_[[2]])) %>% 
    dataset_batch(batch_size = batch_size, drop_remainder = TRUE) %>% 
    dataset_shuffle(128) %>% dataset_repeat(epochs) %>% 
    dataset_prefetch(tf$information$experimental$AUTOTUNE)
  
  tf_test = tensor_slices_dataset(record(test_[[1]],test_[[2]])) %>% 
    dataset_batch(batch_size = batch_size)
  
  # create an enter layer
  enter = layer_input(form=c(max_len), dtype='int32')
  hidden_mean = tf$reduce_mean(model_(enter)[[1]], axis=1L) %>% 
    layer_dense(64,activation = 'relu')
  # create an output layer for binary classification
  output = hidden_mean %>% layer_dense(models=1, activation='sigmoid')
  mannequin = keras_model(inputs=enter, outputs = output)
  
  # compile with AUC rating
  mannequin %>% compile(optimizer= tf$keras$optimizers$Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
                    loss = tf$losses$BinaryCrossentropy(from_logits=F),
                    metrics = tf$metrics$AUC())
  
  print(glue::glue('{ai_m[[i]][1]}'))
  # prepare the mannequin
  historical past = mannequin %>% keras::match(tf_train, epochs=epochs, #steps_per_epoch=len/batch_size,
                validation_data=tf_test)
  gather_history[[i]] historical past
  names(gather_history)[i] = ai_m[[i]][1]
}

Reproduce in a Pocket book

Extract outcomes to see the benchmarks:

Each the RoBERTa and Electra fashions present some extra enhancements after 2 epochs of coaching, which can’t be stated of GPT-2. On this case, it’s clear that it may be sufficient to coach a state-of-the-art mannequin even for a single epoch.

Conclusion

On this submit, we confirmed find out how to use state-of-the-art NLP fashions from R.
To grasp find out how to apply them to extra advanced duties, it’s extremely advisable to overview the transformers tutorial.

We encourage readers to check out these fashions and share their outcomes beneath within the feedback part!

Corrections

In the event you see errors or wish to recommend adjustments, please create a problem on the supply repository.

Reuse

Textual content and figures are licensed beneath Artistic Commons Attribution CC BY 4.0. Supply code is obtainable at https://github.com/henry090/transformers, except in any other case famous. The figures which were reused from different sources do not fall beneath this license and could be acknowledged by a be aware of their caption: “Determine from …”.

Quotation

For attribution, please cite this work as

Abdullayev (2020, July 30). Posit AI Weblog: State-of-the-art NLP fashions from R. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/

BibTeX quotation

@misc{abdullayev2020state-of-the-art,
  creator = {Abdullayev, Turgut},
  title = {Posit AI Weblog: State-of-the-art NLP fashions from R},
  url = {https://blogs.rstudio.com/tensorflow/posts/2020-07-30-state-of-the-art-nlp-models-from-r/},
  yr = {2020}
}

Previous articleHow huge tech skewed AI rankings on Chatbot Enviornment – Computerworld

Next articleThis reader says his Apple Watch saved his life—be certain yours is ready up too

State-of-the-art NLP fashions from R

Introduction

Conditions

Template

Information preparation

Information enter for Keras

Conclusion

Corrections

Reuse

Quotation

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

Firebase Register with Apple returns invalid-credential regardless of legitimate Apple token (Flutter iOS)

Oxford Physicists Attain Fourth-Order Quantum Squeezing With Trapped Ion

This Week’s Superior Tech Tales From Across the Net (Via Might 2)

Chiba group fashions vitality alignment for perovskite photo voltaic cells

Recent Comments

ABOUT US

POPULAR POSTS

Firebase Register with Apple returns invalid-credential regardless of legitimate Apple token (Flutter iOS)

Oxford Physicists Attain Fourth-Order Quantum Squeezing With Trapped Ion

This Week’s Superior Tech Tales From Across the Net (Via Might 2)

POPULAR CATEGORY