Posit AI Weblog: Introducing the textual content bundle

April 21, 2025

149

AI-based language evaluation has not too long ago gone by way of a “paradigm shift” (Bommasani et al., 2021, p. 1), thanks partially to a brand new method known as transformer language mannequin (Vaswani et al., 2017, Liu et al., 2019). Corporations, together with Google, Meta, and OpenAI have launched such fashions, together with BERT, RoBERTa, and GPT, which have achieved unprecedented giant enhancements throughout most language duties comparable to net search and sentiment evaluation. Whereas these language fashions are accessible in Python, and for typical AI duties by way of HuggingFace, the R bundle textual content makes HuggingFace and state-of-the-art transformer language fashions accessible as social scientific pipelines in R.

Introduction

We developed the textual content bundle (Kjell, Giorgi & Schwartz, 2022) with two aims in thoughts:
To function a modular resolution for downloading and utilizing transformer language fashions. This, for instance, consists of remodeling textual content to phrase embeddings in addition to accessing widespread language mannequin duties comparable to textual content classification, sentiment evaluation, textual content technology, query answering, translation and so forth.
To offer an end-to-end resolution that’s designed for human-level analyses together with pipelines for state-of-the-art AI methods tailor-made for predicting traits of the individual that produced the language or eliciting insights about linguistic correlates of psychological attributes.

This weblog publish reveals methods to set up the textual content bundle, rework textual content to state-of-the-art contextual phrase embeddings, use language evaluation duties in addition to visualize phrases in phrase embedding area.

Set up and establishing a python setting

The textual content bundle is establishing a python setting to get entry to the HuggingFace language fashions. The primary time after putting in the textual content bundle that you must run two capabilities: textrpp_install() and textrpp_initialize().

# Set up textual content from CRAN
set up.packages("textual content")
library(textual content)

# Set up textual content required python packages in a conda setting (with defaults)
textrpp_install()

# Initialize the put in conda setting
# save_profile = TRUE saves the settings so that you just wouldn't have to run textrpp_initialize() once more after restarting R
textrpp_initialize(save_profile = TRUE)

See the prolonged set up information for extra data.

Rework textual content to phrase embeddings

The textEmbed() operate is used to rework textual content to phrase embeddings (numeric representations of textual content). The mannequin argument allows you to set which language mannequin to make use of from HuggingFace; when you have not used the mannequin earlier than, it’ll robotically obtain the mannequin and vital information.

# Rework the textual content knowledge to BERT phrase embeddings
# Notice: To run quicker, attempt one thing smaller: mannequin = 'distilroberta-base'.
word_embeddings  textEmbed(texts = "Hi there, how are you doing?",
                            mannequin = 'bert-base-uncased')
word_embeddings
remark(word_embeddings)

The phrase embeddings can now be used for downstream duties comparable to coaching fashions to foretell associated numeric variables (e.g., see the textTrain() and textPredict() capabilities).

(To get token and particular person layers output see the textEmbedRawLayers() operate.)

There are a lot of transformer language fashions at HuggingFace that can be utilized for varied language mannequin duties comparable to textual content classification, sentiment evaluation, textual content technology, query answering, translation and so forth. The textual content bundle includes user-friendly capabilities to entry these.

classifications  textClassify("Hi there, how are you doing?")
classifications
remark(classifications)

generated_text  textGeneration("The that means of life is")
generated_text

For extra examples of accessible language mannequin duties, for instance, see textSum(), textQA(), textTranslate(), and textZeroShot() below Language Evaluation Duties.

Visualizing phrases within the textual content bundle is achieved in two steps: First with a operate to pre-process the info, and second to plot the phrases together with adjusting visible traits comparable to coloration and font measurement.
To display these two capabilities we use instance knowledge included within the textual content bundle: Language_based_assessment_data_3_100. We present methods to create a two-dimensional determine with phrases that people have used to explain their concord in life, plotted based on two totally different well-being questionnaires: the concord in life scale and the satisfaction with life scale. So, the x-axis reveals phrases which can be associated to low versus excessive concord in life scale scores, and the y-axis reveals phrases associated to low versus excessive satisfaction with life scale scores.

word_embeddings_bert  textEmbed(Language_based_assessment_data_3_100,
                                  aggregation_from_tokens_to_word_types = "imply",
                                  keep_token_embeddings = FALSE)

# Pre-process the info for plotting
df_for_plotting  textProjection(Language_based_assessment_data_3_100$harmonywords, 
                                  word_embeddings_bert$textual content$harmonywords,
                                  word_embeddings_bert$word_types,
                                  Language_based_assessment_data_3_100$hilstotal, 
                                  Language_based_assessment_data_3_100$swlstotal
)

# Plot the info
plot_projection  textProjectionPlot(
  word_data = df_for_plotting,
  y_axes = TRUE,
  p_alpha = 0.05,
  title_top = "Supervised Bicentroid Projection of Concord in life phrases",
  x_axes_label = "Low vs. Excessive HILS rating",
  y_axes_label = "Low vs. Excessive SWLS rating",
  p_adjust_method = "bonferroni",
  points_without_words_size = 0.4,
  points_without_words_alpha = 0.4
)
plot_projection$final_plot

Supervised Bicentroid Projection of Harmony in life words — Supervised Bicentroid Projection of Concord in life phrases

This publish demonstrates methods to perform state-of-the-art textual content evaluation in R utilizing the textual content bundle. The bundle intends to make it simple to entry and use transformers language fashions from HuggingFace to investigate pure language. We look ahead to your suggestions and contributions towards making such fashions out there for social scientific and different functions extra typical of R customers.

Bommasani et al. (2021). On the alternatives and dangers of basis fashions.
Kjell et al. (2022). The textual content bundle: An R-package for Analyzing and Visualizing Human Language Utilizing Pure Language Processing and Deep Studying.
Liu et al (2019). Roberta: A robustly optimized bert pretraining method.
Vaswaniet al (2017). Consideration is all you want. Advances in Neural Info Processing Programs, 5998–6008

Corrections

For those who see errors or need to recommend modifications, please create a difficulty on the supply repository.

Reuse

Textual content and figures are licensed below Artistic Commons Attribution CC BY 4.0. Supply code is out there at https://github.com/OscarKjell/ai-blog, until in any other case famous. The figures which were reused from different sources do not fall below this license and might be acknowledged by a notice of their caption: “Determine from …”.

Quotation

For attribution, please cite this work as

Kjell, et al. (2022, Oct. 4). Posit AI Weblog: Introducing the textual content bundle. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2022-09-29-r-text/

BibTeX quotation

@misc{kjell2022introducing,
  creator = {Kjell, Oscar and Giorgi, Salvatore and Schwartz, H Andrew},
  title = {Posit AI Weblog: Introducing the textual content bundle},
  url = {https://blogs.rstudio.com/tensorflow/posts/2022-09-29-r-text/},
  12 months = {2022}
}

Previous articleBlackLock Ransomware: What You Want To Know

Next articlemail.app – How can I modify the port utilized by a Mac Mail account?

Posit AI Weblog: Introducing the textual content bundle

Introduction

Set up and establishing a python setting

Rework textual content to phrase embeddings

Corrections

Reuse

Quotation

An Implementation to Construct Dynamic AI Techniques with the Mannequin Context Protocol (MCP) for Actual-Time Useful resource and Instrument Integration

Microsoft AI Proposes BitNet Distillation (BitDistill): A Light-weight Pipeline that Delivers as much as 10x Reminiscence Financial savings and about 2.65x CPU Speedup

Weak-for-Robust (W4S): A Novel Reinforcement Studying Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs

LEAVE A REPLY Cancel reply

Most Popular

‘Agility is cash’, says Microsoft – as brokers rewrite Vodafone B2B cycle

DOT and FAA Launch eVTOL Integration Pilot Program

Digital Twin of a Cell Tracks Its Whole Life Cycle Right down to the Nanoscale

Warfare halts work on submarine cable hyperlink within the Persian Gulf

Recent Comments

ABOUT US

POPULAR POSTS

‘Agility is cash’, says Microsoft – as brokers rewrite Vodafone B2B cycle

DOT and FAA Launch eVTOL Integration Pilot Program

Digital Twin of a Cell Tracks Its Whole Life Cycle Right down to the Nanoscale

POPULAR CATEGORY