A Coding Implementation to Construct a Transformer-Based mostly Regression Language Mannequin to Predict Steady Values from Textual content

October 5, 2025

64

We are going to construct a Regression Language Mannequin (RLM), a mannequin that predicts steady numerical values instantly from textual content sequences on this coding implementation. As an alternative of classifying or producing textual content, we give attention to coaching a transformer-based structure that learns quantitative relationships hidden inside pure language descriptions. We begin by producing artificial text-to-number information, tokenizing it effectively, after which prepare a light-weight Transformer encoder to map linguistic cues to real-valued targets. By the tip, we not solely perceive how RLMs might be applied from scratch but additionally visualize their studying habits and check their generalization on unseen examples. Try the FULL CODES right here.

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.information import Dataset, DataLoader
import matplotlib.pyplot as plt
from collections import Counter
import re


torch.manual_seed(42)
np.random.seed(42)


print("🚀 Regression Language Mannequin (RLM) Tutorial")
print("=" * 60)

We start by importing important libraries, corresponding to PyTorch, NumPy, and Matplotlib, to construct and visualize our Regression Language Mannequin. We set random seeds to make sure reproducibility and initialize the atmosphere, thereby guaranteeing constant outcomes every time the tutorial is run. Try the FULL CODES right here.

def generate_synthetic_data(n_samples=2000):
   """Generate artificial text-to-number regression information"""
  
   templates = [
       ("The temperature is {} degrees", lambda x: x),
       ("I rate this {} out of ten", lambda x: x),
       ("The price is {} dollars", lambda x: x),
       ("Confidence level: {}", lambda x: x / 100),
       ("Speed of {} kilometers per hour", lambda x: x / 10),
       ("{} percent complete", lambda x: x / 100),
       ("Scored {} points in the game", lambda x: x / 10),
       ("The distance is {} meters", lambda x: x),
   ]
  
   information = []
   for _ in vary(n_samples):
       template, remodel = templates[np.random.randint(len(templates))]
       worth = np.random.uniform(0, 100)
       textual content = template.format(spherical(worth, 1))
       goal = remodel(worth)
       information.append((textual content, goal))
  
   return information

We create an artificial dataset that pairs pure language sentences with corresponding numerical values. Through the use of assorted templates corresponding to temperatures, scores, and percentages, we make sure the mannequin learns various textual content–quantity relationships. This managed setup helps us simulate practical regression duties with out counting on exterior information. Try the FULL CODES right here.

class SimpleTokenizer:
   def __init__(self):
       self.word2idx = {"": 0, "": 1}
       self.idx2word = {0: "", 1: ""}
       self.vocab_size = 2
  
   def match(self, texts):
       """Construct vocabulary from texts"""
       phrases = []
       for textual content in texts:
           phrases.lengthen(re.findall(r'w+|[^ws]', textual content.decrease()))
      
       word_counts = Counter(phrases)
       for phrase, _ in word_counts.most_common():
           if phrase not in self.word2idx:
               self.word2idx[word] = self.vocab_size
               self.idx2word[self.vocab_size] = phrase
               self.vocab_size += 1
  
   def encode(self, textual content, max_len=20):
       """Convert textual content to token indices"""
       phrases = re.findall(r'w+|[^ws]', textual content.decrease())
       indices = [self.word2idx.get(w, 1) for w in words]
      
       if len(indices)

We design a easy tokenizer to transform uncooked textual content into numerical tokens that the mannequin can course of. It builds a vocabulary from all distinctive phrases and maps every to an index, dealing with unknown phrases and padding robotically. This step ensures our textual inputs are remodeled into constant, machine-readable sequences for coaching. Try the FULL CODES right here.

class RLMDataset(Dataset):
   def __init__(self, information, tokenizer, max_len=20):
       self.information = information
       self.tokenizer = tokenizer
       self.max_len = max_len
  
   def __len__(self):
       return len(self.information)
  
   def __getitem__(self, idx):
       textual content, goal = self.information[idx]
       tokens = self.tokenizer.encode(textual content, self.max_len)
       return torch.tensor(tokens), torch.tensor([target], dtype=torch.float32)


class RegressionLanguageModel(nn.Module):
   def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2,
                dropout=0.1, max_len=20):
       tremendous().__init__()
      
       self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
       self.position_embedding = nn.Embedding(max_len, embed_dim)
      
       encoder_layer = nn.TransformerEncoderLayer(
           d_model=embed_dim,
           nhead=num_heads,
           dim_feedforward=embed_dim * 4,
           dropout=dropout,
           batch_first=True
       )
       self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
      
       self.fc1 = nn.Linear(embed_dim, 64)
       self.relu = nn.ReLU()
       self.dropout = nn.Dropout(dropout)
       self.fc2 = nn.Linear(64, 1)
      
       self.max_len = max_len
  
   def ahead(self, x):
       batch_size, seq_len = x.form
      
       positions = torch.arange(0, seq_len, gadget=x.gadget).unsqueeze(0).develop(batch_size, -1)
      
       token_embed = self.token_embedding(x)
       pos_embed = self.position_embedding(positions)
       embeddings = token_embed + pos_embed
      
       padding_mask = (x == 0)
      
       encoded = self.transformer(embeddings, src_key_padding_mask=padding_mask)
      
       mask_expanded = (~padding_mask).unsqueeze(-1).float()
       summed = (encoded * mask_expanded).sum(dim=1)
       pooled = summed / mask_expanded.sum(dim=1)
      
       x = self.fc1(pooled)
       x = self.relu(x)
       x = self.dropout(x)
       output = self.fc2(x)
      
       return output

We bundle our textual content–quantity pairs right into a PyTorch Dataset, the place we tokenize every sentence and return tensors prepared for batching. We then construct a Transformer-based RLM: token and positional embeddings movement by way of a multi-layer encoder, we mean-pool non-padded tokens, and feed the outcome to a small MLP head for regression. In impact, we permit the encoder to study numerical cues from language, whereas the top maps them to a single steady worth. Try the FULL CODES right here.

def train_rlm(mannequin, train_loader, val_loader, epochs=15, lr=0.001):  
   criterion = nn.MSELoss()
   optimizer = optim.Adam(mannequin.parameters(), lr=lr)
  
   train_losses, val_losses = [], []
  
   print(f"n📊 Coaching on {gadget}")
   print("-" * 60)
  
   for epoch in vary(epochs):
       mannequin.prepare()
       train_loss = 0
       for tokens, targets in train_loader:
           tokens, targets = tokens.to(gadget), targets.to(gadget)
          
           optimizer.zero_grad()
           outputs = mannequin(tokens)
           loss = criterion(outputs, targets)
           loss.backward()
           optimizer.step()
          
           train_loss += loss.merchandise()
      
       train_loss /= len(train_loader)
       train_losses.append(train_loss)
      
       mannequin.eval()
       val_loss = 0
       with torch.no_grad():
           for tokens, targets in val_loader:
               tokens, targets = tokens.to(gadget), targets.to(gadget)
               outputs = mannequin(tokens)
               loss = criterion(outputs, targets)
               val_loss += loss.merchandise()
      
       val_loss /= len(val_loader)
       val_losses.append(val_loss)
      
       print(f"Epoch {epoch+1:second}/{epochs} | Practice Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
  
   return train_losses, val_losses

We prepare the mannequin utilizing Adam and MSE loss on a GPU, if accessible, iterating over mini-batches to backpropagate and replace weights. We swap to analysis mode for validation on the finish of every epoch, monitor coaching and validation losses, and print progress so we are able to see the educational dynamics in real-time. Try the FULL CODES right here.

print("n📝 Producing artificial information...")
information = generate_synthetic_data(2000)
split_idx = int(0.8 * len(information))
train_data, val_data = information[:split_idx], information[split_idx:]
print(f"Practice samples: {len(train_data)}, Val samples: {len(val_data)}")


print("n🔤 Constructing tokenizer...")
tokenizer = SimpleTokenizer()
tokenizer.match([text for text, _ in train_data])
print(f"Vocabulary dimension: {tokenizer.vocab_size}")


train_dataset = RLMDataset(train_data, tokenizer)
val_dataset = RLMDataset(val_data, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)


print("n🏗️ Constructing Regression Language Mannequin...")
mannequin = RegressionLanguageModel(vocab_size=tokenizer.vocab_size)
print(f"Mannequin parameters: {sum(p.numel() for p in mannequin.parameters()):,}")


train_losses, val_losses = train_rlm(mannequin, train_loader, val_loader)


plt.determine(figsize=(10, 4))
plt.plot(train_losses, label="Practice Loss", linewidth=2)
plt.plot(val_losses, label="Val Loss", linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('RLM Coaching Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.present()


print("n🎯 Testing Predictions:")
print("-" * 60)
test_examples = [
   "The temperature is 25.5 degrees",
   "I rate this 8.0 out of ten",
   "The price is 45.0 dollars",
   "75.0 percent complete"
]


with torch.no_grad():
   for textual content in test_examples:
       tokens = torch.tensor([tokenizer.encode(text)]).to(gadget)
       prediction = mannequin(tokens).merchandise()
       print(f"Enter: {textual content}")
       print(f"Predicted worth: {prediction:.4f}n")


print("✅ RLM Tutorial Full!")

We generate and cut up artificial information, match our tokenizer, wrap all the things in PyTorch datasets/loaders, and construct the Transformer-based RLM. We prepare the mannequin, visualize loss curves to confirm studying, after which run a couple of natural-language check prompts to see the expected steady values. With that, we full the end-to-end RLM pipeline.

In conclusion, we efficiently designed, educated, and evaluated a Regression Language Mannequin able to predicting steady values from textual inputs. We observe how combining positional embeddings, transformer encoders, and a easy regression head permits the mannequin to seize the numerical semantics embedded in language. By producing artificial information, visualizing coaching progress, and testing predictions, we exhibit how RLMs bridge the hole between language understanding and numerical reasoning.

Try the FULL CODES right here. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be part of us on telegram as properly.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

🙌 Comply with MARKTECHPOST: Add us as a most well-liked supply on Google.