

Picture by Creator | Canva
We use massive language fashions in a lot of our day by day duties. These fashions have been educated on billions of on-line paperwork and various datasets, making them able to understanding, comprehending, and responding in human-like language. Nevertheless, not all LLMs are created the identical approach. Whereas the core concept stays related, they differ of their underlying architectures and these variations have a major influence on their capabilities. For instance, as seen throughout varied benchmarks, DeepSeek excels at reasoning duties, Claude performs effectively in coding, and ChatGPT stands out in artistic writing.
On this article, I’ll stroll you thru 7 common LLM architectures to provide you a transparent overview, all in simply as many minutes. So, let’s get began.
1. BERT
Paper Hyperlink: https://arxiv.org/pdf/1810.04805
Developed by Google in 2018, BERT marked a major shift in pure language understanding by introducing deep bidirectional consideration in language modeling. Not like earlier fashions that learn textual content in a left-to-right or right-to-left method, BERT makes use of a transformer encoder to think about each instructions concurrently. It’s educated utilizing two duties: masked language modeling (predicting randomly masked phrases) and next-sentence prediction (figuring out if one sentence logically follows one other). Architecturally, BERT is available in two sizes: BERT Base (12 layers, 110M parameters) and BERT Giant (24 layers, 340M parameters). Its construction depends solely on encoder stacks and contains particular tokens like [CLS] to signify the complete sentence and [SEP] to separate two sentences. You possibly can fine-tune it for duties like sentiment evaluation, query answering (like SQuAD), and extra. It was the primary of its form to really perceive the complete which means of sentences.
2. GPT
Paper Hyperlink (GPT 4): https://arxiv.org/pdf/2303.08774
The GPT (Generative Pre-trained Transformer) household was launched by OpenAI. The collection started with GPT-1 in 2018 and has developed to GPT-4 by 2023, with the most recent model, GPT-4o, launched in Could 2024, showcasing multimodal capabilities, dealing with each textual content and pictures. They’re pre-trained on very massive textual content corpora with a typical next-token prediction language modeling goal: at every step the mannequin predicts the subsequent phrase in a sequence given all earlier phrases. After this unsupervised pre-training stage, the identical mannequin may be fine-tuned on particular duties or utilized in a zero-/few-shot approach with minimal further parameters. The decoder-only design means GPT attends solely to earlier tokens in contrast to BERT’s bidirectional encoder. What was notable at introduction was the sheer scale and functionality of GPT: as every successive technology (GPT‑2, GPT‑3) grew bigger, the mannequin demonstrated very fluent textual content technology and few-shot studying talents, establishing the “pre-train and immediate/fine-tune” paradigm for giant language fashions. Nevertheless, they’re proprietary, with entry sometimes offered through APIs, and their precise architectures, particularly for latest variations, aren’t totally disclosed.
3. LLaMA
LLaMA 4 Weblog Hyperlink: https://ai.meta.com/weblog/llama-4-multimodal-intelligence/
Paper Hyperlink (LLaMA 3) : https://arxiv.org/abs/2407.21783
LLaMA, developed by Meta AI and first launched in February 2023, is a household of open-source decoder-only transformer fashions. It ranges from 7 billion to 70 billion parameters, with the most recent model, Llama 4, launched in April 2025. Like GPT, LLaMA makes use of a Transformer decoder-only structure (every mannequin is an autoregressive Transformer) however with some architectural tweaks. For instance, the unique LLaMA fashions used the SwiGLU activation as an alternative of GeLU, rotary positional embeddings (RoPE) as an alternative of mounted ones, and RMSNorm rather than layer norm. The LLaMA household was launched in a number of sizes from 7B as much as 65B parameters in LLaMA1, later even bigger in LLaMA3 to make large-scale fashions extra accessible. Notably, regardless of comparatively modest parameter counts, these fashions carried out competitively with a lot bigger contemporaries: Meta reported that LLaMA’s 13B mannequin outperformed OpenAI’s 175B GPT-3 on many benchmarks, and its 65B mannequin was aggressive with contemporaries like Google’s PaLM and DeepMind’s Chinchilla. LLaMA’s open (although research-restricted) launch spawned intensive group use; its key novelty was combining environment friendly coaching at scale with extra open entry to mannequin weights.
4. PaLM
PaLM 2 Technical Report: https://arxiv.org/abs/2305.10403
Paper Hyperlink (PaLM): https://arxiv.org/pdf/2204.02311
PaLM (Pathways Language Mannequin) is a collection of enormous language fashions developed by Google Analysis. The unique PaLM (introduced 2022) was a 540-billion parameter, decoder-only Transformer and is a part of Google’s Pathways system. It was educated on a high-quality corpus of 780 billion tokens and throughout hundreds of TPU v4 chips in Google’s infrastructure, using parallelism to realize excessive {hardware} utilization. The mannequin additionally has multi-query consideration to cut back reminiscence bandwidth necessities throughout inference. PaLM is thought for its few-shot studying capabilities, performing effectively on new duties with minimal examples due to its enormous and various coaching information, which incorporates webpages, books, Wikipedia, information, GitHub code, and social media conversations. PaLM 2, introduced in Could 2023, additional improved multilingual, reasoning, and coding capabilities, powering functions like Google Bard and Workspace AI options.
5. Gemini
Gemini 2.5 Weblog: https://weblog.google/expertise/google-deepmind/gemini-model-thinking-updates-march-2025/
Paper Hyperlink (Gemini 1.5): https://arxiv.org/abs/2403.05530
Paper Hyperlink (Gemini): https://arxiv.org/abs/2312.11805
Gemini is Google’s next-generation LLM household (from Google DeepMind and Google Analysis), launched in late 2023. Gemini fashions are natively multimodal, which means they’re designed from the bottom as much as deal with textual content, photos, audio, video, and even code in a single mannequin. Like PaLM and GPT, Gemini relies on the Transformer, however its key options embody huge scale, assist for terribly lengthy contexts, and (in Gemini 1.5) a Combination-of-Specialists (MoE) structure for effectivity. For instance, Gemini 1.5 (“Professional”) makes use of sparsely activated professional layers (a whole bunch of professional sub-networks, with just a few energetic per enter) to spice up capability with out proportional compute value. The Gemini 2.5 collection, launched in March 2025, constructed upon this basis with even deeper “considering” capabilities. In June 2025, Google launched Gemini 2.5 Flash and Professional as secure fashions and previewed Flash‑Lite, their most cost-efficient, quickest model but, optimized for high-throughput duties whereas nonetheless supporting the million-token context window and gear integrations like search and code execution. The Gemini household is available in a number of sizes (Extremely, Professional, Nano) so it may possibly run from cloud servers all the way down to cell gadgets. The mixture of multimodal pretraining and MoE-based scaling makes Gemini a versatile, extremely succesful basis mannequin.
6. Mistral
Paper Hyperlink (Mistral 7B): https://arxiv.org/abs/2310.06825
Mistral is a French AI startup that launched its first LLMs in 2023. Its flagship mannequin, Mistral 7B (Sept 2023), is a 7.3 billion-parameter Transformer-based decoder mannequin. Architecturally, Mistral 7B is just like a GPT-style mannequin however contains optimizations for inference: it makes use of grouped-query consideration (GQA) to hurry up self-attention and sliding-window consideration to deal with longer contexts extra effectively. By way of efficiency, Mistral 7B outperformed Meta’s Llama 2 13B and even gave robust outcomes versus 34B fashions, whereas being a lot smaller. Mistral AI launched the mannequin underneath an Apache 2.0 license, making it freely out there to be used. Its subsequent main launch was Mixtral 8×7B, a sparse Combination-of-Specialists (MoE) mannequin that includes eight 7 B-parameter professional networks per layer. This design helped Mixtral match or beat GPT‑3.5 and LLaMA 2 70B on duties like arithmetic, coding, and multilingual benchmarks. In Could 2025, Mistral launched Mistral Medium 3, a proprietary mid-sized mannequin geared toward enterprises. This mannequin delivers over 90% of the rating of pricier fashions like Claude 3.7 Sonnet on normal benchmarks, whereas lowering per-token value dramatically ( roughly $0.40 in vs $3.00 for Sonnet). It helps multimodal duties (textual content + photos), skilled reasoning, and is obtainable by way of an API or for on-prem deployment on as few as 4 GPUs. Nevertheless, in contrast to earlier fashions, Medium 3 is closed-source, prompting group criticism that Mistral is transferring away from its open-source ethos. Shortly after, in June 2025, Mistral launched Magistral, their first mannequin devoted to express reasoning. The small model is open underneath Apache 2.0, whereas Magistral Medium is enterprise-only. Magistral Medium scored 73.6% on AIME2024, with the small model scoring 70.7%, demonstrating robust math and logic abilities in a number of languages.
7. DeepSeek
Paper Hyperlink (DeepSeek-R1): https://arxiv.org/abs/2501.12948
DeepSeek is a Chinese language AI firm (spin-off of Excessive-Flyer AI, based 2023) that develops massive LLMs. Its latest fashions (like DeepSeek v3 and DeepSeek-R1) make use of a extremely sparsely activated Combination-of-Specialists Transformer structure. In DeepSeek v3/R1, every Transformer layer has a whole bunch of professional sub-networks, however just a few are activated per token. This implies as an alternative of operating all elements of the mannequin directly, the mannequin has a whole bunch of professional networks and prompts just a few (like 9 out of 257) relying on what’s wanted for every enter. This enables DeepSeek to have an enormous complete mannequin measurement (over 670 billion parameters) whereas solely utilizing about 37 billion throughout every response, making it a lot sooner and cheaper to run than a dense mannequin of comparable measurement. Like different fashionable LMs, it makes use of SwiGLU activations, rotary embeddings (RoPE), and superior optimizations (together with experimental FP8 precision throughout coaching) to make it extra environment friendly. This aggressive MoE design lets DeepSeek obtain very excessive functionality (corresponding to a lot bigger dense fashions) at decrease compute value. DeepSeek’s fashions (launched underneath open licenses) attracted consideration for rivaling main fashions like GPT-4 in multilingual technology and reasoning, all whereas considerably lowering coaching and inference useful resource necessities.
Kanwal Mehreen Kanwal is a machine studying engineer and a technical author with a profound ardour for information science and the intersection of AI with drugs. She co-authored the e-book “Maximizing Productiveness with ChatGPT”. As a Google Technology Scholar 2022 for APAC, she champions range and educational excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower ladies in STEM fields.