Add Julian 600M paper (EN+FR) with LaTeX sources

Browse files

Files changed (6) hide show

.gitattributes +1 -0
JulianKrg_600M_paper.pdf +3 -0
README.md +119 -0
julian_paper.tex +984 -0
julian_paper_fr.pdf +3 -0
julian_paper_fr.tex +1028 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.pdf filter=lfs diff=lfs merge=lfs -text

JulianKrg_600M_paper.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6b8fecfe13439aff4677a7d891d3ded5036c6c66e02995f9242d206a587568f4
+size 209504

README.md ADDED Viewed

	@@ -0,0 +1,119 @@

+---
+license: apache-2.0
+language:
+  - en
+  - fr
+tags:
+  - research-paper
+  - language-model
+  - julian
+  - jax
+  - tpu
+  - llama
+  - bilingual
+  - sft-analysis
+  - scaling-laws
+---
+# Julian: Efficient Training of a Bilingual 600M Parameter Language Model on TPU with JAX
+**Paper** by Julian Kerignard | February 2026
+## Abstract
+We present **Julian**, a family of decoder-only language models ranging from 100M to 600M parameters, trained entirely from scratch on up to **39 billion tokens** of bilingual English-French data (70%/30%) using JAX/Flax on Google Cloud TPU v4-32. Our largest model, Julian-600M, employs a modern transformer architecture with Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm, following the design principles of LLaMA.
+Despite being trained on significantly fewer tokens than comparable models, Julian-600M achieves **53.5% on HellaSwag**, outperforming OPT-1.3B (41.5%) which has over twice the parameters and was trained on 8x more data. We further analyze supervised fine-tuning (SFT) dynamics, revealing a critical disconnect between training loss reduction and downstream task performance.
+## Files
+| File | Description |
+|------|-------------|
+| [`JulianKrg_600M_paper.pdf`](JulianKrg_600M_paper.pdf) | Full paper (English) |
+| [`julian_paper_fr.pdf`](julian_paper_fr.pdf) | Full paper (French) |
+| [`julian_paper.tex`](julian_paper.tex) | LaTeX source (English) |
+| [`julian_paper_fr.tex`](julian_paper_fr.tex) | LaTeX source (French) |
+## Key Results
+### Pretraining Performance
+| Model | Params | Tokens | HellaSwag | PIQA | LAMBADA |
+|-------|--------|--------|-----------|------|---------|
+| **Julian-600M** | **600M** | **39B** | **53.5%** | **66.8%** | **37.3%** |
+| OPT-1.3B | 1.3B | 300B | 41.5% | 71.7% | 58.0% |
+| GPT-2 XL | 1.5B | ~40B | 50.9% | 70.8% | 51.2% |
+| Pythia-1B | 1B | 300B | 37.6% | 69.2% | 56.6% |
+| BLOOM-560M | 560M | 350B | 37.1% | 64.5% | 36.5% |
+Julian-600M outperforms OPT-1.3B on HellaSwag with **2x fewer parameters** and **7.7x fewer tokens**.
+### Critical SFT Finding
+Our paper provides a detailed analysis of supervised fine-tuning dynamics on 2.47M instruction-response pairs:
+| Configuration | Steps | Epochs | Loss | HellaSwag | PIQA | WinoGrande |
+|---------------|-------|--------|------|-----------|------|------------|
+| Base model | - | - | - | 53.5% | 66.8% | 53.8% |
+| SFT-30K | 30K | 0.66 | 1.86 | 53.2% | 66.5% | 53.8% |
+| SFT-100K | 100K | 2.2 | 1.69 | 53.2% | 66.5% | 52.8% |
+**Key insight**: Training loss decreases 9% between SFT-30K and SFT-100K, but benchmark performance stagnates or degrades. This reveals that **training loss is not a reliable proxy for SFT quality** — the model memorizes instruction patterns rather than improving generalization. We recommend limiting SFT to <1 epoch for datasets >1M examples and using held-out benchmarks as stopping criteria.
+## Architecture
+```
+Decoder-only Transformer (600M parameters)
+├── Layers: 18
+├── Hidden: 1280
+├── Heads: 16 (head_dim = 80)
+├── FFN: 5120 (SwiGLU)
+├── Vocab: 50,000 (SentencePiece)
+├── Context: 2048 tokens
+├── Position: RoPE (θ = 10000)
+└── Norm: RMSNorm (pre-norm)
+```
+## Paper Contents
+1. **Introduction** — Motivation and contributions
+2. **Related Work** — Comparison with Pythia, OPT, LLaMA, TinyLlama
+3. **Model Architecture** — Detailed design choices (RoPE, SwiGLU, RMSNorm)
+4. **Training Infrastructure** — Multi-host TPU training with JAX, data pipeline, checkpointing
+5. **Data** — Collection, cleaning, tokenization (Wikipedia, FineWeb-Edu, OSCAR, The Stack)
+6. **Pretraining** — Hyperparameters, loss curves, scaling analysis
+7. **Supervised Fine-Tuning** — SFT methodology, ChatML format, training dynamics
+8. **Evaluation** — Benchmark results across 7 tasks with detailed comparisons
+9. **SFT Analysis** — Critical findings on loss vs. benchmark divergence
+10. **Conclusion** — Practical recommendations for efficient LLM training
+## Models
+All model weights are openly available:
+| Model | Link |
+|-------|------|
+| Julian-600M Base (39B tokens) | [JulianKrgd/julian-600m-40b](https://huggingface.co/JulianKrgd/julian-600m-40b) |
+| Julian-600M Instruct SFT-30K | [JulianKrgd/julian-600m-40b-instruct-sft30k](https://huggingface.co/JulianKrgd/julian-600m-40b-instruct-sft30k) |
+| Julian-600M Instruct SFT-100K | [JulianKrgd/julian-600m-40b-instruct-sft100k](https://huggingface.co/JulianKrgd/julian-600m-40b-instruct-sft100k) |
+## Citation
+```bibtex
+@misc{kerignard2026julian,
+  author = {Julian Kerignard},
+  title = {Julian: Efficient Training of a Bilingual 600M Parameter Language Model on TPU with JAX},
+  year = {2026},
+  url = {https://huggingface.co/JulianKrgd/julian-600m-paper}
+}
+```
+## License
+Apache 2.0 — All paper content, LaTeX sources, and associated materials.
+## Acknowledgments
+- **Google TPU Research Cloud** for compute access
+- **Hugging Face** for model hosting and open-source tools
+- **JAX/Flax teams** for the ML framework

julian_paper.tex ADDED Viewed

	@@ -0,0 +1,984 @@

+\documentclass[11pt,a4paper]{article}
+% ============================================================================
+% Packages
+% ============================================================================
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}
+\usepackage{times}
+\usepackage{geometry}
+\geometry{margin=1in}
+\usepackage{amsmath,amssymb}
+\usepackage{graphicx}
+\usepackage{booktabs}
+\usepackage{hyperref}
+\usepackage{url}
+\urlstyle{same}
+\usepackage{natbib}
+\usepackage{xcolor}
+\usepackage{array}
+\usepackage{float}
+\usepackage{enumitem}
+\usepackage{fancyvrb}
+\usepackage{pgfplots}
+\pgfplotsset{compat=1.18}
+\hypersetup{
+  colorlinks=true,
+  linkcolor=blue!60!black,
+  citecolor=blue!60!black,
+  urlcolor=blue!60!black
+}
+% ============================================================================
+% Title
+% ============================================================================
+\title{
+\textbf{Julian: Efficient Training of a Bilingual 600M Parameter \\
+Language Model on TPU with JAX}
+}
+\author{
+Julian Kerignard \\
+Independent Research \\
+\texttt{github.com/JulianKrgd} \\
+\texttt{huggingface.co/JulianKrgd}
+}
+\date{February 2026}
+\begin{document}
+\maketitle
+% ============================================================================
+% Abstract
+% ============================================================================
+\begin{abstract}
+We present \textbf{Julian}\footnote{Models available on HuggingFace: \url{https://huggingface.co/JulianKrgd}}, a family of decoder-only language models ranging from 100M to 600M parameters, trained entirely from scratch on up to 39 billion tokens of bilingual English-French data using JAX/Flax on Google Cloud TPUs. Our largest model, Julian-600M, employs a modern transformer architecture with Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm, following the design principles of LLaMA. Despite being trained on significantly fewer tokens than comparable models, Julian-600M achieves 53.5\% normalized accuracy on HellaSwag, outperforming OPT-1.3B (41.5\%) which has over twice the parameters and was trained on 8$\times$ more data. We further fine-tune Julian-600M using supervised fine-tuning (SFT) on 2.47 million instruction-response pairs formatted with the ChatML template, producing instruction-following variants at 30K and 100K training steps. We provide a detailed account of our training infrastructure, data pipeline, and the challenges of multi-host TPU training with JAX. All model weights are released openly under the Apache 2.0 license on HuggingFace.
+\end{abstract}
+% ============================================================================
+% 1. Introduction
+% ============================================================================
+\section{Introduction}
+The rapid advancement of large language models (LLMs) has demonstrated remarkable capabilities in natural language understanding and generation \citep{brown2020language, chowdhery2023palm, touvron2023llama}. However, the training of such models typically requires enormous computational resources, often inaccessible to independent researchers and smaller organizations.
+Recent work has shown that smaller language models, when trained with appropriate data and techniques, can achieve competitive performance on many benchmarks \citep{biderman2023pythia, zhang2022opt}. The Chinchilla scaling laws \citep{hoffmann2022training} further suggest that many models are undertrained relative to their size, and that optimal performance requires a careful balance between model size and training data volume.
+In this work, we present \textbf{Julian}, a family of bilingual (English-French) language models trained from scratch using JAX/Flax on Google Cloud TPU v4-32 pods. Our contributions are:
+\begin{enumerate}[leftmargin=*]
+    \item \textbf{Efficient training}: We train a 600M parameter model on 39B tokens that outperforms OPT-1.3B on HellaSwag despite using 2$\times$ fewer parameters and 8$\times$ fewer training tokens.
+    \item \textbf{Bilingual capability}: To the best of our knowledge, Julian is among the few openly released small language models trained from scratch on a mixture of English and French data (70\%/30\% ratio).
+    \item \textbf{Complete pipeline}: We describe the full training pipeline including data collection, tokenizer training, pre-training, supervised fine-tuning, and evaluation, providing a practical guide for training LLMs on TPU infrastructure.
+    \item \textbf{Open release}: All model weights, tokenizer, and training code are released under the Apache 2.0 license.
+\end{enumerate}
+% ============================================================================
+% 2. Related Work
+% ============================================================================
+\section{Related Work}
+\paragraph{Scaling Laws.}
+\citet{kaplan2020scaling} established neural scaling laws showing power-law relationships between model size, dataset size, compute budget, and loss. \citet{hoffmann2022training} refined these findings with the Chinchilla scaling laws, demonstrating that many large models are significantly undertrained and that the optimal token-to-parameter ratio is approximately 20:1. Our Julian-600M model is trained on 39B tokens (65:1 ratio), exceeding the Chinchilla-optimal budget.
+\paragraph{Open Language Models.}
+GPT-2 \citep{radford2019language} pioneered the release of pre-trained language models, with sizes ranging from 124M to 1.5B parameters. OPT \citep{zhang2022opt} provided models from 125M to 175B parameters trained on 300B tokens with detailed training logs. Pythia \citep{biderman2023pythia} offered a suite of models from 70M to 12B parameters trained on 300B tokens from The Pile, specifically designed for studying model behavior during training. LLaMA \citep{touvron2023llama} introduced architectural improvements (RoPE, SwiGLU, RMSNorm) that have become standard in modern language models.
+\paragraph{Small Language Models.}
+TinyLlama \citep{zhang2024tinyllama} demonstrated that a 1.1B model trained on 3T tokens can achieve strong performance. MobileLLM \citep{liu2024mobilellm} explored architecture design for sub-billion parameter models. These works highlight the viability and growing interest in smaller, more efficient models.
+\paragraph{Multilingual Models.}
+While large multilingual models like mBERT \citep{devlin2019bert}, XLM-R \citep{conneau2020xlmr}, and BLOOM \citep{workshop2023bloom} cover many languages, few small models are specifically designed for bilingual English-French text generation from scratch.
+% ============================================================================
+% 3. Model Architecture
+% ============================================================================
+\section{Model Architecture}
+Julian follows the LLaMA architecture \citep{touvron2023llama}: a decoder-only transformer with pre-normalization using RMSNorm \citep{zhang2019root}, SwiGLU feed-forward networks \citep{shazeer2020glu}, and Rotary Position Embeddings (RoPE) \citep{su2021roformer}. No bias terms are used in any linear projection.
+\subsection{Architecture Details}
+\begin{table}[h]
+\centering
+\caption{Julian model configurations. All models use RoPE ($\theta$=10000), SwiGLU, RMSNorm (pre-norm), and no bias terms.}
+\label{tab:model_configs}
+\begin{tabular}{lccc}
+\toprule
+\textbf{Parameter} & \textbf{Julian-100M} & \textbf{Julian-250M$^\dagger$} & \textbf{Julian-600M} \\
+\midrule
+Hidden size ($d_{\text{model}}$) & 640 & 1024 & 1280 \\
+Layers ($L$) & 12 & 14 & 18 \\
+Attention heads ($H$) & 10 & 16 & 20 \\
+Head dimension ($d_h$) & 64 & 64 & 64 \\
+FFN size ($d_{\text{ff}}$) & 2560 & 4096 & 5120 \\
+Vocabulary size ($V$) & 50{,}000 & 50{,}000 & 50{,}000 \\
+Context length & 2048 & 2048 & 2048 \\
+Precision & bfloat16 & bfloat16 & bfloat16 \\
+\bottomrule
+\end{tabular}
+\end{table}
+\noindent{\small $^\dagger$ Julian-250M is currently in preparation and has not yet been trained.}
+\paragraph{Rotary Position Embeddings (RoPE).}
+We use RoPE \citep{su2021roformer} with base frequency $\theta = 10{,}000$. For each attention head, the query and key vectors are rotated by position-dependent angles:
+\begin{equation}
+    f_{\theta}(x, m) = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_{d-1} \\ x_d \end{pmatrix} \odot \begin{pmatrix} \cos(m\theta_1) \\ \cos(m\theta_1) \\ \vdots \\ \cos(m\theta_{d/2}) \\ \cos(m\theta_{d/2}) \end{pmatrix} + \begin{pmatrix} -x_2 \\ x_1 \\ \vdots \\ -x_d \\ x_{d-1} \end{pmatrix} \odot \begin{pmatrix} \sin(m\theta_1) \\ \sin(m\theta_1) \\ \vdots \\ \sin(m\theta_{d/2}) \\ \sin(m\theta_{d/2}) \end{pmatrix}
+\end{equation}
+where $\theta_i = \theta^{-2i/d}$ and $m$ is the position index.
+\paragraph{SwiGLU Feed-Forward Network.}
+Each transformer block uses a SwiGLU \citep{shazeer2020glu} feed-forward network:
+\begin{equation}
+    \text{FFN}(x) = W_{\text{down}} \cdot (\text{SiLU}(W_{\text{gate}} x) \odot W_{\text{up}} x)
+\end{equation}
+where $W_{\text{gate}}, W_{\text{up}} \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ and $W_{\text{down}} \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$. The SwiGLU activation introduces an additional projection compared to standard FFNs but improves quality at equivalent compute.
+\paragraph{RMSNorm.}
+We use Root Mean Square Layer Normalization \citep{zhang2019root} applied before each attention and feed-forward sub-layer (pre-norm architecture):
+\begin{equation}
+    \text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma
+\end{equation}
+where $\gamma$ is a learned scale parameter and $\epsilon = 10^{-6}$.
+\subsection{Tokenizer}
+We train a SentencePiece \citep{kudo2018sentencepiece} BPE tokenizer with a vocabulary of 50{,}000 tokens on a balanced sample of our training corpus. Key settings include:
+\begin{itemize}[leftmargin=*]
+    \item Character coverage: 99.99\%
+    \item Byte fallback enabled (handles any UTF-8 input)
+    \item Special tokens: \texttt{<pad>} (0), \texttt{<unk>} (1), \texttt{<s>} (2), \texttt{</s>} (3), \texttt{<|code|>} (4), \texttt{<|endcode|>} (5), \texttt{<|im\_start|>} (6), \texttt{<|im\_end|>} (7)
+\end{itemize}
+The ChatML-style tokens (\texttt{<|im\_start|>} and \texttt{<|im\_end|>}) are included from the start of pre-training to support later instruction fine-tuning without vocabulary expansion.
+% ============================================================================
+% 4. Training Data
+% ============================================================================
+\section{Training Data}
+\subsection{Data Sources}
+We curate a bilingual training corpus of approximately 39 billion tokens with a 70\% English / 30\% French ratio. Table~\ref{tab:data_sources} lists our data sources.
+\begin{table}[H]
+\centering
+\caption{Training data composition for Julian-600M (39B tokens).}
+\label{tab:data_sources}
+\begin{tabular}{lccc}
+\toprule
+\textbf{Source} & \textbf{Languages} & \textbf{Tokens (approx.)} & \textbf{Quality} \\
+\midrule
+Wikipedia & EN + FR & 5.5B & High \\
+OSCAR 2301 & EN + FR & 15B & Medium \\
+FineWeb-Edu & EN & 8B & Very High \\
+Project Gutenberg & EN + FR & 1B & High \\
+The Stack (code) & Multi & 2B & High \\
+\midrule
+\textbf{Total} & & \textbf{$\sim$39B} & \\
+\bottomrule
+\end{tabular}
+\end{table}
+\subsection{Data Processing Pipeline}
+Our data processing pipeline consists of the following stages:
+\begin{enumerate}[leftmargin=*]
+    \item \textbf{Download}: Raw data is obtained from HuggingFace datasets (OSCAR, FineWeb-Edu, The Stack), Wikipedia dumps, and Project Gutenberg mirrors.
+    \item \textbf{Cleaning}: Documents shorter than 100 characters or longer than 500K characters are removed. We enforce a minimum alphanumeric character ratio of 70\%.
+    \item \textbf{Deduplication}: MinHash Locality-Sensitive Hashing (LSH) with a Jaccard similarity threshold of 0.8 is used for near-duplicate removal.
+    \item \textbf{Language detection}: We use fastText language identification with a confidence threshold of 0.8 to ensure correct language labeling.
+    \item \textbf{Tokenization}: The cleaned corpus is tokenized using our SentencePiece tokenizer and packed into sequences of 2048 tokens.
+    \item \textbf{Sharding}: The tokenized data is split into 359 shards stored on Google Cloud Storage (GCS) for streaming during training.
+\end{enumerate}
+% ============================================================================
+% 5. Training Procedure
+% ============================================================================
+\section{Training Procedure}
+\subsection{Infrastructure}
+All training is conducted on Google Cloud TPU v4-32 pods (32 TPU chips across 4 hosts) provided through the TPU Research Cloud (TRC) program. We use the JAX \citep{bradbury2018jax} framework with Flax for model definition and Optax for optimization.
+\subsection{Parallelism Strategy}
+We employ Fully Sharded Data Parallelism (FSDP) \citep{xu2021gspmd} across the 32 TPU chips using JAX's \texttt{pmap} primitive. Model parameters are replicated across all devices, while the batch dimension is sharded. Gradient accumulation over 8 micro-steps yields an effective batch size of 1024 sequences. All computations use bfloat16 mixed precision \citep{micikevicius2018mixed} for both forward and backward passes, with optimizer states also stored in bfloat16.
+\subsection{Optimizer and Schedule}
+We use AdamW \citep{loshchilov2019decoupled} with the following configuration. The total compute budget for Julian-600M is approximately $2.4 \times 10^{19}$ FLOPs (estimated as $6 \times N \times D$ where $N = 600\text{M}$ parameters and $D = 39\text{B}$ tokens). Training was completed in approximately 21 days of wall-clock time on a single TPU v4-32 pod, achieving a Model FLOPs Utilization (MFU) of approximately 38\%.
+\begin{table}[h]
+\centering
+\caption{Pre-training hyperparameters for Julian-600M.}
+\label{tab:hyperparams}
+\begin{tabular}{lc}
+\toprule
+\textbf{Hyperparameter} & \textbf{Value} \\
+\midrule
+Optimizer & AdamW \\
+$\beta_1$, $\beta_2$ & 0.9, 0.95 \\
+$\epsilon$ & $10^{-8}$ \\
+Weight decay & 0.1 \\
+Peak learning rate & $1.2 \times 10^{-3}$ \\
+Minimum learning rate & $1.2 \times 10^{-4}$ (10\% of peak) \\
+Warmup steps & 3{,}000 \\
+Total steps & 300{,}000 \\
+LR schedule & Cosine annealing \\
+Gradient clipping & 1.0 (global norm) \\
+Batch size (per device) & 4 \\
+Gradient accumulation steps & 8 \\
+Effective batch size & 1{,}024 \\
+Sequence length & 2{,}048 \\
+Tokens per step & $\sim$2.1M \\
+Total tokens & $\sim$39B \\
+Precision & bfloat16 \\
+\bottomrule
+\end{tabular}
+\end{table}
+We follow the Chinchilla cosine learning rate schedule \citep{hoffmann2022training}: linear warmup from 0 to the peak learning rate over 3{,}000 steps, followed by cosine decay to 10\% of the peak value. Optimizer states ($\mu$ and $\nu$) are stored in bfloat16 to reduce memory consumption by approximately 40\%.
+\subsection{Robustness}
+Training on preemptible TPU instances requires robust checkpoint management. We implement:
+\begin{itemize}[leftmargin=*]
+    \item \textbf{Asynchronous checkpointing} using Orbax, saving every 10{,}000 steps without blocking training.
+    \item \textbf{SIGTERM handler}: On preemption, an emergency checkpoint is written within the 30-second grace period.
+    \item \textbf{Health monitoring}: Automatic detection of NaN/Inf values in gradients and loss, with circuit-breaker logic for retries.
+    \item \textbf{Global synchronization}: JAX barrier synchronization before checkpoint writes to ensure multi-host consistency.
+\end{itemize}
+% ============================================================================
+% 6. Supervised Fine-Tuning
+% ============================================================================
+\section{Supervised Fine-Tuning}
+We perform supervised fine-tuning (SFT) on the pre-trained Julian-600M checkpoint (step 300{,}000) using a large instruction-following dataset.
+\subsection{Instruction Dataset}
+Our SFT dataset comprises 2.47 million instruction-response pairs drawn from multiple sources:
+\begin{table}[H]
+\centering
+\caption{SFT dataset composition.}
+\label{tab:sft_data}
+\begin{tabular}{lcc}
+\toprule
+\textbf{Source} & \textbf{Examples (approx.)} & \textbf{Language} \\
+\midrule
+Stanford Alpaca & 52K & English \\
+Databricks Dolly 15K & 15K & English \\
+Code Alpaca & 20K & English \\
+GPT4All-J & 20K & English \\
+French instruction data & 15K+ & French \\
+OpenHermes 2.5 (synthetic) & $\sim$900K & English \\
+SlimOrca & $\sim$500K & English \\
+Other open-source instruction data & $\sim$900K & Multilingual \\
+\midrule
+\textbf{Total} & \textbf{2.47M} & \\
+\bottomrule
+\end{tabular}
+\end{table}
+\subsection{ChatML Format}
+All instruction data is formatted using the ChatML template \citep{openai2023chatml}:
+\smallskip\noindent\begin{minipage}{\textwidth}
+\begin{Verbatim}[fontsize=\small, vspace=0pt]
+<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+{instruction}<|im_end|>
+<|im_start|>assistant
+{response}<|im_end|>
+\end{Verbatim}
+\end{minipage}
+\smallskip\noindent During SFT, loss is computed only on assistant response tokens using a binary loss mask. System and user tokens receive zero loss weight, ensuring the model learns to generate responses rather than memorizing prompts.
+\subsection{SFT Hyperparameters}
+\begin{table}[h]
+\centering
+\caption{SFT training hyperparameters.}
+\label{tab:sft_hyperparams}
+\begin{tabular}{lc}
+\toprule
+\textbf{Hyperparameter} & \textbf{Value} \\
+\midrule
+Base checkpoint & step 300{,}000 (39B tokens) \\
+Learning rate & $2 \times 10^{-5}$ \\
+Warmup steps & 1{,}000 \\
+Batch size (effective) & 32--256 \\
+Sequence length & 2{,}048 \\
+Weight decay & 0.01 \\
+Gradient clipping & 1.0 \\
+\bottomrule
+\end{tabular}
+\end{table}
+We train two SFT variants:
+\begin{itemize}[leftmargin=*]
+    \item \textbf{SFT-30K}: 30{,}000 steps, approximately 2B tokens seen, final loss 1.86
+    \item \textbf{SFT-100K}: 100{,}000 steps, approximately 6.5B tokens seen ($\sim$2.2 epochs), final loss 1.69
+\end{itemize}
+An earlier variant, \textbf{Julian-600M-10B-Instruct-v0.1}, was fine-tuned from an intermediate pre-training checkpoint (step 170{,}000, $\sim$10B tokens) on a smaller instruction dataset ($\sim$185K examples). This variant serves as a baseline for comparison.
+% ============================================================================
+% 7. Evaluation
+% ============================================================================
+\section{Evaluation}
+\subsection{Benchmark Suite}
+We evaluate all Julian models on standard zero-shot benchmarks using the Language Model Evaluation Harness \citep{gao2023framework}:
+\begin{itemize}[leftmargin=*]
+    \item \textbf{HellaSwag} \citep{zellers2019hellaswag}: Commonsense natural language inference (acc\_norm)
+    \item \textbf{PIQA} \citep{bisk2020piqa}: Physical intuition QA (acc)
+    \item \textbf{LAMBADA} \citep{paperno2016lambada}: Word prediction requiring broad context (acc, perplexity)
+    \item \textbf{ARC-Easy / ARC-Challenge} \citep{clark2018think}: Science question answering (acc / acc\_norm)
+    \item \textbf{WinoGrande} \citep{sakaguchi2020winogrande}: Commonsense coreference resolution (acc)
+    \item \textbf{BoolQ} \citep{clark2019boolq}: Yes/no question answering (acc)
+\end{itemize}
+\subsection{Evaluation Infrastructure}
+Because standard lm-eval with HuggingFace models defaults to PyTorch on CPU when run on TPU VMs (no CUDA available), we implement a custom JAX-based evaluation wrapper that performs inference directly on TPU. This achieves approximately 5.8 items/second with batch size 48, completing the full evaluation suite ($\sim$72K requests) in approximately 3.5 hours on a single TPU v4-32 pod.
+% ============================================================================
+% 8. Results
+% ============================================================================
+\section{Results}
+\subsection{Julian Model Progression}
+Table~\ref{tab:julian_results} presents the benchmark results across Julian model variants, illustrating the impact of additional pre-training and supervised fine-tuning.
+\begin{table}[h]
+\centering
+\caption{Benchmark results (0-shot) for Julian model variants. Bold indicates best within Julian models for each benchmark.}
+\label{tab:julian_results}
+\begin{tabular}{lccccccc}
+\toprule
+\textbf{Model} & \textbf{HS} & \textbf{PIQA} & \textbf{LAM.} & \textbf{ARC-E} & \textbf{ARC-C} & \textbf{WG} & \textbf{BoolQ} \\
+\midrule
+Julian-600M Base & \textbf{53.5} & \textbf{66.8} & 37.3 & --- & --- & --- & --- \\
+Julian-600M SFT-30K & 41.7 & \textbf{66.8} & \textbf{37.7} & 53.5 & \textbf{27.1} & \textbf{53.8} & 60.6 \\
+Julian-600M SFT-100K & 41.6 & 66.6 & \textbf{37.7} & \textbf{53.8} & 26.7 & 52.8 & \textbf{60.8} \\
+Julian-600M-10B-v0.1 & 42.7 & 66.2 & 34.6 & --- & --- & --- & --- \\
+\bottomrule
+\end{tabular}
+\end{table}
+\paragraph{SFT Impact.} Supervised fine-tuning causes a notable drop in HellaSwag accuracy ($-$11.8 points), consistent with observations in other models where instruction tuning trades benchmark performance for instruction-following capability. Other benchmarks remain largely stable, with slight improvements in LAMBADA, ARC-Easy, and BoolQ.
+\paragraph{SFT-30K vs SFT-100K.} The two SFT variants produce near-identical results, suggesting that 30K steps is sufficient for this dataset size. At 100K steps ($\sim$2.2 epochs), WinoGrande begins to degrade, likely due to overfitting.
+\subsection{Comparison with Existing Models}
+Table~\ref{tab:comparison} compares Julian-600M with publicly available models of similar or larger scale.
+\begin{table}[h]
+\centering
+\caption{Comparison with existing models (0-shot). Julian-600M Base outperforms OPT-1.3B on HellaSwag despite 2$\times$ fewer parameters and 8$\times$ fewer training tokens.}
+\label{tab:comparison}
+\resizebox{\textwidth}{!}{
+\begin{tabular}{lccccccccc}
+\toprule
+\textbf{Model} & \textbf{Params} & \textbf{Tokens} & \textbf{HS} & \textbf{PIQA} & \textbf{LAM.} & \textbf{ARC-E} & \textbf{ARC-C} & \textbf{WG} \\
+\midrule
+GPT-2 Small & 124M & 100B+ & 31.5 & --- & 46.0 & --- & --- & 50.4 \\
+OPT-125M & 125M & 300B & 29.2 & 63.0 & 37.9 & 43.5 & 18.9 & 50.3 \\
+OPT-350M & 331M & 300B & 32.0 & 64.4 & 45.2 & 44.0 & 20.7 & 52.3 \\
+Pythia-410M & 405M & 300B & 33.3 & 66.8 & 50.5 & 50.4 & 21.3 & 53.0 \\
+\midrule
+\textbf{Julian-600M Base} & \textbf{600M} & \textbf{39B} & \textbf{53.5} & \textbf{66.8} & \textbf{37.3} & --- & --- & --- \\
+\textbf{Julian-600M SFT-30K} & \textbf{600M} & \textbf{39B+2B} & \textbf{41.7} & \textbf{66.8} & \textbf{37.7} & \textbf{53.5} & \textbf{27.1} & \textbf{53.8} \\
+\midrule
+GPT-2 XL & 1{,}558M & 100B+ & 50.9 & 70.8 & 63.2 & --- & --- & 59.4 \\
+Pythia-1B & 1B & 300B & 37.6 & 70.5 & 56.6 & 55.9 & 24.3 & 54.5 \\
+OPT-1.3B & 1.3B & 300B & 41.5 & 71.7 & 57.9 & 57.0 & 23.4 & 59.5 \\
+\bottomrule
+\end{tabular}
+}
+\end{table}
+\paragraph{Key Findings.}
+\begin{itemize}[leftmargin=*]
+    \item \textbf{HellaSwag}: Julian-600M Base achieves 53.5\%, surpassing GPT-2~XL (50.9\%, 1.5B params), OPT-1.3B (41.5\%), and Pythia-1B (37.6\%). This is a remarkable result for a 600M model trained on only 39B tokens.
+    \item \textbf{PIQA}: Julian-600M matches Pythia-410M at 66.8\% and falls only slightly below models 2--3$\times$ larger.
+    \item \textbf{LAMBADA}: Julian-600M achieves 37.3\%, lower than similarly-sized models trained on more data. This likely reflects the smaller training corpus, as LAMBADA is particularly sensitive to the volume and diversity of training text.
+    \item \textbf{Tokens efficiency}: Julian-600M achieves its HellaSwag score with 39B tokens, while OPT and Pythia models were trained on 300B tokens (7.7$\times$ more).
+\end{itemize}
+\begin{figure}[t]
+\centering
+\begin{tikzpicture}
+\begin{axis}[
+    xbar,
+    bar width=7pt,
+    width=0.88\textwidth,
+    height=6cm,
+    xlabel={HellaSwag (acc\_norm, \%)},
+    ytick={0,1,2,3,4,5,6,7},
+    yticklabels={
+        {OPT-125M {\scriptsize(125M, 300B tok)}},
+        {GPT-2 Small {\scriptsize(124M, 100B+ tok)}},
+        {OPT-350M {\scriptsize(331M, 300B tok)}},
+        {Pythia-410M {\scriptsize(405M, 300B tok)}},
+        {Pythia-1B {\scriptsize(1B, 300B tok)}},
+        {OPT-1.3B {\scriptsize(1.3B, 300B tok)}},
+        {GPT-2 XL {\scriptsize(1.5B, 100B+ tok)}},
+        {\textbf{Julian-600M} {\scriptsize\textbf{(600M, 39B tok)}}}
+    },
+    xmin=25, xmax=58,
+    nodes near coords,
+    nodes near coords style={font=\scriptsize, anchor=west},
+    enlarge y limits=0.1,
+    xmajorgrids=true,
+    grid style={gray!20},
+    y tick label style={font=\footnotesize},
+]
+\addplot[fill=gray!40, draw=gray!60] coordinates {
+    (29.2,0) (31.5,1) (32.0,2) (33.3,3) (37.6,4) (41.5,5) (50.9,6) (53.5,7)
+};
+\end{axis}
+\end{tikzpicture}
+\caption{HellaSwag accuracy (acc\_norm) across models, sorted by score. Numbers in parentheses indicate parameter count and training data volume. Julian-600M achieves the highest score despite having fewer parameters and significantly less training data than most comparison models.}
+\label{fig:hellaswag_comparison}
+\end{figure}
+% ============================================================================
+% 9. Interpretation of Results
+% ============================================================================
+\section{Interpretation of Results}
+This section provides an in-depth analysis of the results presented above, examining pre-training dynamics, the impact of SFT, and the saturation phenomena observed.
+\subsection{Pre-training Progression}
+The evolution of performance between the two pre-training checkpoints reveals sustained learning dynamics. Between the 10B token checkpoint (step 100{,}000) and the final 39B token checkpoint (step 300{,}000), we observe:
+\begin{itemize}[leftmargin=*]
+    \item \textbf{HellaSwag}: 45.8\% $\rightarrow$ 53.5\% (+7.7 points)
+    \item \textbf{Loss}: 3.20 $\rightarrow$ 2.33 ($-$27\%)
+    \item \textbf{PIQA}: 67.6\% $\rightarrow$ 66.8\% ($-$0.8 point)
+    \item \textbf{LAMBADA}: 35.0\% $\rightarrow$ 37.3\% (+2.3 points)
+\end{itemize}
+The +7.7 point improvement on HellaSwag is particularly significant. This benchmark measures commonsense reasoning, and the continued improvement suggests that the model has not reached its maximum learning capacity at 39B tokens. The loss continuing to decrease substantially (from 3.20 to 2.33) confirms the absence of saturation: the model continues to learn effectively at each additional training step. PIQA remains stable, while LAMBADA shows a modest but encouraging improvement. Extrapolating this trajectory, continued training beyond 39B tokens would likely yield further gains, particularly on LAMBADA where Julian-600M remains behind models trained on 300B tokens.
+\subsection{Impact of SFT on Benchmarks}
+Supervised fine-tuning fundamentally transforms the model's behavior: from a text completer that statistically predicts the next token, it becomes an assistant capable of responding to structured instructions. This transformation has a measurable cost on benchmarks.
+\paragraph{The HellaSwag sacrifice.} The most notable drop is on HellaSwag: $-$11.8 points (53.5\% $\rightarrow$ 41.7\%). This phenomenon is well documented in the literature \citep{ouyang2022training} and is explained by the very nature of SFT. HellaSwag measures the model's ability to naturally complete a text; however, SFT reorients the model toward producing responses in a specific conversational format (ChatML). The model partially ``unlearns'' free completion in favor of instruction following. This is an expected and generally accepted trade-off.
+\paragraph{Reasoning stability.} In contrast, benchmarks measuring reasoning are remarkably stable after SFT:
+\begin{itemize}[leftmargin=*]
+    \item \textbf{PIQA} stays at 66.8\% (identical to the base model), indicating that physical intuition is unaffected.
+    \item \textbf{WinoGrande} reaches 53.8\%, comparable to reference models of similar size.
+    \item \textbf{BoolQ} reaches 60.6\%, within the expected range for a 600M model.
+\end{itemize}
+These results suggest that SFT does not alter the model's underlying reasoning capabilities but primarily modifies the output distribution (the format of generated responses).
+\paragraph{LAMBADA improvement.} Notably, LAMBADA slightly improves after SFT (+0.4 points, from 37.3\% to 37.7\%). This counterintuitive result can be explained by the fact that the instruction-response format encourages the model to better exploit provided context to produce a precise answer---exactly what LAMBADA measures (predicting a word from a long context).
+\subsection{Over-SFT: Quantitative Analysis (30K vs 100K)}
+The comparison between SFT-30K and SFT-100K constitutes one of the most instructive findings of this work. Table~\ref{tab:sft_delta} presents the detailed differences.
+\begin{table}[H]
+\centering
+\caption{Detailed comparison between SFT-30K and SFT-100K. $\Delta$ represents the difference (100K $-$ 30K). SFT-100K uses 3.3$\times$ more compute for nearly identical results.}
+\label{tab:sft_delta}
+\begin{tabular}{lccccc}
+\toprule
+\textbf{Benchmark} & \textbf{SFT-30K} & \textbf{SFT-100K} & \textbf{$\Delta$} & \textbf{SFT Tokens} & \textbf{Epochs} \\
+\midrule
+Loss & 1.86 & 1.69 & $-$0.17 & --- & --- \\
+HellaSwag & 41.7\% & 41.6\% & $-$0.1 & --- & --- \\
+PIQA & 66.8\% & 66.6\% & $-$0.2 & --- & --- \\
+LAMBADA & 37.7\% & 37.7\% & 0.0 & --- & --- \\
+ARC-Easy & 53.5\% & 53.8\% & +0.3 & --- & --- \\
+ARC-Challenge & 27.1\% & 26.7\% & $-$0.4 & --- & --- \\
+WinoGrande & 53.8\% & 52.8\% & \textbf{$-$1.0} & --- & --- \\
+BoolQ & 60.6\% & 60.8\% & +0.2 & --- & --- \\
+\midrule
+& & & & 1.97B vs 6.55B & 0.66 vs 2.20 \\
+\bottomrule
+\end{tabular}
+\end{table}
+\begin{figure}[t]
+\centering
+\begin{tikzpicture}
+\begin{axis}[
+    ybar=8pt,
+    bar width=12pt,
+    width=\textwidth,
+    height=6.5cm,
+    ylabel={Accuracy (\%)},
+    symbolic x coords={HellaSwag, PIQA, LAMBADA},
+    xtick=data,
+    ymin=30, ymax=72,
+    nodes near coords,
+    nodes near coords style={font=\scriptsize, /pgf/number format/fixed, /pgf/number format/precision=1, anchor=south},
+    legend style={at={(0.5,-0.15)}, anchor=north, legend columns=3, font=\small},
+    enlarge x limits=0.35,
+    ymajorgrids=true,
+    grid style={gray!15},
+]
+\addplot[fill=blue!25, draw=blue!50] coordinates {
+    (HellaSwag, 53.5) (PIQA, 66.8) (LAMBADA, 37.3)
+};
+\addplot[fill=orange!30, draw=orange!55] coordinates {
+    (HellaSwag, 41.7) (PIQA, 66.8) (LAMBADA, 37.7)
+};
+\addplot[fill=red!20, draw=red!45] coordinates {
+    (HellaSwag, 41.6) (PIQA, 66.6) (LAMBADA, 37.7)
+};
+\legend{Base 39B, SFT-30K (0.66 ep.), SFT-100K (2.2 ep.)}
+\end{axis}
+\end{tikzpicture}
+\caption{Impact of supervised fine-tuning on benchmark performance. SFT causes a significant HellaSwag drop ($-$11.8 points) while preserving PIQA and slightly improving LAMBADA. SFT-30K and SFT-100K achieve near-identical results despite 3.3$\times$ difference in compute, indicating clear saturation.}
+\label{fig:sft_impact}
+\end{figure}
+\paragraph{Loss is not a good SFT quality indicator.} The most striking result is the disconnect between loss and benchmark performance. The loss drops significantly from 1.86 to 1.69 ($-$9\%), but benchmarks stagnate or degrade. This reveals that the model learns to better reproduce the \emph{format} of the SFT dataset responses (lower loss on response tokens) without improving its underlying \emph{knowledge} or \emph{reasoning} capabilities. In other words, the model becomes more fluent in the ChatML format without becoming more capable.
+\paragraph{Overfitting signal: WinoGrande.} The degradation of WinoGrande from 53.8\% to 52.8\% ($-$1.0 point) is the clearest overfitting signal. WinoGrande tests commonsense reasoning on ambiguous pronoun resolution, a capability that should not degrade with additional training if the model were generalizing correctly. With 2.47M examples and 2.2 epochs, each example in the SFT dataset has been seen on average more than 2 times. The model begins to memorize dataset-specific patterns rather than generalize, which harms its general reasoning ability.
+\paragraph{ARC-Challenge confirms the trend.} The drop in ARC-Challenge ($-$0.4 points) points in the same direction. This benchmark tests scientific reasoning on difficult questions, and its parallel degradation with WinoGrande reinforces the hypothesis of overfitting that specifically impacts reasoning capabilities.
+\paragraph{Practical implication.} For a dataset of 2.47M examples with a batch size of 32, one epoch corresponds to 45{,}383 steps. SFT-30K (0.66 epochs) has not yet completed a full pass through the dataset but already achieves optimal performance. The additional compute of SFT-100K (3.3$\times$ more) is therefore largely wasted.
+\subsection{Importance of the Base Checkpoint}
+The comparison between the different fine-tuned variants reveals an apparent paradox:
+\begin{itemize}[leftmargin=*]
+    \item \textbf{Instruct v0.1} (base 10B tokens, 5{,}500 SFT steps, 185K examples): HellaSwag = 42.7\%
+    \item \textbf{SFT-30K} (base 39B tokens, 30{,}000 SFT steps, 2.47M examples): HellaSwag = 41.7\%
+\end{itemize}
+The model fine-tuned from a weaker base (10B tokens) achieves a higher post-SFT HellaSwag (+1.0 point) than the one fine-tuned from the stronger base (39B tokens). Several factors may explain this result:
+\begin{enumerate}[leftmargin=*]
+    \item \textbf{Different SFT datasets}: Instruct v0.1 uses 185K examples (likely of higher individual quality), while SFT-30K uses 2.47M examples (more diversity but potentially more noise). The quality of SFT examples has a direct impact on benchmark degradation.
+    \item \textbf{Different SFT duration}: 5{,}500 steps represent a much lighter SFT exposure than 30{,}000 steps, which preserves more of the base model's capabilities. With fewer steps, the model ``forgets'' less of its text completion abilities.
+    \item \textbf{Different loss surfaces}: The model at 10B tokens is in a different training regime (loss 3.20 vs 2.33), which may influence how SFT modifies the weights---a model with higher loss may be more ``malleable'' to SFT.
+\end{enumerate}
+This result underscores that post-SFT quality is not a simple function of the base checkpoint: the combination of base checkpoint, SFT dataset, and SFT duration forms a three-dimensional hyperparameter space that should be optimized jointly.
+\subsection{Practical Recommendations}
+Based on the entirety of our observations, we formulate the following recommendations for fine-tuning small language models (under 1B parameters):
+\begin{enumerate}[leftmargin=*]
+    \item \textbf{Limit SFT to less than 1 epoch}: For datasets on the order of millions of examples, 0.5--0.7 epochs appears optimal. Beyond that, the risk of overfitting increases with no measurable benefit on benchmarks.
+    \item \textbf{Monitor WinoGrande and ARC-Challenge}: These two benchmarks are the first to show signs of overfitting during SFT. A degradation of these metrics is a more reliable stopping signal than training loss.
+    \item \textbf{Do not trust loss for SFT quality}: Unlike pre-training where loss is a reliable indicator of model quality, SFT loss primarily measures format compliance, not reasoning quality.
+    \item \textbf{Prefer diversity over volume}: A high-quality SFT dataset with diverse examples is preferable to a large noisy dataset trained over multiple epochs.
+    \item \textbf{Invest in pre-training}: The progression from 45.8\% to 53.5\% on HellaSwag shows that additional pre-training yields gains that far exceed those from increasing SFT.
+\end{enumerate}
+% ============================================================================
+% 10. Analysis
+% ============================================================================
+\section{Analysis}
+\subsection{Training Efficiency}
+The strong HellaSwag performance of Julian-600M despite limited training data suggests that our architecture and training procedure are highly efficient. We hypothesize several contributing factors:
+\begin{enumerate}[leftmargin=*]
+    \item \textbf{Modern architecture}: The combination of RoPE, SwiGLU, and RMSNorm (as in LLaMA) provides better inductive biases than the architectures used in GPT-2 and OPT (learned positional embeddings, standard FFN, LayerNorm).
+    \item \textbf{Data quality}: FineWeb-Edu and Wikipedia provide high-quality, factual training data, potentially offering more ``learning per token'' than noisier web crawls.
+    \item \textbf{Bilingual training}: Exposure to both English and French may provide cross-lingual transfer benefits, particularly for commonsense reasoning tasks.
+\end{enumerate}
+\begin{figure}[t]
+\centering
+\begin{tikzpicture}
+\begin{axis}[
+    width=0.92\textwidth,
+    height=7cm,
+    xlabel={Training tokens},
+    ylabel={HellaSwag (acc\_norm, \%)},
+    xmode=log,
+    xmin=2e10, xmax=5e11,
+    ymin=25, ymax=58,
+    grid=both,
+    grid style={gray!15},
+    legend style={at={(0.97,0.97)}, anchor=north east, font=\small},
+    xtick={5e10, 1e11, 3e11},
+    xticklabels={50B, 100B, 300B},
+]
+\addplot[only marks, mark=*, mark size=2.5pt, gray!60] coordinates {
+    (3e11, 29.2)
+    (1e11, 31.5)
+    (3e11, 32.0)
+    (3e11, 33.3)
+    (3e11, 37.6)
+    (3e11, 41.5)
+    (1e11, 50.9)
+};
+\addplot[only marks, mark=*, mark size=3.5pt, black, fill=black!70] coordinates {
+    (3.9e10, 53.5)
+};
+\node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 29.2) {OPT-125M};
+\node[font=\tiny, anchor=south east] at (axis cs:9.5e10, 31.5) {GPT-2 Small};
+\node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 32.0) {OPT-350M};
+\node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 33.3) {Pythia-410M};
+\node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 37.6) {Pythia-1B};
+\node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 41.5) {OPT-1.3B};
+\node[font=\tiny, anchor=south east] at (axis cs:9.5e10, 50.9) {GPT-2 XL};
+\node[font=\scriptsize, anchor=south west] at (axis cs:4.2e10, 53.5) {\textbf{Julian-600M}};
+\legend{Other models, Julian (ours)}
+\end{axis}
+\end{tikzpicture}
+\caption{Token efficiency: HellaSwag accuracy vs.\ training data volume. Julian-600M (bottom-left, 39B tokens) achieves the highest HellaSwag score with 7.7$\times$ less training data than OPT and Pythia models (300B tokens). The diamond marker highlights Julian's position in the high-accuracy, low-data region.}
+\label{fig:token_efficiency}
+\end{figure}
+\subsection{The HellaSwag Anomaly}
+The HellaSwag score of 53.5\% for Julian-600M is remarkably high---surpassing even GPT-2~XL (50.9\%) which has 2.5$\times$ more parameters. Several hypotheses merit investigation:
+\begin{itemize}[leftmargin=*]
+    \item \textbf{Architectural hypothesis}: Modern components (RoPE, SwiGLU, RMSNorm) may be particularly advantageous for text completion tasks measured by HellaSwag. The length-normalized scoring (acc\_norm) could also favor our architecture.
+    \item \textbf{Data quality hypothesis}: FineWeb-Edu's educational content may provide particularly relevant training signal for the commonsense scenarios tested by HellaSwag.
+    \item \textbf{Contamination hypothesis}: While we applied rigorous deduplication \citep{lee2022deduplicating}, we cannot fully exclude partial contamination with benchmark-adjacent data, particularly through FineWeb-Edu.
+\end{itemize}
+% ============================================================================
+% 11. Limitations
+% ============================================================================
+\section{Limitations}
+\begin{itemize}[leftmargin=*]
+    \item \textbf{Model size}: At 600M parameters, Julian has limited reasoning capabilities and factual accuracy compared to larger models.
+    \item \textbf{Training data volume}: While efficient, 39B tokens is below the Chinchilla-optimal ratio for 600M parameters ($\sim$12B optimal model size for 39B tokens), suggesting the model could benefit from further training.
+    \item \textbf{English-centric evaluation}: All benchmarks are in English. We lack standardized French evaluation benchmarks for language models of this size.
+    \item \textbf{Hallucination}: Like all language models, Julian frequently generates incorrect or fabricated information, particularly for factual queries.
+    \item \textbf{Basic instruction following}: SFT without reinforcement learning from human feedback \citep{christiano2017deep} (RLHF) or direct preference optimization \citep{rafailov2023direct} (DPO) produces basic instruction-following capabilities that are significantly weaker than RLHF-trained models.
+    \item \textbf{LAMBADA underperformance}: The relatively low LAMBADA accuracy (37.3\% vs.\ 50.5\% for Pythia-410M) indicates that broader text prediction capabilities lag behind the strong commonsense reasoning performance.
+\end{itemize}
+% ============================================================================
+% 12. Conclusion
+% ============================================================================
+\section{Conclusion}
+We have presented Julian, a family of bilingual language models trained from scratch on TPU infrastructure using JAX/Flax. Our flagship Julian-600M model achieves remarkable efficiency on HellaSwag (53.5\%), outperforming models with 2$\times$ more parameters trained on 8$\times$ more data. We have documented the complete training pipeline, from data collection and tokenizer training to pre-training, supervised fine-tuning, and evaluation.
+\paragraph{Future Work.} We plan to: (1) scale Julian to 2B parameters using larger TPU configurations (v6e-64); (2) implement DPO \citep{rafailov2023direct} for improved instruction following; (3) develop French-language evaluation benchmarks; and (4) explore continued pre-training on larger datasets to improve LAMBADA and general text prediction performance.
+\paragraph{Open Release.} All model weights are available at \url{https://huggingface.co/JulianKrgd} under the Apache 2.0 license.
+% ============================================================================
+% Acknowledgments
+% ============================================================================
+\section*{Acknowledgments}
+This work was supported by the Google TPU Research Cloud (TRC) program, which provided access to Cloud TPU v4-32 pods. We thank the TRC team for their support and the allocation of compute resources that made this research possible.
+% ============================================================================
+% References
+% ============================================================================
+\bibliographystyle{plainnat}
+\begin{thebibliography}{36}
+\bibitem[Biderman et~al.(2023)]{biderman2023pythia}
+Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van~der Wal.
+\newblock Pythia: A suite for analyzing large language models across training and scaling.
+\newblock In \emph{ICML}, 2023.
+\newblock \url{https://arxiv.org/abs/2304.01373}
+\bibitem[Christiano et~al.(2017)]{christiano2017deep}
+Paul~F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei.
+\newblock Deep reinforcement learning from human preferences.
+\newblock In \emph{NeurIPS}, 2017.
+\newblock \url{https://arxiv.org/abs/1706.03741}
+\bibitem[Bradbury et~al.(2018)]{bradbury2018jax}
+James Bradbury, Roy Frostig, Peter Hawkins, Matthew~James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander{P}las, Skye Wanderman-{M}ilne, and Qiao Zhang.
+\newblock {JAX}: Composable transformations of {Python}+{NumPy} programs.
+\newblock 2018.
+\newblock \url{https://github.com/jax-ml/jax}
+\bibitem[Bisk et~al.(2020)]{bisk2020piqa}
+Yonatan Bisk, Rowan Zellers, Ronan Le~Bras, Jianfeng Gao, and Yejin Choi.
+\newblock {PIQA}: Reasoning about physical intuition in natural language.
+\newblock In \emph{AAAI}, 2020.
+\newblock \url{https://arxiv.org/abs/1911.11641}
+\bibitem[Brown et~al.(2020)]{brown2020language}
+Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared~D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et~al.
+\newblock Language models are few-shot learners.
+\newblock In \emph{NeurIPS}, 2020.
+\newblock \url{https://arxiv.org/abs/2005.14165}
+\bibitem[Chowdhery et~al.(2023)]{chowdhery2023palm}
+Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung~Won Chung, Charles Sutton, Sebastian Gehrmann, et~al.
+\newblock {PaLM}: Scaling language modeling with {P}athways.
+\newblock \emph{JMLR}, 2023.
+\newblock \url{https://arxiv.org/abs/2204.02311}
+\bibitem[Clark et~al.(2018)]{clark2018think}
+Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.
+\newblock Think you have solved question answering? {T}ry {ARC}, the {AI2} reasoning challenge.
+\newblock \emph{arXiv preprint arXiv:1803.05457}, 2018.
+\newblock \url{https://arxiv.org/abs/1803.05457}
+\bibitem[Clark et~al.(2019)]{clark2019boolq}
+Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova.
+\newblock {BoolQ}: Exploring the surprising difficulty of natural yes/no questions.
+\newblock In \emph{NAACL}, 2019.
+\newblock \url{https://arxiv.org/abs/1905.10044}
+\bibitem[Conneau et~al.(2020)]{conneau2020xlmr}
+Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm{\'a}n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.
+\newblock Unsupervised cross-lingual representation learning at scale.
+\newblock In \emph{ACL}, 2020.
+\newblock \url{https://arxiv.org/abs/1911.02116}
+\bibitem[Devlin et~al.(2019)]{devlin2019bert}
+Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
+\newblock {BERT}: Pre-training of deep bidirectional transformers for language understanding.
+\newblock In \emph{NAACL}, 2019.
+\newblock \url{https://arxiv.org/abs/1810.04805}
+\bibitem[Gao et~al.(2023)]{gao2023framework}
+Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le~Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou.
+\newblock A framework for few-shot language model evaluation.
+\newblock \emph{Zenodo}, 2023.
+\newblock \url{https://zenodo.org/records/10256836}
+\bibitem[Hoffmann et~al.(2022)]{hoffmann2022training}
+Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de~Las~Casas, Lisa~Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van~den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack~W. Rae, Oriol Vinyals, and Laurent Sifre.
+\newblock Training compute-optimal large language models.
+\newblock In \emph{NeurIPS}, 2022.
+\newblock \url{https://arxiv.org/abs/2203.15556}
+\bibitem[Kaplan et~al.(2020)]{kaplan2020scaling}
+Jared Kaplan, Sam McCandlish, Tom Henighan, Tom~B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever.
+\newblock Scaling laws for neural language models.
+\newblock \emph{arXiv preprint arXiv:2001.08361}, 2020.
+\newblock \url{https://arxiv.org/abs/2001.08361}
+\bibitem[Lee et~al.(2022)]{lee2022deduplicating}
+Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini.
+\newblock Deduplicating training data makes language models better.
+\newblock In \emph{ACL}, 2022.
+\newblock \url{https://arxiv.org/abs/2107.06499}
+\bibitem[Kudo and Richardson(2018)]{kudo2018sentencepiece}
+Taku Kudo and John Richardson.
+\newblock {SentencePiece}: A simple and language independent subword tokenizer and detokenizer for neural text processing.
+\newblock In \emph{EMNLP (demo)}, 2018.
+\newblock \url{https://arxiv.org/abs/1808.06226}
+\bibitem[Liu et~al.(2024)]{liu2024mobilellm}
+Zechun Liu, Changlin Li, Barlas O\u{g}uz, et~al.
+\newblock {MobileLLM}: Optimizing sub-billion parameter language models for on-device use cases.
+\newblock In \emph{ICML}, 2024.
+\newblock \url{https://arxiv.org/abs/2402.14905}
+\bibitem[Micikevicius et~al.(2018)]{micikevicius2018mixed}
+Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu.
+\newblock Mixed precision training.
+\newblock In \emph{ICLR}, 2018.
+\newblock \url{https://arxiv.org/abs/1710.03740}
+\bibitem[Loshchilov and Hutter(2019)]{loshchilov2019decoupled}
+Ilya Loshchilov and Frank Hutter.
+\newblock Decoupled weight decay regularization.
+\newblock In \emph{ICLR}, 2019.
+\newblock \url{https://arxiv.org/abs/1711.05101}
+\bibitem[OpenAI(2023)]{openai2023chatml}
+OpenAI.
+\newblock {ChatML}: Chat markup language.
+\newblock Technical documentation, 2023.
+\newblock \url{https://github.com/openai/openai-python/blob/v0.28.1/chatml.md}
+\bibitem[Ouyang et~al.(2022)]{ouyang2022training}
+Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et~al.
+\newblock Training language models to follow instructions with human feedback.
+\newblock In \emph{NeurIPS}, 2022.
+\newblock \url{https://arxiv.org/abs/2203.02155}
+\bibitem[Paperno et~al.(2016)]{paperno2016lambada}
+Denis Paperno, Germ{\'a}n Kruszewski, Angeliki Lazaridou, Quan~Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern{\'a}ndez.
+\newblock The {LAMBADA} dataset: Word prediction requiring a broad discourse context.
+\newblock In \emph{ACL}, 2016.
+\newblock \url{https://arxiv.org/abs/1606.06031}
+\bibitem[Rafailov et~al.(2023)]{rafailov2023direct}
+Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher~D Manning, and Chelsea Finn.
+\newblock Direct preference optimization: Your language model is secretly a reward model.
+\newblock In \emph{NeurIPS}, 2023.
+\newblock \url{https://arxiv.org/abs/2305.18290}
+\bibitem[Radford et~al.(2019)]{radford2019language}
+Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
+\newblock Language models are unsupervised multitask learners.
+\newblock \emph{OpenAI blog}, 2019.
+\newblock \url{https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf}
+\bibitem[Sakaguchi et~al.(2020)]{sakaguchi2020winogrande}
+Keisuke Sakaguchi, Ronan Le~Bras, Chandra Bhagavatula, and Yejin Choi.
+\newblock {WinoGrande}: An adversarial winograd schema challenge at scale.
+\newblock In \emph{AAAI}, 2020.
+\newblock \url{https://arxiv.org/abs/1907.10641}
+\bibitem[Shazeer(2020)]{shazeer2020glu}
+Noam Shazeer.
+\newblock {GLU} variants improve transformer.
+\newblock \emph{arXiv preprint arXiv:2002.05202}, 2020.
+\newblock \url{https://arxiv.org/abs/2002.05202}
+\bibitem[Su et~al.(2021)]{su2021roformer}
+Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu.
+\newblock {RoFormer}: Enhanced transformer with rotary position embedding.
+\newblock \emph{arXiv preprint arXiv:2104.09864}, 2021.
+\newblock \url{https://arxiv.org/abs/2104.09864}
+\bibitem[Touvron et~al.(2023)]{touvron2023llama}
+Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth{\'e}e Lacroix, Baptiste Rozi{\`e}re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample.
+\newblock {LLaMA}: Open and efficient foundation language models.
+\newblock \emph{arXiv preprint arXiv:2302.13971}, 2023.
+\newblock \url{https://arxiv.org/abs/2302.13971}
+\bibitem[Xu et~al.(2021)]{xu2021gspmd}
+Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen.
+\newblock {GSPMD}: General and scalable parallelization for {ML} computation graphs.
+\newblock \emph{arXiv preprint arXiv:2105.04663}, 2021.
+\newblock \url{https://arxiv.org/abs/2105.04663}
+\bibitem[Vaswani et~al.(2017)]{vaswani2017attention}
+Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan~N Gomez, {\L}ukasz Kaiser, and Illia Polosukhin.
+\newblock Attention is all you need.
+\newblock In \emph{NeurIPS}, 2017.
+\newblock \url{https://arxiv.org/abs/1706.03762}
+\bibitem[Workshop et~al.(2023)]{workshop2023bloom}
+BigScience Workshop, Teven Le~Scao, Angela Fan, et~al.
+\newblock {BLOOM}: A 176B-parameter open-access multilingual language model.
+\newblock \emph{arXiv preprint arXiv:2211.05100}, 2023.
+\newblock \url{https://arxiv.org/abs/2211.05100}
+\bibitem[Zellers et~al.(2019)]{zellers2019hellaswag}
+Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi.
+\newblock {HellaSwag}: Can a machine really finish your sentence?
+\newblock In \emph{ACL}, 2019.
+\newblock \url{https://arxiv.org/abs/1905.07830}
+\bibitem[Zhang et~al.(2022)]{zhang2022opt}
+Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi~Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit~Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer.
+\newblock {OPT}: Open pre-trained transformer language models.
+\newblock \emph{arXiv preprint arXiv:2205.01068}, 2022.
+\newblock \url{https://arxiv.org/abs/2205.01068}
+\bibitem[Zhang and Sennrich(2019)]{zhang2019root}
+Biao Zhang and Rico Sennrich.
+\newblock Root mean square layer normalization.
+\newblock In \emph{NeurIPS}, 2019.
+\newblock \url{https://arxiv.org/abs/1910.07467}
+\bibitem[Zhang et~al.(2024)]{zhang2024tinyllama}
+Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu.
+\newblock {TinyLlama}: An open-source small language model.
+\newblock \emph{arXiv preprint arXiv:2401.02385}, 2024.
+\newblock \url{https://arxiv.org/abs/2401.02385}
+\end{thebibliography}
+% ============================================================================
+% Appendix
+% ============================================================================
+\appendix
+\section{Full Hyperparameter Tables}
+\label{app:hyperparams}
+\begin{table}[h]
+\centering
+\caption{Complete pre-training configuration for Julian-600M.}
+\begin{tabular}{lc}
+\toprule
+\textbf{Category} & \textbf{Value} \\
+\midrule
+\multicolumn{2}{l}{\textit{Model}} \\
+Parameters & $\sim$600M \\
+Hidden dimension & 1280 \\
+Layers & 18 \\
+Attention heads & 20 \\
+Head dimension & 64 \\
+FFN dimension & 5120 \\
+Activation & SwiGLU (SiLU gate) \\
+Normalization & RMSNorm ($\epsilon = 10^{-6}$) \\
+Position encoding & RoPE ($\theta = 10{,}000$) \\
+Vocabulary & 50{,}000 (SentencePiece BPE) \\
+Context length & 2{,}048 \\
+Dropout & 0.1 \\
+\midrule
+\multicolumn{2}{l}{\textit{Optimization}} \\
+Optimizer & AdamW \\
+$\beta_1, \beta_2$ & 0.9, 0.95 \\
+$\epsilon$ & $10^{-8}$ \\
+Weight decay & 0.1 \\
+Peak LR & $1.2 \times 10^{-3}$ \\
+Min LR & $1.2 \times 10^{-4}$ \\
+LR schedule & Cosine with linear warmup \\
+Warmup steps & 3{,}000 \\
+Total steps & 300{,}000 \\
+Gradient clipping & 1.0 (global norm) \\
+Optimizer state precision & bfloat16 \\
+\midrule
+\multicolumn{2}{l}{\textit{Compute}} \\
+Hardware & TPU v4-32 (32 chips, 4 hosts) \\
+Batch per device & 4 \\
+Gradient accumulation & 8 \\
+Effective batch size & 1{,}024 \\
+Precision & bfloat16 mixed \\
+Tokens per step & $\sim$2.1M \\
+Total tokens & $\sim$39B \\
+Checkpointing & Orbax async, every 10K steps \\
+\bottomrule
+\end{tabular}
+\end{table}
+\section{Model Availability}
+All Julian models are available on the HuggingFace Hub:
+\begin{table}[h]
+\centering
+\begin{tabular}{ll}
+\toprule
+\textbf{Model} & \textbf{HuggingFace Repository} \\
+\midrule
+Julian-600M Base & \texttt{JulianKrgd/julian-600m-40b} \\
+Julian-600M-10B-Instruct-v0.1 & \texttt{JulianKrgd/julian-600m-10b-instruct-v0.1} \\
+Julian-600M SFT-30K & \texttt{JulianKrgd/julian-600m-40b-instruct-sft30k} \\
+Julian-600M SFT-100K & \texttt{JulianKrgd/julian-600m-40b-instruct-sft100k} \\
+\bottomrule
+\end{tabular}
+\end{table}
+\end{document}

julian_paper_fr.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0ec43641f185a3dac51a4d4691cf913b7cdc7a13499313b64c7cd6ae0fe3c3cb
+size 224968

julian_paper_fr.tex ADDED Viewed

	@@ -0,0 +1,1028 @@

+\documentclass[11pt,a4paper]{article}
+% ============================================================================
+% Packages
+% ============================================================================
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}
+\usepackage[french]{babel}
+\usepackage{times}
+\usepackage{geometry}
+\geometry{margin=1in}
+\usepackage{amsmath,amssymb}
+\usepackage{graphicx}
+\usepackage{booktabs}
+\usepackage{hyperref}
+\usepackage{url}
+\urlstyle{same}
+\usepackage{natbib}
+\usepackage{xcolor}
+\usepackage{array}
+\usepackage{float}
+\usepackage{enumitem}
+\usepackage{fancyvrb}
+\usepackage{pgfplots}
+\pgfplotsset{compat=1.18}
+\hypersetup{
+  colorlinks=true,
+  linkcolor=blue!60!black,
+  citecolor=blue!60!black,
+  urlcolor=blue!60!black
+}
+% ============================================================================
+% Titre
+% ============================================================================
+\title{
+\textbf{Julian : Entra\^inement Efficace d'un Mod\`ele de Langage \\
+Bilingue de 600M de Param\`etres sur TPU avec JAX}
+}
+\author{
+Julian Kerignard \\
+Recherche Ind\'ependante \\
+\texttt{github.com/JulianKrgd} \\
+\texttt{huggingface.co/JulianKrgd}
+}
+\date{F\'evrier 2026}
+\begin{document}
+\maketitle
+% ============================================================================
+% R\'esum\'e
+% ============================================================================
+\begin{abstract}
+Nous pr\'esentons \textbf{Julian}\footnote{Mod\`eles disponibles sur HuggingFace : \url{https://huggingface.co/JulianKrgd}}, une famille de mod\`eles de langage \`a d\'ecodeur seul (\emph{decoder-only}) allant de 100M \`a 600M de param\`etres, entra\^in\'es enti\`erement \`a partir de z\'ero sur jusqu'\`a 39 milliards de tokens de donn\'ees bilingues anglais-fran\c{c}ais, en utilisant JAX/Flax sur les TPU de Google Cloud. Notre mod\`ele principal, Julian-600M, emploie une architecture Transformer moderne avec des Rotary Position Embeddings (RoPE), des activations SwiGLU et une normalisation RMSNorm, suivant les principes de conception de LLaMA. Malgr\'e un entra\^inement sur significativement moins de tokens que les mod\`eles comparables, Julian-600M atteint 53,5\,\% de pr\'ecision normalis\'ee sur HellaSwag, surpassant OPT-1.3B (41,5\,\%) qui poss\`ede plus du double de param\`etres et a \'et\'e entra\^in\'e sur 8$\times$ plus de donn\'ees. Nous effectuons ensuite un \emph{Supervised Fine-Tuning} (SFT) de Julian-600M sur 2,47 millions de paires instruction-r\'eponse format\'ees avec le template ChatML, produisant des variantes capables de suivre des instructions \`a 30K et 100K pas d'entra\^inement. Nous fournissons un compte-rendu d\'etaill\'e de notre infrastructure d'entra\^inement, de notre pipeline de donn\'ees et des d\'efis de l'entra\^inement multi-h\^ote sur TPU avec JAX. Tous les poids des mod\`eles sont publi\'es ouvertement sous licence Apache 2.0 sur HuggingFace.
+\end{abstract}
+% ============================================================================
+% 1. Introduction
+% ============================================================================
+\section{Introduction}
+L'avanc\'ee rapide des grands mod\`eles de langage (\emph{Large Language Models}, LLM) a d\'emontr\'e des capacit\'es remarquables en compr\'ehension et g\'en\'eration de langage naturel \citep{brown2020language, chowdhery2023palm, touvron2023llama}. Cependant, l'entra\^inement de tels mod\`eles n\'ecessite g\'en\'eralement d'\'enormes ressources computationnelles, souvent inaccessibles aux chercheurs ind\'ependants et aux petites organisations.
+Les travaux r\'ecents ont montr\'e que des mod\`eles de langage plus petits, lorsqu'ils sont entra\^in\'es avec des donn\'ees et des techniques appropri\'ees, peuvent atteindre des performances comp\'etitives sur de nombreux benchmarks \citep{biderman2023pythia, zhang2022opt}. Les lois d'\'echelle de Chinchilla \citep{hoffmann2022training} sugg\`erent de plus que de nombreux mod\`eles sont sous-entra\^in\'es par rapport \`a leur taille, et que la performance optimale n\'ecessite un \'equilibre soigneux entre la taille du mod\`ele et le volume de donn\'ees d'entra\^inement.
+Un LLM fonctionne de mani\`ere fondamentalement simple : il pr\'edit le token suivant dans une s\'equence de texte. \`A partir d'un contexte donn\'e (par exemple, \og La capitale de la France est\fg), le mod\`ele calcule une distribution de probabilit\'e sur l'ensemble de son vocabulaire (50\,000 tokens dans notre cas) et s\'electionne le token le plus probable. C'est en empilant ces pr\'edictions que le mod\`ele g\'en\`ere du texte coh\'erent. La qualit\'e de ces pr\'edictions d\'epend directement de deux facteurs : l'architecture du mod\`ele (comment il traite l'information) et les donn\'ees d'entra\^inement (ce qu'il a appris \`a pr\'edire).
+Dans ce travail, nous pr\'esentons \textbf{Julian}, une famille de mod\`eles de langage bilingues (anglais-fran\c{c}ais) entra\^in\'es \`a partir de z\'ero en utilisant JAX/Flax sur des pods TPU v4-32 de Google Cloud. Nos contributions sont les suivantes :
+\begin{enumerate}[leftmargin=*]
+    \item \textbf{Entra\^inement efficace} : Nous entra\^inons un mod\`ele de 600M de param\`etres sur 39B tokens qui surpasse OPT-1.3B sur HellaSwag malgr\'e 2$\times$ moins de param\`etres et 8$\times$ moins de tokens d'entra\^inement.
+    \item \textbf{Capacit\'e bilingue} : \`A notre connaissance, Julian est parmi les rares petits mod\`eles de langage ouvertement publi\'es, entra\^in\'es \`a partir de z\'ero sur un m\'elange de donn\'ees anglaises et fran\c{c}aises (ratio 70\,\%/30\,\%).
+    \item \textbf{Pipeline complet} : Nous d\'ecrivons l'int\'egralit\'e du pipeline d'entra\^inement, incluant la collecte de donn\'ees, l'entra\^inement du tokenizer, le pr\'e-entra\^inement, le fine-tuning supervis\'e et l'\'evaluation, fournissant un guide pratique pour entra\^iner des LLM sur infrastructure TPU.
+    \item \textbf{Publication ouverte} : Tous les poids des mod\`eles, le tokenizer et le code d'entra\^inement sont publi\'es sous licence Apache 2.0.
+\end{enumerate}
+% ============================================================================
+% 2. Travaux Connexes
+% ============================================================================
+\section{Travaux Connexes}
+\paragraph{Lois d'\'echelle.}
+\citet{kaplan2020scaling} ont \'etabli les lois d'\'echelle neuronales montrant des relations en loi de puissance entre la taille du mod\`ele, la taille du dataset, le budget de calcul et la loss. Intuitivement, ces lois indiquent qu'augmenter la taille d'un mod\`ele ou la quantit\'e de donn\'ees am\'eliore la performance de mani\`ere pr\'edictible, mais avec des rendements d\'ecroissants. \citet{hoffmann2022training} ont affin\'e ces r\'esultats avec les lois d'\'echelle de Chinchilla, d\'emontrant que de nombreux grands mod\`eles sont significativement sous-entra\^in\'es et que le ratio optimal tokens/param\`etres est d'environ 20:1. Concr\`etement, un mod\`ele de 600M de param\`etres devrait id\'ealement \^etre entra\^in\'e sur environ 12 milliards de tokens pour atteindre son optimum. Notre mod\`ele Julian-600M est entra\^in\'e sur 39B tokens (ratio 65:1), d\'epassant largement le budget optimal de Chinchilla, ce qui signifie que nous investissons davantage de calcul dans l'entra\^inement que ce que la th\'eorie sugg\`ere comme minimum.
+\paragraph{Mod\`eles de langage ouverts.}
+GPT-2 \citep{radford2019language} a \'et\'e pionnier dans la publication de mod\`eles de langage pr\'e-entra\^in\'es, avec des tailles allant de 124M \`a 1,5B de param\`etres. OPT \citep{zhang2022opt} a fourni des mod\`eles de 125M \`a 175B de param\`etres entra\^in\'es sur 300B tokens avec des journaux d'entra\^inement d\'etaill\'es. Pythia \citep{biderman2023pythia} a offert une suite de mod\`eles de 70M \`a 12B de param\`etres entra\^in\'es sur 300B tokens issus de The Pile, sp\'ecifiquement con\c{c}us pour \'etudier le comportement des mod\`eles pendant l'entra\^inement. LLaMA \citep{touvron2023llama} a introduit des am\'eliorations architecturales (RoPE, SwiGLU, RMSNorm) devenues standard dans les mod\`eles de langage modernes.
+\paragraph{Petits mod\`eles de langage.}
+TinyLlama \citep{zhang2024tinyllama} a d\'emontr\'e qu'un mod\`ele de 1,1B entra\^in\'e sur 3T tokens peut atteindre de bonnes performances. MobileLLM \citep{liu2024mobilellm} a explor\'e la conception d'architectures pour des mod\`eles \`a moins d'un milliard de param\`etres, destin\'es \`a fonctionner sur des appareils mobiles. Ces travaux soulignent la viabilit\'e et l'int\'er\^et croissant pour des mod\`eles plus petits et plus efficaces, capables de fonctionner sans infrastructure cloud co\^uteuse.
+\paragraph{Mod\`eles multilingues.}
+Alors que de grands mod\`eles multilingues comme mBERT \citep{devlin2019bert}, XLM-R \citep{conneau2020xlmr} et BLOOM \citep{workshop2023bloom} couvrent de nombreuses langues, peu de petits mod\`eles sont sp\'ecifiquement con\c{c}us pour la g\'en\'eration de texte bilingue anglais-fran\c{c}ais \`a partir de z\'ero. BLOOM (176B param\`etres) est notable pour son inclusion explicite du fran\c{c}ais, mais sa taille le rend inaccessible pour la plupart des cas d'usage en inf\'erence locale.
+% ============================================================================
+% 3. Architecture du Mod\`ele
+% ============================================================================
+\section{Architecture du Mod\`ele}
+Julian suit l'architecture LLaMA \citep{touvron2023llama} : un Transformer \`a d\'ecodeur seul avec pr\'e-normalisation par RMSNorm \citep{zhang2019root}, des r\'eseaux feed-forward SwiGLU \citep{shazeer2020glu}, et des Rotary Position Embeddings (RoPE) \citep{su2021roformer}. Aucun terme de biais n'est utilis\'e dans les projections lin\'eaires.
+Un Transformer \`a d\'ecodeur seul est une architecture de r\'eseau de neurones qui traite le texte s\'equentiellement, de gauche \`a droite. Chaque \og couche \fg{} du mod\`ele effectue deux op\'erations principales : (1) un m\'ecanisme d'\emph{attention} qui permet \`a chaque token de \og regarder \fg{} les tokens pr\'ec\'edents pour comprendre le contexte, et (2) un r\'eseau feed-forward qui transforme cette information. Ces deux op\'erations sont r\'ep\'et\'ees 18 fois dans Julian-600M, permettant au mod\`ele de construire des repr\'esentations de plus en plus abstraites du texte.
+\subsection{D\'etails de l'Architecture}
+\begin{table}[h]
+\centering
+\caption{Configurations des mod\`eles Julian. Tous les mod\`eles utilisent RoPE ($\theta$=10000), SwiGLU, RMSNorm (pr\'e-normalisation) et aucun terme de biais.}
+\label{tab:model_configs}
+\begin{tabular}{lccc}
+\toprule
+\textbf{Param\`etre} & \textbf{Julian-100M} & \textbf{Julian-250M$^\dagger$} & \textbf{Julian-600M} \\
+\midrule
+Dimension cach\'ee ($d_{\text{model}}$) & 640 & 1024 & 1280 \\
+Couches ($L$) & 12 & 14 & 18 \\
+T\^etes d'attention ($H$) & 10 & 16 & 20 \\
+Dimension par t\^ete ($d_h$) & 64 & 64 & 64 \\
+Dimension FFN ($d_{\text{ff}}$) & 2560 & 4096 & 5120 \\
+Taille du vocabulaire ($V$) & 50\,000 & 50\,000 & 50\,000 \\
+Longueur de contexte & 2048 & 2048 & 2048 \\
+Pr\'ecision & bfloat16 & bfloat16 & bfloat16 \\
+\bottomrule
+\end{tabular}
+\end{table}
+\noindent{\small $^\dagger$ Julian-250M est actuellement en pr\'eparation et n'a pas encore \'et\'e entra\^in\'e.}
+\paragraph{Rotary Position Embeddings (RoPE).}
+Dans un Transformer, le mod\`ele doit savoir \`a quelle \emph{position} se trouve chaque token dans la s\'equence. Sans cette information, la phrase \og le chat mange la souris \fg{} et \og la souris mange le chat \fg{} seraient trait\'ees de mani\`ere identique. RoPE \citep{su2021roformer} encode cette information positionnelle en appliquant une \emph{rotation} aux vecteurs de requ\^ete (\emph{query}) et de cl\'e (\emph{key}) du m\'ecanisme d'attention. L'angle de rotation d\'epend de la position du token, ce qui permet au mod\`ele de distinguer les positions tout en pr\'eservant les propri\'et\'es alg\'ebriques utiles pour le calcul d'attention.
+Formellement, pour la fr\'equence de base $\theta = 10\,000$ :
+\begin{equation}
+    f_{\theta}(x, m) = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_{d-1} \\ x_d \end{pmatrix} \odot \begin{pmatrix} \cos(m\theta_1) \\ \cos(m\theta_1) \\ \vdots \\ \cos(m\theta_{d/2}) \\ \cos(m\theta_{d/2}) \end{pmatrix} + \begin{pmatrix} -x_2 \\ x_1 \\ \vdots \\ -x_d \\ x_{d-1} \end{pmatrix} \odot \begin{pmatrix} \sin(m\theta_1) \\ \sin(m\theta_1) \\ \vdots \\ \sin(m\theta_{d/2}) \\ \sin(m\theta_{d/2}) \end{pmatrix}
+\end{equation}
+o\`u $\theta_i = \theta^{-2i/d}$ et $m$ est l'indice de position. L'op\'erateur $\odot$ d\'esigne le produit \'el\'ement par \'el\'ement (produit de Hadamard). Intuitivement, chaque paire de dimensions du vecteur est \og tourn\'ee \fg{} d'un angle proportionnel \`a la position, ce qui fait que le produit scalaire entre deux vecteurs d\'epend naturellement de leur distance relative dans la s\'equence.
+\paragraph{R\'eseau Feed-Forward SwiGLU.}
+Apr\`es le m\'ecanisme d'attention, chaque couche du Transformer poss\`ede un r\'eseau feed-forward (FFN) qui transforme la repr\'esentation de chaque token ind\'ependamment. SwiGLU \citep{shazeer2020glu} est une variante am\'elior\'ee du FFN standard qui utilise un m\'ecanisme de \og porte \fg{} (\emph{gating}). Au lieu d'appliquer simplement une transformation lin\'eaire suivie d'une activation, SwiGLU multiplie deux chemins de transformation :
+\begin{equation}
+    \text{FFN}(x) = W_{\text{down}} \cdot (\text{SiLU}(W_{\text{gate}} x) \odot W_{\text{up}} x)
+\end{equation}
+o\`u $W_{\text{gate}}, W_{\text{up}} \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ et $W_{\text{down}} \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$. La fonction SiLU (\emph{Sigmoid Linear Unit}), d\'efinie par $\text{SiLU}(x) = x \cdot \sigma(x)$, est une activation lisse qui combine les avantages de ReLU et des fonctions sigmoid. Le terme $W_{\text{gate}} x$ agit comme une porte qui d\'etermine quelles informations de $W_{\text{up}} x$ doivent \^etre transmises. Cette architecture introduit une projection suppl\'ementaire par rapport aux FFN standards, mais am\'eliore la qualit\'e du mod\`ele \`a calcul \'equivalent.
+\paragraph{RMSNorm.}
+La normalisation est cruciale dans les r\'eseaux profonds pour stabiliser l'entra\^inement. Sans normalisation, les valeurs des activations peuvent cro\^itre ou d\'ecro\^itre de mani\`ere incontr\^ol\'ee \`a travers les couches, rendant l'entra\^inement instable. RMSNorm (\emph{Root Mean Square Layer Normalization}) \citep{zhang2019root} est une version simplifi\'ee de LayerNorm qui ne recentre pas les activations (pas de soustraction de la moyenne), mais les renormalise uniquement par leur norme quadratique moyenne :
+\begin{equation}
+    \text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma
+\end{equation}
+o\`u $\gamma$ est un param\`etre d'\'echelle appris et $\epsilon = 10^{-6}$ est un terme de stabilit\'e num\'erique. Cette simplification r\'eduit le co\^ut computationnel tout en maintenant la stabilit\'e de l'entra\^inement. Nous utilisons la pr\'e-normalisation (\emph{pre-norm}), c'est-\`a-dire que RMSNorm est appliqu\'e \emph{avant} chaque sous-couche d'attention et de FFN, plut\^ot qu'apr\`es (ce qui \'etait la convention dans le Transformer original \citep{vaswani2017attention}).
+\subsection{Tokenizer}
+Un tokenizer est le composant qui convertit du texte brut en une s\'equence de nombres (tokens) que le mod\`ele peut traiter. Plut\^ot que de travailler au niveau des caract\`eres individuels (ce qui donnerait des s\'equences tr\`es longues) ou des mots entiers (ce qui n\'ecessiterait un vocabulaire \'enorme), les tokenizers modernes d\'ecoupent le texte en \emph{sous-mots}. Par exemple, le mot \og incompr\'ehensible \fg{} pourrait \^etre d\'ecoup\'e en \og in \fg, \og compr\'ehens \fg, \og ible \fg.
+Nous entra\^inons un tokenizer SentencePiece \citep{kudo2018sentencepiece} de type BPE (\emph{Byte Pair Encoding}) avec un vocabulaire de 50\,000 tokens sur un \'echantillon \'equilibr\'e de notre corpus d'entra\^inement. Les param\`etres cl\'es incluent :
+\begin{itemize}[leftmargin=*]
+    \item Couverture de caract\`eres : 99,99\,\%
+    \item Byte fallback activ\'e (g\`ere toute entr\'ee UTF-8, y compris les caract\`eres rares)
+    \item Tokens sp\'eciaux : \texttt{<pad>} (0), \texttt{<unk>} (1), \texttt{<s>} (2), \texttt{</s>} (3), \texttt{<|code|>} (4), \texttt{<|endcode|>} (5), \texttt{<|im\_start|>} (6), \texttt{<|im\_end|>} (7)
+\end{itemize}
+Les tokens de style ChatML (\texttt{<|im\_start|>} et \texttt{<|im\_end|>}) sont inclus d\`es le d\'ebut du pr\'e-entra\^inement pour supporter le fine-tuning d'instructions ult\'erieur sans expansion du vocabulaire. Cette d\'ecision de conception \'evite les probl\`emes d'embeddings non entra\^in\'es lors du SFT.
+% ============================================================================
+% 4. Donn\'ees d'Entra\^inement
+% ============================================================================
+\section{Donn\'ees d'Entra\^inement}
+\subsection{Sources de Donn\'ees}
+La qualit\'e et la diversit\'e des donn\'ees d'entra\^inement sont des facteurs d\'eterminants pour la performance d'un mod\`ele de langage. Nous constituons un corpus d'entra\^inement bilingue d'environ 39 milliards de tokens avec un ratio de 70\,\% d'anglais et 30\,\% de fran\c{c}ais (tableau~\ref{tab:data_sources}).
+\begin{itemize}[leftmargin=*]
+    \item \textbf{Wikip\'edia} (EN + FR, 5,5B tokens) : texte encyclop\'edique factuel et bien structur\'e dans les deux langues.
+    \item \textbf{OSCAR 2301} (EN + FR, 15B tokens) : large corpus web multilingual extrait de Common Crawl, offrant de la diversit\'e linguistique mais avec un bruit plus important.
+    \item \textbf{FineWeb-Edu} (EN, 8B tokens) : corpus anglophone de haute qualit\'e filtr\'e pour son contenu \'educatif, contribuant significativement \`a la qualit\'e des repr\'esentations apprises.
+    \item \textbf{Projet Gutenberg} (EN + FR, 1B tokens) : textes litt\'eraires libres de droits.
+    \item \textbf{The Stack} (Multi, 2B tokens) : code source d\'edupliqu\'e provenant de d\'ep\^ots GitHub ouverts, permettant au mod\`ele d'acqu\'erir des comp\'etences basiques en compr\'ehension de code.
+\end{itemize}
+\begin{table}[H]
+\centering
+\caption{Composition des donn\'ees d'entra\^inement pour Julian-600M (39B tokens).}
+\label{tab:data_sources}
+\begin{tabular}{lccc}
+\toprule
+\textbf{Source} & \textbf{Langues} & \textbf{Tokens (approx.)} & \textbf{Qualit\'e} \\
+\midrule
+Wikip\'edia & EN + FR & 5,5B & \'Elev\'ee \\
+OSCAR 2301 & EN + FR & 15B & Moyenne \\
+FineWeb-Edu & EN & 8B & Tr\`es \'elev\'ee \\
+Projet Gutenberg & EN + FR & 1B & \'Elev\'ee \\
+The Stack (code) & Multi & 2B & \'Elev\'ee \\
+\midrule
+\textbf{Total} & & \textbf{$\sim$39B} & \\
+\bottomrule
+\end{tabular}
+\end{table}
+\subsection{Pipeline de Traitement des Donn\'ees}
+Notre pipeline de traitement des donn\'ees comprend les \'etapes suivantes :
+\begin{enumerate}[leftmargin=*]
+    \item \textbf{T\'el\'echargement} : Les donn\'ees brutes sont obtenues depuis les datasets HuggingFace (OSCAR, FineWeb-Edu, The Stack), les dumps Wikip\'edia et les miroirs du Projet Gutenberg.
+    \item \textbf{Nettoyage} : Les documents de moins de 100 caract\`eres ou de plus de 500K caract\`eres sont retir\'es. Nous imposons un ratio minimum de 70\,\% de caract\`eres alphanum\'eriques pour \'eliminer les documents principalement compos\'es de code HTML, de listes de liens ou de contenu non textuel.
+    \item \textbf{D\'eduplication} : Le MinHash Locality-Sensitive Hashing (LSH) avec un seuil de similarit\'e de Jaccard de 0,8 est utilis\'e pour la suppression des quasi-doublons. Cette technique permet de d\'etecter efficacement les documents similaires sans comparer chaque paire, ce qui serait prohibitif \`a grande \'echelle.
+    \item \textbf{D\'etection de langue} : Nous utilisons l'identification de langue fastText avec un seuil de confiance de 0,8 pour assurer un \'etiquetage linguistique correct et maintenir le ratio 70/30.
+    \item \textbf{Tokenisation} : Le corpus nettoy\'e est tokenis\'e avec notre tokenizer SentencePiece et empaquet\'e en s\'equences de 2048 tokens.
+    \item \textbf{Sharding} : Les donn\'ees tokenis\'ees sont d\'ecoup\'ees en 359 shards stock\'es sur Google Cloud Storage (GCS) pour le streaming pendant l'entra\^inement.
+\end{enumerate}
+% ============================================================================
+% 5. Proc\'edure d'Entra\^inement
+% ============================================================================
+\section{Proc\'edure d'Entra\^inement}
+\subsection{Infrastructure}
+Tout l'entra\^inement est r\'ealis\'e sur des pods TPU v4-32 de Google Cloud (32 puces TPU r\'eparties sur 4 h\^otes) fournis dans le cadre du programme TPU Research Cloud (TRC). Les TPU (\emph{Tensor Processing Units}) sont des acc\'el\'erateurs mat\'eriels con\c{c}us par Google, sp\'ecifiquement optimis\'es pour les op\'erations matricielles qui constituent le c\oe{}ur du calcul des r\'eseaux de neurones. Contrairement aux GPU qui sont des processeurs g\'en\'eralistes adapt\'es au calcul parall\`ele, les TPU sont des ASIC (\emph{Application-Specific Integrated Circuits}) d\'edi\'es, offrant un meilleur rapport performance/watt pour l'entra\^inement de mod\`eles de langage.
+Nous utilisons le framework JAX \citep{bradbury2018jax} avec Flax pour la d\'efinition du mod\`ele et Optax pour l'optimisation. JAX est un framework de calcul num\'erique qui combine la familiarit\'e de NumPy avec la compilation JIT (\emph{Just-In-Time}) et la diff\'erentiation automatique. Son principal avantage pour l'entra\^inement sur TPU est sa gestion native du parall\'elisme multi-device.
+\subsection{Strat\'egie de Parall\'elisme}
+L'entra\^inement d'un mod\`ele de 600M de param\`etres sur un seul acc\'el\'erateur serait possible mais extr\^emement lent. Pour acc\'el\'erer le processus, nous distribuons le calcul sur les 32 puces TPU. Nous employons le Fully Sharded Data Parallelism (FSDP) \citep{xu2021gspmd} via la primitive \texttt{pmap} de JAX. Concr\`etement, les param\`etres du mod\`ele sont r\'epliqu\'es sur tous les devices, tandis que la dimension du batch est fragment\'ee : chaque puce TPU traite une portion diff\'erente des donn\'ees. Les gradients sont ensuite agr\'eg\'es (moyenn\'es) entre toutes les puces avant la mise \`a jour des param\`etres.
+L'accumulation de gradient sur 8 micro-pas permet d'atteindre une taille de batch effective de 1024 s\'equences (4 s\'equences par device $\times$ 32 devices $\times$ 8 pas d'accumulation). \`A 2048 tokens par s\'equence, chaque pas d'entra\^inement traite environ 2,1 millions de tokens.
+\subsection{Optimiseur et Programme d'Apprentissage}
+L'optimiseur est l'algorithme qui met \`a jour les poids du mod\`ele \`a chaque pas d'entra\^inement en fonction des gradients calcul\'es. Nous utilisons AdamW \citep{loshchilov2019decoupled}, une variante d'Adam qui d\'ecouple la r\'egularisation par d\'ecroissance des poids (\emph{weight decay}) de l'optimisation. AdamW maintient deux \og moments \fg{} pour chaque param\`etre : le premier moment $\mu$ (moyenne mobile des gradients) et le second moment $\nu$ (moyenne mobile des gradients au carr\'e), qui permettent d'adapter le taux d'apprentissage individuellement pour chaque param\`etre.
+\paragraph{Budget de calcul.}
+L'entra\^inement de Julian-600M n\'ecessite environ $2{,}4 \times 10^{19}$ FLOPs (estim\'es selon la formule $C \approx 6ND$ o\`u $N$ est le nombre de param\`etres et $D$ le nombre de tokens). Le temps d'entra\^inement total est d'environ 21 jours sur un pod TPU v4-32, avec une utilisation de mod\`ele (\emph{Model FLOPs Utilization}, MFU) d'environ 38\,\%.
+\begin{table}[h]
+\centering
+\caption{Hyperparam\`etres de pr\'e-entra\^inement pour Julian-600M.}
+\label{tab:hyperparams}
+\begin{tabular}{lc}
+\toprule
+\textbf{Hyperparam\`etre} & \textbf{Valeur} \\
+\midrule
+Optimiseur & AdamW \\
+$\beta_1$, $\beta_2$ & 0,9 ; 0,95 \\
+$\epsilon$ & $10^{-8}$ \\
+D\'ecroissance des poids & 0,1 \\
+Taux d'apprentissage maximal & $1,2 \times 10^{-3}$ \\
+Taux d'apprentissage minimal & $1,2 \times 10^{-4}$ (10\,\% du max) \\
+Pas de warmup & 3\,000 \\
+Pas totaux & 300\,000 \\
+Programme du taux d'apprentissage & Cosine annealing \\
+Gradient clipping & 1,0 (norme globale) \\
+Taille de batch (par device) & 4 \\
+Pas d'accumulation de gradient & 8 \\
+Taille de batch effective & 1\,024 \\
+Longueur de s\'equence & 2\,048 \\
+Tokens par pas & $\sim$2,1M \\
+Tokens totaux & $\sim$39B \\
+Pr\'ecision & bfloat16 \\
+\bottomrule
+\end{tabular}
+\end{table}
+Nous suivons le programme cosinus de Chinchilla \citep{hoffmann2022training} : un warmup lin\'eaire de 0 au taux d'apprentissage maximal sur 3\,000 pas, suivi d'une d\'ecroissance en cosinus jusqu'\`a 10\,\% de la valeur maximale. Le \emph{warmup} est une phase critique o\`u le taux d'apprentissage augmente progressivement depuis z\'ero, permettant au mod\`ele de se stabiliser avant de recevoir des mises \`a jour plus agressives. La d\'ecroissance en cosinus r\'eduit ensuite progressivement le taux d'apprentissage, permettant au mod\`ele de converger finement.
+Les \'etats de l'optimiseur ($\mu$ et $\nu$) sont stock\'es en bfloat16 pour r\'eduire la consommation m\'emoire d'environ 40\,\%. Le format bfloat16 (Brain Floating Point 16-bit) \citep{micikevicius2018mixed} utilise 16 bits au lieu de 32, r\'eduisant la m\'emoire de moiti\'e avec une perte de pr\'ecision n\'egligeable pour l'entra\^inement.
+\subsection{Robustesse}
+L'entra\^inement sur des instances TPU pr\'eemptibles (\emph{spot instances}, qui peuvent \^etre interrompues \`a tout moment par le fournisseur cloud pour lib\'erer des ressources) n\'ecessite une gestion robuste des checkpoints. Nous impl\'ementons :
+\begin{itemize}[leftmargin=*]
+    \item \textbf{Checkpointing asynchrone} via Orbax, sauvegardant tous les 10\,000 pas sans bloquer l'entra\^inement. L'\'ecriture des checkpoints se fait en arri\`ere-plan tandis que le calcul continue.
+    \item \textbf{Gestionnaire SIGTERM} : Lors d'une pr\'eemption, le syst\`eme d'exploitation envoie un signal SIGTERM. Notre gestionnaire \'ecrit un checkpoint d'urgence dans la p\'eriode de gr\^ace de 30 secondes avant l'arr\^et forc\'e.
+    \item \textbf{Monitoring de sant\'e} : D\'etection automatique des valeurs NaN/Inf dans les gradients et la loss, avec logique de disjoncteur (\emph{circuit breaker}) pour les tentatives de r\'ecup\'eration.
+    \item \textbf{Synchronisation globale} : Barri\`ere de synchronisation JAX avant les \'ecritures de checkpoint pour assurer la coh\'erence multi-h\^ote. Sans cette synchronisation, les h\^otes pourraient sauvegarder des \'etats \`a des pas diff\'erents, corrompant le checkpoint.
+\end{itemize}
+% ============================================================================
+% 6. Supervised Fine-Tuning
+% ============================================================================
+\section{Supervised Fine-Tuning}
+Le pr\'e-entra\^inement produit un mod\`ele capable de pr\'edire le prochain token dans un texte, c'est-\`a-dire un \og compl\'eteur de texte \fg{} : si on lui donne le d\'ebut d'un article Wikip\'edia, il peut le continuer de mani\`ere coh\'erente. Cependant, ce mod\`ele ne sait pas r\'epondre \`a des questions ou suivre des instructions. Le \emph{Supervised Fine-Tuning} (SFT) transforme ce compl\'eteur de texte en \og assistant \fg{} en le r\'e-entra\^inant sur des exemples de conversations instruction-r\'eponse.
+Nous effectuons le SFT sur le checkpoint pr\'e-entra\^in\'e de Julian-600M (pas 300\,000, soit 39B tokens vus).
+\subsection{Dataset d'Instructions}
+Notre dataset SFT comprend 2,47 millions de paires instruction-r\'eponse issues de multiples sources :
+\begin{table}[H]
+\centering
+\caption{Composition du dataset SFT.}
+\label{tab:sft_data}
+\begin{tabular}{lcc}
+\toprule
+\textbf{Source} & \textbf{Exemples (approx.)} & \textbf{Langue} \\
+\midrule
+Stanford Alpaca & 52K & Anglais \\
+Databricks Dolly 15K & 15K & Anglais \\
+Code Alpaca & 20K & Anglais \\
+GPT4All-J & 20K & Anglais \\
+Donn\'ees d'instructions fran\c{c}aises & 15K+ & Fran\c{c}ais \\
+OpenHermes 2.5 & $\sim$900K & Anglais \\
+SlimOrca & $\sim$500K & Anglais \\
+Autres donn\'ees synth\'etiques & $\sim$900K & Multilingue \\
+\midrule
+\textbf{Total} & \textbf{2,47M} & \\
+\bottomrule
+\end{tabular}
+\end{table}
+\subsection{Format ChatML}
+Toutes les donn\'ees d'instructions sont format\'ees avec le template ChatML \citep{openai2023chatml}. Ce format structure la conversation en segments clairement d\'elimit\'es par les tokens sp\'eciaux \texttt{<|im\_start|>} et \texttt{<|im\_end|>} :
+\smallskip\noindent\begin{minipage}{\textwidth}
+\begin{Verbatim}[fontsize=\small, vspace=0pt]
+<|im_start|>system
+You are a helpful assistant.<|im_end|>
+<|im_start|>user
+{instruction}<|im_end|>
+<|im_start|>assistant
+{response}<|im_end|>
+\end{Verbatim}
+\end{minipage}
+\smallskip\noindent Pendant le SFT, la loss est calcul\'ee \emph{uniquement} sur les tokens de r\'eponse de l'assistant en utilisant un masque de loss binaire. Les tokens du syst\`eme et de l'utilisateur re\c{c}oivent un poids de loss nul, ce qui assure que le mod\`ele apprend \`a \emph{g\'en\'erer} des r\'eponses plut\^ot qu'\`a m\'emoriser les instructions.
+\subsection{Hyperparam\`etres du SFT}
+\begin{table}[h]
+\centering
+\caption{Hyperparam\`etres d'entra\^inement SFT.}
+\label{tab:sft_hyperparams}
+\begin{tabular}{lc}
+\toprule
+\textbf{Hyperparam\`etre} & \textbf{Valeur} \\
+\midrule
+Checkpoint de base & pas 300\,000 (39B tokens) \\
+Taux d'apprentissage & $2 \times 10^{-5}$ \\
+Pas de warmup & 1\,000 \\
+Taille de batch (effective) & 32--256 \\
+Longueur de s\'equence & 2\,048 \\
+D\'ecroissance des poids & 0,01 \\
+Gradient clipping & 1,0 \\
+\bottomrule
+\end{tabular}
+\end{table}
+Nous entra\^inons deux variantes SFT avec des dur\'ees diff\'erentes :
+\begin{itemize}[leftmargin=*]
+    \item \textbf{SFT-30K} : 30\,000 pas. Calcul : $30\,000 \times 32 \times 2048 = 1{,}97$ milliards de tokens vus, soit 0,66 \'epoque. Loss finale : 1,86.
+    \item \textbf{SFT-100K} : 100\,000 pas. Calcul : $100\,000 \times 32 \times 2048 = 6{,}55$ milliards de tokens vus, soit 2,20 \'epoques (chaque exemple vu en moyenne 2,2 fois). Loss finale : 1,69.
+\end{itemize}
+Une \'epoque repr\'esente un passage complet \`a travers le dataset. Avec 2,47M d'exemples et un batch de 32, une \'epoque compl\`ete correspond \`a environ 45\,383 pas ($2{,}47\text{M} / 32$).
+Une variante ant\'erieure, \textbf{Julian-600M-10B-Instruct-v0.1}, a \'et\'e fine-tun\'ee \`a partir d'un checkpoint interm\'ediaire du pr\'e-entra\^inement (pas 100\,000, $\sim$10B tokens) sur un dataset d'instructions plus petit ($\sim$185K exemples, 5\,500 pas). Cette variante sert de ligne de base pour la comparaison.
+% ============================================================================
+% 7. \'Evaluation
+% ============================================================================
+\section{\'Evaluation}
+\subsection{Suite de Benchmarks}
+Les benchmarks sont des tests standardis\'es qui mesurent diff\'erentes capacit\'es d'un mod\`ele de langage. Nous \'evaluons tous les mod\`eles Julian en mode zero-shot (sans exemples fournis au mod\`ele) en utilisant le Language Model Evaluation Harness \citep{gao2023framework} :
+\begin{itemize}[leftmargin=*]
+    \item \textbf{HellaSwag} \citep{zellers2019hellaswag} : Inf\'erence en langage naturel par sens commun. Le mod\`ele doit choisir la suite la plus plausible d'un sc\'enario parmi 4 propositions. Exemple : \og Une personne entre dans une cuisine et ouvre le r\'efrig\'erateur. Elle...\fg{} suivi de 4 fins possibles. Nous rapportons la pr\'ecision normalis\'ee par la longueur (acc\_norm).
+    \item \textbf{PIQA} \citep{bisk2020piqa} : Questions-r\'eponses sur l'intuition physique. Le mod\`ele doit choisir la m\'ethode correcte pour accomplir un objectif physique. Exemple : \og Pour s\'eparer un \oe{}uf, vous devriez...\fg{} Nous rapportons la pr\'ecision (acc).
+    \item \textbf{LAMBADA} \citep{paperno2016lambada} : Pr\'ediction du dernier mot d'un passage n\'ecessitant une compr\'ehension du contexte large. Ce benchmark mesure la capacit\'e du mod\`ele \`a utiliser un contexte long pour pr\'edire un mot sp\'ecifique. Nous rapportons la pr\'ecision (acc).
+    \item \textbf{ARC-Easy / ARC-Challenge} \citep{clark2018think} : Questions de sciences (niveau primaire/coll\`ege). ARC-Challenge contient les questions auxquelles les algorithmes de r\'ecup\'eration d'information \'echouent, testant le raisonnement plut\^ot que la m\'emorisation. Nous rapportons acc / acc\_norm respectivement.
+    \item \textbf{WinoGrande} \citep{sakaguchi2020winogrande} : R\'esolution de cor\'ef\'erence par sens commun. Le mod\`ele doit d\'eterminer \`a quoi se r\'ef\`ere un pronom dans une phrase ambigu\"e. Exemple : \og Le trophée ne rentrait pas dans la valise parce qu'\emph{il} \'etait trop grand \fg{} --- \og il \fg{} d\'esigne-t-il le troph\'ee ou la valise ? Nous rapportons la pr\'ecision (acc).
+    \item \textbf{BoolQ} \citep{clark2019boolq} : Questions oui/non naturelles extraites de requ\^etes Google r\'eelles, associ\'ees \`a des passages Wikip\'edia. Nous rapportons la pr\'ecision (acc).
+\end{itemize}
+\subsection{Infrastructure d'\'Evaluation}
+Un d\'efi technique notable : le harness standard lm-eval avec les mod\`eles HuggingFace utilise par d\'efaut PyTorch en mode CPU lorsqu'il est ex\'ecut\'e sur des VM TPU (pas de CUDA disponible). Cela rend l'\'evaluation extr\^emement lente ($\sim$7 minutes par item). Pour contourner cette limitation, nous impl\'ementons un wrapper d'\'evaluation JAX custom qui effectue l'inf\'erence directement sur TPU. Ce wrapper atteint environ 5,8 items/seconde avec un batch de 48, compl\'etant la suite d'\'evaluation compl\`ete ($\sim$72K requ\^etes) en environ 3,5 heures sur un seul pod TPU v4-32.
+% ============================================================================
+% 8. R\'esultats
+% ============================================================================
+\section{R\'esultats}
+\subsection{Progression des Mod\`eles Julian}
+Le tableau~\ref{tab:julian_results} pr\'esente les r\'esultats de benchmarks \`a travers toutes les variantes du mod\`ele Julian, illustrant l'impact du pr\'e-entra\^inement additionnel et du fine-tuning supervis\'e.
+\begin{table}[H]
+\centering
+\caption{R\'esultats des benchmarks (0-shot) pour les variantes du mod\`ele Julian. Les valeurs en gras indiquent le meilleur score parmi les mod\`eles Julian pour chaque benchmark.}
+\label{tab:julian_results}
+\resizebox{\textwidth}{!}{
+\begin{tabular}{llccccccccc}
+\toprule
+\textbf{Mod\`ele} & \textbf{Base} & \textbf{SFT} & \textbf{Loss} & \textbf{HS} & \textbf{PIQA} & \textbf{LAM.} & \textbf{ARC-E} & \textbf{ARC-C} & \textbf{WG} & \textbf{BoolQ} \\
+\midrule
+Base 10B (ckpt 100K) & 10B & --- & 3,20 & 45,8 & 67,6 & 35,0 & --- & --- & --- & --- \\
+Base 39B (ckpt 300K) & 39B & --- & 2,33 & \textbf{53,5} & 66,8 & 37,3 & --- & --- & --- & --- \\
+Instruct v0.1 (base 10B) & $\sim$10B & 5,5K & 5,01 & 42,7 & 66,2 & 34,6 & --- & --- & --- & --- \\
+SFT-30K (base 39B) & 39B & 30K & 1,86 & 41,7 & \textbf{66,8} & \textbf{37,7} & 53,5 & \textbf{27,1} & \textbf{53,8} & 60,6 \\
+SFT-100K (base 39B) & 39B & 100K & 1,69 & 41,6 & 66,6 & \textbf{37,7} & \textbf{53,8} & 26,7 & 52,8 & \textbf{60,8} \\
+\bottomrule
+\end{tabular}
+}
+\end{table}
+Pour r\'ef\'erence, le checkpoint brut 170K (interm\'ediaire de pr\'e-entra\^inement, $\sim$22B tokens) obtient : HS=39,0, PIQA=66,2, LAM.=34,7, ARC-E=56,1, ARC-C=26,5, WG=51,0, BoolQ=59,0.
+\subsection{Comparaison avec les Mod\`eles Existants}
+Le tableau~\ref{tab:comparison} compare Julian-600M avec des mod\`eles publiquement disponibles de taille similaire ou sup\'erieure.
+\begin{table}[H]
+\centering
+\caption{Comparaison avec les mod\`eles existants (0-shot). Julian-600M Base surpasse OPT-1.3B sur HellaSwag malgr\'e 2$\times$ moins de param\`etres et 8$\times$ moins de tokens d'entra\^inement.}
+\label{tab:comparison}
+\resizebox{\textwidth}{!}{
+\begin{tabular}{lccccccccc}
+\toprule
+\textbf{Mod\`ele} & \textbf{Param.} & \textbf{Tokens} & \textbf{HS} & \textbf{PIQA} & \textbf{LAM.} & \textbf{ARC-E} & \textbf{ARC-C} & \textbf{WG} \\
+\midrule
+GPT-2 Small & 124M & 100B+ & 31,5 & --- & 46,0 & --- & --- & 50,4 \\
+OPT-125M & 125M & 300B & 29,2 & 63,0 & 37,9 & 43,5 & 18,9 & 50,3 \\
+OPT-350M & 331M & 300B & 32,0 & 64,4 & 45,2 & 44,0 & 20,7 & 52,3 \\
+Pythia-410M & 405M & 300B & 33,3 & 66,8 & 50,5 & 50,4 & 21,3 & 53,0 \\
+\midrule
+\textbf{Julian-600M Base} & \textbf{600M} & \textbf{39B} & \textbf{53,5} & \textbf{66,8} & 37,3 & --- & --- & --- \\
+\textbf{Julian-600M SFT-30K} & \textbf{600M} & \textbf{39B+2B} & 41,7 & \textbf{66,8} & \textbf{37,7} & \textbf{53,5} & \textbf{27,1} & \textbf{53,8} \\
+\midrule
+GPT-2 XL & 1\,558M & 100B+ & 50,9 & 70,8 & 63,2 & --- & --- & 59,4 \\
+Pythia-1B & 1B & 300B & 37,6 & 70,5 & 56,6 & 55,9 & 24,3 & 54,5 \\
+OPT-1.3B & 1,3B & 300B & 41,5 & 71,7 & 57,9 & 57,0 & 23,4 & 59,5 \\
+\bottomrule
+\end{tabular}
+}
+\end{table}
+\paragraph{R\'esultats cl\'es.}
+\begin{itemize}[leftmargin=*]
+    \item \textbf{HellaSwag} : Julian-600M Base atteint 53,5\,\%, surpassant GPT-2~XL (50,9\,\%, 1,5B param\`etres), OPT-1.3B (41,5\,\%) et Pythia-1B (37,6\,\%). C'est un r\'esultat remarquable pour un mod\`ele de 600M entra\^in\'e sur seulement 39B tokens.
+    \item \textbf{PIQA} : Julian-600M \'egale Pythia-410M \`a 66,8\,\% et se situe l\'eg\`erement en dessous des mod\`eles 2--3$\times$ plus grands.
+    \item \textbf{LAMBADA} : Julian-600M atteint 37,3\,\%, inf\'erieur aux mod\`eles de taille similaire entra\^in\'es sur plus de donn\'ees. LAMBADA est particuli\`erement sensible au volume et \`a la diversit\'e du texte d'entra\^inement, ce qui explique probablement cet \'ecart.
+    \item \textbf{Efficacit\'e en tokens} : Julian-600M atteint son score HellaSwag avec 39B tokens, tandis que les mod\`eles OPT et Pythia ont \'et\'e entra\^in\'es sur 300B tokens (7,7$\times$ plus).
+\end{itemize}
+\begin{figure}[t]
+\centering
+\begin{tikzpicture}
+\begin{axis}[
+    xbar,
+    bar width=7pt,
+    width=0.88\textwidth,
+    height=6cm,
+    xlabel={HellaSwag (acc\_norm, \%)},
+    ytick={0,1,2,3,4,5,6,7},
+    yticklabels={
+        {OPT-125M {\scriptsize(125M, 300B tok)}},
+        {GPT-2 Small {\scriptsize(124M, 100B+ tok)}},
+        {OPT-350M {\scriptsize(331M, 300B tok)}},
+        {Pythia-410M {\scriptsize(405M, 300B tok)}},
+        {Pythia-1B {\scriptsize(1B, 300B tok)}},
+        {OPT-1.3B {\scriptsize(1,3B, 300B tok)}},
+        {GPT-2 XL {\scriptsize(1,5B, 100B+ tok)}},
+        {\textbf{Julian-600M} {\scriptsize\textbf{(600M, 39B tok)}}}
+    },
+    xmin=25, xmax=58,
+    nodes near coords,
+    nodes near coords style={font=\scriptsize, anchor=west},
+    enlarge y limits=0.1,
+    xmajorgrids=true,
+    grid style={gray!20},
+    y tick label style={font=\footnotesize},
+]
+\addplot[fill=gray!40, draw=gray!60] coordinates {
+    (29.2,0) (31.5,1) (32.0,2) (33.3,3) (37.6,4) (41.5,5) (50.9,6) (53.5,7)
+};
+\end{axis}
+\end{tikzpicture}
+\caption{Pr\'ecision HellaSwag (acc\_norm) compar\'ee entre mod\`eles, tri\'ee par score. Les nombres entre parenth\`eses indiquent le nombre de param\`etres et le volume de donn\'ees d'entra\^inement. Julian-600M obtient le score le plus \'elev\'e malgr\'e moins de param\`etres et significativement moins de donn\'ees que la plupart des mod\`eles de comparaison.}
+\label{fig:hellaswag_comparison}
+\end{figure}
+% ============================================================================
+% 9. Interpr\'etation des R\'esultats
+% ============================================================================
+\section{Interpr\'etation des R\'esultats}
+Cette section propose une analyse approfondie des r\'esultats pr\'esent\'es ci-dessus, en examinant les dynamiques du pr\'e-entra\^inement, l'impact du SFT et les ph\'enom\`enes de saturation observ\'es.
+\subsection{Progression du Pr\'e-entra\^inement}
+L'\'evolution des performances entre les deux checkpoints de pr\'e-entra\^inement r\'ev\`ele une dynamique d'apprentissage soutenue. Entre le checkpoint \`a 10B tokens (pas 100\,000) et le checkpoint final \`a 39B tokens (pas 300\,000), nous observons :
+\begin{itemize}[leftmargin=*]
+    \item \textbf{HellaSwag} : 45,8\,\% $\rightarrow$ 53,5\,\% (+7,7 points)
+    \item \textbf{Loss} : 3,20 $\rightarrow$ 2,33 ($-$27\,\%)
+    \item \textbf{PIQA} : 67,6\,\% $\rightarrow$ 66,8\,\% ($-$0,8 point)
+    \item \textbf{LAMBADA} : 35,0\,\% $\rightarrow$ 37,3\,\% (+2,3 points)
+\end{itemize}
+La progression de +7,7 points sur HellaSwag est particuli\`erement significative. Ce benchmark mesure la capacit\'e de raisonnement par sens commun, et l'am\'elioration continue sugg\`ere que le mod\`ele n'a pas atteint sa capacit\'e maximale d'apprentissage \`a 39B tokens. La loss qui continue de d\'ecro\^itre de mani\`ere substantielle (de 3,20 \`a 2,33) confirme l'absence de saturation : le mod\`ele continue d'apprendre efficacement \`a chaque pas d'entra\^inement suppl\'ementaire.
+Il est int\'eressant de noter la l\'eg\`ere baisse de PIQA ($-$0,8 point), qui pourrait refl\'eter une redistribution de la capacit\'e du mod\`ele \`a mesure qu'il apprend des repr\'esentations plus complexes. Ce ph\'enom\`ene, parfois appel\'e \og interf\'erence catastrophique \fg{} dans sa forme extr\^eme, est ici b\'enin et compens\'e par les gains substantiels sur d'autres m\'etriques.
+En extrapolant cette trajectoire, on peut raisonnablement s'attendre \`a ce qu'un entra\^inement continu au-del\`a de 39B tokens apporte des am\'eliorations suppl\'ementaires, particuli\`erement sur LAMBADA o\`u Julian-600M reste en retrait par rapport aux mod\`eles entra\^in\'es sur 300B tokens.
+\subsection{Impact du SFT sur les Capacit\'es du Mod\`ele}
+Le SFT transforme fondamentalement le comportement du mod\`ele : d'un \og compl\'eteur de texte \fg{} qui pr\'edit statistiquement le prochain token, il devient un \og assistant \fg{} capable de r\'epondre \`a des instructions structur\'ees. Cette transformation a un co\^ut mesurable sur les benchmarks.
+\paragraph{Le sacrifice HellaSwag.} La baisse la plus notable est celle de HellaSwag : $-$11,8 points (53,5\,\% $\rightarrow$ 41,7\,\%). Ce ph\'enom\`ene est bien document\'e dans la litt\'erature \citep{ouyang2022training} et s'explique par la nature m\^eme du SFT. HellaSwag mesure la capacit\'e du mod\`ele \`a compl\'eter naturellement un texte ; or, le SFT r\'eoriente le mod\`ele vers la production de r\'eponses dans un format conversationnel sp\'ecifique (ChatML). Le mod\`ele \og d\'esapprend \fg{} partiellement la compl\'etion libre au profit du suivi d'instructions. C'est un compromis attendu et g\'en\'eralement accept\'e.
+\paragraph{Stabilit\'e du raisonnement.} En revanche, les benchmarks mesurant le raisonnement sont remarquablement stables apr\`es le SFT :
+\begin{itemize}[leftmargin=*]
+    \item \textbf{PIQA} reste \`a 66,8\,\% (identique au mod\`ele de base), indiquant que l'intuition physique n'est pas affect\'ee.
+    \item \textbf{WinoGrande} atteint 53,8\,\% (non mesur\'e sur le mod\`ele de base, mais comparable aux mod\`eles de r\'ef\'erence).
+    \item \textbf{ARC-Challenge} \`a 27,1\,\% se situe dans la plage attendue pour un mod\`ele de 600M.
+\end{itemize}
+Ces r\'esultats sugg\`erent que le SFT n'alt\`ere pas les capacit\'es de raisonnement sous-jacentes du mod\`ele, mais modifie principalement la \emph{distribution de sortie} (le format des r\'eponses g\'en\'er\'ees).
+\paragraph{Am\'elioration sur LAMBADA.} Fait notable, LAMBADA s'am\'eliore l\'eg\`erement apr\`es le SFT (+0,4 point, de 37,3\,\% \`a 37,7\,\%). Ce r\'esultat, a priori contre-intuitif, peut s'expliquer par le fait que le format instruction-r\'eponse encourage le mod\`ele \`a mieux exploiter le contexte fourni pour produire une r\'eponse pr\'ecise --- exactement ce que LAMBADA mesure (pr\'edire un mot \`a partir d'un contexte long).
+\subsection{Le Sur-SFT : Analyse Quantitative (30K vs 100K)}
+La comparaison entre SFT-30K et SFT-100K constitue l'un des r\'esultats les plus instructifs de ce travail. Le tableau~\ref{tab:sft_delta} pr\'esente le d\'etail des diff\'erences.
+\begin{table}[H]
+\centering
+\caption{Comparaison d\'etaill\'ee entre SFT-30K et SFT-100K. $\Delta$ repr\'esente la diff\'erence (100K $-$ 30K). Le SFT-100K utilise 3,3$\times$ plus de compute pour des r\'esultats quasi identiques.}
+\label{tab:sft_delta}
+\begin{tabular}{lccccc}
+\toprule
+\textbf{Benchmark} & \textbf{SFT-30K} & \textbf{SFT-100K} & \textbf{$\Delta$} & \textbf{Tokens SFT} & \textbf{\'Epoques} \\
+\midrule
+Loss & 1,86 & 1,69 & $-$0,17 & --- & --- \\
+HellaSwag & 41,7\,\% & 41,6\,\% & $-$0,1 & --- & --- \\
+PIQA & 66,8\,\% & 66,6\,\% & $-$0,2 & --- & --- \\
+LAMBADA & 37,7\,\% & 37,7\,\% & 0,0 & --- & --- \\
+ARC-Easy & 53,5\,\% & 53,8\,\% & +0,3 & --- & --- \\
+ARC-Challenge & 27,1\,\% & 26,7\,\% & $-$0,4 & --- & --- \\
+WinoGrande & 53,8\,\% & 52,8\,\% & \textbf{$-$1,0} & --- & --- \\
+BoolQ & 60,6\,\% & 60,8\,\% & +0,2 & --- & --- \\
+\midrule
+& & & & 1,97B vs 6,55B & 0,66 vs 2,20 \\
+\bottomrule
+\end{tabular}
+\end{table}
+\begin{figure}[t]
+\centering
+\begin{tikzpicture}
+\begin{axis}[
+    ybar=8pt,
+    bar width=12pt,
+    width=\textwidth,
+    height=6.5cm,
+    ylabel={Pr\'ecision (\%)},
+    symbolic x coords={HellaSwag, PIQA, LAMBADA},
+    xtick=data,
+    ymin=30, ymax=72,
+    nodes near coords,
+    nodes near coords style={font=\scriptsize, /pgf/number format/fixed, /pgf/number format/precision=1, anchor=south},
+    legend style={at={(0.5,-0.15)}, anchor=north, legend columns=3, font=\small},
+    enlarge x limits=0.35,
+    ymajorgrids=true,
+    grid style={gray!15},
+]
+\addplot[fill=blue!25, draw=blue!50] coordinates {
+    (HellaSwag, 53.5) (PIQA, 66.8) (LAMBADA, 37.3)
+};
+\addplot[fill=orange!30, draw=orange!55] coordinates {
+    (HellaSwag, 41.7) (PIQA, 66.8) (LAMBADA, 37.7)
+};
+\addplot[fill=red!20, draw=red!45] coordinates {
+    (HellaSwag, 41.6) (PIQA, 66.6) (LAMBADA, 37.7)
+};
+\legend{Base 39B, {SFT-30K (0,66 \'ep.)}, {SFT-100K (2,2 \'ep.)}}
+\end{axis}
+\end{tikzpicture}
+\caption{Impact du fine-tuning supervis\'e sur les performances aux benchmarks. Le SFT provoque une chute significative de HellaSwag ($-$11,8 points) tout en pr\'eservant PIQA et en am\'eliorant l\'eg\`erement LAMBADA. SFT-30K et SFT-100K obtiennent des r\'esultats quasi identiques malgr\'e 3,3$\times$ plus de calcul, indiquant une saturation claire.}
+\label{fig:sft_impact}
+\end{figure}
+\paragraph{La loss n'est pas un bon indicateur de qualit\'e SFT.} Le r\'esultat le plus frappant est la d\'econnexion entre la loss et les performances aux benchmarks. La loss baisse significativement de 1,86 \`a 1,69 ($-$9\,\%), mais les benchmarks stagnent ou se d\'egradent. Ce ph\'enom\`ene r\'ev\`ele que le mod\`ele apprend \`a mieux reproduire le \emph{format} des r\'eponses du dataset SFT (baisse de la loss sur les tokens de r\'eponse) sans am\'eliorer ses \emph{connaissances} ou ses capacit\'es de \emph{raisonnement} sous-jacentes. En d'autres termes, le mod\`ele devient plus fluide dans le format ChatML sans devenir plus intelligent.
+\paragraph{Signal d'overfitting : WinoGrande.} La d\'egradation de WinoGrande de 53,8\,\% \`a 52,8\,\% ($-$1,0 point) est le signal le plus clair d'overfitting. WinoGrande teste le raisonnement par sens commun sur la r\'esolution de pronoms ambigus, une capacit\'e qui ne devrait pas se d\'egrader avec un entra\^inement suppl\'ementaire si le mod\`ele g\'en\'eralisait correctement. Avec 2,47M d'exemples et 2,2 \'epoques, chaque exemple du dataset SFT a \'et\'e vu en moyenne plus de 2 fois. Le mod\`ele commence \`a m\'emoriser les patterns sp\'ecifiques du dataset plut\^ot qu'\`a g\'en\'eraliser, ce qui nuit \`a sa capacit\'e de raisonnement g\'en\'eral.
+\paragraph{ARC-Challenge confirme la tendance.} La baisse d'ARC-Challenge ($-$0,4 point) va dans le m\^eme sens. Ce benchmark teste le raisonnement scientifique sur des questions difficiles, et sa d\'egradation parall\`ele \`a celle de WinoGrande renforce l'hypoth\`ese d'un overfitting qui impacte sp\'ecifiquement les capacit\'es de raisonnement.
+\paragraph{Implication pratique.} Pour un dataset de 2,47M d'exemples avec un batch de 32, une \'epoque correspond \`a 45\,383 pas. Le SFT-30K (0,66 \'epoque) n'a pas encore fait un passage complet du dataset, mais atteint d\'ej\`a des performances optimales. Le compute suppl\'ementaire du SFT-100K (3,3$\times$ plus) est donc largement gaspill\'e.
+\subsection{Importance du Checkpoint de Base}
+La comparaison entre les diff\'erentes variantes fine-tun\'ees r\'ev\`ele un paradoxe apparent :
+\begin{itemize}[leftmargin=*]
+    \item \textbf{Instruct v0.1} (base 10B tokens, 5\,500 pas SFT, 185K exemples) : HellaSwag = 42,7\,\%
+    \item \textbf{SFT-30K} (base 39B tokens, 30\,000 pas SFT, 2,47M exemples) : HellaSwag = 41,7\,\%
+\end{itemize}
+Le mod\`ele fine-tun\'e \`a partir d'une base plus faible (10B tokens) obtient un HellaSwag post-SFT sup\'erieur (+1,0 point) \`a celui fine-tun\'e \`a partir de la base plus forte (39B tokens). Plusieurs facteurs peuvent expliquer ce r\'esultat :
+\begin{enumerate}[leftmargin=*]
+    \item \textbf{Datasets SFT diff\'erents} : Instruct v0.1 utilise 185K exemples (probablement de meilleure qualit\'e unitaire), tandis que SFT-30K utilise 2,47M exemples (plus de diversit\'e mais potentiellement plus de bruit). La qualit\'e des exemples SFT a un impact direct sur la d\'egradation des benchmarks.
+    \item \textbf{Nombre de pas SFT diff\'erent} : 5\,500 pas repr\'esentent une exposition beaucoup plus l\'eg\`ere au SFT que 30\,000 pas, ce qui pr\'eserve davantage les capacit\'es du mod\`ele de base. Avec moins de pas, le mod\`ele \og oublie \fg{} moins ses capacit\'es de compl\'etion de texte.
+    \item \textbf{Surface de loss diff\'erente} : Le mod\`ele \`a 10B tokens se trouve dans un r\'egime d'entra\^inement diff\'erent (loss 3,20 vs 2,33), ce qui peut influencer la mani\`ere dont le SFT modifie les poids --- un mod\`ele avec une loss plus \'elev\'ee pourrait \^etre plus \og malleable \fg{} au SFT.
+\end{enumerate}
+Ce r\'esultat souligne que la qualit\'e post-SFT n'est pas une simple fonction du checkpoint de base : la combinaison checkpoint de base, dataset SFT et dur\'ee de SFT forme un espace d'hyperparam\`etres \`a trois dimensions qu'il convient d'optimiser conjointement.
+\subsection{Recommandations Pratiques}
+\`A partir de l'ensemble de nos observations, nous formulons les recommandations suivantes pour le fine-tuning de petits mod\`eles de langage (moins d'1B de param\`etres) :
+\begin{enumerate}[leftmargin=*]
+    \item \textbf{Limiter le SFT \`a moins d'1 \'epoque} : Pour un dataset de l'ordre du million d'exemples, 0,5--0,7 \'epoque semble optimal. Au-del\`a, le risque d'overfitting augmente sans b\'en\'efice mesurable sur les benchmarks.
+    \item \textbf{Surveiller WinoGrande et ARC-Challenge} : Ces deux benchmarks sont les premiers \`a montrer des signes d'overfitting lors du SFT. Une d\'egradation de ces m\'etriques est un signal d'arr\^et plus fiable que la loss d'entra\^inement.
+    \item \textbf{Ne pas se fier \`a la loss pour le SFT} : Contrairement au pr\'e-entra\^inement o\`u la loss est un indicateur fiable de la qualit\'e du mod\`ele, la loss SFT mesure principalement la conformit\'e au format, pas la qualit\'e du raisonnement.
+    \item \textbf{Privil\'egier la diversit\'e plut\^ot que le volume} : Un dataset SFT de haute qualit\'e avec des exemples diversifi\'es est pr\'ef\'erable \`a un large dataset bruit\'e entra\^in\'e sur plusieurs \'epoques.
+    \item \textbf{Investir dans le pr\'e-entra\^inement} : La progression de 45,8\,\% \`a 53,5\,\% sur HellaSwag montre que le pr\'e-entra\^inement suppl\'ementaire apporte des gains bien plus importants que l'augmentation du SFT.
+\end{enumerate}
+% ============================================================================
+% 10. Analyse
+% ============================================================================
+\section{Analyse}
+\subsection{Efficacit\'e d'Entra\^inement}
+La performance \'elev\'ee de Julian-600M sur HellaSwag malgr\'e un volume limit\'e de donn\'ees d'entra\^inement sugg\`ere que notre architecture et notre proc\'edure d'entra\^inement sont hautement efficaces. Nous \'emettons l'hypoth\`ese de plusieurs facteurs contributifs :
+\begin{enumerate}[leftmargin=*]
+    \item \textbf{Architecture moderne} : La combinaison de RoPE, SwiGLU et RMSNorm (comme dans LLaMA) fournit de meilleurs biais inductifs que les architectures utilis\'ees dans GPT-2 et OPT (embeddings positionnels appris, FFN standard, LayerNorm). Les biais inductifs sont les hypoth\`eses implicites de l'architecture sur la structure des donn\'ees ; de meilleurs biais inductifs permettent au mod\`ele d'apprendre plus efficacement \`a partir de moins de donn\'ees.
+    \item \textbf{Qualit\'e des donn\'ees} : FineWeb-Edu et Wikip\'edia fournissent des donn\'ees d'entra\^inement de haute qualit\'e et factuelles, offrant potentiellement plus d'\og apprentissage par token \fg{} que des crawls web plus bruit\'es comme ceux utilis\'es pour entra\^iner OPT.
+    \item \textbf{Entra\^inement bilingue} : L'exposition au fran\c{c}ais et \`a l'anglais peut fournir des b\'en\'efices de transfert cross-lingue, particuli\`erement pour les t\^aches de raisonnement par sens commun o\`u la structure logique transcende les fronti\`eres linguistiques.
+\end{enumerate}
+\begin{figure}[t]
+\centering
+\begin{tikzpicture}
+\begin{axis}[
+    width=0.92\textwidth,
+    height=7cm,
+    xlabel={Tokens d'entra\^inement},
+    ylabel={HellaSwag (acc\_norm, \%)},
+    xmode=log,
+    xmin=2e10, xmax=5e11,
+    ymin=25, ymax=58,
+    grid=both,
+    grid style={gray!15},
+    legend style={at={(0.97,0.97)}, anchor=north east, font=\small},
+    xtick={5e10, 1e11, 3e11},
+    xticklabels={50B, 100B, 300B},
+]
+\addplot[only marks, mark=*, mark size=2.5pt, gray!60] coordinates {
+    (3e11, 29.2)
+    (1e11, 31.5)
+    (3e11, 32.0)
+    (3e11, 33.3)
+    (3e11, 37.6)
+    (3e11, 41.5)
+    (1e11, 50.9)
+};
+\addplot[only marks, mark=*, mark size=3.5pt, black, fill=black!70] coordinates {
+    (3.9e10, 53.5)
+};
+\node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 29.2) {OPT-125M};
+\node[font=\tiny, anchor=south east] at (axis cs:9.5e10, 31.5) {GPT-2 Small};
+\node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 32.0) {OPT-350M};
+\node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 33.3) {Pythia-410M};
+\node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 37.6) {Pythia-1B};
+\node[font=\tiny, anchor=south west] at (axis cs:3.15e11, 41.5) {OPT-1.3B};
+\node[font=\tiny, anchor=south east] at (axis cs:9.5e10, 50.9) {GPT-2 XL};
+\node[font=\scriptsize, anchor=south west] at (axis cs:4.2e10, 53.5) {\textbf{Julian-600M}};
+\legend{Autres mod\`eles, Julian (le n\^otre)}
+\end{axis}
+\end{tikzpicture}
+\caption{Efficacit\'e en tokens : pr\'ecision HellaSwag en fonction du volume de donn\'ees d'entra\^inement. Julian-600M (en bas \`a gauche, 39B tokens) atteint le score HellaSwag le plus \'elev\'e avec 7,7$\times$ moins de donn\'ees que les mod\`eles OPT et Pythia (300B tokens). Le losange met en \'evidence la position de Julian dans la r\'egion haute pr\'ecision / peu de donn\'ees.}
+\label{fig:token_efficiency}
+\end{figure}
+\subsection{L'Anomalie HellaSwag}
+Le score HellaSwag de 53,5\,\% pour Julian-600M est remarquablement \'elev\'e --- sup\'erieur m\^eme \`a GPT-2~XL (50,9\,\%) qui poss\`ede 2,5$\times$ plus de param\`etres. Plusieurs hypoth\`eses m\'eritent investigation :
+\begin{itemize}[leftmargin=*]
+    \item \textbf{Hypoth\`ese architecturale} : Les composants modernes (RoPE, SwiGLU, RMSNorm) peuvent \^etre particuli\`erement avantageux pour les t\^aches de compl\'etion de texte mesur\'ees par HellaSwag. La normalisation par la longueur (acc\_norm) pourrait favoriser notre architecture.
+    \item \textbf{Hypoth\`ese de sur-sp\'ecialisation} : Il est possible que le mod\`ele ait d\'evelopp\'e une sp\'ecialisation particuli\`ere pour ce type de t\^ache, au d\'etriment d'autres capacit\'es --- ce que sugg\`erent les scores plus modestes sur LAMBADA.
+    \item \textbf{Hypoth\`ese de contamination} : Bien que nous ayons appliqu\'e une d\'eduplication rigoureuse \citep{lee2022deduplicating}, nous ne pouvons pas exclure compl\`etement une contamination partielle avec des donn\'ees proches du benchmark, en particulier via FineWeb-Edu qui contient du contenu \'educatif potentiellement li\'e aux sc\'enarios de sens commun test\'es par HellaSwag.
+\end{itemize}
+\subsection{Saturation du SFT}
+Comme discut\'e en d\'etail dans la section~\ref{tab:sft_delta}, la comparaison SFT-30K vs SFT-100K r\'ev\`ele une saturation claire : l'entra\^inement suppl\'ementaire au-del\`a de 30K pas n'apporte qu'une am\'elioration n\'egligeable et commence \`a d\'egrader certains benchmarks. Pour un dataset de 2,47M d'exemples, une seule \'epoque ($\sim$45K pas) est suffisante, et les \'epoques multiples m\`enent \`a l'overfitting. Ce r\'esultat est coh\'erent avec les observations de la litt\'erature sur les mod\`eles de petite taille, dont la capacit\'e d'absorption est limit\'ee.
+% ============================================================================
+% 11. Limites
+% ============================================================================
+\section{Limites}
+\begin{itemize}[leftmargin=*]
+    \item \textbf{Taille du mod\`ele} : \`A 600M de param\`etres, Julian a des capacit\'es de raisonnement et une pr\'ecision factuelle limit\'ees par rapport aux mod\`eles plus grands. Les t\^aches n\'ecessitant une cha\^ine de raisonnement longue ou des connaissances factuelles pr\'ecises restent hors de port\'ee.
+    \item \textbf{Volume de donn\'ees d'entra\^inement} : Bien qu'efficace, 39B tokens est en dessous du ratio optimal de Chinchilla pour un mod\`ele qui atteint ce niveau de performance sur HellaSwag, sugg\'erant que le mod\`ele pourrait b\'en\'eficier d'un entra\^inement continu sur davantage de donn\'ees. LAMBADA en particulier b\'en\'eficierait d'un corpus plus large et plus diversifi\'e.
+    \item \textbf{\'Evaluation anglo-centr\'ee} : Tous les benchmarks sont en anglais. Nous manquons de benchmarks d'\'evaluation standardis\'es en fran\c{c}ais pour les mod\`eles de langage de cette taille. Des efforts comme le French Bench ou FrenchBench pourraient combler cette lacune.
+    \item \textbf{Hallucination} : Comme tous les mod\`eles de langage, Julian g\'en\`ere fr\'equemment des informations incorrectes ou fabriqu\'ees, particuli\`erement pour les requ\^etes factuelles. Ce probl\`eme est exacerb\'e par la petite taille du mod\`ele.
+    \item \textbf{Suivi d'instructions basique} : Le SFT sans apprentissage par renforcement \`a partir de feedback humain (RLHF) \citep{christiano2017deep, ouyang2022training} ou optimisation directe des pr\'ef\'erences (DPO) \citep{rafailov2023direct} produit des capacit\'es de suivi d'instructions significativement plus faibles que les mod\`eles entra\^in\'es avec RLHF. Le mod\`ele peut mal interpr\'eter des instructions complexes ou multi-\'etapes.
+    \item \textbf{Sous-performance LAMBADA} : La pr\'ecision relativement basse sur LAMBADA (37,3\,\% vs.\ 50,5\,\% pour Pythia-410M) indique que les capacit\'es de pr\'ediction de texte g\'en\'eral sont en retrait par rapport \`a la performance forte en raisonnement par sens commun.
+    \item \textbf{Reproductibilit\'e} : L'utilisation de TPU pr\'eemptibles implique que l'entra\^inement a \'et\'e interrompu et repris plusieurs fois. Bien que nous sauvegardions l'\'etat complet de l'optimiseur, ces interruptions introduisent une variabilit\'e non contr\^ol\'ee.
+\end{itemize}
+% ============================================================================
+% 12. Conclusion
+% ============================================================================
+\section{Conclusion}
+Nous avons pr\'esent\'e Julian, une famille de mod\`eles de langage bilingues entra\^in\'es \`a partir de z\'ero sur infrastructure TPU en utilisant JAX/Flax. Notre mod\`ele phare, Julian-600M, atteint une efficacit\'e remarquable sur HellaSwag (53,5\,\%), surpassant des mod\`eles poss\'edant 2$\times$ plus de param\`etres et entra\^in\'es sur 8$\times$ plus de donn\'ees. Nous avons document\'e l'int\'egralit\'e du pipeline d'entra\^inement, de la collecte de donn\'ees et l'entra\^inement du tokenizer au pr\'e-entra\^inement, au fine-tuning supervis\'e et \`a l'\'evaluation.
+Notre analyse d\'etaill\'ee du SFT r\'ev\`ele des enseignements pratiques importants : (1) le SFT d\'egrade in\'evitablement les benchmarks de compl\'etion de texte comme HellaSwag, mais pr\'eserve les capacit\'es de raisonnement ; (2) au-del\`a d'une \'epoque, le SFT suppl\'ementaire apporte des rendements d\'ecroissants voire n\'egatifs ; (3) la loss d'entra\^inement n'est pas un indicateur fiable de la qualit\'e d'un mod\`ele fine-tun\'e.
+\paragraph{Travaux futurs.} Nous pr\'evoyons de : (1) \'etendre Julian \`a 2B de param\`etres en utilisant des configurations TPU plus larges (v6e-64) ; (2) impl\'ementer le DPO pour un meilleur suivi d'instructions ; (3) d\'evelopper des benchmarks d'\'evaluation en fran\c{c}ais ; et (4) explorer l'entra\^inement continu sur des datasets plus larges pour am\'eliorer LAMBADA et la pr\'ediction de texte g\'en\'eral.
+\paragraph{Publication ouverte.} Tous les poids des mod\`eles sont disponibles \`a l'adresse \url{https://huggingface.co/JulianKrgd} sous licence Apache 2.0.
+% ============================================================================
+% Remerciements
+% ============================================================================
+\section*{Remerciements}
+Ce travail a \'et\'e rendu possible gr\^ace au programme \emph{TPU Research Cloud} (TRC) de Google, qui a fourni un acc\`es gratuit \`a des pods TPU v4-32. Nous remercions l'\'equipe TRC pour son soutien.
+% ============================================================================
+% R\'ef\'erences
+% ============================================================================
+\bibliographystyle{plainnat}
+\begin{thebibliography}{36}
+\bibitem[Biderman et~al.(2023)]{biderman2023pythia}
+Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, et Oskar van~der Wal.
+\newblock Pythia: A suite for analyzing large language models across training and scaling.
+\newblock In \emph{ICML}, 2023.
+\newblock \url{https://arxiv.org/abs/2304.01373}
+\bibitem[Bradbury et~al.(2018)]{bradbury2018jax}
+James Bradbury, Roy Frostig, Peter Hawkins, Matthew~James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander{P}las, Skye Wanderman-{M}ilne, et Qiao Zhang.
+\newblock {JAX}: Composable transformations of {Python}+{NumPy} programs.
+\newblock 2018.
+\newblock \url{https://github.com/jax-ml/jax}
+\bibitem[Bisk et~al.(2020)]{bisk2020piqa}
+Yonatan Bisk, Rowan Zellers, Ronan Le~Bras, Jianfeng Gao, et Yejin Choi.
+\newblock {PIQA}: Reasoning about physical intuition in natural language.
+\newblock In \emph{AAAI}, 2020.
+\newblock \url{https://arxiv.org/abs/1911.11641}
+\bibitem[Brown et~al.(2020)]{brown2020language}
+Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared~D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et~al.
+\newblock Language models are few-shot learners.
+\newblock In \emph{NeurIPS}, 2020.
+\newblock \url{https://arxiv.org/abs/2005.14165}
+\bibitem[Chowdhery et~al.(2023)]{chowdhery2023palm}
+Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung~Won Chung, Charles Sutton, Sebastian Gehrmann, et~al.
+\newblock {PaLM}: Scaling language modeling with {P}athways.
+\newblock \emph{JMLR}, 2023.
+\newblock \url{https://arxiv.org/abs/2204.02311}
+\bibitem[Clark et~al.(2018)]{clark2018think}
+Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, et Oyvind Tafjord.
+\newblock Think you have solved question answering? {T}ry {ARC}, the {AI2} reasoning challenge.
+\newblock \emph{arXiv preprint arXiv:1803.05457}, 2018.
+\newblock \url{https://arxiv.org/abs/1803.05457}
+\bibitem[Clark et~al.(2019)]{clark2019boolq}
+Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, et Kristina Toutanova.
+\newblock {BoolQ}: Exploring the surprising difficulty of natural yes/no questions.
+\newblock In \emph{NAACL}, 2019.
+\newblock \url{https://arxiv.org/abs/1905.10044}
+\bibitem[Christiano et~al.(2017)]{christiano2017deep}
+Paul~F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, et Dario Amodei.
+\newblock Deep reinforcement learning from human preferences.
+\newblock In \emph{NeurIPS}, 2017.
+\newblock \url{https://arxiv.org/abs/1706.03741}
+\bibitem[Conneau et~al.(2020)]{conneau2020xlmr}
+Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm{\'a}n, Edouard Grave, Myle Ott, Luke Zettlemoyer, et Veselin Stoyanov.
+\newblock Unsupervised cross-lingual representation learning at scale.
+\newblock In \emph{ACL}, 2020.
+\newblock \url{https://arxiv.org/abs/1911.02116}
+\bibitem[Devlin et~al.(2019)]{devlin2019bert}
+Jacob Devlin, Ming-Wei Chang, Kenton Lee, et Kristina Toutanova.
+\newblock {BERT}: Pre-training of deep bidirectional transformers for language understanding.
+\newblock In \emph{NAACL}, 2019.
+\newblock \url{https://arxiv.org/abs/1810.04805}
+\bibitem[Gao et~al.(2023)]{gao2023framework}
+Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le~Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, et Andy Zou.
+\newblock A framework for few-shot language model evaluation.
+\newblock \emph{Zenodo}, 2023.
+\newblock \url{https://zenodo.org/records/10256836}
+\bibitem[Hoffmann et~al.(2022)]{hoffmann2022training}
+Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de~Las~Casas, Lisa~Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van~den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack~W. Rae, Oriol Vinyals, et Laurent Sifre.
+\newblock Training compute-optimal large language models.
+\newblock In \emph{NeurIPS}, 2022.
+\newblock \url{https://arxiv.org/abs/2203.15556}
+\bibitem[Kaplan et~al.(2020)]{kaplan2020scaling}
+Jared Kaplan, Sam McCandlish, Tom Henighan, Tom~B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, et Ilya Sutskever.
+\newblock Scaling laws for neural language models.
+\newblock \emph{arXiv preprint arXiv:2001.08361}, 2020.
+\newblock \url{https://arxiv.org/abs/2001.08361}
+\bibitem[Lee et~al.(2022)]{lee2022deduplicating}
+Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, et Nicholas Carlini.
+\newblock Deduplicating training data makes language models better.
+\newblock In \emph{ACL}, 2022.
+\newblock \url{https://arxiv.org/abs/2107.06499}
+\bibitem[Kudo et Richardson(2018)]{kudo2018sentencepiece}
+Taku Kudo et John Richardson.
+\newblock {SentencePiece}: A simple and language independent subword tokenizer and detokenizer for neural text processing.
+\newblock In \emph{EMNLP (demo)}, 2018.
+\newblock \url{https://arxiv.org/abs/1808.06226}
+\bibitem[Liu et~al.(2024)]{liu2024mobilellm}
+Zechun Liu, Changlin Li, Barlas O\u{g}uz, et~al.
+\newblock {MobileLLM}: Optimizing sub-billion parameter language models for on-device use cases.
+\newblock In \emph{ICML}, 2024.
+\newblock \url{https://arxiv.org/abs/2402.14905}
+\bibitem[Micikevicius et~al.(2018)]{micikevicius2018mixed}
+Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et Hao Wu.
+\newblock Mixed precision training.
+\newblock In \emph{ICLR}, 2018.
+\newblock \url{https://arxiv.org/abs/1710.03740}
+\bibitem[Loshchilov et Hutter(2019)]{loshchilov2019decoupled}
+Ilya Loshchilov et Frank Hutter.
+\newblock Decoupled weight decay regularization.
+\newblock In \emph{ICLR}, 2019.
+\newblock \url{https://arxiv.org/abs/1711.05101}
+\bibitem[OpenAI(2023)]{openai2023chatml}
+OpenAI.
+\newblock {ChatML}: Chat markup language.
+\newblock Documentation technique, 2023.
+\newblock \url{https://github.com/openai/openai-python/blob/v0.28.1/chatml.md}
+\bibitem[Ouyang et~al.(2022)]{ouyang2022training}
+Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et~al.
+\newblock Training language models to follow instructions with human feedback.
+\newblock In \emph{NeurIPS}, 2022.
+\newblock \url{https://arxiv.org/abs/2203.02155}
+\bibitem[Paperno et~al.(2016)]{paperno2016lambada}
+Denis Paperno, Germ{\'a}n Kruszewski, Angeliki Lazaridou, Quan~Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, et Raquel Fern{\'a}ndez.
+\newblock The {LAMBADA} dataset: Word prediction requiring a broad discourse context.
+\newblock In \emph{ACL}, 2016.
+\newblock \url{https://arxiv.org/abs/1606.06031}
+\bibitem[Rafailov et~al.(2023)]{rafailov2023direct}
+Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher~D. Manning, et Chelsea Finn.
+\newblock Direct preference optimization: Your language model is secretly a reward model.
+\newblock In \emph{NeurIPS}, 2023.
+\newblock \url{https://arxiv.org/abs/2305.18290}
+\bibitem[Radford et~al.(2019)]{radford2019language}
+Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, et Ilya Sutskever.
+\newblock Language models are unsupervised multitask learners.
+\newblock \emph{OpenAI blog}, 2019.
+\newblock \url{https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf}
+\bibitem[Sakaguchi et~al.(2020)]{sakaguchi2020winogrande}
+Keisuke Sakaguchi, Ronan Le~Bras, Chandra Bhagavatula, et Yejin Choi.
+\newblock {WinoGrande}: An adversarial winograd schema challenge at scale.
+\newblock In \emph{AAAI}, 2020.
+\newblock \url{https://arxiv.org/abs/1907.10641}
+\bibitem[Shazeer(2020)]{shazeer2020glu}
+Noam Shazeer.
+\newblock {GLU} variants improve transformer.
+\newblock \emph{arXiv preprint arXiv:2002.05202}, 2020.
+\newblock \url{https://arxiv.org/abs/2002.05202}
+\bibitem[Su et~al.(2021)]{su2021roformer}
+Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, et Yunfeng Liu.
+\newblock {RoFormer}: Enhanced transformer with rotary position embedding.
+\newblock \emph{arXiv preprint arXiv:2104.09864}, 2021.
+\newblock \url{https://arxiv.org/abs/2104.09864}
+\bibitem[Touvron et~al.(2023)]{touvron2023llama}
+Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth{\'e}e Lacroix, Baptiste Rozi{\`e}re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, et Guillaume Lample.
+\newblock {LLaMA}: Open and efficient foundation language models.
+\newblock \emph{arXiv preprint arXiv:2302.13971}, 2023.
+\newblock \url{https://arxiv.org/abs/2302.13971}
+\bibitem[Vaswani et~al.(2017)]{vaswani2017attention}
+Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan~N Gomez, {\L}ukasz Kaiser, et Illia Polosukhin.
+\newblock Attention is all you need.
+\newblock In \emph{NeurIPS}, 2017.
+\newblock \url{https://arxiv.org/abs/1706.03762}
+\bibitem[Workshop et~al.(2023)]{workshop2023bloom}
+BigScience Workshop, Teven Le~Scao, Angela Fan, et~al.
+\newblock {BLOOM}: A 176B-parameter open-access multilingual language model.
+\newblock \emph{arXiv preprint arXiv:2211.05100}, 2023.
+\newblock \url{https://arxiv.org/abs/2211.05100}
+\bibitem[Zellers et~al.(2019)]{zellers2019hellaswag}
+Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, et Yejin Choi.
+\newblock {HellaSwag}: Can a machine really finish your sentence?
+\newblock In \emph{ACL}, 2019.
+\newblock \url{https://arxiv.org/abs/1905.07830}
+\bibitem[Zhang et~al.(2022)]{zhang2022opt}
+Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi~Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit~Singh Koura, Anjali Sridhar, Tianlu Wang, et Luke Zettlemoyer.
+\newblock {OPT}: Open pre-trained transformer language models.
+\newblock \emph{arXiv preprint arXiv:2205.01068}, 2022.
+\newblock \url{https://arxiv.org/abs/2205.01068}
+\bibitem[Zhang et Sennrich(2019)]{zhang2019root}
+Biao Zhang et Rico Sennrich.
+\newblock Root mean square layer normalization.
+\newblock In \emph{NeurIPS}, 2019.
+\newblock \url{https://arxiv.org/abs/1910.07467}
+\bibitem[Xu et~al.(2021)]{xu2021gspmd}
+Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, et Zhifeng Chen.
+\newblock {GSPMD}: General and scalable parallelization for {ML} computation graphs.
+\newblock \emph{arXiv preprint arXiv:2105.04663}, 2021.
+\newblock \url{https://arxiv.org/abs/2105.04663}
+\bibitem[Zhang et~al.(2024)]{zhang2024tinyllama}
+Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, et Wei Lu.
+\newblock {TinyLlama}: An open-source small language model.
+\newblock \emph{arXiv preprint arXiv:2401.02385}, 2024.
+\newblock \url{https://arxiv.org/abs/2401.02385}
+\end{thebibliography}
+% ============================================================================
+% Annexes
+% ============================================================================
+\appendix
+\section{Tables Compl\`etes d'Hyperparam\`etres}
+\label{app:hyperparams}
+\begin{table}[H]
+\centering
+\caption{Configuration compl\`ete de pr\'e-entra\^inement pour Julian-600M.}
+\begin{tabular}{lc}
+\toprule
+\textbf{Cat\'egorie} & \textbf{Valeur} \\
+\midrule
+\multicolumn{2}{l}{\textit{Mod\`ele}} \\
+Param\`etres & $\sim$600M \\
+Dimension cach\'ee & 1280 \\
+Couches & 18 \\
+T\^etes d'attention & 20 \\
+Dimension par t\^ete & 64 \\
+Dimension FFN & 5120 \\
+Activation & SwiGLU (porte SiLU) \\
+Normalisation & RMSNorm ($\epsilon = 10^{-6}$) \\
+Encodage positionnel & RoPE ($\theta = 10\,000$) \\
+Vocabulaire & 50\,000 (SentencePiece BPE) \\
+Longueur de contexte & 2\,048 \\
+Dropout & 0,1 \\
+\midrule
+\multicolumn{2}{l}{\textit{Optimisation}} \\
+Optimiseur & AdamW \\
+$\beta_1, \beta_2$ & 0,9 ; 0,95 \\
+$\epsilon$ & $10^{-8}$ \\
+D\'ecroissance des poids & 0,1 \\
+Taux d'apprentissage max & $1,2 \times 10^{-3}$ \\
+Taux d'apprentissage min & $1,2 \times 10^{-4}$ \\
+Programme du taux & Cosinus avec warmup lin\'eaire \\
+Pas de warmup & 3\,000 \\
+Pas totaux & 300\,000 \\
+Gradient clipping & 1,0 (norme globale) \\
+Pr\'ecision \'etats optimiseur & bfloat16 \\
+\midrule
+\multicolumn{2}{l}{\textit{Calcul}} \\
+Mat\'eriel & TPU v4-32 (32 puces, 4 h\^otes) \\
+Batch par device & 4 \\
+Accumulation de gradient & 8 \\
+Taille de batch effective & 1\,024 \\
+Pr\'ecision & bfloat16 mixte \\
+Tokens par pas & $\sim$2,1M \\
+Tokens totaux & $\sim$39B \\
+Checkpointing & Orbax asynchrone, tous les 10K pas \\
+\bottomrule
+\end{tabular}
+\end{table}
+\section{Disponibilit\'e des Mod\`eles}
+\label{app:availability}
+Tous les mod\`eles Julian sont disponibles sur le HuggingFace Hub :
+\begin{table}[H]
+\centering
+\caption{D\'ep\^ots HuggingFace des mod\`eles Julian.}
+\begin{tabular}{ll}
+\toprule
+\textbf{Mod\`ele} & \textbf{D\'ep\^ot HuggingFace} \\
+\midrule
+Julian-600M Base & \texttt{JulianKrgd/julian-600m-40b} \\
+Julian-600M-10B-Instruct-v0.1 & \texttt{JulianKrgd/julian-600m-10b-instruct-v0.1} \\
+Julian-600M SFT-30K & \texttt{JulianKrgd/julian-600m-40b-instruct-sft30k} \\
+Julian-600M SFT-100K & \texttt{JulianKrgd/julian-600m-40b-instruct-sft100k} \\
+\bottomrule
+\end{tabular}
+\end{table}
+\end{document}