Skip to content

Toy GPT Training Repository

This repo is an example of a pre-trained next-token prediction model, illustrating how language models are trained and used.

  • README.md in the repository root for the home page
  • SE_MANIFEST.toml for intent, scope, and declared training corpus

py.typed (PEP 561)

Python projects may include a py.typed marker file (defined by PEP 561).

Type checkers (Pyright, Mypy) trust inline type hints when this marker is present. The file is typically empty; comments are allowed.

API Reference

Selected public APIs are provided below.

Training Math

toy_gpt_train.math_training

math_training.py - Mathematical utilities used during model training.

This module contains reusable math functions used by the training and inference code in this repository.

Scope: - Pure functions with no model, vocabulary, or artifact assumptions. - Reusable across unigram, bigram, and higher-context variants.

These functions are intentionally simple and explicit to support inspection and debugging.

argmax

argmax(values: list[float]) -> int

Return the index of the maximum value in a list.

Concept

argmax is the argument (index) at which a function reaches its maximum.

Common uses
  • Measuring accuracy during training (pick the most likely token)
  • Greedy decoding during inference (choose the top prediction)
In training and inference
  • A model outputs a probability distribution over possible next tokens.
  • The token with the highest probability is the model's most confident prediction.
  • argmax selects that token.
Example

values = [0.1, 0.7, 0.2] has index values of 0,1, 2 respectively. argmax(values) -> 1 (since 0.7 is the largest value)

Parameters:

Name Type Description Default
values list[float]

A non-empty list of numeric values (typically logits or probabilities).

required

Returns:

Type Description
int

The index of the largest value in the list.

Raises:

Type Description
ValueError

If values is empty.

Source code in src/toy_gpt_train/math_training.py
def argmax(values: list[float]) -> int:
    """Return the index of the maximum value in a list.

    Concept:
        argmax is the argument (index) at which a function reaches its maximum.

    Common uses:
        - Measuring accuracy during training (pick the most likely token)
        - Greedy decoding during inference (choose the top prediction)

    In training and inference:
        - A model outputs a probability distribution over possible next tokens.
        - The token with the highest probability is the model's most confident prediction.
        - argmax selects that token.

    Example:
        values = [0.1, 0.7, 0.2] has index values of 0,1, 2 respectively.
        argmax(values) -> 1 (since 0.7 is the largest value)

    Args:
        values: A non-empty list of numeric values (typically logits or probabilities).

    Returns:
        The index of the largest value in the list.

    Raises:
        ValueError: If values is empty.
    """
    if not values:
        raise ValueError("argmax() requires a non-empty list")

    best_idx: int = 0
    best_val: float = values[0]

    for i in range(1, len(values)):
        v = values[i]
        if v > best_val:
            best_val = v
            best_idx = i

    return best_idx

cross_entropy_loss

cross_entropy_loss(
    probs: list[float], target_id: int
) -> float

Compute cross-entropy loss for a single training example.

Cross-Entropy Loss

Cross-entropy measures how well a predicted probability distribution matches the true outcome.

In next-token prediction: - The true distribution is "one-hot" which means we encode it as either 1 or 0: - Probability = 1.0 for the correct next token - Probability = 0.0 for all others - The model predicts a probability distribution over all tokens.

Cross-entropy answers the question: "How well does the predicted probability distribution align with the true outcome?"

Formula

loss = -log(p_correct)

  • If the model assigns high probability to the correct token, the loss is small.
  • If the probability is near zero, the loss is large.
Numerical safety

log(0) is undefined, so we clamp probabilities to a small minimum (1e-12). This does not change learning behavior in practice, but prevents runtime errors.

In training
  • This loss value drives gradient descent.
  • Lower loss means better predictions.

Parameters:

Name Type Description Default
probs list[float]

A probability distribution over the vocabulary (sums to 1.0).

required
target_id int

The integer ID of the correct next token.

required

Returns:

Type Description
float

A non-negative floating-point loss value.

float
  • 0.0 means a perfect prediction
float
  • Larger values indicate worse predictions

Raises: ValueError: If target_id is out of range for probs.

Source code in src/toy_gpt_train/math_training.py
def cross_entropy_loss(probs: list[float], target_id: int) -> float:
    """Compute cross-entropy loss for a single training example.

    Concept: Cross-Entropy Loss
        Cross-entropy measures how well a predicted probability distribution
        matches the true outcome.

        In next-token prediction:
        - The true distribution is "one-hot" which means we encode it as either 1 or 0:
            - Probability = 1.0 for the correct next token
            - Probability = 0.0 for all others
        - The model predicts a probability distribution over all tokens.

        Cross-entropy answers the question:
            "How well does the predicted probability distribution align with the true outcome?"

    Formula:
        loss = -log(p_correct)

        - If the model assigns high probability to the correct token,
          the loss is small.
        - If the probability is near zero, the loss is large.

    Numerical safety:
        log(0) is undefined, so we clamp probabilities to a small minimum
        (1e-12). This does not change learning behavior in practice,
        but prevents runtime errors.

    In training:
        - This loss value drives gradient descent.
        - Lower loss means better predictions.

    Args:
        probs: A probability distribution over the vocabulary (sums to 1.0).
        target_id: The integer ID of the correct next token.

    Returns:
        A non-negative floating-point loss value.
        - 0.0 means a perfect prediction
        - Larger values indicate worse predictions
    Raises:
        ValueError: If target_id is out of range for probs.
    """
    if target_id < 0 or target_id >= len(probs):
        raise ValueError(
            f"target_id out of range: target_id={target_id} len(probs)={len(probs)}"
        )

    p: float = probs[target_id]

    # Guard against log(0), which would produce -infinity
    p = max(p, 1e-12)

    return -math.log(p)

Models

toy_gpt_train.c_model

c_model.py - Simple model module.

Defines a minimal next-token prediction model for unigram (no context). A unigram models P(next) - just word frequencies, ignoring all context.

Responsibilities: - Represent a simple parameterized model that outputs the same probability distribution regardless of input. - Convert scores into probabilities using softmax. - Provide a forward pass (no training in this module).

This model is intentionally simple: - one weight vector (1D: just next_token scores) - one forward computation that ignores input - no learning here

Training is handled in a different module.

SimpleNextTokenModel

A minimal next-token prediction model (unigram - no context).

Unigram ignores all context and predicts based solely on corpus word frequencies: P(next).

Source code in src/toy_gpt_train/c_model.py
class SimpleNextTokenModel:
    """A minimal next-token prediction model (unigram - no context).

    Unigram ignores all context and predicts based solely on
    corpus word frequencies: P(next).
    """

    def __init__(self, vocab_size: int) -> None:
        """Initialize the model with a given vocabulary size."""
        self.vocab_size: Final[int] = vocab_size

        # Weight matrix: 1 row x vocab_size columns
        # Unigram has only ONE row because predictions don't depend on input.
        # We store as list[list[float]] with 1 row for artifact compatibility.
        self.weights: list[list[float]] = [[0.0 for _ in range(vocab_size)]]

        LOG.info(f"Model initialized with vocabulary size {vocab_size} (unigram).")

    def forward(self, current_id: int | None = None) -> list[float]:
        """Perform a forward pass.

        Args:
            current_id: Ignored for unigram - included for API consistency.

        Returns:
            Probability distribution over next tokens (same for all inputs).
        """
        # Unigram ignores current_id - always returns the same distribution
        _ = current_id
        scores: list[float] = self.weights[0]
        return self._softmax(scores)

    @staticmethod
    def _softmax(scores: list[float]) -> list[float]:
        max_score: float = max(scores)
        exp_scores: list[float] = [math.exp(s - max_score) for s in scores]
        total: float = sum(exp_scores)
        return [s / total for s in exp_scores]

__init__

__init__(vocab_size: int) -> None

Initialize the model with a given vocabulary size.

Source code in src/toy_gpt_train/c_model.py
def __init__(self, vocab_size: int) -> None:
    """Initialize the model with a given vocabulary size."""
    self.vocab_size: Final[int] = vocab_size

    # Weight matrix: 1 row x vocab_size columns
    # Unigram has only ONE row because predictions don't depend on input.
    # We store as list[list[float]] with 1 row for artifact compatibility.
    self.weights: list[list[float]] = [[0.0 for _ in range(vocab_size)]]

    LOG.info(f"Model initialized with vocabulary size {vocab_size} (unigram).")

forward

forward(current_id: int | None = None) -> list[float]

Perform a forward pass.

Parameters:

Name Type Description Default
current_id int | None

Ignored for unigram - included for API consistency.

None

Returns:

Type Description
list[float]

Probability distribution over next tokens (same for all inputs).

Source code in src/toy_gpt_train/c_model.py
def forward(self, current_id: int | None = None) -> list[float]:
    """Perform a forward pass.

    Args:
        current_id: Ignored for unigram - included for API consistency.

    Returns:
        Probability distribution over next tokens (same for all inputs).
    """
    # Unigram ignores current_id - always returns the same distribution
    _ = current_id
    scores: list[float] = self.weights[0]
    return self._softmax(scores)

main

main() -> None

Demonstrate a forward pass of the simple unigram model.

Source code in src/toy_gpt_train/c_model.py
def main() -> None:
    """Demonstrate a forward pass of the simple unigram model."""
    # Local imports keep modules decoupled.
    from toy_gpt_train.a_tokenizer import SimpleTokenizer
    from toy_gpt_train.b_vocab import Vocabulary

    log_header(LOG, "Simple Next-Token Model Demo (Unigram - No Context)")

    # Step 1: Tokenize input text.
    tokenizer: SimpleTokenizer = SimpleTokenizer()
    tokens: list[str] = tokenizer.get_tokens()

    if not tokens:
        LOG.info("No tokens available for demonstration.")
        return

    # Step 2: Build vocabulary.
    vocab: Vocabulary = Vocabulary(tokens)

    # Step 3: Initialize model.
    model: SimpleNextTokenModel = SimpleNextTokenModel(vocab_size=vocab.vocab_size())

    # Step 4: Forward pass (unigram ignores input).
    probs: list[float] = model.forward()

    # Step 5: Inspect results.
    LOG.info("Unigram ignores input - same predictions for any context:")
    LOG.info("Output probabilities for next token:")
    for idx, prob in enumerate(probs):
        tok: str | None = vocab.get_id_token(idx)
        LOG.info(f"  {tok!r} (ID {idx}) -> {prob:.4f}")

Training Pipeline

toy_gpt_train.d_train

d_train.py - Training loop module.

Trains the SimpleNextTokenModel on a small token corpus using unigram (no context - just word frequencies).

A unigram models P(next) - the probability of each word based purely on how often it appears in the corpus, ignoring all context.

Responsibilities: - Count token frequencies in the corpus - Train a single row of weights to predict based on frequency - Track loss and accuracy per epoch - Write a CSV log of training progress - Write inspectable training artifacts (vocabulary, weights, embeddings, meta)

Concepts: - unigram: predict next token using only corpus frequencies (no context) - softmax: converts raw scores into probabilities (so predictions sum to 1) - cross-entropy loss: measures how well predicted probabilities match the correct token - gradient descent: iterative weight updates to minimize loss

Notes: - This is intentionally simple: no deep learning framework, no Transformer. - The model has only ONE row of weights (predictions are context-independent). - Training updates the same single row for every example. - token_embeddings.csv is a visualization-friendly projection for levels 100-400; in later repos (500+), embeddings become a first-class learned table.

main

main() -> None

Run a simple training demo end-to-end.

Source code in src/toy_gpt_train/d_train.py
def main() -> None:
    """Run a simple training demo end-to-end."""
    from toy_gpt_train.a_tokenizer import CORPUS_DIR, SimpleTokenizer
    from toy_gpt_train.b_vocab import Vocabulary

    log_header(LOG, "Training Demo: Unigram (Frequency-Based) Model")

    base_dir: Final[Path] = Path(__file__).resolve().parents[2]
    outputs_dir: Final[Path] = base_dir / "outputs"
    train_log_path: Final[Path] = outputs_dir / "train_log.csv"

    # Step 0: Identify the corpus file (single file rule).
    corpus_path: Path = find_single_corpus_file(CORPUS_DIR)

    # Step 1: Load and tokenize the corpus.
    tokenizer: SimpleTokenizer = SimpleTokenizer(corpus_path=corpus_path)
    tokens: list[str] = tokenizer.get_tokens()

    if not tokens:
        LOG.error("No tokens found. Check corpus file.")
        return

    # Step 2: Build vocabulary (maps tokens <-> integer IDs).
    vocab: Vocabulary = Vocabulary(tokens)

    # Step 3: Convert token strings to integer IDs for training.
    token_ids: list[int] = []
    for tok in tokens:
        tok_id: int | None = vocab.get_token_id(tok)
        if tok_id is None:
            LOG.error(f"Token not found in vocabulary: {tok!r}")
            return
        token_ids.append(tok_id)

    # Step 4: Create training targets (just the tokens themselves for unigram).
    targets: list[int] = make_training_targets(token_ids)
    LOG.info(f"Created {len(targets)} training targets.")

    # Step 5: Initialize model (unigram has only 1 row of weights).
    model: SimpleNextTokenModel = SimpleNextTokenModel(vocab_size=vocab.vocab_size())

    # Step 6: Train the model.
    learning_rate: float = 0.1
    epochs: int = 50

    history: list[dict[str, float]] = train_model(
        model=model,
        targets=targets,
        learning_rate=learning_rate,
        epochs=epochs,
    )

    # Step 7: Save training metrics for analysis.
    write_training_log(train_log_path, history)

    # Step 7b: Write inspectable artifacts for downstream use.
    write_artifacts(
        base_dir=base_dir,
        corpus_path=corpus_path,
        vocab=vocab,
        model=model,
        model_kind="unigram",
        learning_rate=learning_rate,
        epochs=epochs,
        row_labeler=row_labeler_unigram(vocab, vocab.vocab_size()),
    )

    # Step 8: Qualitative check - what does the model predict?
    probs: list[float] = model.forward()
    best_id: int = argmax(probs)
    best_tok: str | None = vocab.get_id_token(best_id)
    LOG.info(
        f"After training, most likely token (based on frequency) "
        f"is {best_tok!r} (ID: {best_id})."
    )

make_training_targets

make_training_targets(token_ids: list[int]) -> list[int]

Extract training targets for unigram model.

For unigram, we don't need (input, target) pairs because the model ignores input. We just need the list of all tokens that appear in the corpus - each one is a target to predict.

Parameters:

Name Type Description Default
token_ids list[int]

Sequence of integer token IDs from the corpus.

required

Returns:

Type Description
list[int]

List of target token IDs (all tokens in corpus).

Example

Token sequence "the cat sat" with IDs [3, 1, 2] produces: [3, 1, 2] Meaning: the model should learn to predict these tokens based on their frequency (3 appears once, 1 appears once, etc.)

Source code in src/toy_gpt_train/d_train.py
def make_training_targets(token_ids: list[int]) -> list[int]:
    """Extract training targets for unigram model.

    For unigram, we don't need (input, target) pairs because
    the model ignores input. We just need the list of all tokens
    that appear in the corpus - each one is a target to predict.

    Args:
        token_ids: Sequence of integer token IDs from the corpus.

    Returns:
        List of target token IDs (all tokens in corpus).

    Example:
        Token sequence "the cat sat" with IDs [3, 1, 2] produces:
        [3, 1, 2]
        Meaning: the model should learn to predict these tokens
        based on their frequency (3 appears once, 1 appears once, etc.)
    """
    return token_ids

row_labeler_unigram

row_labeler_unigram(
    vocab: VocabularyLike, vocab_size: int
) -> RowLabeler

Map a unigram row index to a label.

Unigram has only one row, labeled to indicate it's context-free.

Source code in src/toy_gpt_train/d_train.py
def row_labeler_unigram(vocab: VocabularyLike, vocab_size: int) -> RowLabeler:
    """Map a unigram row index to a label.

    Unigram has only one row, labeled to indicate it's context-free.
    """
    _ = vocab  # unused - unigram doesn't label by token
    _ = vocab_size

    def label(row_idx: int) -> str:
        # Only one row in unigram - label it descriptively
        return "(no context)"

    return label

train_model

train_model(
    model: SimpleNextTokenModel,
    targets: list[int],
    learning_rate: float,
    epochs: int,
) -> list[dict[str, float]]

Train the unigram model using gradient descent on softmax cross-entropy.

Unigram training learns corpus frequencies. The model has a single row of weights that gets updated for every token in the corpus.

Training proceeds in epochs (full passes through all tokens). For each token, we: 1. Compute the model's predicted probabilities (forward pass). 2. Measure how wrong the prediction was (loss). 3. Adjust weights to reduce the loss (gradient descent).

Parameters:

Name Type Description Default
model SimpleNextTokenModel

The model to train (weights will be modified in place).

required
targets list[int]

List of target token IDs from the corpus.

required
learning_rate float

Step size for gradient descent.

required
epochs int

Number of complete passes through the training data.

required

Returns:

Type Description
list[dict[str, float]]

List of per-epoch metrics dictionaries containing epoch number,

list[dict[str, float]]

average loss, and accuracy.

Source code in src/toy_gpt_train/d_train.py
def train_model(
    model: "SimpleNextTokenModel",
    targets: list[int],
    learning_rate: float,
    epochs: int,
) -> list[dict[str, float]]:
    """Train the unigram model using gradient descent on softmax cross-entropy.

    Unigram training learns corpus frequencies. The model has a single row
    of weights that gets updated for every token in the corpus.

    Training proceeds in epochs (full passes through all tokens).
    For each token, we:
    1. Compute the model's predicted probabilities (forward pass).
    2. Measure how wrong the prediction was (loss).
    3. Adjust weights to reduce the loss (gradient descent).

    Args:
        model: The model to train (weights will be modified in place).
        targets: List of target token IDs from the corpus.
        learning_rate: Step size for gradient descent.
        epochs: Number of complete passes through the training data.

    Returns:
        List of per-epoch metrics dictionaries containing epoch number,
        average loss, and accuracy.
    """
    history: list[dict[str, float]] = []

    for epoch in range(1, epochs + 1):
        total_loss: float = 0.0
        correct: int = 0

        for target_id in targets:
            # Forward pass: get probability distribution (same for all inputs).
            probs: list[float] = model.forward()

            # Compute loss: how surprised is the model by this token?
            loss: float = cross_entropy_loss(probs, target_id)
            total_loss += loss

            # Check if the model's top prediction matches the target.
            pred_id: int = argmax(probs)
            if pred_id == target_id:
                correct += 1

            # Backward pass: update the single row of weights.
            #
            # For softmax cross-entropy, the gradient is:
            #   gradient[j] = predicted_prob[j] - true_prob[j]
            #
            # This pushes probability mass toward frequently-seen tokens.
            row: list[float] = model.weights[0]  # unigram has only one row
            for j in range(model.vocab_size):
                y: float = 1.0 if j == target_id else 0.0
                grad: float = probs[j] - y
                row[j] -= learning_rate * grad

        # Compute epoch-level metrics.
        avg_loss: float = total_loss / len(targets) if targets else float("nan")
        accuracy: float = correct / len(targets) if targets else 0.0

        metrics: dict[str, float] = {
            "epoch": float(epoch),
            "avg_loss": avg_loss,
            "accuracy": accuracy,
        }
        history.append(metrics)

        LOG.info(
            f"Epoch {epoch}/{epochs} | avg_loss={avg_loss:.6f} | accuracy={accuracy:.3f}"
        )

    return history

Inference

toy_gpt_train.e_infer

e_infer.py - Inference module (artifact-driven).

Runs inference using previously saved training artifacts.

Responsibilities: - Load inspectable training artifacts from artifacts/ - 00_meta.json - 01_vocabulary.csv - 02_model_weights.csv - Reconstruct a vocabulary-like interface and model weights - Generate tokens using greedy decoding (argmax) - Print top-k next-token probabilities for inspection

Notes: - This module does NOT retrain by default. - If artifacts are missing, run d_train.py first.

Unigram inference

The model ignores all context and predicts based solely on corpus word frequencies. Every call to forward() returns the same probability distribution.

ArtifactVocabulary dataclass

Vocabulary reconstructed from artifacts/01_vocabulary.csv.

Provides the same surface area used by inference: - vocab_size() - get_token_id() - get_id_token() - get_token_frequency()

Source code in src/toy_gpt_train/e_infer.py
@dataclass(frozen=True)
class ArtifactVocabulary:
    """Vocabulary reconstructed from artifacts/01_vocabulary.csv.

    Provides the same surface area used by inference:
    - vocab_size()
    - get_token_id()
    - get_id_token()
    - get_token_frequency()
    """

    token_to_id: dict[str, int]
    id_to_token: dict[int, str]
    token_freq: dict[str, int]

    def vocab_size(self) -> int:
        """Return the total number of tokens in the vocabulary."""
        return len(self.token_to_id)

    def get_token_id(self, token: str) -> int | None:
        """Return the token ID for a given token, or None if not found."""
        return self.token_to_id.get(token)

    def get_id_token(self, idx: int) -> str | None:
        """Return the token for a given token ID, or None if not found."""
        return self.id_to_token.get(idx)

    def get_token_frequency(self, token: str) -> int:
        """Return the frequency count for a given token, or 0 if not found."""
        return self.token_freq.get(token, 0)

get_id_token

get_id_token(idx: int) -> str | None

Return the token for a given token ID, or None if not found.

Source code in src/toy_gpt_train/e_infer.py
def get_id_token(self, idx: int) -> str | None:
    """Return the token for a given token ID, or None if not found."""
    return self.id_to_token.get(idx)

get_token_frequency

get_token_frequency(token: str) -> int

Return the frequency count for a given token, or 0 if not found.

Source code in src/toy_gpt_train/e_infer.py
def get_token_frequency(self, token: str) -> int:
    """Return the frequency count for a given token, or 0 if not found."""
    return self.token_freq.get(token, 0)

get_token_id

get_token_id(token: str) -> int | None

Return the token ID for a given token, or None if not found.

Source code in src/toy_gpt_train/e_infer.py
def get_token_id(self, token: str) -> int | None:
    """Return the token ID for a given token, or None if not found."""
    return self.token_to_id.get(token)

vocab_size

vocab_size() -> int

Return the total number of tokens in the vocabulary.

Source code in src/toy_gpt_train/e_infer.py
def vocab_size(self) -> int:
    """Return the total number of tokens in the vocabulary."""
    return len(self.token_to_id)

generate_tokens_unigram

generate_tokens_unigram(
    model: SimpleNextTokenModel,
    vocab: ArtifactVocabulary,
    num_tokens: int,
) -> list[str]

Generate tokens using unigram (no context - same prediction every time).

Note: Unigram ignores all context, so there's no start token. Every generated token will be the same (the most frequent word).

Source code in src/toy_gpt_train/e_infer.py
def generate_tokens_unigram(
    model: SimpleNextTokenModel,
    vocab: ArtifactVocabulary,
    num_tokens: int,
) -> list[str]:
    """Generate tokens using unigram (no context - same prediction every time).

    Note: Unigram ignores all context, so there's no start token.
    Every generated token will be the same (the most frequent word).
    """
    generated: list[str] = []

    for _ in range(num_tokens):
        probs: list[float] = model.forward()  # no input - unigram ignores context
        next_id: int = argmax(probs)
        next_token: str | None = vocab.get_id_token(next_id)

        if next_token is None:
            LOG.error(f"Generated invalid token ID: {next_id}")
            break

        generated.append(next_token)

    return generated

load_meta

load_meta(path: Path) -> JsonObject

Load 00_meta.json.

Source code in src/toy_gpt_train/e_infer.py
def load_meta(path: Path) -> JsonObject:
    """Load 00_meta.json."""
    with path.open("r", encoding="utf-8") as f:
        data: JsonObject = json.load(f)
    return data

load_model_weights_csv

load_model_weights_csv(
    path: Path, vocab_size: int
) -> list[list[float]]

Load 02_model_weights.csv -> weights matrix.

For unigram, expects exactly 1 row (context-independent predictions).

Source code in src/toy_gpt_train/e_infer.py
def load_model_weights_csv(path: Path, vocab_size: int) -> list[list[float]]:
    """Load 02_model_weights.csv -> weights matrix.

    For unigram, expects exactly 1 row (context-independent predictions).
    """
    weights: list[list[float]] = []

    with path.open("r", encoding="utf-8", newline="") as f:
        reader = csv.reader(f)
        header = next(reader, None)
        if header is None:
            raise ValueError("Weights CSV is empty.")
        if len(header) < 2 or header[0] != "input_token":
            raise ValueError("Weights CSV must start with header 'input_token'.")

        num_outputs = len(header) - 1
        if num_outputs != vocab_size:
            raise ValueError(
                f"Weights CSV output width mismatch. Expected {vocab_size} output columns "
                f"but found {num_outputs}."
            )

        for row in reader:
            if not row:
                continue
            if len(row) != vocab_size + 1:
                raise ValueError(
                    f"Invalid weights row length. Expected {vocab_size + 1} columns but found {len(row)}."
                )
            # row[0] is input token label; row[1:] are numeric weights
            weights.append([float(x) for x in row[1:]])

    # Unigram has exactly 1 row
    if len(weights) != 1:
        raise ValueError(
            f"Unigram weights CSV should have 1 row but found {len(weights)}."
        )

    return weights

load_vocabulary_csv

load_vocabulary_csv(path: Path) -> ArtifactVocabulary

Load 01_vocabulary.csv -> ArtifactVocabulary.

Source code in src/toy_gpt_train/e_infer.py
def load_vocabulary_csv(path: Path) -> ArtifactVocabulary:
    """Load 01_vocabulary.csv -> ArtifactVocabulary."""
    token_to_id: dict[str, int] = {}
    id_to_token: dict[int, str] = {}
    token_freq: dict[str, int] = {}

    with path.open("r", encoding="utf-8", newline="") as f:
        reader = csv.DictReader(f)
        expected = {"token_id", "token", "frequency"}
        if reader.fieldnames is None or set(reader.fieldnames) != expected:
            raise ValueError(
                f"Unexpected vocabulary header. Expected {sorted(expected)} "
                f"but got {reader.fieldnames}"
            )

        for row in reader:
            token_id = int(row["token_id"])
            token = row["token"]
            freq = int(row["frequency"])

            token_to_id[token] = token_id
            id_to_token[token_id] = token
            token_freq[token] = freq

    return ArtifactVocabulary(
        token_to_id=token_to_id,
        id_to_token=id_to_token,
        token_freq=token_freq,
    )

main

main() -> None

Run inference using saved training artifacts.

Source code in src/toy_gpt_train/e_infer.py
def main() -> None:
    """Run inference using saved training artifacts."""
    log_header(LOG, "Inference Demo: Load Artifacts and Generate Text (Unigram)")

    base_dir: Final[Path] = Path(__file__).resolve().parents[2]
    artifacts_dir: Final[Path] = base_dir / "artifacts"
    meta_path: Final[Path] = artifacts_dir / "00_meta.json"
    vocab_path: Final[Path] = artifacts_dir / "01_vocabulary.csv"
    weights_path: Final[Path] = artifacts_dir / "02_model_weights.csv"
    require_artifacts(
        meta_path=meta_path,
        vocab_path=vocab_path,
        weights_path=weights_path,
        train_hint="uv run python src/toy_gpt_train/d_train.py",
    )

    meta: JsonObject = load_meta(meta_path)
    vocab: ArtifactVocabulary = load_vocabulary_csv(vocab_path)

    v: int = vocab.vocab_size()
    model: SimpleNextTokenModel = SimpleNextTokenModel(vocab_size=v)
    model.weights = load_model_weights_csv(weights_path, vocab_size=v)

    args: argparse.Namespace = parse_args([])

    LOG.info(
        f"Loaded repo_name={meta.get('repo_name')} model_kind={meta.get('model_kind')}"
    )
    LOG.info(f"Vocab size: {v}")
    LOG.info("Unigram model: predictions are the same regardless of input.")

    # Show predictions (same for any input)
    probs: list[float] = model.forward()
    LOG.info("Top next-token predictions (based on corpus frequency):")
    for tok_id, prob in top_k(probs, k=max(1, args.topk)):
        tok: str | None = vocab.get_id_token(tok_id)
        LOG.info(f"  {tok!r} (ID {tok_id}): {prob:.4f}")

    generated: list[str] = generate_tokens_unigram(
        model=model,
        vocab=vocab,
        num_tokens=max(0, args.num_tokens),
    )

    LOG.info("Generated sequence:")
    LOG.info(f"  {' '.join(generated)}")

parse_args

parse_args(
    argv: list[str] | None = None,
) -> argparse.Namespace

Parse command-line arguments for inference.

Source code in src/toy_gpt_train/e_infer.py
def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
    """Parse command-line arguments for inference."""
    parser = argparse.ArgumentParser(
        description="Run inference using saved training artifacts (unigram)."
    )
    parser.add_argument(
        "--num-tokens",
        type=int,
        default=10,
        help="Number of tokens to generate (default: 10).",
    )
    parser.add_argument(
        "--topk",
        type=int,
        default=3,
        help="Number of top predictions to display (default: 3).",
    )
    return parser.parse_args(argv)

require_artifacts

require_artifacts(
    *,
    meta_path: Path,
    vocab_path: Path,
    weights_path: Path,
    train_hint: str,
) -> None

Fail fast with a helpful message if artifacts are missing.

Source code in src/toy_gpt_train/e_infer.py
def require_artifacts(
    *,
    meta_path: Path,
    vocab_path: Path,
    weights_path: Path,
    train_hint: str,
) -> None:
    """Fail fast with a helpful message if artifacts are missing."""
    missing: list[Path] = []
    for p in [meta_path, vocab_path, weights_path]:
        if not p.exists():
            missing.append(p)

    if missing:
        LOG.error("Missing training artifacts:")
        for p in missing:
            LOG.error(f"  - {p}")
        LOG.error("Run training first:")
        LOG.error(f"  {train_hint}")
        raise SystemExit(2)

top_k

top_k(
    probs: list[float], k: int
) -> list[tuple[int, float]]

Return top-k (token_id, probability) pairs sorted by probability.

Source code in src/toy_gpt_train/e_infer.py
def top_k(probs: list[float], k: int) -> list[tuple[int, float]]:
    """Return top-k (token_id, probability) pairs sorted by probability."""
    pairs: list[tuple[int, float]] = list(enumerate(probs))
    pairs.sort(key=lambda x: x[1], reverse=True)
    return pairs[:k]

Artifacts and I/O

toy_gpt_train.io_artifacts

io_artifacts.py - Input/output and training-artifact utilities used by the models.

This module is responsible for persisting and describing the results of model training in a consistent, inspectable format.

It does not perform training. It: - Writes artifacts produced by training (weights, vocabulary, logs, metadata) - Assumes a conventional repository layout for reproducibility - Provides small helper utilities shared across training and inference

The expected directory structure is: - artifacts/ contains all inspectable model outputs - corpus/ contains training text files (often exactly one) - outputs/ contains training logs and diagnostics

External callers should treat paths as implementation details and interact through the functions provided here.

Concepts

Artifact A concrete file written to disk that captures some aspect of training. In this project, artifacts are designed to be: - Human-readable (CSV / JSON) - Stable across model variants (unigram, bigram, context-3, etc.) - Reusable by inference without retraining

Epoch One epoch is one complete pass through all training examples. Training typically consists of multiple epochs so the model can gradually improve its predictions by repeatedly adjusting weights.

Training Log A CSV file recording per-epoch metrics such as: - average loss - accuracy This allows learning behavior to be inspected after training.

Vocabulary A mapping between token strings and integer token IDs. The vocabulary defines: - the size of the model output space - the meaning of each row and column in the weight tables

Row Labeler A small function that maps a numeric row index in the model's weight table to a human-readable label. For example, as the number of context tokens increases, the row labeler produces context strings such as: - unigram: "cat" - bigram: "the|cat" - context-3: "the|black|cat"

Row labels are written into CSV artifacts to make model structure visible.

Model Weights Numeric parameters learned during training. Conceptually: - each row corresponds to an input context - each column corresponds to a possible next token Weights are written verbatim so learning can be inspected or reused.

Token Embeddings (Derived) A simple 2D projection derived from model weights for visualization. These are not learned embeddings yet. In later stages (500+), embeddings become first-class learned parameters.

Reproducibility Metadata The 00_meta.json file records: - which corpus was used - how it was hashed - which model variant was trained - what training settings were applied This allows results to be traced and compared across runs and repositories.

Design Notes

  • This module is shared unchanged across model levels (100-400).
  • More advanced pipelines (embeddings, attention, batching) build on the same artifact-writing concepts.
  • Centralizing I/O logic prevents drift across repositories and keeps training code focused on learning.

VocabularyLike

Bases: Protocol

Protocol for vocabulary-like objects used in training and artifacts.

Source code in src/toy_gpt_train/io_artifacts.py
class VocabularyLike(Protocol):
    """Protocol for vocabulary-like objects used in training and artifacts."""

    def vocab_size(self) -> int:
        """Return the total number of unique tokens in the vocabulary."""
        ...

    def get_token_id(self, token: str) -> int | None:
        """Return the integer ID for a given token, or None if not found."""
        ...

    def get_id_token(self, idx: int) -> str | None:
        """Return the token string for a given integer ID, or None if not found."""
        ...

    def get_token_frequency(self, token: str) -> int:
        """Return the frequency count for a given token."""
        ...

get_id_token

get_id_token(idx: int) -> str | None

Return the token string for a given integer ID, or None if not found.

Source code in src/toy_gpt_train/io_artifacts.py
def get_id_token(self, idx: int) -> str | None:
    """Return the token string for a given integer ID, or None if not found."""
    ...

get_token_frequency

get_token_frequency(token: str) -> int

Return the frequency count for a given token.

Source code in src/toy_gpt_train/io_artifacts.py
def get_token_frequency(self, token: str) -> int:
    """Return the frequency count for a given token."""
    ...

get_token_id

get_token_id(token: str) -> int | None

Return the integer ID for a given token, or None if not found.

Source code in src/toy_gpt_train/io_artifacts.py
def get_token_id(self, token: str) -> int | None:
    """Return the integer ID for a given token, or None if not found."""
    ...

vocab_size

vocab_size() -> int

Return the total number of unique tokens in the vocabulary.

Source code in src/toy_gpt_train/io_artifacts.py
def vocab_size(self) -> int:
    """Return the total number of unique tokens in the vocabulary."""
    ...

artifact_paths_from_base_dir

artifact_paths_from_base_dir(
    base_dir: Path,
) -> dict[str, Path]

Return standard artifact paths under base_dir/artifacts/.

Source code in src/toy_gpt_train/io_artifacts.py
def artifact_paths_from_base_dir(base_dir: Path) -> dict[str, Path]:
    """Return standard artifact paths under base_dir/artifacts/."""
    artifacts_dir = artifacts_dir_from_base_dir(base_dir)
    return {
        "00_meta.json": artifacts_dir / "00_meta.json",
        "01_vocabulary.csv": artifacts_dir / "01_vocabulary.csv",
        "02_model_weights.csv": artifacts_dir / "02_model_weights.csv",
        "03_token_embeddings.csv": artifacts_dir / "03_token_embeddings.csv",
    }

artifacts_dir_from_base_dir

artifacts_dir_from_base_dir(base_dir: Path) -> Path

Return artifacts/ directory under a repository base directory.

Source code in src/toy_gpt_train/io_artifacts.py
def artifacts_dir_from_base_dir(base_dir: Path) -> Path:
    """Return artifacts/ directory under a repository base directory."""
    return base_dir / "artifacts"

find_single_corpus_file

find_single_corpus_file(corpus_dir: Path) -> Path

Find the single corpus file in corpus/ (same rule as SimpleTokenizer).

Source code in src/toy_gpt_train/io_artifacts.py
def find_single_corpus_file(corpus_dir: Path) -> Path:
    """Find the single corpus file in corpus/ (same rule as SimpleTokenizer)."""
    if not corpus_dir.exists():
        msg = f"Corpus directory not found: {corpus_dir}"
        raise FileNotFoundError(msg)

    files = sorted([p for p in corpus_dir.iterdir() if p.is_file()])
    if len(files) == 0:
        msg = f"No files found in corpus directory: {corpus_dir}"
        raise FileNotFoundError(msg)
    if len(files) > 1:
        msg = f"Expected exactly one file in corpus directory, found {len(files)}: {corpus_dir}"
        raise ValueError(msg)

    return files[0]

outputs_dir_from_base_dir

outputs_dir_from_base_dir(base_dir: Path) -> Path

Return outputs/ directory under a repository base directory.

Source code in src/toy_gpt_train/io_artifacts.py
def outputs_dir_from_base_dir(base_dir: Path) -> Path:
    """Return outputs/ directory under a repository base directory."""
    return base_dir / "outputs"

repo_name_from_base_dir

repo_name_from_base_dir(base_dir: Path) -> str

Infer repository name from base directory.

Source code in src/toy_gpt_train/io_artifacts.py
def repo_name_from_base_dir(base_dir: Path) -> str:
    """Infer repository name from base directory."""
    return base_dir.resolve().name

sha256_of_bytes

sha256_of_bytes(data: bytes) -> str

Return hex SHA-256 digest for given bytes.

Source code in src/toy_gpt_train/io_artifacts.py
def sha256_of_bytes(data: bytes) -> str:
    """Return hex SHA-256 digest for given bytes."""
    return hashlib.sha256(data).hexdigest()

sha256_of_file

sha256_of_file(path: Path) -> str

Return hex SHA-256 digest for a file.

Source code in src/toy_gpt_train/io_artifacts.py
def sha256_of_file(path: Path) -> str:
    """Return hex SHA-256 digest for a file."""
    data = path.read_bytes()
    return sha256_of_bytes(data)

write_artifacts

write_artifacts(
    *,
    base_dir: Path,
    corpus_path: Path,
    vocab: VocabularyLike,
    model: SimpleNextTokenModel,
    model_kind: str,
    learning_rate: float,
    epochs: int,
    row_labeler: RowLabeler,
) -> None

Write all training artifacts to artifacts/.

Parameters:

Name Type Description Default
base_dir Path

Repository base directory.

required
corpus_path Path

Corpus file used for training.

required
vocab VocabularyLike

VocabularyLike instance.

required
model SimpleNextTokenModel

Trained model (weights already updated).

required
model_kind str

Human-readable model kind (e.g., "unigram", "bigram").

required
learning_rate float

Training learning rate.

required
epochs int

Number of training passes.

required
row_labeler RowLabeler

Function that maps a model weight-row index to a label written in the first column of 02_model_weights.csv.

required
Source code in src/toy_gpt_train/io_artifacts.py
def write_artifacts(
    *,
    base_dir: Path,
    corpus_path: Path,
    vocab: VocabularyLike,
    model: SimpleNextTokenModel,
    model_kind: str,
    learning_rate: float,
    epochs: int,
    row_labeler: RowLabeler,
) -> None:
    """Write all training artifacts to artifacts/.

    Args:
        base_dir: Repository base directory.
        corpus_path: Corpus file used for training.
        vocab: VocabularyLike instance.
        model: Trained model (weights already updated).
        model_kind: Human-readable model kind (e.g., "unigram", "bigram").
        learning_rate: Training learning rate.
        epochs: Number of training passes.
        row_labeler: Function that maps a model weight-row index to a label
            written in the first column of 02_model_weights.csv.
    """
    artifacts_dir: Final[Path] = base_dir / "artifacts"
    meta_path: Final[Path] = artifacts_dir / "00_meta.json"
    vocab_path: Final[Path] = artifacts_dir / "01_vocabulary.csv"
    weights_path: Final[Path] = artifacts_dir / "02_model_weights.csv"
    embeddings_path: Final[Path] = artifacts_dir / "03_token_embeddings.csv"

    artifacts_dir.mkdir(parents=True, exist_ok=True)

    write_vocabulary_csv(vocab_path, vocab)
    write_model_weights_csv(weights_path, vocab, model, row_labeler=row_labeler)
    write_token_embeddings_csv(embeddings_path, model, row_labeler=row_labeler)
    write_meta_json(
        meta_path,
        base_dir=base_dir,
        corpus_path=corpus_path,
        vocab_size=vocab.vocab_size(),
        model_kind=model_kind,
        learning_rate=learning_rate,
        epochs=epochs,
    )

write_meta_json

write_meta_json(
    path: Path,
    *,
    base_dir: Path,
    corpus_path: Path,
    vocab_size: int,
    model_kind: str,
    learning_rate: float,
    epochs: int,
) -> None

Write 00_meta.json describing corpus, model, and training settings.

This file is the authoritative, human-readable summary of a training run. It records: - what corpus was used - what model architecture was trained - how training was configured - which artifacts were produced

The intent is transparency and reproducibility.

Parameters:

Name Type Description Default
path Path

Output JSON path (artifacts/00_meta.json).

required
base_dir Path

Repository base directory.

required
corpus_path Path

Corpus file used for training.

required
vocab_size int

Number of unique tokens.

required
model_kind str

Human-readable model kind (e.g., "unigram", "bigram", "context2", "context3").

required
learning_rate float

Training learning rate.

required
epochs int

Number of epochs (full passes over the training pairs).

required
Source code in src/toy_gpt_train/io_artifacts.py
def write_meta_json(
    path: Path,
    *,
    base_dir: Path,
    corpus_path: Path,
    vocab_size: int,
    model_kind: str,
    learning_rate: float,
    epochs: int,
) -> None:
    """Write 00_meta.json describing corpus, model, and training settings.

    This file is the authoritative, human-readable summary of a training run.
    It records:
    - what corpus was used
    - what model architecture was trained
    - how training was configured
    - which artifacts were produced

    The intent is transparency and reproducibility.

    Args:
        path: Output JSON path (artifacts/00_meta.json).
        base_dir: Repository base directory.
        corpus_path: Corpus file used for training.
        vocab_size: Number of unique tokens.
        model_kind: Human-readable model kind
            (e.g., "unigram", "bigram", "context2", "context3").
        learning_rate: Training learning rate.
        epochs: Number of epochs (full passes over the training pairs).
    """
    path.parent.mkdir(parents=True, exist_ok=True)

    # Derive sibling artifact paths from base_dir
    artifact_paths = artifact_paths_from_base_dir(base_dir)

    repo_name = repo_name_from_base_dir(base_dir)
    base_resolved = base_dir.resolve()
    corpus_resolved = corpus_path.resolve()
    try:
        corpus_rel = str(corpus_resolved.relative_to(base_resolved))
    except ValueError:
        corpus_rel = str(corpus_resolved)

    corpus_text = corpus_path.read_text(encoding="utf-8")
    corpus_lines = [ln for ln in corpus_text.splitlines() if ln.strip()]

    meta: JsonObject = {
        "repo_name": repo_name,
        "model_kind": model_kind,
        "vocab_size": vocab_size,
        "training": {
            "learning_rate": learning_rate,
            "epochs": epochs,
            "epoch_definition": (
                "One epoch is a complete pass through all training pairs. "
                "Each pair contributes one gradient update."
            ),
        },
        "corpus": {
            "path": corpus_rel,
            "filename": corpus_path.name,
            "sha256": sha256_of_file(corpus_path),
            "num_lines": len(corpus_lines),
            "num_chars": len(corpus_text),
            "description": (
                "The corpus is tokenized sequential text. "
                "Training pairs are derived by sliding a fixed-size context window "
                "over the token stream."
            ),
        },
        "artifacts": {
            "00_meta.json": artifact_paths["00_meta.json"].name,
            "01_vocabulary.csv": artifact_paths["01_vocabulary.csv"].name,
            "02_model_weights.csv": artifact_paths["02_model_weights.csv"].name,
            "03_token_embeddings.csv": artifact_paths["03_token_embeddings.csv"].name,
        },
        "concepts": {
            "token": "An atomic symbol produced by the tokenizer (e.g., a word).",
            "vocabulary": (
                "The set of all unique tokens observed in the corpus, "
                "mapped to integer IDs."
            ),
            "context": (
                "The fixed number of preceding tokens used as input "
                "to predict the next token."
            ),
            "softmax": (
                "A function that converts raw scores into probabilities "
                "that sum to 1.0."
            ),
            "cross_entropy_loss": (
                "A measure of how well the predicted probability distribution "
                "matches the correct next token."
            ),
            "gradient_descent": (
                "An optimization process that incrementally adjusts weights "
                "to reduce prediction error."
            ),
        },
        "notes": [
            "This is an intentionally inspectable training pipeline.",
            "Models are trained using softmax regression with cross-entropy loss.",
            "Weights are updated incrementally via gradient descent.",
            "Token embeddings are a derived 2D projection for visualization only "
            "in levels 100-400.",
            "In later stages (500+), embeddings are a learned parameter table.",
        ],
    }

    # WHY: Ensure the file always ends with a newline so pre-commit
    # end-of-file-fixer does not modify generated artifacts in CI.
    rendered = json.dumps(meta, indent=2, sort_keys=True) + "\n"
    path.write_text(rendered, encoding="utf-8")

    LOG.info(f"Wrote meta to {path}")

write_model_weights_csv

write_model_weights_csv(
    path: Path,
    vocab: VocabularyLike,
    model: SimpleNextTokenModel,
    *,
    row_labeler: RowLabeler,
) -> None

Write 02_model_weights.csv with token-labeled columns.

Shape
  • first column: input_token (serialized context label)
  • remaining columns: one per output token (weights)
Notes
  • Tokens may be typed internally.
  • This function serializes all token labels explicitly at the I/O boundary.
Source code in src/toy_gpt_train/io_artifacts.py
def write_model_weights_csv(
    path: Path,
    vocab: VocabularyLike,
    model: SimpleNextTokenModel,
    *,
    row_labeler: RowLabeler,
) -> None:
    """Write 02_model_weights.csv with token-labeled columns.

    Shape:
        - first column: input_token (serialized context label)
        - remaining columns: one per output token (weights)

    Notes:
        - Tokens may be typed internally.
        - This function serializes all token labels explicitly at the I/O boundary.
    """
    path.parent.mkdir(parents=True, exist_ok=True)

    # Header: output-token labels (serialized)
    out_tokens: list[str] = []
    for j in range(vocab.vocab_size()):
        tok = vocab.get_id_token(j)
        out_tokens.append(str(tok) if tok is not None else f"id_{j}")

    with path.open("w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["input_token"] + out_tokens)

        for row_idx, row in enumerate(model.weights):
            # Serialize row label explicitly (may be a typed token/context)
            input_label = str(row_labeler(row_idx))

            writer.writerow([input_label] + [_fmt_float(w, decimals=4) for w in row])

    LOG.info(f"Wrote model weights to {path}")

write_token_embeddings_csv

write_token_embeddings_csv(
    path: Path,
    model: SimpleNextTokenModel,
    *,
    row_labeler: RowLabeler,
) -> None

Write 03_token_embeddings.csv as a simple 2D projection.

This file is a derived visualization artifact, not a learned embedding table.

For each model weight row
  • x coordinate = first weight (if present)
  • y coordinate = second weight (if present)

If a row has fewer than two weights, missing values default to 0.0.

Parameters:

Name Type Description Default
path Path

Output CSV path.

required
model SimpleNextTokenModel

Trained model providing a weight matrix.

required
row_labeler RowLabeler

Function mapping row index to a human-readable label.

required
Source code in src/toy_gpt_train/io_artifacts.py
def write_token_embeddings_csv(
    path: Path,
    model: SimpleNextTokenModel,
    *,
    row_labeler: RowLabeler,
) -> None:
    """Write 03_token_embeddings.csv as a simple 2D projection.

    This file is a derived visualization artifact, not a learned embedding table.

    For each model weight row:
        - x coordinate = first weight (if present)
        - y coordinate = second weight (if present)

    If a row has fewer than two weights, missing values default to 0.0.

    Args:
        path: Output CSV path.
        model: Trained model providing a weight matrix.
        row_labeler: Function mapping row index to a human-readable label.
    """
    path.parent.mkdir(parents=True, exist_ok=True)

    with path.open("w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["row", "label", "x", "y"])

        for row_idx, row in enumerate(model.weights):
            # Defensive defaults
            x: float = row[0] if len(row) >= 1 else 0.0
            y: float = row[1] if len(row) >= 2 else 0.0

            writer.writerow(
                [
                    row_idx,
                    row_labeler(row_idx),
                    _fmt_float(x, decimals=4),
                    _fmt_float(y, decimals=4),
                ]
            )

    LOG.info(f"Wrote token embeddings to {path}")

write_training_log

write_training_log(
    path: Path, history: list[dict[str, float]]
) -> None

Write per-epoch training metrics to a CSV file.

Parameters:

Name Type Description Default
path Path

Output file path.

required
history list[dict[str, float]]

List of per-epoch metrics dictionaries.

required
Source code in src/toy_gpt_train/io_artifacts.py
def write_training_log(path: Path, history: list[dict[str, float]]) -> None:
    """Write per-epoch training metrics to a CSV file.

    Args:
        path: Output file path.
        history: List of per-epoch metrics dictionaries.
    """
    path.parent.mkdir(parents=True, exist_ok=True)

    fieldnames: list[str] = ["epoch", "avg_loss", "accuracy"]
    with path.open("w", encoding="utf-8", newline="") as f:
        writer: csv.DictWriter[str] = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for row in history:
            writer.writerow(
                {
                    "epoch": int(row["epoch"]),
                    "avg_loss": f"{row['avg_loss']:.8f}",
                    "accuracy": f"{row['accuracy']:.6f}",
                }
            )

    LOG.info(f"Wrote training log to {path}")

write_vocabulary_csv

write_vocabulary_csv(
    path: Path, vocab: VocabularyLike
) -> None

Write 01_vocabulary.csv: token_id, token, frequency.

Parameters:

Name Type Description Default
path Path

Output CSV path.

required
vocab VocabularyLike

Vocabulary instance (must provide vocab_size(), get_id_token(), get_token_frequency()).

required
Source code in src/toy_gpt_train/io_artifacts.py
def write_vocabulary_csv(path: Path, vocab: VocabularyLike) -> None:
    """Write 01_vocabulary.csv: token_id, token, frequency.

    Args:
        path: Output CSV path.
        vocab: Vocabulary instance (must provide vocab_size(), get_id_token(), get_token_frequency()).
    """
    path.parent.mkdir(parents=True, exist_ok=True)

    with path.open("w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["token_id", "token", "frequency"])

        for token_id in range(vocab.vocab_size()):
            token = vocab.get_id_token(token_id)
            if token is None:
                continue
            freq = vocab.get_token_frequency(token)
            writer.writerow([token_id, token, freq])

    LOG.info(f"Wrote vocabulary to {path}")