Skip to content

API Reference

This page documents the public API of the package.

Documented functions below are considered stable.


Training Math

math_training

math_training.py - Mathematical utilities used during model training.

This module contains reusable mathematical functions that appear throughout the training process of language models.

Common themes: - These functions are not specific to any one model. - They are reused unchanged across unigram, bigram, and higher-context models. - Keeping them here avoids duplication and keeps training code readable.

As models become more complex (embeddings, attention, batching), these core ideas remain the same.

argmax

argmax(values: list[float]) -> int

Return the index of the maximum value in a list.

Concept

argmax is the argument (index) at which a function reaches its maximum value.

In training and inference
  • A model outputs a probability distribution over possible next tokens.
  • The token with the highest probability is the model's most confident prediction.
  • argmax selects that token.
Example

values = [0.1, 0.7, 0.2] has index values of 0,1, 2 respectively. argmax(values) -> 1 (since 0.7 is the largest value)

This is used for
  • Measuring accuracy during training
  • Greedy decoding during inference

Parameters:

Name Type Description Default
values list[float]

A list of numeric values (typically probabilities).

required

Returns:

Type Description
int

The index of the largest value in the list.

Raises:

Type Description
ValueError

If the list is empty.

cross_entropy_loss

cross_entropy_loss(
    probs: list[float], target_id: int
) -> float

Compute cross-entropy loss for a single training example.

Cross-Entropy Loss

Cross-entropy measures how well a predicted probability distribution matches the true outcome.

In next-token prediction: - The true distribution is "one-hot" which means we encode it as either 1 or 0: - Probability = 1.0 for the correct next token - Probability = 0.0 for all others - The model predicts a probability distribution over all tokens.

Cross-entropy answers the question: "How well does the predicted probability distribution align with the true outcome?"

Formula

loss = -log(p_correct)

  • If the model assigns high probability to the correct token, the loss is small.
  • If the probability is near zero, the loss is large.
Numerical safety

log(0) is undefined, so we clamp probabilities to a small minimum (1e-12). This does not change learning behavior in practice, but prevents runtime errors.

In training
  • This loss value drives gradient descent.
  • Lower loss means better predictions.

Parameters:

Name Type Description Default
probs list[float]

A probability distribution over the vocabulary (sums to 1.0).

required
target_id int

The integer ID of the correct next token.

required

Returns:

Type Description
float

A non-negative floating-point loss value.

float
  • 0.0 means a perfect prediction
float
  • Larger values indicate worse predictions

Models

c_model

c_model.py - Simple model module.

Defines a minimal next-token prediction model for a context-3 setting (uses three tokens in sequence as context).

Responsibilities: - Represent a simple parameterized model that maps a 3-tuple of token IDs (prev2, prev1, current) to a score for each token in the vocabulary. - Convert scores into probabilities using softmax. - Provide a forward pass (no training in this module).

This model is intentionally simple: - one weight table (conceptually a 4D tensor: prev2 x prev1 x curr x next, flattened for storage) - one forward computation - no learning here

Training is handled in a different module.

SimpleNextTokenModel

A minimal next-token prediction model (context-3).

__init__

__init__(vocab_size: int) -> None

Initialize the model with random weights.

forward

forward(
    prev2_id: int, prev1_id: int, current_id: int
) -> list[float]

Perform a forward pass to get next-token probabilities.

Parameters:

Name Type Description Default
prev2_id int

Token ID of the token two positions before current.

required
prev1_id int

Token ID of the token one position before current.

required
current_id int

Token ID of the current token.

required

Returns:

Type Description
list[float]

list[float]: Probabilities for each token in the vocabulary.

main

main() -> None

Demonstrate a forward pass of the simple context-3 model.


Training Pipeline

d_train

d_train.py - Training loop module.

Trains the SimpleNextTokenModel on a small token corpus using a context-3 window (three preceding tokens).

Responsibilities: - Create ((token_{t-2}, token_{t-1}, token_t) -> next_token) training pairs - Run a basic gradient-descent training loop - Track loss and accuracy per epoch - Write a CSV log of training progress - Write inspectable training artifacts (vocabulary, weights, embeddings, meta)

Concepts: - softmax: converts raw scores into probabilities (so predictions sum to 1) - cross-entropy loss: measures how well predicted probabilities match the correct token - gradient descent: iterative weight updates to minimize loss - think descending to find the bottom of a valley in a landscape - where the valley floor corresponds to lower prediction error

Notes: - This remains intentionally simple: no deep learning framework, no Transformer. - The model generalizes n-gram training by expanding the context window. - Training updates weight rows associated with the observed context-3 pattern. - token_embeddings.csv remains a derived visualization artifact; learned embeddings are introduced in later stages.

main

main() -> None

Run a simple training demo end-to-end (context-3).

make_training_pairs

make_training_pairs(
    token_ids: list[int],
) -> list[Context3Pair]

Convert token IDs into ((t-2, t-1, t), next) training pairs.

Example

ids = [3, 1, 2, 4, 5] pairs = [((3, 1, 2), 4), ((1, 2, 4), 5)]

row_labeler_context3

row_labeler_context3(
    vocab: VocabularyLike, vocab_size: int
) -> RowLabeler

Map a context-3 row index to a label like 'tok_{t-2}|tok_{t-1}|tok_t'.

token_row_index_context3

token_row_index_context3(
    context_ids: Context3, vocab_size: int
) -> int

Return the row index for a context-3 token sequence.

Context order

(token_id_{t-2}, token_id_{t-1}, token_id_t)

Flattening scheme

row_index = a * vocab_size^2 + b * vocab_size + c

This is the context-3 analogue of

unigram: row = token_id bigram: row = prev_id * vocab_size + curr_id

train_model

train_model(
    model: SimpleNextTokenModel,
    pairs: list[Context3Pair],
    learning_rate: float,
    epochs: int,
) -> list[dict[str, float]]

Train the model using gradient descent on softmax cross-entropy (context-3).

Each example

context_ids = (token_id_{t-2}, token_id_{t-1}, token_id_t) target_id = token_id_{t+1}

Returns:

Type Description
list[dict[str, float]]

A list of per-epoch metrics dictionaries (epoch, avg_loss, accuracy).


Inference

e_infer

e_infer.py - Inference module (artifact-driven).

Runs inference using previously saved training artifacts.

Responsibilities: - Load inspectable training artifacts from artifacts/ - 00_meta.json - 01_vocabulary.csv - 02_model_weights.csv - Reconstruct a vocabulary-like interface and model weights - Generate tokens using greedy decoding (argmax) - Print top-k next-token probabilities for inspection

Notes: - This module does NOT retrain by default. - If artifacts are missing, run d_train.py first. - Context-3 bootstrapping: generation starts from a single start token. To form the first 3-token context, we use (start, start, start) as the initial context.

ArtifactVocabulary dataclass

Vocabulary reconstructed from artifacts/01_vocabulary.csv.

Provides the same surface area used by inference: - vocab_size() - get_token_id() - get_id_token() - get_token_frequency()

get_id_token

get_id_token(idx: int) -> str | None

Return the token for a given token ID, or None if not found.

get_token_frequency

get_token_frequency(token: str) -> int

Return the frequency count for a given token, or 0 if not found.

get_token_id

get_token_id(token: str) -> int | None

Return the token ID for a given token, or None if not found.

vocab_size

vocab_size() -> int

Return the total number of tokens in the vocabulary.

generate_tokens_context3

generate_tokens_context3(
    model: SimpleNextTokenModel,
    vocab: ArtifactVocabulary,
    start_token: str,
    num_tokens: int,
) -> list[str]

Generate tokens using a context-3 window (t-2, t-1, t).

Bootstrapping

If we only have one start token, we begin with: (start, start, start) so that forward(previous2_id, previous1_id, current_id) is well-defined.

load_meta

load_meta(path: Path) -> JsonObject

Load 00_meta.json.

load_model_weights_csv

load_model_weights_csv(
    path: Path, vocab_size: int, *, expected_rows: int
) -> list[list[float]]

Load 02_model_weights.csv -> weights matrix.

load_vocabulary_csv

load_vocabulary_csv(path: Path) -> ArtifactVocabulary

Load 01_vocabulary.csv -> ArtifactVocabulary.

main

main() -> None

Run inference using saved training artifacts.

parse_args

parse_args() -> argparse.Namespace

Parse command-line arguments.

require_artifacts

require_artifacts() -> None

Fail fast with a helpful message if artifacts are missing.

top_k

top_k(
    probs: list[float], k: int
) -> list[tuple[int, float]]

Return top-k (token_id, probability) pairs sorted by probability.


Artifacts and I/O

io_artifacts

io.py - Input/output and training-artifact utilities used by the models.

This module is responsible for persisting and describing the results of model training in a consistent, inspectable format.

It does not perform training. It: - Writes artifacts produced by training (weights, vocabulary, logs, metadata) - Enforces a fixed repository layout for reproducibility - Provides small helper utilities shared across training and inference

The directory structure is intentionally fixed: - artifacts/ contains all inspectable model outputs - corpus/ contains exactly one training text file - outputs/ contains training logs and diagnostics

External callers should treat paths as implementation details and interact only through the functions provided here.

Concepts

Artifact A concrete file written to disk that captures some aspect of training. In this project, artifacts are designed to be: - Human-readable (CSV / JSON) - Stable across model variants (unigram, bigram, context-3, etc.) - Reusable by inference without retraining

Epoch One epoch is one complete pass through all training examples. Training typically consists of multiple epochs so the model can gradually improve its predictions by repeatedly adjusting weights.

Training Log A CSV file recording per-epoch metrics such as: - average loss - accuracy This allows learning behavior to be inspected after training.

Vocabulary A mapping between token strings and integer token IDs. The vocabulary defines: - the size of the model output space - the meaning of each row and column in the weight tables

Row Labeler A small function that maps a numeric row index in the model's weight table to a human-readable label. For example, as the number of context tokens increases, the row labeler produces context strings such as: - unigram: "cat" - bigram: "the|cat" - context-3: "the|black|cat"

Row labels are written into CSV artifacts to make model structure visible.

Model Weights Numeric parameters learned during training. Conceptually: - each row corresponds to an input context - each column corresponds to a possible next token Weights are written verbatim so learning can be inspected or reused.

Token Embeddings (Derived) A simple 2D projection derived from model weights for visualization. These are not learned embeddings yet. In later stages (500+), embeddings become first-class learned parameters.

Reproducibility Metadata The 00_meta.json file records: - which corpus was used - how it was hashed - which model variant was trained - what training settings were applied This allows results to be traced and compared across runs and repositories.

Design Notes

  • This module is shared unchanged across model levels (100-400).
  • More advanced pipelines (embeddings, attention, batching) build on the same artifact-writing concepts.
  • Centralizing I/O logic prevents drift across repositories and keeps training code focused on learning.

VocabularyLike

Bases: Protocol

Protocol for vocabulary-like objects used in training and artifacts.

get_id_token

get_id_token(idx: int) -> str | None

Return the token string for a given integer ID, or None if not found.

get_token_frequency

get_token_frequency(token: str) -> int

Return the frequency count for a given token.

get_token_id

get_token_id(token: str) -> int | None

Return the integer ID for a given token, or None if not found.

vocab_size

vocab_size() -> int

Return the total number of unique tokens in the vocabulary.

find_single_corpus_file

find_single_corpus_file(corpus_dir: Path) -> Path

Find the single corpus file in corpus/ (same rule as SimpleTokenizer).

repo_name_from_base_dir

repo_name_from_base_dir(base_dir: Path) -> str

Infer repository name from base directory.

sha256_of_bytes

sha256_of_bytes(data: bytes) -> str

Return hex SHA-256 digest for given bytes.

sha256_of_file

sha256_of_file(path: Path) -> str

Return hex SHA-256 digest for a file.

write_artifacts

write_artifacts(
    *,
    base_dir: Path,
    corpus_path: Path,
    vocab: VocabularyLike,
    model: SimpleNextTokenModel,
    model_kind: str,
    learning_rate: float,
    epochs: int,
    row_labeler: RowLabeler,
) -> None

Write all training artifacts to artifacts/.

Parameters:

Name Type Description Default
base_dir Path

Repository base directory.

required
corpus_path Path

Corpus file used for training.

required
vocab VocabularyLike

VocabularyLike instance.

required
model SimpleNextTokenModel

Trained model (weights already updated).

required
model_kind str

Human-readable model kind (e.g., "unigram", "bigram").

required
learning_rate float

Training learning rate.

required
epochs int

Number of training passes.

required
row_labeler RowLabeler

Function that maps a model weight-row index to a label written in the first column of 02_model_weights.csv.

required

write_meta_json

write_meta_json(
    path: Path,
    *,
    base_dir: Path,
    corpus_path: Path,
    vocab_size: int,
    model_kind: str,
    learning_rate: float,
    epochs: int,
) -> None

Write 00_meta.json describing corpus, model, and training settings.

This file is the authoritative, human-readable summary of a training run. It records: - what corpus was used - what model architecture was trained - how training was configured - which artifacts were produced

The intent is transparency and reproducibility.

Parameters:

Name Type Description Default
path Path

Output JSON path (artifacts/00_meta.json).

required
base_dir Path

Repository base directory.

required
corpus_path Path

Corpus file used for training.

required
vocab_size int

Number of unique tokens.

required
model_kind str

Human-readable model kind (e.g., "unigram", "bigram", "context2", "context3").

required
learning_rate float

Training learning rate.

required
epochs int

Number of epochs (full passes over the training pairs).

required

write_model_weights_csv

write_model_weights_csv(
    path: Path,
    vocab: VocabularyLike,
    model: SimpleNextTokenModel,
    *,
    row_labeler: RowLabeler,
) -> None

Write 02_model_weights.csv with token-labeled columns.

Shape
  • first column: input_token
  • remaining columns: one per output token (header names are tokens)

Parameters:

Name Type Description Default
path Path

Output CSV path.

required
vocab VocabularyLike

Vocabulary instance (must provide vocab_size(), get_id_token()).

required
model SimpleNextTokenModel

Trained model (must provide vocab_size and weights).

required
row_labeler RowLabeler

Function that maps a model weight-row index to a label written in the first column.

required

write_token_embeddings_csv

write_token_embeddings_csv(
    path: Path,
    model: SimpleNextTokenModel,
    *,
    row_labeler: RowLabeler,
) -> None

Write 03_token_embeddings.csv as a simple 2D projection for plotting.

write_training_log

write_training_log(
    path: Path, history: list[dict[str, float]]
) -> None

Write per-epoch training metrics to a CSV file.

Parameters:

Name Type Description Default
path Path

Output file path.

required
history list[dict[str, float]]

List of per-epoch metrics dictionaries.

required

write_vocabulary_csv

write_vocabulary_csv(
    path: Path, vocab: VocabularyLike
) -> None

Write 01_vocabulary.csv: token_id, token, frequency.

Parameters:

Name Type Description Default
path Path

Output CSV path.

required
vocab VocabularyLike

Vocabulary instance (must provide vocab_size(), get_id_token(), get_token_frequency()).

required

Marker File: py.typed

This package includes a py.typed marker file as defined by PEP 561.

  • Type checkers (Pyright, Mypy) trust inline type hints in installed packages only when this marker is present.
  • The file may be empty; comments are allowed.