API Reference¶

This page documents the public API of the package.

Documented functions below are considered stable.

Training Math¶

math_training ¶

math_training.py - Mathematical utilities used during model training.

This module contains reusable math functions used by the training and inference code in this repository.

Scope: - Pure functions with no model, vocabulary, or artifact assumptions. - Reusable across unigram, bigram, and higher-context variants.

These functions are intentionally simple and explicit to support inspection and debugging.

argmax ¶

argmax(values: list[float]) -> int

Return the index of the maximum value in a list.

Concept

argmax is the argument (index) at which a function reaches its maximum.

Common uses

Measuring accuracy during training (pick the most likely token)
Greedy decoding during inference (choose the top prediction)

In training and inference

A model outputs a probability distribution over possible next tokens.
The token with the highest probability is the model's most confident prediction.
argmax selects that token.

Example

values = [0.1, 0.7, 0.2] has index values of 0,1, 2 respectively. argmax(values) -> 1 (since 0.7 is the largest value)

Parameters:

Name	Type	Description	Default
`values`	`list[float]`	A non-empty list of numeric values (typically logits or probabilities).	required

Returns:

Type	Description
`int`	The index of the largest value in the list.

Raises:

Type	Description
`ValueError`	If values is empty.

cross_entropy_loss ¶

cross_entropy_loss(
    probs: list[float], target_id: int
) -> float

Compute cross-entropy loss for a single training example.

Cross-Entropy Loss

Cross-entropy measures how well a predicted probability distribution matches the true outcome.

In next-token prediction: - The true distribution is "one-hot" which means we encode it as either 1 or 0: - Probability = 1.0 for the correct next token - Probability = 0.0 for all others - The model predicts a probability distribution over all tokens.

Cross-entropy answers the question: "How well does the predicted probability distribution align with the true outcome?"

Formula

loss = -log(p_correct)

If the model assigns high probability to the correct token, the loss is small.
If the probability is near zero, the loss is large.

Numerical safety

log(0) is undefined, so we clamp probabilities to a small minimum (1e-12). This does not change learning behavior in practice, but prevents runtime errors.

In training

This loss value drives gradient descent.
Lower loss means better predictions.

Parameters:

Name	Type	Description	Default
`probs`	`list[float]`	A probability distribution over the vocabulary (sums to 1.0).	required
`target_id`	`int`	The integer ID of the correct next token.	required

Returns:

Type	Description
`float`	A non-negative floating-point loss value.
`float`	0.0 means a perfect prediction
`float`	Larger values indicate worse predictions

Raises: ValueError: If target_id is out of range for probs.

Models¶

c_model ¶

c_model.py - Simple model module.

Defines a minimal next-token prediction model for a context-2 setting (uses two tokens in sequence as context). A context-2 model computes P(next | previous, current).

Initial token sequence for demonstration: ((tokens[0], tokens[1]), tokens[2])

prev curr next¶

Slide forward by one token for each prediction. ((tokens[1], tokens[2]), tokens[3])

prev curr next¶

Responsibilities: - Represent a simple parameterized model that maps a 2-tuple of token IDs (previous token, current token) to a score for each token in the vocabulary. - Convert scores into probabilities using softmax. - Provide a forward pass (no training in this module).

Compare context-2 and bigram model (train-200): - Both models use the same mathematical structure: a conditional distribution p(next | previous, current). - Both models use a weight table conceptually shaped as: prev x curr x next (flattened for storage).

The difference is conceptual, not mathematical: - The bigram model (train-200) is presented as a classical n-gram idea, emphasizing conditional next-token statistics. - The context-2 model (train-300) reframes the same structure as a sliding context window, laying the foundation for: - context-3 models (train-400) - embeddings (train-500) - attention mechanisms (train-600)

Conceptually: - Bigram answers: "What usually comes next after this pair?" - Context-2 answers: "Given the recent local context, what comes next?" - Results should be identical between the two models. - The context-2 framing is more extensible for future models.

This model is intentionally simple: - one weight table (conceptually a 3D tensor: prev x curr x next, flattened for storage) - one forward computation - no learning here

Training is handled in a different module.

SimpleNextTokenModel ¶

A minimal next-token prediction model (context-2).

init ¶

__init__(vocab_size: int) -> None

Initialize the model with a given vocabulary size.

Parameters:

Name	Type	Description	Default
`vocab_size`	`int`	Number of unique tokens in the vocabulary.	required

forward ¶

forward(previous_id: int, current_id: int) -> list[float]

Perform a forward pass to get next-token probabilities.

Parameters:

Name	Type	Description	Default
`previous_id`	`int`	Integer ID of the previous token (t-1).	required
`current_id`	`int`	Integer ID of the current token (t).	required

Returns:

Type	Description
`list[float]`	list[float]: Probabilities for each token in the vocabulary.

main ¶

main() -> None

Demonstrate a forward pass of the simple context-2 model.

Training Pipeline¶

d_train ¶

d_train.py - Training loop module.

Trains the SimpleNextTokenModel on a small token corpus using a context-2 window (two tokens of context).

Responsibilities: - Create ((token_{t-1}, token_t) -> next_token) training pairs - Run a basic gradient-descent training loop - Track loss and accuracy per epoch - Write a CSV log of training progress - Write inspectable training artifacts (vocabulary, weights, embeddings, meta)

Concepts: - context-2: predict the next token using (previous token, current token) - epoch: one complete pass through all training pairs - softmax: converts raw scores into probabilities (so predictions sum to 1) - cross-entropy loss: measures how well predicted probabilities match the correct next token - gradient descent: iterative weight updates to reduce prediction error - think descending to find the bottom of a valley in a landscape - where the valley floor corresponds to lower prediction error

Notes: - This remains intentionally simple: no deep learning framework, no Transformer. - The model generalizes n-gram training by expanding the context window. - Training updates weight rows associated with the observed context-2 pattern. - token_embeddings.csv is a visualization-friendly projection for levels 100-400; in later repos (500+), embeddings become a first-class learned table.

main ¶

main() -> None

Run a simple training demo end-to-end.

make_training_pairs ¶

make_training_pairs(
    token_ids: list[int],
) -> list[Context2Pair]

Convert token IDs into ((t-1, t), next) training pairs.

Example

ids = [3, 1, 2, 4] pairs = [((3, 1), 2), ((1, 2), 4)]

row_labeler_context2 ¶

row_labeler_context2(
    vocab: VocabularyLike, vocab_size: int
) -> RowLabeler

Map a context-2 row index back to a readable label like 'the|cat'.

This reverses the flattening done by token_row_index_context2().

Example with vocab_size=10: row 0 -> (0, 0) -> "tok_0|tok_0" row 25 -> (2, 5) -> "tok_2|tok_5"

The math (integer division and modulo) undoes the flattening: token_id_{t-1} = row_index // vocab_size (which "block" of vocab_size) token_id_t = row_index % vocab_size (position within that block)

token_row_index_context2 ¶

token_row_index_context2(
    context_ids: Context2, vocab_size: int
) -> int

Return the row index for a context-2 token sequence.

We need to map a 2D context (previous, current) to a 1D row index. This is like converting 2D coordinates to a 1D array index.

Example with vocab_size=10: context (0, 0) -> row 0 context (0, 1) -> row 1 context (0, 9) -> row 9 context (1, 0) -> row 10 context (1, 1) -> row 11 context (2, 5) -> row 25

Formula

row_index = token_id_{t-1} * vocab_size + token_id_t

This creates vocab_size * vocab_size unique rows, one for each possible (previous, current) pair.

train_model ¶

train_model(
    model: SimpleNextTokenModel,
    pairs: list[Context2Pair],
    learning_rate: float,
    epochs: int,
) -> list[dict[str, float]]

Train the model using gradient descent on softmax cross-entropy (context-2).

Training proceeds in epochs (full passes through all training pairs). For each pair, we: 1. Compute the model's predicted probabilities (forward pass). 2. Measure how wrong the prediction was (loss). 3. Adjust weights to reduce the loss (gradient descent).

Each example

context_ids = (token_id_{t-1}, token_id_t) target_id = token_id_{t+1}

Parameters:

Name	Type	Description	Default
`model`	`SimpleNextTokenModel`	The model to train (weights will be modified in place).	required
`pairs`	`list[Context2Pair]`	List of Context2Pair training pairs.	required
`learning_rate`	`float`	Step size for gradient descent. Larger values learn faster but may overshoot; smaller values are more stable but slower.	required
`epochs`	`int`	Number of complete passes through the training data.	required

Returns:

Type	Description
`list[dict[str, float]]`	List of per-epoch metrics dictionaries containing epoch number,
`list[dict[str, float]]`	average loss, and accuracy.

Inference¶

e_infer ¶

e_infer.py - Inference module (artifact-driven).

Runs inference using previously saved training artifacts.

Responsibilities: - Load inspectable training artifacts from artifacts/ - 00_meta.json - 01_vocabulary.csv - 02_model_weights.csv - Reconstruct a vocabulary-like interface and model weights - Generate tokens using greedy decoding (argmax) - Print top-k next-token probabilities for inspection

Notes: - This module does NOT retrain by default. - If artifacts are missing, run d_train.py first.

Context-2 startup

Generation requires 2 tokens to form the initial context (prev, curr). These can be provided via --start-tokens or default to start_tokens saved in 00_meta.json during training (the first 2 tokens from the corpus).

ArtifactVocabulary `dataclass` ¶

Vocabulary reconstructed from artifacts/01_vocabulary.csv.

Provides the same surface area used by inference: - vocab_size() - get_token_id() - get_id_token() - get_token_frequency()

get_id_token ¶

get_id_token(idx: int) -> str | None

Return the token for a given token ID, or None if not found.

get_token_frequency ¶

get_token_frequency(token: str) -> int

Return the frequency count for a given token, or 0 if not found.

get_token_id ¶

get_token_id(token: str) -> int | None

Return the token ID for a given token, or None if not found.

vocab_size ¶

vocab_size() -> int

Return the total number of tokens in the vocabulary.

generate_tokens_context2 ¶

generate_tokens_context2(
    model: SimpleNextTokenModel,
    vocab: ArtifactVocabulary,
    start_tokens: tuple[str, str],
    num_tokens: int,
) -> list[str]

Generate tokens using a context-2 window (prev, curr).

Parameters:

Name	Type	Description	Default
`model`	`SimpleNextTokenModel`	Trained context-2 model.	required
`vocab`	`ArtifactVocabulary`	Vocabulary for token <-> ID conversion.	required
`start_tokens`	`tuple[str, str]`	Tuple of (prev_token, curr_token) to start generation.	required
`num_tokens`	`int`	Number of new tokens to generate.	required

Returns:

Type	Description
`list[str]`	List of tokens: [prev, curr, generated_1, generated_2, ...].

load_meta ¶

load_meta(path: Path) -> JsonObject

Load 00_meta.json.

load_model_weights_csv ¶

load_model_weights_csv(
    path: Path, vocab_size: int, *, expected_rows: int
) -> list[list[float]]

Load 02_model_weights.csv -> weights matrix.

load_vocabulary_csv ¶

load_vocabulary_csv(path: Path) -> ArtifactVocabulary

Load 01_vocabulary.csv -> ArtifactVocabulary.

main ¶

main() -> None

Run inference using saved training artifacts.

parse_args ¶

parse_args() -> argparse.Namespace

Parse command-line arguments for inference.

require_artifacts ¶

require_artifacts(
    *,
    meta_path: Path,
    vocab_path: Path,
    weights_path: Path,
    train_hint: str,
) -> None

Fail fast with a helpful message if artifacts are missing.

top_k ¶

top_k(
    probs: list[float], k: int
) -> list[tuple[int, float]]

Return top-k (token_id, probability) pairs sorted by probability.

Artifacts and I/O¶

io_artifacts ¶

io_artifacts.py - Input/output and training-artifact utilities used by the models.

This module is responsible for persisting and describing the results of model training in a consistent, inspectable format.

It does not perform training. It: - Writes artifacts produced by training (weights, vocabulary, logs, metadata) - Assumes a conventional repository layout for reproducibility - Provides small helper utilities shared across training and inference

The expected directory structure is: - artifacts/ contains all inspectable model outputs - corpus/ contains training text files (often exactly one) - outputs/ contains training logs and diagnostics

External callers should treat paths as implementation details and interact through the functions provided here.

Concepts¶

Artifact A concrete file written to disk that captures some aspect of training. In this project, artifacts are designed to be: - Human-readable (CSV / JSON) - Stable across model variants (unigram, bigram, context-3, etc.) - Reusable by inference without retraining

Epoch One epoch is one complete pass through all training examples. Training typically consists of multiple epochs so the model can gradually improve its predictions by repeatedly adjusting weights.

Training Log A CSV file recording per-epoch metrics such as: - average loss - accuracy This allows learning behavior to be inspected after training.

Vocabulary A mapping between token strings and integer token IDs. The vocabulary defines: - the size of the model output space - the meaning of each row and column in the weight tables

Row Labeler A small function that maps a numeric row index in the model's weight table to a human-readable label. For example, as the number of context tokens increases, the row labeler produces context strings such as: - unigram: "cat" - bigram: "the|cat" - context-3: "the|black|cat"

Row labels are written into CSV artifacts to make model structure visible.

Model Weights Numeric parameters learned during training. Conceptually: - each row corresponds to an input context - each column corresponds to a possible next token Weights are written verbatim so learning can be inspected or reused.

Token Embeddings (Derived) A simple 2D projection derived from model weights for visualization. These are not learned embeddings yet. In later stages (500+), embeddings become first-class learned parameters.

Reproducibility Metadata The 00_meta.json file records: - which corpus was used - how it was hashed - which model variant was trained - what training settings were applied This allows results to be traced and compared across runs and repositories.

Design Notes¶

This module is shared unchanged across model levels (100-400).
More advanced pipelines (embeddings, attention, batching) build on the same artifact-writing concepts.
Centralizing I/O logic prevents drift across repositories and keeps training code focused on learning.

VocabularyLike ¶

Bases: Protocol

Protocol for vocabulary-like objects used in training and artifacts.

get_id_token ¶

get_id_token(idx: int) -> str | None

Return the token string for a given integer ID, or None if not found.

get_token_frequency ¶

get_token_frequency(token: str) -> int

Return the frequency count for a given token.

get_token_id ¶

get_token_id(token: str) -> int | None

Return the integer ID for a given token, or None if not found.

vocab_size ¶

vocab_size() -> int

Return the total number of unique tokens in the vocabulary.

artifact_paths_from_base_dir ¶

artifact_paths_from_base_dir(
    base_dir: Path,
) -> dict[str, Path]

Return standard artifact paths under base_dir/artifacts/.

artifacts_dir_from_base_dir ¶

artifacts_dir_from_base_dir(base_dir: Path) -> Path

Return artifacts/ directory under a repository base directory.

find_single_corpus_file ¶

find_single_corpus_file(corpus_dir: Path) -> Path

Find the single corpus file in corpus/ (same rule as SimpleTokenizer).

outputs_dir_from_base_dir ¶

outputs_dir_from_base_dir(base_dir: Path) -> Path

Return outputs/ directory under a repository base directory.

repo_name_from_base_dir ¶

repo_name_from_base_dir(base_dir: Path) -> str

Infer repository name from base directory.

sha256_of_bytes ¶

sha256_of_bytes(data: bytes) -> str

Return hex SHA-256 digest for given bytes.

sha256_of_file ¶

sha256_of_file(path: Path) -> str

Return hex SHA-256 digest for a file.

write_artifacts ¶

write_artifacts(
    *,
    base_dir: Path,
    corpus_path: Path,
    vocab: VocabularyLike,
    model: SimpleNextTokenModel,
    model_kind: str,
    learning_rate: float,
    epochs: int,
    row_labeler: RowLabeler,
) -> None

Write all training artifacts to artifacts/.

Parameters:

Name	Type	Description	Default
`base_dir`	`Path`	Repository base directory.	required
`corpus_path`	`Path`	Corpus file used for training.	required
`vocab`	`VocabularyLike`	VocabularyLike instance.	required
`model`	`SimpleNextTokenModel`	Trained model (weights already updated).	required
`model_kind`	`str`	Human-readable model kind (e.g., "unigram", "bigram").	required
`learning_rate`	`float`	Training learning rate.	required
`epochs`	`int`	Number of training passes.	required
`row_labeler`	`RowLabeler`	Function that maps a model weight-row index to a label written in the first column of 02_model_weights.csv.	required

write_meta_json ¶

write_meta_json(
    path: Path,
    *,
    base_dir: Path,
    corpus_path: Path,
    vocab_size: int,
    model_kind: str,
    learning_rate: float,
    epochs: int,
) -> None

Write 00_meta.json describing corpus, model, and training settings.

This file is the authoritative, human-readable summary of a training run. It records: - what corpus was used - what model architecture was trained - how training was configured - which artifacts were produced

The intent is transparency and reproducibility.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Output JSON path (artifacts/00_meta.json).	required
`base_dir`	`Path`	Repository base directory.	required
`corpus_path`	`Path`	Corpus file used for training.	required
`vocab_size`	`int`	Number of unique tokens.	required
`model_kind`	`str`	Human-readable model kind (e.g., "unigram", "bigram", "context2", "context3").	required
`learning_rate`	`float`	Training learning rate.	required
`epochs`	`int`	Number of epochs (full passes over the training pairs).	required

write_model_weights_csv ¶

write_model_weights_csv(
    path: Path,
    vocab: VocabularyLike,
    model: SimpleNextTokenModel,
    *,
    row_labeler: RowLabeler,
) -> None

Write 02_model_weights.csv with token-labeled columns.

Shape

first column: input_token (serialized context label)
remaining columns: one per output token (weights)

Notes

Tokens may be typed internally.
This function serializes all token labels explicitly at the I/O boundary.

write_token_embeddings_csv ¶

write_token_embeddings_csv(
    path: Path,
    model: SimpleNextTokenModel,
    *,
    row_labeler: RowLabeler,
) -> None

Write 03_token_embeddings.csv as a simple 2D projection.

This file is a derived visualization artifact, not a learned embedding table.

For each model weight row

x coordinate = first weight (if present)
y coordinate = second weight (if present)

If a row has fewer than two weights, missing values default to 0.0.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Output CSV path.	required
`model`	`SimpleNextTokenModel`	Trained model providing a weight matrix.	required
`row_labeler`	`RowLabeler`	Function mapping row index to a human-readable label.	required

write_training_log ¶

write_training_log(
    path: Path, history: list[dict[str, float]]
) -> None

Write per-epoch training metrics to a CSV file.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Output file path.	required
`history`	`list[dict[str, float]]`	List of per-epoch metrics dictionaries.	required

write_vocabulary_csv ¶

write_vocabulary_csv(
    path: Path, vocab: VocabularyLike
) -> None

Write 01_vocabulary.csv: token_id, token, frequency.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Output CSV path.	required
`vocab`	`VocabularyLike`	Vocabulary instance (must provide vocab_size(), get_id_token(), get_token_frequency()).	required

Marker File: py.typed¶

This package includes a py.typed marker file as defined by PEP 561.

Type checkers (Pyright, Mypy) trust inline type hints in installed packages only when this marker is present.
The file may be empty; comments are allowed.

API Reference¶

Training Math¶

math_training ¶

argmax ¶

cross_entropy_loss ¶

Models¶

c_model ¶

prev curr next¶

prev curr next¶

SimpleNextTokenModel ¶

__init__ ¶

forward ¶

main ¶

Training Pipeline¶

d_train ¶

main ¶

make_training_pairs ¶

row_labeler_context2 ¶

token_row_index_context2 ¶

train_model ¶

Inference¶

e_infer ¶

ArtifactVocabulary dataclass ¶

get_id_token ¶

get_token_frequency ¶

get_token_id ¶

vocab_size ¶

generate_tokens_context2 ¶

load_meta ¶

load_model_weights_csv ¶

load_vocabulary_csv ¶

main ¶

parse_args ¶

require_artifacts ¶

top_k ¶

Artifacts and I/O¶

io_artifacts ¶

Concepts¶

Design Notes¶

VocabularyLike ¶

get_id_token ¶

get_token_frequency ¶

get_token_id ¶

vocab_size ¶

artifact_paths_from_base_dir ¶

artifacts_dir_from_base_dir ¶

find_single_corpus_file ¶

outputs_dir_from_base_dir ¶

repo_name_from_base_dir ¶

sha256_of_bytes ¶

sha256_of_file ¶

write_artifacts ¶

write_meta_json ¶

write_model_weights_csv ¶

write_token_embeddings_csv ¶

write_training_log ¶

write_vocabulary_csv ¶

Marker File: py.typed¶

init ¶

ArtifactVocabulary `dataclass` ¶