API Reference¶
This page documents the public API of the package.
Documented functions below are considered stable.
Training Math¶
math_training ¶
math_training.py - Mathematical utilities used during model training.
This module contains reusable math functions used by the training and inference code in this repository.
Scope: - Pure functions with no model, vocabulary, or artifact assumptions. - Reusable across unigram, bigram, and higher-context variants.
These functions are intentionally simple and explicit to support inspection and debugging.
argmax ¶
Return the index of the maximum value in a list.
Concept
argmax is the argument (index) at which a function reaches its maximum.
Common uses
- Measuring accuracy during training (pick the most likely token)
- Greedy decoding during inference (choose the top prediction)
In training and inference
- A model outputs a probability distribution over possible next tokens.
- The token with the highest probability is the model's most confident prediction.
- argmax selects that token.
Example
values = [0.1, 0.7, 0.2] has index values of 0,1, 2 respectively. argmax(values) -> 1 (since 0.7 is the largest value)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[float]
|
A non-empty list of numeric values (typically logits or probabilities). |
required |
Returns:
| Type | Description |
|---|---|
int
|
The index of the largest value in the list. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If values is empty. |
cross_entropy_loss ¶
Compute cross-entropy loss for a single training example.
Cross-Entropy Loss
Cross-entropy measures how well a predicted probability distribution matches the true outcome.
In next-token prediction: - The true distribution is "one-hot" which means we encode it as either 1 or 0: - Probability = 1.0 for the correct next token - Probability = 0.0 for all others - The model predicts a probability distribution over all tokens.
Cross-entropy answers the question: "How well does the predicted probability distribution align with the true outcome?"
Formula
loss = -log(p_correct)
- If the model assigns high probability to the correct token, the loss is small.
- If the probability is near zero, the loss is large.
Numerical safety
log(0) is undefined, so we clamp probabilities to a small minimum (1e-12). This does not change learning behavior in practice, but prevents runtime errors.
In training
- This loss value drives gradient descent.
- Lower loss means better predictions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
probs
|
list[float]
|
A probability distribution over the vocabulary (sums to 1.0). |
required |
target_id
|
int
|
The integer ID of the correct next token. |
required |
Returns:
| Type | Description |
|---|---|
float
|
A non-negative floating-point loss value. |
float
|
|
float
|
|
Raises: ValueError: If target_id is out of range for probs.
Models¶
c_model ¶
c_model.py - Simple model module.
Defines a minimal next-token prediction model for unigram (no context). A unigram models P(next) - just word frequencies, ignoring all context.
Responsibilities: - Represent a simple parameterized model that outputs the same probability distribution regardless of input. - Convert scores into probabilities using softmax. - Provide a forward pass (no training in this module).
This model is intentionally simple: - one weight vector (1D: just next_token scores) - one forward computation that ignores input - no learning here
Training is handled in a different module.
SimpleNextTokenModel ¶
A minimal next-token prediction model (unigram - no context).
Unigram ignores all context and predicts based solely on corpus word frequencies: P(next).
forward ¶
Perform a forward pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
current_id
|
int | None
|
Ignored for unigram - included for API consistency. |
None
|
Returns:
| Type | Description |
|---|---|
list[float]
|
Probability distribution over next tokens (same for all inputs). |
Training Pipeline¶
d_train ¶
d_train.py - Training loop module.
Trains the SimpleNextTokenModel on a small token corpus using unigram (no context - just word frequencies).
A unigram models P(next) - the probability of each word based purely on how often it appears in the corpus, ignoring all context.
Responsibilities: - Count token frequencies in the corpus - Train a single row of weights to predict based on frequency - Track loss and accuracy per epoch - Write a CSV log of training progress - Write inspectable training artifacts (vocabulary, weights, embeddings, meta)
Concepts: - unigram: predict next token using only corpus frequencies (no context) - softmax: converts raw scores into probabilities (so predictions sum to 1) - cross-entropy loss: measures how well predicted probabilities match the correct token - gradient descent: iterative weight updates to minimize loss
Notes: - This is intentionally simple: no deep learning framework, no Transformer. - The model has only ONE row of weights (predictions are context-independent). - Training updates the same single row for every example. - token_embeddings.csv is a visualization-friendly projection for levels 100-400; in later repos (500+), embeddings become a first-class learned table.
make_training_targets ¶
Extract training targets for unigram model.
For unigram, we don't need (input, target) pairs because the model ignores input. We just need the list of all tokens that appear in the corpus - each one is a target to predict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token_ids
|
list[int]
|
Sequence of integer token IDs from the corpus. |
required |
Returns:
| Type | Description |
|---|---|
list[int]
|
List of target token IDs (all tokens in corpus). |
Example
Token sequence "the cat sat" with IDs [3, 1, 2] produces: [3, 1, 2] Meaning: the model should learn to predict these tokens based on their frequency (3 appears once, 1 appears once, etc.)
row_labeler_unigram ¶
Map a unigram row index to a label.
Unigram has only one row, labeled to indicate it's context-free.
train_model ¶
train_model(
model: SimpleNextTokenModel,
targets: list[int],
learning_rate: float,
epochs: int,
) -> list[dict[str, float]]
Train the unigram model using gradient descent on softmax cross-entropy.
Unigram training learns corpus frequencies. The model has a single row of weights that gets updated for every token in the corpus.
Training proceeds in epochs (full passes through all tokens). For each token, we: 1. Compute the model's predicted probabilities (forward pass). 2. Measure how wrong the prediction was (loss). 3. Adjust weights to reduce the loss (gradient descent).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
SimpleNextTokenModel
|
The model to train (weights will be modified in place). |
required |
targets
|
list[int]
|
List of target token IDs from the corpus. |
required |
learning_rate
|
float
|
Step size for gradient descent. |
required |
epochs
|
int
|
Number of complete passes through the training data. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, float]]
|
List of per-epoch metrics dictionaries containing epoch number, |
list[dict[str, float]]
|
average loss, and accuracy. |
Inference¶
e_infer ¶
e_infer.py - Inference module (artifact-driven).
Runs inference using previously saved training artifacts.
Responsibilities: - Load inspectable training artifacts from artifacts/ - 00_meta.json - 01_vocabulary.csv - 02_model_weights.csv - Reconstruct a vocabulary-like interface and model weights - Generate tokens using greedy decoding (argmax) - Print top-k next-token probabilities for inspection
Notes: - This module does NOT retrain by default. - If artifacts are missing, run d_train.py first.
Unigram inference
The model ignores all context and predicts based solely on corpus word frequencies. Every call to forward() returns the same probability distribution.
ArtifactVocabulary
dataclass
¶
Vocabulary reconstructed from artifacts/01_vocabulary.csv.
Provides the same surface area used by inference: - vocab_size() - get_token_id() - get_id_token() - get_token_frequency()
generate_tokens_unigram ¶
generate_tokens_unigram(
model: SimpleNextTokenModel,
vocab: ArtifactVocabulary,
num_tokens: int,
) -> list[str]
Generate tokens using unigram (no context - same prediction every time).
Note: Unigram ignores all context, so there's no start token. Every generated token will be the same (the most frequent word).
load_model_weights_csv ¶
Load 02_model_weights.csv -> weights matrix.
For unigram, expects exactly 1 row (context-independent predictions).
load_vocabulary_csv ¶
Load 01_vocabulary.csv -> ArtifactVocabulary.
require_artifacts ¶
require_artifacts(
*,
meta_path: Path,
vocab_path: Path,
weights_path: Path,
train_hint: str,
) -> None
Fail fast with a helpful message if artifacts are missing.
top_k ¶
Return top-k (token_id, probability) pairs sorted by probability.
Artifacts and I/O¶
io_artifacts ¶
io_artifacts.py - Input/output and training-artifact utilities used by the models.
This module is responsible for persisting and describing the results of model training in a consistent, inspectable format.
It does not perform training. It: - Writes artifacts produced by training (weights, vocabulary, logs, metadata) - Assumes a conventional repository layout for reproducibility - Provides small helper utilities shared across training and inference
The expected directory structure is: - artifacts/ contains all inspectable model outputs - corpus/ contains training text files (often exactly one) - outputs/ contains training logs and diagnostics
External callers should treat paths as implementation details and interact through the functions provided here.
Concepts¶
Artifact A concrete file written to disk that captures some aspect of training. In this project, artifacts are designed to be: - Human-readable (CSV / JSON) - Stable across model variants (unigram, bigram, context-3, etc.) - Reusable by inference without retraining
Epoch One epoch is one complete pass through all training examples. Training typically consists of multiple epochs so the model can gradually improve its predictions by repeatedly adjusting weights.
Training Log A CSV file recording per-epoch metrics such as: - average loss - accuracy This allows learning behavior to be inspected after training.
Vocabulary A mapping between token strings and integer token IDs. The vocabulary defines: - the size of the model output space - the meaning of each row and column in the weight tables
Row Labeler A small function that maps a numeric row index in the model's weight table to a human-readable label. For example, as the number of context tokens increases, the row labeler produces context strings such as: - unigram: "cat" - bigram: "the|cat" - context-3: "the|black|cat"
Row labels are written into CSV artifacts to make model structure visible.
Model Weights Numeric parameters learned during training. Conceptually: - each row corresponds to an input context - each column corresponds to a possible next token Weights are written verbatim so learning can be inspected or reused.
Token Embeddings (Derived) A simple 2D projection derived from model weights for visualization. These are not learned embeddings yet. In later stages (500+), embeddings become first-class learned parameters.
Reproducibility Metadata The 00_meta.json file records: - which corpus was used - how it was hashed - which model variant was trained - what training settings were applied This allows results to be traced and compared across runs and repositories.
Design Notes¶
- This module is shared unchanged across model levels (100-400).
- More advanced pipelines (embeddings, attention, batching) build on the same artifact-writing concepts.
- Centralizing I/O logic prevents drift across repositories and keeps training code focused on learning.
VocabularyLike ¶
Bases: Protocol
Protocol for vocabulary-like objects used in training and artifacts.
artifact_paths_from_base_dir ¶
Return standard artifact paths under base_dir/artifacts/.
artifacts_dir_from_base_dir ¶
Return artifacts/ directory under a repository base directory.
find_single_corpus_file ¶
Find the single corpus file in corpus/ (same rule as SimpleTokenizer).
outputs_dir_from_base_dir ¶
Return outputs/ directory under a repository base directory.
repo_name_from_base_dir ¶
Infer repository name from base directory.
write_artifacts ¶
write_artifacts(
*,
base_dir: Path,
corpus_path: Path,
vocab: VocabularyLike,
model: SimpleNextTokenModel,
model_kind: str,
learning_rate: float,
epochs: int,
row_labeler: RowLabeler,
) -> None
Write all training artifacts to artifacts/.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_dir
|
Path
|
Repository base directory. |
required |
corpus_path
|
Path
|
Corpus file used for training. |
required |
vocab
|
VocabularyLike
|
VocabularyLike instance. |
required |
model
|
SimpleNextTokenModel
|
Trained model (weights already updated). |
required |
model_kind
|
str
|
Human-readable model kind (e.g., "unigram", "bigram"). |
required |
learning_rate
|
float
|
Training learning rate. |
required |
epochs
|
int
|
Number of training passes. |
required |
row_labeler
|
RowLabeler
|
Function that maps a model weight-row index to a label written in the first column of 02_model_weights.csv. |
required |
write_meta_json ¶
write_meta_json(
path: Path,
*,
base_dir: Path,
corpus_path: Path,
vocab_size: int,
model_kind: str,
learning_rate: float,
epochs: int,
) -> None
Write 00_meta.json describing corpus, model, and training settings.
This file is the authoritative, human-readable summary of a training run. It records: - what corpus was used - what model architecture was trained - how training was configured - which artifacts were produced
The intent is transparency and reproducibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output JSON path (artifacts/00_meta.json). |
required |
base_dir
|
Path
|
Repository base directory. |
required |
corpus_path
|
Path
|
Corpus file used for training. |
required |
vocab_size
|
int
|
Number of unique tokens. |
required |
model_kind
|
str
|
Human-readable model kind (e.g., "unigram", "bigram", "context2", "context3"). |
required |
learning_rate
|
float
|
Training learning rate. |
required |
epochs
|
int
|
Number of epochs (full passes over the training pairs). |
required |
write_model_weights_csv ¶
write_model_weights_csv(
path: Path,
vocab: VocabularyLike,
model: SimpleNextTokenModel,
*,
row_labeler: RowLabeler,
) -> None
Write 02_model_weights.csv with token-labeled columns.
Shape
- first column: input_token (serialized context label)
- remaining columns: one per output token (weights)
Notes
- Tokens may be typed internally.
- This function serializes all token labels explicitly at the I/O boundary.
write_token_embeddings_csv ¶
write_token_embeddings_csv(
path: Path,
model: SimpleNextTokenModel,
*,
row_labeler: RowLabeler,
) -> None
Write 03_token_embeddings.csv as a simple 2D projection.
This file is a derived visualization artifact, not a learned embedding table.
For each model weight row
- x coordinate = first weight (if present)
- y coordinate = second weight (if present)
If a row has fewer than two weights, missing values default to 0.0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output CSV path. |
required |
model
|
SimpleNextTokenModel
|
Trained model providing a weight matrix. |
required |
row_labeler
|
RowLabeler
|
Function mapping row index to a human-readable label. |
required |
write_training_log ¶
Write per-epoch training metrics to a CSV file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output file path. |
required |
history
|
list[dict[str, float]]
|
List of per-epoch metrics dictionaries. |
required |
write_vocabulary_csv ¶
Write 01_vocabulary.csv: token_id, token, frequency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Output CSV path. |
required |
vocab
|
VocabularyLike
|
Vocabulary instance (must provide vocab_size(), get_id_token(), get_token_frequency()). |
required |
Marker File: py.typed¶
This package includes a py.typed marker file as defined by PEP 561.
- Type checkers (Pyright, Mypy) trust inline type hints in installed packages only when this marker is present.
- The file may be empty; comments are allowed.