A feedforward network with backpropagation built from scratch. No framework – every computation step is transparent.
1 · Example & data iTraining data are the examples from which the network is meant to discover the rule itself – never the rule directly. Each row is one example: inputs on the left, the desired outputs on the right (after the >). The network compares its prediction with the target and adjusts. What to expect: the presets below load ready-made tasks (curves, boundaries, mini-images). You can also freely change values or type your own data. Rule of thumb: more and cleaner examples → better prediction; too few → the network guesses.
First row = column names. Separate inputs from outputs with ">". Values comma-separated. Any number of columns and rows.
iHolds part of the data out of training and measures the error only on it. This reveals overfitting: if the training error keeps dropping while the test error rises again, the network is memorising the data instead of the rule. Try "Line + noise" with a large network.
In the "Error curve" tab a second, red curve for the test data then appears.
2 · Architecture iArchitecture: here you define how the network is built – task type, number and size of hidden layers, activation function, optimiser, learning rate and scaling. More/larger layers can learn more complex things but need more data and are more prone to overfitting.
iRegression: the network predicts a continuous number (e.g. cooking time). Linear output, error measure MSE. Classification: the network decides between class 0 and 1 (e.g. inside/outside). Sigmoid output 0..1, error measure BCE (cross-entropy), plus accuracy.
Sets the output function and error measure. For classification the target values stay raw (0/1) and are not scaled.
iEach number is one layer, the value is the number of neurons in it. 8,8 = two layers with 8 neurons each. Each neuron computes a weighted sum of its inputs plus a bias and passes the result through the activation function. More neurons = the network can model more complex, curved relationships – but it gets slower and may overfit.
e.g. "8,8" = two layers with 8 neurons each. "12" = one layer. More neurons = more capacity for non-linear patterns.
iHow large each learning step is. After every example the network changes each weight by −learning rate × gradient. The gradient points in the direction that increases the error – we go the learning-rate fraction in the opposite direction. Too large: training overshoots and oscillates. Too small: it crawls forever. Typical range 0.01–0.1.
iThe non-linear function applied after the weighted sum. Without it the whole network would be linear. tanh: S-curve from −1 to 1, smooth and symmetric. sigmoid: S-curve from 0 to 1. ReLU: returns 0 for negative inputs, otherwise the value itself – simple and fast, but not smooth. It is what allows the network to learn curves.
iThe strategy for updating weights. SGD: follows the gradient directly, gets stuck on flat regions. Momentum: accumulates velocity like a rolling ball and rolls through flat areas. Adam: momentum + individual step size per weight – learns most reliably. All three use the same gradient, just differently.
SGD = follows gradient only, easily gets stuck on plateaus. Momentum = accumulates velocity. Adam = momentum + automatic step size per weight. When switching to Adam, set the learning rate to ~0.01.
iScales all values to a small range before entering the network and converts the output back. Necessary because a neuron with raw values like "180 seconds" and "3 size" struggles with mixed magnitudes. [0,1]: standard. [−1,1]: symmetric around zero – works well with tanh and functions like x³ or sine.
Symmetric normalization greatly helps with point-symmetric functions around zero (e.g. x³, sine). tanh neurons are themselves symmetric around zero – data and building blocks then match.
iOne epoch = one pass through all training rows. This value sets how many epochs one click of "Start training" runs. More epochs = more practice = smaller error, until it stops improving. After training completes, simply click again to add more epochs.
iOnline: adjusts immediately after each single example – fast but jittery. Batch: the gradients of all examples are averaged, then one step per epoch – calmer, smoother learning curve. The S in SGD stands for exactly this difference (stochastic = one at a time).
Online often adapts faster, batch gives more stable curves.
3 · Curriculum (section-by-section training) iCurriculum: first limits training to part of the value range, then widens it – like a lesson plan that starts easy. This helps with difficult functions, because the network learns the rough shape before the details are added. „Auto-curriculum“ widens the window automatically during training.
Choose a window: train only a section of the curve (left, right, or center). Helps when the network gets stuck in the middle. Reference is the first input.
active: full range
Auto: starts with a small left window and automatically expands it during training until the full range is covered.
4 · Multi-run training (statistics) iMulti-run: starts the same task several times with random initialisation and shows the spread of final errors. This reveals whether a result is stable or depends on the luck of the start – important, because neural networks don't learn equally well every time.
Trains N times from scratch, each time with fresh random weights. Shows how much the result depends on the starting point. The best network stays loaded at the end.
5 · Save / load experiment iSaves all settings (task, layers, learning rate, activation, optimiser, scaling, update style, test fraction, epochs) and the training data in a JSON file. This lets you build deliberate experiments, save them and repeat them exactly later or with others. Optionally the trained state (the learned weights) is saved too – then the finished result appears right after import, otherwise it still needs training. Nothing is stored in the browser – only a file downloaded or uploaded.
As a JSON file – the training data is always included. Nothing is stored in the browser.
Epoch
0
Error (MSE)
–
Test error
–
Accuracy
–
Parameters
0
Data rows
0
Network structure
Loss curve
Target vs. Network
Plane · view
Plane · draw
Latent space
Error landscape
Mathematics
What the correction is doing
Calculation in detail
Trains exactly this one row once – numbers change visibly.
Click "Compute 1 learning step" to see which weights change and by how much.
This neuron's computation iThis is exactly what the neuron does with the current numbers. You can trace it step by step on paper: multiply each input by its weight, sum everything plus the bias, then pass through the activation function.
Click a neuron in the "Network structure" tab to see its calculation step by step.
Test prediction iHere you enter your own input values and the trained network predicts the result – also for values not in the training data. This is how you check whether the network truly generalised the rule or only memorised the examples. In the boundary view you can click a point to copy its values here.
Enter input values and click "Compute".
Show full calculation
Here the network collects its own training data itself, instead of from a table. The dino sees 6 senses and controls 2 things: whether to jump and whether to extend the jump (higher). For tall cacti, stretching pays off.
Learning method iBoth look for the same thing: good weights – the dino’s “brain”. Only the path there is opposite.
Selection (evolution): computes no direction. Many dinos, each with randomly, slightly changed weights (mutation). Only how far one gets counts – the best pass on their weights, the rest drop out. Progress through selection. Needs no differentiable error, but many dinos and patience.
Reward (trial and error): a single dino. From each outcome (cleared/crash) backpropagation computes the direction each weight must move to lower the error – gradient descent. The Adam optimizer rolls with momentum into the “valley” of least error and picks the step size per weight itself. The only random part is which jump distance it tries.
Quality, not just survival: a pass doesn’t only count as “cleared” – it’s judged by safety: plenty of air above the cactus and enough room to the next one are good. A tight pass teaches jumping a bit earlier or higher; so the dino improves even without having to die.
Experience memory: every success and crash goes into a small memory and is re-practised continuously – so the dino keeps learning between cacti and improves faster.
Imitation (teacher): you demonstrate (Space = jump, hold = higher). Every ground decision instantly becomes a lesson: “wait” at this distance, or “jump” with this height. The net adjusts its weights to predict your action in every situation (supervised learning). It only learns while you steer – under “AI takes over” it is frozen, and it can only do what you show it.
When does the curve rise (teacher)? It measures how far the white student ghost gets. It rises once your jump timing is learned well enough that the ghost clears more cacti – i.e. when it has taken over your distance→action rule.
In a nutshell: selection = change at random and keep the best. Reward = compute where it gets better and roll there. Imitation = copy what you demonstrate.
Curriculum: both start easy (slow, low cacti); tall cacti appear only once the basic jump works.
Tip: you can keep training the same dino with both methods – just switch the method.
Trial & error: a dino learns from every outcome – cleared reinforces, crash discourages. More in (i).
iThe hidden layers between the 6 inputs and the 2 outputs. One number = one layer (e.g. 6); commas for several (e.g. “8,4” = two layers). Feel free to try different sizes.
One number = one layer; commas for more, e.g. “8,4”.
iHow many dinosaurs run simultaneously per generation. More = more diversity, slower.
iFraction of weights randomly changed during inheritance. Low = population converges fast but easily gets stuck (local optimum). High = lots of exploration, but jittery and rarely stable.
iGame speed. Faster = learning visible in seconds, but choppier. Tempo only changes how fast it practises – not what the dino learns.
Keeps speed & height constant and dampens randomness so the dino can rehearse the ideal jump in peace and eventually run through safely. Also marks the learned jump-trigger distance.
iShows and sets the difficulty (higher = faster, tighter gaps, bigger cacti). In automatic it climbs on its own and the field follows; when frozen it stays fixed at your value.
Current level – adjustable. Frozen it stays fixed, otherwise it rises automatically.
What is happening
Not started yet.
The dinosaur's decision iLive: the 6 senses the network sees and its 2 outputs. Output “Jump” above 0.5 = jump; output “Hold” above 0.5 = extend the jump (higher). Click an input neuron in the weights canvas to mute that input (∅) – useful for exploring which senses the network actually needs.
–
Save / load dino iSaves the method and settings; with the checkbox also the trained state, so you can load a finished dino later and keep training it.
With the checkbox the brain is saved too – a finished dino can be loaded later and trained further (also in the other mode).
Here you’ll see live what the network is learning.
Learning progress
Learned weights iThis is the dinosaur’s brain – the same network as in the lab above, only for 6 inputs → hidden layer → 2 outputs. Green = strengthening, red = inhibiting connection; thickness = strength.