Daily Notes: 2025-11-28

daily

Published

November 28, 2025

ML Notes

Adaline: another single-layer neural network

Adaline: Adaptive Linear Neuron. Also known as the Widrow-Hoff rule: Published by Bernard Widrow and Tedd Hoff a few years after Rosenblatt’s perceptron algorithm as an improvement.
The Adaline algorithm is important because it introduces the importance of defining and minimizing continuous loss functions.
- This lays the groundwork for other ML classification algorithms such as logistic regression, support vector machines, multilayer neural networks, etc.

Comparison with Perceptron - Perceptron learning tweaks the weights when the predicted class (the net input and how it compares to the threshold) disagrees with the label. Each update uses eta * (target - predict(x)) and it counts the errors per epoch. - Consider the following loop from Perceptron.fit():

for _ in range(self.n_iter):
    errors = 0
    for xi, target in zip(X, y):
        update = self.eta * (target - self.predict(xi))
        self.w_ += update * xi
        self.b_ += update
        errors += int(update != 0.0)
    self.errors_.append(errors)
return self

Adaline uses Gradient Descent to differentiate from Perceptron in 2 important ways:

Linear activation: No more thresholding the net input. Instead, Adaline keeps the raw value net_input = wx + b and compares that to the true continuous target
Cost function and update: Adaline minimizes the sum of squared errors (SSE) between net_input and target.
- Gradient of SSE w.r.t. weights: (target - net_input) * x
- Weight update for weights & bias: w += eta * (target - net_input) * x

Why is the actual difference important?

The gradient using the actual difference is important because the updates become smoother and differentiable. This allows you to apply batch or stochastic gradient descent.

Pseudocode for per-sample gradient descent loop:
- Compute net_input for each sample
- Calculate the error which is target - net_input
- Determine the direction and magnitude to update the weights by multiplying that error by the input vector x and learning rate eta
- Aggregate SSE per epoch to monitor whether it is converging

Why is this better than Perceptron?

Adaline relies on a continuous error surface, which allows it to converge when a Perceptron might oscillate

Personal Notes

This is the first time I’ve seen the mathematical intuition emerge as extremely important.
It’s hard to get back into ML studying. I’m trying to keep the big picture in mind of the “why” behind the learning.
It’s extremely motivating to listen to how Gabriel Petersson thinks about empowering yourself to learn using AI - getting down to the bottom of things and truly understanding vs. vibecoding.

Questions I still have

Why is Adaline a single layer neural network? I’m assuming it’s because there’s one “decision” that the algorithm makes before it corrects itself per epoch. This is consistent with what I’d expect given the videos I’ve seen of MNNs in action. Would love to understand whether this is correct.

Tomorrow’s plan

I need to study up on Gradient Descent to truly understand Adaline. I will watch 3Blue1Brown’s video as well.