Daily Notes: 2025-12-10

daily
Published
December 10, 2025
ML Notes

Adaline implementation
class AdalineGD:
    """ADAptive LInear NEuron classifier.

    Parameters
    --------------

    eta: float. 
        This is the learning rate (between 0.0 and 1.0)
    n_iter: int
        Passes over the training dataset
    random_state: int
        Random number generator seed for random weight generalization
    
    Attributes
    ---------------
    w_: 1D-Array
        Weights after fitting
    b_: Scalar
        Bias after fitting
    losses_: list
        Mean Squared Error loss function values in each epoch
    """
    def __init__(self, eta=0.01, n_iter = 50, random_state = 1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state

    def net_input(self, X):
        """Calculate net input
        
        Think of this as the linear combination of features and weights + bias
        This helps get the neuron's "raw score"
        """
        return np.dot(X, self.w_) + self.b_

    def activation(self, X):
        """Compute linear activation
        
        This is just the identity."""
        return X
    
    def predict(self, X): 
        """Return class label after unit step
        
        Binary class label of 1 when the activation >= 0.5, else 0
        """
        return np.where(self.activation(self.net_input(X)) >= 0.5, 1, 0)

    def fit(self, X, y):
        """ Fit training data.

        Parameters
        -----------
        X: array-like, shape = [m_examples, n_features]
            Training vectors
            m_examples: # of examples
            n_features: # of features
        
        y: array-like, shape = [m_examples]
            Target values
        
        Returns
        -----------
        self: object
        """

        # Initialize the RNG, weights w_ with small random values, bias b_ to zero, losses_ empty
        rgen = np.random.RandomState(self.random_state)
        # This starts weights as tiny random numbers
        self.w_ = rgen.normal(loc=0.0, scale=0.01, size=X.shape[1]) 
        # This starts the bias at zero
        self.b_ = np.float_(0.)
        # This prepares a list to record training error at each pass
        self.losses_ = []

        for i in range(self.n_iter): # for each epoch
            # Given whatever is w_ and b_, we're computing the raw score for each example
            net_input = self.net_input(X) 
            # Activation is the identity (no "squashing" yet)
            output = self.activation(net_input)
            # How far off are we from the desired targets? 
            # Positive if too low, negative if too high
            errors = (y - output)
            
            """Weight update. 
            
            This step is important!
            We are calculating the gradient based on the whole training set,
            not just evaluating each individual training example (as in the perceptron).
            This makes the learning "smoother" - less aberrations because of individual examples.

            This is called "Batch Gradient Descent." 
            
            Think of the model as making guesses with two kinds of knobs:
            w_: one knob per input feature (like volume sliders for each input).
            b_: one extra knob that shifts everything up or down (like a master volume).

            Weight update (the "averaged nudge"):

            Recall that error = (y - (X * w + b))
            Multiplying by * 2.0 is because of the Chain Rule
            The loss is error^2, so the derivative is 2 * error * derivative of error
            Derivative of (X * w + b) w.r.t. w is X.
            Multiply X.T * error to aggregate per feature across all samples

            X has rows of training examples, columns of features
            errors is rows of how wrong we were per example
            X.T is the transpose of X so that each feature lines up with the 
                errors across example
            X.T.dot(errors) is the dot product that combines every feature with 
                its errors. 
            X is (n_samples, n_features); errors is (n_samples,). 
            Flipping X gives X.T as (n_features, n_samples).
            Dotting (n_features, n_samples) with (n_samples,) yields (n_features,):
                a separate summed value for each feature.
            
            Recall that the loss is the mean squared error
            Therefore, we need to divide by N (# of examples) so the 
                summed gradient becomes an average
            / X.shape[0] means “take the average over all examples” so we dont 
                overreact to any single case.

            Bias update: nudge the bias by the avg error
            """
            self.w_ += self.eta * 2.0 * X.T.dot(errors) / X.shape[0]
            self.b_ += self.eta * 2.0 * errors.mean()
            # Compute Mean Squared Error
            loss = (errors**2).mean()
            # Track loss history so we can see learning progress
            self.losses_.append(loss)
        return self
Personal Notes

I think I understand the implementation line by line. This tweet by GabrielPeterss4 helped. It’s worth reviewing again and again.
Questions I still have

Need to be able to implement this from scratch. I wonder if I’m missing the forest for the trees here, but I do think it’s important to really understand Gradient Descent forwards and backwards.
ML Notes

Personal Notes

Questions I still have

Tomorrow’s plan