Daily Notes: 2025-11-29

daily
Published

November 29, 2025

ML Notes

Gradient Descent, or how machines learn

  • When measuring how “costly” a wrong prediction is / how “accurate” a training example was, you can use the Sum of Squared Errors (SSE).
  • It is said to be appropriate when modeling regression with Gaussian noise.
  • It is not appropriate for classification or distributions with outliers (perhaps because it penalizes large errors heavily?).

Formally, the Sum of Squared Errors is defined as:

\(SSE = \sum\limits_{i} (y_i - \hat{y}_i)^2\)

  • Recall that Weights \(w\) represent how strongly each input dimension influences the neuron, and the Bias \(b\) shifts the activation threshold and acts as an offset.
    • 3Blue1Brown says: Neurons are connected to the neurons in the previous layer. The weights are the strength of those connections, and for ReLU-like activations, the bias affects when the neuron is active/inactive.
  • A network “learns” by adjusting the parameters \(W\) and \(b\) to minimize a cost function / loss function.
  • Gradient Descent is one way in which the network can minimize this cost function by converging towards a local minimum, even when the derivative of the minimum is not 0.
NoteW vs. w

\(W\) and \(w\) mean different things in neural networks.

\(W\): A weight matrix for an entire layer. If a layer has \(n_{in}\) inputs and \(m_{out}\) neurons, then \(W \in R^{n_{in} \times m_{out}}\). Each row is the weight vector of one neuron, and \(z = Wx + b\) is the equivalent of doing many individual dot products at once.

\(w\): A weight vector for a single neuron.

\(w = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix}\)

Formally, Gradient Descent is the algorithm used to minimize a cost function \(J(\theta)\) by iteratively updating parameters in the direction that reduces the loss:

\(\theta := \theta - \eta \nabla_{\theta} J(\theta)\)

Personal Notes

Questions I still have

  • To answer a question from yesterday, it does seem like MNNs are a version of the SNN where there is a \(wx + b\) weight & bias calculation per layer, and then you do it again and again.

Tomorrow’s plan