Intuition for Fundamentals of Machine Learning: Backpropagation Chain Rule

This is really meant for me to look back on. To really reinforce and remember fundamental concepts of machine learning.

Forward Propagation

Given the following variables:

\({\Huge z = Wx + b}\)

The activation function (sigmoid in this case) applied to ( z ): \({\Huge a = \sigma(z) = \frac{1}{1 + e^{-z}}}\)

For simplicity, we use the activation as the output: \({\Huge y = a}\)

The loss function (Mean Squared Error): \({\Huge L = \frac{1}{2}(y - y_{\text{true}})^2}\)

Backpropagation is calculated using the chain rule. The chain rule is for composite functions.
\[{\Huge \frac{d}{dx}[f(g(x))] = f^`(g(x)) * g^`(x)}\]

\[{\Huge L = \frac{1}{2}(y - y_{\text{true}})^2}\]
\[{\Huge L = \frac{1}{2}(a - y_{\text{true}})^2}\]
\[{\Huge L = \frac{1}{2}(\sigma(z) - y_{\text{true}})^2}\]
\({\Huge L = \frac{1}{2}(\sigma(Wx + b) - y_{\text{true}})^2}\)
5. Showing the derivative as a composite function

Derivative of Loss Function with respect to W

Reference: [https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-2-new/ab-3-1a/a/chain-rule-review]
We need how the loss would change as W changes hence we need \({\frac{dL}{dW}}\).
\[{\Huge \frac{dL}{dW} = \color{green}{\frac{2}{2}(\sigma(Wx+b) - Y_{true})} * \color{red}{1} * \color{orange}{\sigma(Wx+b) *(1-\sigma(Wx+b))} * \color{purple}{x}}\]
\[{\Huge \frac{dL}{dW} = \color{green}{\frac{dL}{dy}} * \color{red}{\frac{dy}{da}} *\color{orange}{\frac{da}{dz}} *\color{purple}{\frac{dz}{dW}}}\]