In this note, I will derive the Bias-Variance decomposition following Ref. 1,2.
Let us suppose there exists the real function \(g({\bf x})\) that generates our data \(y\) with additive noise, so
\[{\bf y} = g({\bf X})+{\bf \epsilon}.\]Let \(T\) be the set of training data
\[T=\lbrace (x_i,y_i) | i=0,...,N_T \rbrace\]And suppose we train a model \(\hat{g}_T\) with this set. The cost function \(J({\bf X},\hat{g}_T)\) that we consider here is given by
\[J({\bf X},\hat{g}_T) = \left( {\bf y}- {\bf \hat{g}_T(X)} \right)^2\]where \({\bf y}=(y_1,...,y_N)\) and \({\bf \hat{g}_T(X)}=(\hat{g}_T({\bf x}_1),...,\hat{g}_T({\bf x}_N))\).
The expectation value of \(J\), denoted by \(\langle J \rangle\), will be given by
\[\langle J \rangle = \langle \left( {\bf y}-{\bf g(x)}+{\bf g(x)}- {\bf \hat{g}_T(x)} \right)^2 \rangle \\ = \langle \left( {\bf y}-{\bf g(X)}+{\bf g(X)}- {\bf \hat{g}_T(x)} \right)^2 \rangle \\ = \langle \left( {\bf y}-{\bf g(X)} \right)^2 \rangle + \langle \left( {\bf g(X)}- {\bf \hat{g}_T(x)} \right)^2 \rangle + 2 \langle \left( {\bf y}- {\bf \hat{g}_T(x)} \right)\rangle \langle \left( {\bf g(X)}- {\bf \hat{g}_T(x)} \right) \rangle \\ = \langle \left( {\bf y}-{\bf g(X)} \right)^2 \rangle + \langle \left( {\bf g(X)}- {\bf \hat{g}_T(x)} \right)^2 \rangle\]