While neural networks are sometimes intimidating structures, the mechanism for making them work is surprisingly simple: stochastic gradient descent. For each of the parameters in our network (such as weights or biases), all we have to do is calculate the derivative of the loss with respect to the parameter, and nudge it a little bit in the opposite direction.
Stochastic gradient descent seems simple enough, but in many networks we might begin to notice something odd: the weights closer to the end of the network change a lot more than those at the beginning. And the deeper the network, the less and less the beginning layers change. This is problematic, because our weights are initialized randomly. If they're barely moving, they're never going to reach the right values, or it'll take them years.