Plumbing Deep Learning in the Shallows

Binomial Bottleneck

• Machine Learning, Dropout, Regularization, AWS, GPU, Python, and Numpy

Neural Net Architecture is not a playground for those that demand instant gratification. The endless trials with slight variations of one parameter or another (made ever so much worse when you don’t take rigorous notes, head slap) provide feedback only when they are good and ready. So it is much like watching water boil. While we’re waiting. Lets break out Python’s cProfile and see what is going on under the hood, just to pass the time, of course.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   529200 5912.656    0.011 5945.037    0.011 {method 'binomial' of 'mtrand.RandomState' objects}
  1075196 1768.545    0.002 8431.153    0.008 layer.py:53(_vector_pass)
  1058400  589.529    0.001  766.941    0.001 layer.py:124(_update_weights)
  1075196  462.097    0.000  462.097    0.000 {built-in method numpy.core.multiarray.dot}
  1058400  171.428    0.000  177.412    0.000 numeric.py:1003(outer)
  6932145  156.128    0.000  156.128    0.000 {built-in method numpy.core.multiarray.array}
   529200  126.677    0.000  126.677    0.000 {built-in method numpy.core.multiarray.copyto}
  3185634   33.730    0.000   33.730    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   537598   10.852    0.000   57.246    0.000 data.py:563(normalize)
  1058400   10.728    0.000   17.094    0.000 layer.py:96(_layer_level_backprop)

That’s a lot of time to be spending with the binomial method. Now something suddenly makes sense. Several weeks ago in early testing on AWS there seemed to be little difference between training on a CPU optimized instance and a GPU optimized one. That in itself did not seem to surprising as there isn’t much in a neural net that lends itself to parallelization. The backpropogation at any given point depeneds entirely on every calculation that comes before it.

However, in a recent of fit watching the water boil, I splurged and gave the GPU another go, this time with an identical architecture to one running on the CPU opt. instance. And magically … a 10-fold increase! What changed? Aha! During the first look at GPU’s I was testing regularization (so I had neuron dropout turn off) to prevent overfitting. But since then I had gone back to dropout experiments (and hence regularization) turned off.

So what does binomial have to do with all this? That is the function that creates the random mask over the neurons to turn some of them off for a particular pass. This however is something that can be parallelized. Of course, the GPU will be making hay with this. Now I just need to decide if dropout is worth that much time. Heh, I guess that’s what bigger computers are for.

Thanks Hobson for the insight on this one.

Happy learning!

comments powered by Disqus