I read with interest the recent paper out of Baidu about scaling up image recognition. In it they talk about creating a supercomputer to carry out the learning phase of training a deep convolutional network. Training such things is terribly slow, with their typical example taking 212 hours on a single GPU machine because of the enormous number of weight computations that need to be evaluated and the slow stochastic gradient process over large training sets.
Baidu has built a dedicated machine with 36 servers connected by an InfiniBand switch, each server with four GPUs. In the paper they describe different ways of partitioning the problem to run on this machine. They end up being able to train the model using 32 GPUs in 8.6 hours.