Stochastic Distributed Learning with Gradient Quantization and Double Variance Reduction

Abstract

We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g. quantize or sparsify) the gradients, thereby introducing additional variance ω 1 that might slow down convergence. For strongly convex functions with condition number kappa distributed among n machines, we (i) give a scheme that converges in O((κ+κω/n+ω)log(1/ϵ)) steps to a neighborhood of the optimal solution. For objective functions with a finite-sum structure, each worker having less than m components, we (ii) present novel variance reduced schemes that converge in O((κ+κω/n+ω+m)log(1/ϵ)) steps to arbitrary accuracy ϵ>0.

Publication
Optimization Methods and Software