scispace - formally typeset
Open AccessJournal ArticleDOI

Randomized Distributed Mean Estimation: Accuracy vs. Communication

Reads0
Chats0
TLDR
In this article, the authors consider the problem of estimating the arithmetic average of a finite collection of real vectors stored in a distributed fashion across several compute nodes subject to a communication budget constraint.
Abstract
We consider the problem of estimating the arithmetic average of a finite collection of real vectors stored in a distributed fashion across several compute nodes subject to a communication budget constraint. Our analysis does not rely on any statistical assumptions about the source of the vectors. This problem arises as a subproblem in many applications, including reduce-all operations within algorithms for distributed and federated optimization and learning. We propose a flexible family of randomized algorithms exploring the trade-off between expected communication cost and estimation error. Our family contains the full-communication and zero-error method on one extreme, and an epsilon-bit communication and O(1/(epsilon n)) error method on the opposite extreme. In the special case where we communicate, in expectation, a single bit per coordinate of each vector, we improve upon existing results by obtaining O(r/n) error, where r is the number of bits used to represent a floating point value.

read more

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Randomized Distributed Mean Estimation: Accuracy vs.
Communication
Citation for published version:
Konený, J & Richtárik, P 2018, 'Randomized Distributed Mean Estimation: Accuracy vs. Communication',
Frontiers in Applied Mathematics and Statistics, vol. 4. https://doi.org/10.3389/fams.2018.00062
Digital Object Identifier (DOI):
10.3389/fams.2018.00062
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Publisher's PDF, also known as Version of record
Published In:
Frontiers in Applied Mathematics and Statistics
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 26. Aug. 2022

ORIGINAL RESEARCH
published: 18 December 2018
doi: 10.3389/fams.2018.00062
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 1 December 2018 | Volume 4 | Article 62
Edited by:
Yiming Ying,
University at Albany, United States
Reviewed by:
Shiyin Qin,
Beihang University, China
Shao-Bo Lin,
Wenzhou University, China
*Correspondence:
Jakub Kone
ˇ
cný
konkey@google.com
Specialty section:
This article was submitted to
Mathematics of Computation and
Data Science,
a section of the journal
Frontiers in Applied Mathematics and
Statistics
Received: 11 October 2018
Accepted: 28 November 2018
Published: 18 December 2018
Citation:
Kone
ˇ
cný J and Richtárik P (2018)
Randomized Distributed Mean
Estimation: Accuracy vs.
Communication.
Front. Appl. Math. Stat. 4:62.
doi: 10.3389/fams.2018.00062
Randomized Distributed Mean
Estimation: Accuracy vs.
Communication
Jakub Kone
ˇ
cný
1
*
and Peter Richtárik
1,2,3
1
School of Mathematics, The University of Edinburgh, Edinburgh, United Kingdom,
2
Moscow Institute of Physics and
Technology, Dolgoprudny, Russia,
3
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
We consider the problem of estimating the arithmetic average of a finite collection
of real vectors stored in a distributed fashion across several compute nodes subject
to a communication budget constraint. Our analysis does not rely on any statistical
assumptions about the source of the vectors. This problem arises as a subproblem
in many applications, incl uding reduce-all operations within algorithms for distributed
and federated optimization and lear ning. We propose a flexible family of randomized
algorithms exploring the tr ade- off between expected communication cost and e stimat ion
error. Our family contains the full-communication and zero-error method on one extreme,
and an ǫ-bit communication and O
1/(ǫn)
error method on the opposite extreme. In the
special case where we communicate, in expectation, a single bit per coordinate of each
vector, we improve upon existing results by obtaining O(r/n) error, where r is the number
of bits used to represent a floating point value.
Keywords: communication efficiency, distributed mean e stimation, accuracy-communication tradeoff, gradient
compression, quantization
1. INTRODUCTION
We address the problem of estimating the arithmetic mean of n vectors, X
1
, . . . , X
n
R
d
, stored in
a distributed fashion across n compute nodes, subject to a constraint on the communication cost.
In particular, we consider a star network topology with a single ser ver at the centre and n
nodes connected to it. All nodes send an encoded (possibly via a lossy randomized transformation)
version of their vect or to the server, after which the server performs a decoding operation to
estimate the true mean
X
def
=
1
n
n
X
i = 1
X
i
.
The purpose of the encoding operation is to compress the vector so as to save on communication
cost, which is typically the bottleneck in practical applications.
To better illustrate the setup, consider the naive approach in which all nodes send t he
vectors without performing any encoding operation, followed by the application of a simple
averaging decoder by the server. This results in zero estimation error at the expense of maximum
communication cost of ndr bits, where r is the number of bits needed to communicate a single
floating point entry/coordinate of X
i
.

Kone
ˇ
cný and Richtárik Randomized Distributed Mean Estimation
This operation appears as a computational primitive in
numerous cases, and the communication cost can be reduced at
the expense of acurracy. Our proposal for balancing accuracy and
communication is in practice relevant for any application t hat
uses the MPI_Gather or MPI_Allgather routines [1], or
their conceptual variants, for efficient implementation and can
tolerate inexactness in compuation, such as many algorithms for
distributed optimization.
1.1. Background and Contributions
The distributed mean estimation problem was recently studied
in a statistical framework where it is assumed that the vectors
X
i
are independent and identicaly distributed samples from
some specific underlying distribution. In such a setup, the goal
is to estimate the true mean of the underlying distribution
[
25]. These works formulate lower and upper bounds on the
communication cost needed to achieve the minimax optimal
estimation error.
In contrast, we do not make any statistical assumptions on the
source of the vectors, and study the trade-off between expected
communication costs and mean square error of the estimate.
Arguably, this setup is a more robust and accurate model of the
distributed mean estimation problems arising as subproblems
in applications such as reduce-all operations within algorithms
for distributed and federated optimization [
610]. In these
applications, the averaging operations need to be done repeatedly
throughout the iterations of a master learning/optimization
algorithm, and the vectors {X
i
} correspond to upd at es to a global
model/variable. In such cases, t he vectors evolve throughout the
iterative process in a complicated pattern, typically approaching
zero as the master algorithm converges to optimality. Hence,
their sta tisti cal properties change, which renders fixed statistical
assumptions not satisfied in practice.
For instance, when training a deep neural network model
in a distributed environment, the vector X
i
corresponds to a
stochastic gradient based on a minibatch of data stored on
node i. In this setup we do not have any useful prior statistical
knowledge about the high-dimensional vectors to be aggregated.
It has recently been observed that when communication cost is
high, which is typically the case for commodity clusters, and even
more so in a federated optimiza tion framework, it is can be very
useful to sacrifice on estimation accuracy in favor of reduced
communication [
11, 12].
In this paper we propose a parametric family of randomized
methods for estimating the mean X, with parameters being a
set of probabilities p
ij
for i = 1, . . . , n and j = 1, 2, . . . , d
and node centers µ
i
R for i = 1, 2, . . . , n. The exact
meaning of these parameters is explained in section 3. By varying
the probabilities, at one extreme, we recover the exact method
described, enjoying zero estimation error at the expense of full
communication cost. At the opposite extreme are methods with
arbitrarily small expected communication cost, which is achieved
at the expense of suffering an exploding estimation error.
Practical methods appear somewhere on the continuum between
these two extremes, depending on the specific requirements of
the application at hand. Suresh et al. [
13] propose a method
combining a pre-processing step via a random structured
rotation, followed by randomized binary quantization. Their
quantization protocol arises as a suboptimal special case of our
parametric family of methods
1
.
To illustrate our results, consider the special c ase presented
in Example 7, in which we choose to communicate a single
bit per element of X
i
only. We then obtain an O
r
n
R
bound
on the mean square error, where r is number of bits used to
represent a floating point value, and R =
1
n
P
n
i = 1
kX
i
µ
i
1k
2
with µ
i
R being t h e average of elements of X
i
, and 1 the
all-ones vector in R
d
. Note that this bound improves upon the
performance of th e method of Suresh et al. [13] in two aspects.
First, the bound is independent of d, improving from logarithmic
dependence, as stated in Remark 4 in detail. Further, due to
a preprocessing rotation step, their method requires O(d log d)
time to be implemented on each node, while our method is linear
in d. This and other special cases are summarized in Table 1 in
section 5.
While the above already improves upon the state of the
art, the improved results are in fact obtained for a suboptimal
choice of the parameters of our method (constant probabilities
p
ij
, and node centers fixed t o the mean µ
i
). One can decrease
the MSE further by optimizing over the probabilities and/or
node centers (see section 6). However, apart from a very low
communication cost regime in which we have a closed form
expression for the optimal probabilities, the problem needs to
be solved numerically, and h ence we do not have expressions
for how much improvement is possible. We illustrate t h e effect
of fixed and optimal probabilities on the trade-off between
communication cost and MSE experimentally on a few selected
datasets in section 6 (see Figure 1).
Remark 1. Since the initial version of this work, an updated
version of Suresh et al. [
13] contains a rate similar to Example 7,
using variable length coding. That work also formulates lower
bounds, which are attained by both their and our results. Other
works that were published since, such as [14, 15], propose
algorithms t hat can also be represented as a particular choice of
protocols α, β, γ , demonstrating the versatility of our proposal.
1.2. Outline
In section 2 we formalize t h e concepts of encoding and decoding
protocols. In section 3 we describe a parametric family of
randomized (and unbiased) encoding protocols and give a simple
formula for the mean squared error. Subsequently, in section 4 we
formalize the notion of communication cost, and describe several
communication protocols, which are optimal under different
circumstances. We give simple instantiations of our protocol in
section 5, illustrating the trade-off between c ommunic a tion costs
and accuracy. In section 6 we address the question of the optimal
choice of parameters of our protocol. Finally, in section 7 we
comment on possible extensions we leave out to future work.
2. THREE PROTOCOLS
In this work we consider (randomized) encoding protocols α,
communication protocols β, and decoding protocols γ using which
1
See Remark 4.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 2 December 2018 | Volume 4 | Article 62

Kone
ˇ
cný and Richtárik Randomized Distributed Mean Estimation
the averaging is performed inexactly as follows. Node i computes
a (possibly stochastic) estimate of X
i
using the encoding protocol,
which we denote Y
i
= α(X
i
) R
d
, and sends it to the
server using communication protocol β. By β(Y
i
) we denote the
number of bits that need to be transferred under β. The server
then estimates X using the decoding protocol γ of the estimates:
Y
def
= γ (Y
1
, . . . , Y
n
).
The objective of this work is to study the trade-off between the
(expected) number of bits that need to be communic at ed, and
the accuracy of Y as an estimate of X.
In this work we focus on encoders which are unbiased, in the
following sense.
Definition 2.1 (Unbiased a nd Independent Encoder): We say
that encoder α is unbiased if E
α
α(X
i
)
= X
i
for all i =
1, 2, . . . , n. We say that it is independent, if α(X
i
) is independent
from α(X
j
) for all i 6= j.
Example 1 (Identity Encoder). A trivial example of an encoding
protocol is the identity function: α(X
i
) = X
i
. It is both unbiased
and independent. This encoder does not lead to any savings i n
communication that would be otherwise infeasible though.
Another examples of unbiased and independent Encoders
include the protocols introduced in section 3, or other existing
techniques [
12, 14, 15].
We now formalize the notion of accuracy of estimating X via
Y. Since Y can be random, the notion of accuracy will naturally
be prob abilistic.
Definition 2.2 (Estimation Error / Mean Squared Error): The
mean squared error of protocol (α, γ ) is the quantity
MSE
α,γ
(X
1
, . . . , X
n
) = E
α,γ
kY Xk
2
= E
α,γ
h
γ (α(X
1
), . . . , α(X
n
)) X
2
i
.
To illustrate the above concept, we now give a few examples:
Example 2 (Averaging Decoder). If γ is the averaging function,
i.e., γ (Y
1
, . . . , Y
n
) =
1
n
P
n
i = 1
Y
i
, th en
MSE
α,γ
(X
1
, . . . , X
n
) =
1
n
2
E
α
n
X
i = 1
α(X
i
) X
i
2
.
The next example generalizes the identity encoder and averaging
decoder.
Example 3 (Linear Encoder and Inverse Linear Decoder). Let
A
:
R
d
R
d
be linear and invertible. Then we c a n set Y
i
=
α(X
i
)
def
= AX
i
and γ (Y
1
, . . . , Y
n
)
def
= A
1
1
n
P
n
i = 1
Y
i
. If A is
random, then α and γ are random (e.g., a structured random
rotation, see [
16]). Note that
γ (Y
1
, . . . , Y
n
) =
1
n
n
X
i = 1
A
1
Y
i
=
1
n
n
X
i = 1
X
i
= X,
and hence the MSE of (α, γ ) is zero.
We shall now prove a simple result for unbiased and
independent encoders used in subsequent sections.
Lemma 2.3 (Unbiased and Independent Encoder + Averaging
Decoder): If the encoder α is unbiased and independent, and
γ is the averaging decoder, then
MSE
α,γ
(X
1
, . . . , X
n
) =
1
n
2
n
X
i = 1
E
α
kY
i
X
i
k
2
=
1
n
2
n
X
i = 1
Var
α
α(X
i
)
.
Proof: Note that E
α
[
Y
i
]
= X
i
for all i. We have
MSE
α
(X
1
, . . . , X
n
) = E
α
kY Xk
2
=
1
n
2
E
α
n
X
i = 1
Y
i
X
i
2
()
=
1
n
2
n
X
i = 1
E
α
k
Y
i
E
α
[
Y
i
]
k
2
=
1
n
2
n
X
i = 1
Var
α
α(X
i
)
,
where (*) follows from unbiasedness and independence.
One may wish to define the encoder as a combination of two
or more separate encoders: α(X
i
) = α
2
(α
1
(X
i
)). See Suresh et al.
[
13] for an example where α
1
is a random rotation and α
2
is
binary quantization.
3. A FAMILY OF RANDOMIZED ENCODING
PROTOCOLS
Let X
1
, . . . , X
n
R
d
be given. We shall write X
i
=
(X
i
(1), . . . , X
i
(d)) to denote the entries of vector X
i
. In addition,
with each i we also associate a paramet er µ
i
R. We refer to
µ
i
as the center of dat a at node i, or simply as node center. For
now, we assume these parameters are fixed. As a special case, we
recover for instance classical binary quantization, see section 5.1.
We shall comment on how to choose the parameters optimally in
section 6.
We shall define support of α on node i to be the set S
i
def
=
{j
:
Y
i
(j) 6= µ
i
}. We now define two parametric families of
randomized encoding protocols. The first results in S
i
of random
size, the second has S
i
of a fixed size.
3.1. Encoding Protocol With Variable-Size
Support
With each pair (i, j) we associate a parameter 0 < p
ij
1,
representing a probability. The collection of parameters {p
ij
, µ
i
}
defines an encoding protocol α as follows:
Y
i
(j) =
(
X
i
(j)
p
ij
1p
ij
p
ij
µ
i
with probability p
ij
,
µ
i
with probability 1 p
ij
.
(1)
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 3 December 2018 | Volume 4 | Article 62

Kone
ˇ
cný and Richtárik Randomized Distributed Mean Estimation
Remark 2. Enforcing the probabilities to be positive, as opposed
to non-negative, leads to vastly simplified notation in what
follows. However, it is more natural to allow p
ij
to be zero,
in which case we have Y
i
(j) = µ
i
with probability 1. This
raises issues such as potential lack of unbiasedness, which c an
be resolved, but only at the expense of a larger-than-reasonable
notational overload.
In the rest of this section, let γ be the averaging decoder
(Example 2). Since γ is fixed and deterministic, we shall for
simplicity write E
α
[
·
]
instead of E
α,γ
[
·
]
. Similarly, we shall write
MSE
α
(·) instead of MSE
α,γ
(·).
We now prove two lemmas describing properties of the
encoding protocol α. Lemma 3.1 states that the protocol yields
an unbiased estimate of the average X and Lemma 3.2 provides
the expected mean square error of the estimate.
Lemma 3.1 (Unbiasedness): The encoder α defined in (1) is
unbiased. That is, E
α
α(X
i
)
= X
i
for all i. As a result, Y is an
unbiased estimate of the true average: E
α
[
Y
]
= X.
Proof: Due to line arity of expectation, it is enough to show that
E
α
Y(j)
= X(j) for all j . Since Y(j) =
1
n
P
n
i = 1
Y
i
(j) and X(j) =
1
n
P
n
i = 1
X
i
(j), it suffices to show that E
α
Y
i
(j)
= X
i
(j):
E
α
Y
i
(j)
= p
ij
X
i
(j)
p
ij
1 p
ij
p
ij
µ
i
(j)
+ (1 p
ij
)µ
i
(j) = X
i
(j),
and t he claim is proved.
Lemma 3.2 (Mean Squared Error): Let α = α(p
ij
, µ
i
) be the
encoder defined in (1). Then
MSE
α
(X
1
, . . . , X
n
) =
1
n
2
X
i,j
1
p
ij
1
X
i
(j) µ
i
2
. (2)
Proof: Using Lemma 2.3, we have
MSE
α
(X
1
, . . . , X
n
) =
1
n
2
n
X
i = 1
E
α
k
Y
i
X
i
k
2
=
1
n
2
n
X
i = 1
E
α
d
X
j=1
(Y
i
(j) X
i
(j))
2
=
1
n
2
n
X
i = 1
d
X
j=1
E
α
(Y
i
(j) X
i
(j))
2
. (3)
For any i, j we further have
E
α
(Y
i
(j) X
i
(j))
2
= p
ij
X
i
(j)
p
ij
1 p
ij
p
ij
µ
i
X
i
(j)
2
+ (1 p
ij
)
µ
i
X
i
(j)
2
=
(1 p
ij
)
2
p
ij
X
i
(j) µ
i
2
+ (1 p
ij
)
µ
i
X
i
(j)
2
=
1 p
ij
p
ij
X
i
(j) µ
i
2
.
It suffices to substitute the above into (3).
3.2. Encoding Protocol With Fixed-Size
Support
Here we propose an alternative encoding protocol, one with
deterministic support size. As we shall see later, this results in
deterministic communication cost.
Let σ
k
(d) denote the set of all subsets of {1, 2, . . . , d}
containing k elements. The protocol α with a single integer
parameter k is then working as follows: First, each node i samples
D
i
σ
k
(d) uniformly at random, and then sets
Y
i
(j) =
(
dX
i
(j)
k
dk
k
µ
i
if j D
i
,
µ
i
otherwise.
(4)
Note that due to the design, t h e size of the support of Y
i
is always
k, i.e., |S
i
| = k. Naturally, we can expect this protocol to perform
practically the s a me as the protocol (1) with p
ij
= k/d, for all i, j.
Lemma 3.4 indeed suggests this is the case. While th is protocol
admits a more efficient communication protocol (as we shall see
in section 4.4), protocol (1) enjoys a larger parameters space,
ultimately leading to better MSE. We comment on this tradeoff
in subsequent sections.
As for the d a ta-dependent protocol, we prove basic properties.
The proofs a re similar to those of Lemmas 3.1 and 3.2 and we
defer them to Appendix A.
Lemma 3.3 (Unbiasedness): The encoder α defined in (4) is
unbiased. That is, E
α
α(X
i
)
= X
i
for all i. As a result, Y is an
unbiased estimate of the true average: E
α
[
Y
]
= X.
Lemma 3.4 (Mean Squared Error): Let α = α(k) be encoder
defined as in (4). Then
MSE
α
(X
1
, . . . , X
n
) =
1
n
2
n
X
i = 1
d
X
j=1
d k
k
X
i
(j) µ
i
2
. (5)
4. COMMUNICATION PROTOCOLS
Having defined the encoding protocols α, we need to speci fy
the way the encoded vectors Y
i
= α(X
i
), for i = 1, 2, . . . , n,
are communicated to the server. Given a specific communication
protocol β, we write β(Y
i
) to denote the (expected) number of bits
that are communicated by node i to the server. Since Y
i
= α(X
i
)
is in general not deterministic, β(Y
i
) can be a random variable.
Definition 4.1 (Communication Cost): The communication cost
of communication protocol β under randomized encoding α is
the total expected number of bits transmitted to the server:
C
α,β
(X
1
, . . . , X
n
) = E
α
"
n
X
i = 1
β(α (X
i
))
#
. (6)
Given Y
i
, a good communication protocol is able to encode Y
i
=
α(X
i
) using a few bits only. Let r denote the number of bits used
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 4 December 2018 | Volume 4 | Article 62

Citations
More filters
Journal ArticleDOI

Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air

TL;DR: This work introduces a novel analog scheme, called A-DSGD, which exploits the additive nature of the wireless MAC for over-the-air gradient computation, and provides convergence analysis for this approach.
Posted Content

Expanding the Reach of Federated Learning by Reducing Client Resource Requirements

TL;DR: Federated Dropout is introduced, which allows users to efficiently train locally on smaller subsets of the global model and also provides a reduction in both client-to-server communication and local computation.
Proceedings Article

Gradient Sparsification for Communication-Efficient Distributed Optimization

TL;DR: In this paper, a convex optimization formulation is proposed to minimize the coding length of stochastic gradient vectors to reduce the communication overhead for exchanging information among different workers in large-scale machine learning applications.
Posted Content

Federated Learning over Wireless Fading Channels

TL;DR: Results show clear advantages for the proposed analog over-the-air DSGD scheme, which suggests that learning and communication algorithms should be designed jointly to achieve the best end-to-end performance in machine learning applications at the wireless edge.
References
More filters
Posted Content

Communication-Efficient Learning of Deep Networks from Decentralized Data

TL;DR: This work presents a practical method for the federated learning of deep networks based on iterative model averaging, and conducts an extensive empirical evaluation, considering five different model architectures and four datasets.
Posted Content

Federated Learning: Strategies for Improving Communication Efficiency

TL;DR: Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
Posted Content

Federated Optimization: Distributed Machine Learning for On-Device Intelligence

TL;DR: A new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number of nodes, is introduced, to train a high-quality centralized model.
Journal Article

On the complexity of best-arm identification in multi-armed bandit models

TL;DR: This work introduces generic notions of complexity for the two dominant frameworks considered in the literature: fixed-budget and fixed-confidence settings, and provides the first known distribution-dependent lower bound on the complexity that involves information-theoretic quantities and holds when m ≥ 1 under general assumptions.
Posted Content

Federated Learning of Deep Networks using Model Averaging

TL;DR: This work presents a practical method for the federated learning of deep networks that proves robust to the unbalanced and non-IID data distributions that naturally arise, and allows high-quality models to be trained in relatively few rounds of communication.
Related Papers (5)