Randomized Distributed Mean Estimation: Accuracy vs. Communication

doi:10.3389/FAMS.2018.00062

Edinburgh Research Explorer

Randomized Distributed Mean Estimation: Accuracy vs.

Communication

Citation for published version:

Konený, J & Richtárik, P 2018, 'Randomized Distributed Mean Estimation: Accuracy vs. Communication',

Frontiers in Applied Mathematics and Statistics, vol. 4. https://doi.org/10.3389/fams.2018.00062

Digital Object Identifier (DOI):

10.3389/fams.2018.00062

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Publisher's PDF, also known as Version of record

Published In:

Frontiers in Applied Mathematics and Statistics

General rights

Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 26. Aug. 2022

ORIGINAL RESEARCH

published: 18 December 2018

doi: 10.3389/fams.2018.00062

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 1 December 2018 | Volume 4 | Article 62

Edited by:

Yiming Ying,

University at Albany, United States

Reviewed by:

Shiyin Qin,

Beihang University, China

Shao-Bo Lin,

Wenzhou University, China

*Correspondence:

Jakub Kone

ˇ

cný

konkey@google.com

Specialty section:

This article was submitted to

Mathematics of Computation and

Data Science,

a section of the journal

Frontiers in Applied Mathematics and

Statistics

Received: 11 October 2018

Accepted: 28 November 2018

Published: 18 December 2018

Citation:

Kone

ˇ

cný J and Richtárik P (2018)

Randomized Distributed Mean

Estimation: Accuracy vs.

Communication.

Front. Appl. Math. Stat. 4:62.

doi: 10.3389/fams.2018.00062

Randomized Distributed Mean

Estimation: Accuracy vs.

Communication

Jakub Kone

ˇ

cný

1

*

and Peter Richtárik

1,2,3

1

School of Mathematics, The University of Edinburgh, Edinburgh, United Kingdom,

2

Moscow Institute of Physics and

Technology, Dolgoprudny, Russia,

3

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

We consider the problem of estimating the arithmetic average of a ﬁnite collection

of real vectors stored in a distributed fashion across several compute nodes subject

to a communication budget constraint. Our analysis does not rely on any statistical

assumptions about the source of the vectors. This problem arises as a subproblem

in many applications, incl uding reduce-all operations within algorithms for distributed

and federated optimization and lear ning. We propose a ﬂexible family of randomized

algorithms exploring the tr ade- off between expected communication cost and e stimat ion

error. Our family contains the full-communication and zero-error method on one extreme,

and an ǫ-bit communication and O



1/(ǫn)



error method on the opposite extreme. In the

special case where we communicate, in expectation, a single bit per coordinate of each

vector, we improve upon existing results by obtaining O(r/n) error, where r is the number

of bits used to represent a ﬂoating point value.

Keywords: communication efﬁciency, distributed mean e stimation, accuracy-communication tradeoff, gradient

compression, quantization

1. INTRODUCTION

We address the problem of estimating the arithmetic mean of n vectors, X

1

, . . . , X

n

∈ R

d

, stored in

a distributed fashion across n compute nodes, subject to a constraint on the communication cost.

In particular, we consider a star network topology with a single ser ver at the centre and n

nodes connected to it. All nodes send an encoded (possibly via a lossy randomized transformation)

version of their vect or to the server, after which the server performs a decoding operation to

estimate the true mean

X

def

=

1

n

X

i = 1

X

i

.

The purpose of the encoding operation is to compress the vector so as to save on communication

cost, which is typically the bottleneck in practical applications.

To better illustrate the setup, consider the naive approach in which all nodes send t he

vectors without performing any encoding operation, followed by the application of a simple

averaging decoder by the server. This results in zero estimation error at the expense of maximum

communication cost of ndr bits, where r is the number of bits needed to communicate a single

ﬂoating point entry/coordinate of X

i

.

Kone

ˇ

cný and Richtárik Randomized Distributed Mean Estimation

This operation appears as a computational primitive in

numerous cases, and the communication cost can be reduced at

the expense of acurracy. Our proposal for balancing accuracy and

communication is in practice relevant for any application t hat

uses the MPI_Gather or MPI_Allgather routines [1], or

their conceptual variants, for eﬃcient implementation and can

tolerate inexactness in compuation, such as many algorithms for

distributed optimization.

1.1. Background and Contributions

The distributed mean estimation problem was recently studied

in a statistical framework where it is assumed that the vectors

X

i

are independent and identicaly distributed samples from

some speciﬁc underlying distribution. In such a setup, the goal

is to estimate the true mean of the underlying distribution

[

2–5]. These works formulate lower and upper bounds on the

communication cost needed to achieve the minimax optimal

estimation error.

In contrast, we do not make any statistical assumptions on the

source of the vectors, and study the trade-oﬀ between expected

communication costs and mean square error of the estimate.

Arguably, this setup is a more robust and accurate model of the

distributed mean estimation problems arising as subproblems

in applications such as reduce-all operations within algorithms

for distributed and federated optimization [

6–10]. In these

applications, the averaging operations need to be done repeatedly

throughout the iterations of a master learning/optimization

algorithm, and the vectors {X

i

} correspond to upd at es to a global

model/variable. In such cases, t he vectors evolve throughout the

iterative process in a complicated pattern, typically approaching

zero as the master algorithm converges to optimality. Hence,

their sta tisti cal properties change, which renders ﬁxed statistical

assumptions not satisﬁed in practice.

For instance, when training a deep neural network model

in a distributed environment, the vector X

i

corresponds to a

stochastic gradient based on a minibatch of data stored on

node i. In this setup we do not have any useful prior statistical

knowledge about the high-dimensional vectors to be aggregated.

It has recently been observed that when communication cost is

high, which is typically the case for commodity clusters, and even

more so in a federated optimiza tion framework, it is can be very

useful to sacriﬁce on estimation accuracy in favor of reduced

communication [

11, 12].

In this paper we propose a parametric family of randomized

methods for estimating the mean X, with parameters being a

set of probabilities p

ij

for i = 1, . . . , n and j = 1, 2, . . . , d

and node centers µ

i

∈ R for i = 1, 2, . . . , n. The exact

meaning of these parameters is explained in section 3. By varying

the probabilities, at one extreme, we recover the exact method

described, enjoying zero estimation error at the expense of full

communication cost. At the opposite extreme are methods with

arbitrarily small expected communication cost, which is achieved

at the expense of suﬀering an exploding estimation error.

Practical methods appear somewhere on the continuum between

these two extremes, depending on the speciﬁc requirements of

the application at hand. Suresh et al. [

13] propose a method

combining a pre-processing step via a random structured

rotation, followed by randomized binary quantization. Their

quantization protocol arises as a suboptimal special case of our

parametric family of methods

1

.

To illustrate our results, consider the special c ase presented

in Example 7, in which we choose to communicate a single

bit per element of X

i

only. We then obtain an O



r

n

R



bound

on the mean square error, where r is number of bits used to

represent a ﬂoating point value, and R =

1

n

P

n

i = 1

kX

i

− µ

i

1k

2

with µ

i

∈ R being t h e average of elements of X

i

, and 1 the

all-ones vector in R

d

. Note that this bound improves upon the

performance of th e method of Suresh et al. [13] in two aspects.

First, the bound is independent of d, improving from logarithmic

dependence, as stated in Remark 4 in detail. Further, due to

a preprocessing rotation step, their method requires O(d log d)

time to be implemented on each node, while our method is linear

in d. This and other special cases are summarized in Table 1 in

section 5.

While the above already improves upon the state of the

art, the improved results are in fact obtained for a suboptimal

choice of the parameters of our method (constant probabilities

p

ij

, and node centers ﬁxed t o the mean µ

i

). One can decrease

the MSE further by optimizing over the probabilities and/or

node centers (see section 6). However, apart from a very low

communication cost regime in which we have a closed form

expression for the optimal probabilities, the problem needs to

be solved numerically, and h ence we do not have expressions

for how much improvement is possible. We illustrate t h e eﬀect

of ﬁxed and optimal probabilities on the trade-oﬀ between

communication cost and MSE experimentally on a few selected

datasets in section 6 (see Figure 1).

Remark 1. Since the initial version of this work, an updated

version of Suresh et al. [

13] contains a rate similar to Example 7,

using variable length coding. That work also formulates lower

bounds, which are attained by both their and our results. Other

works that were published since, such as [14, 15], propose

algorithms t hat can also be represented as a particular choice of

protocols α, β, γ , demonstrating the versatility of our proposal.

1.2. Outline

In section 2 we formalize t h e concepts of encoding and decoding

protocols. In section 3 we describe a parametric family of

randomized (and unbiased) encoding protocols and give a simple

formula for the mean squared error. Subsequently, in section 4 we

formalize the notion of communication cost, and describe several

communication protocols, which are optimal under diﬀerent

circumstances. We give simple instantiations of our protocol in

section 5, illustrating the trade-oﬀ between c ommunic a tion costs

and accuracy. In section 6 we address the question of the optimal

choice of parameters of our protocol. Finally, in section 7 we

comment on possible extensions we leave out to future work.

2. THREE PROTOCOLS

In this work we consider (randomized) encoding protocols α,

communication protocols β, and decoding protocols γ using which

1

See Remark 4.

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 2 December 2018 | Volume 4 | Article 62

Kone

ˇ

cný and Richtárik Randomized Distributed Mean Estimation

the averaging is performed inexactly as follows. Node i computes

a (possibly stochastic) estimate of X

i

using the encoding protocol,

which we denote Y

i

= α(X

i

) ∈ R

d

, and sends it to the

server using communication protocol β. By β(Y

i

) we denote the

number of bits that need to be transferred under β. The server

then estimates X using the decoding protocol γ of the estimates:

Y

def

= γ (Y

1

, . . . , Y

n

).

The objective of this work is to study the trade-oﬀ between the

(expected) number of bits that need to be communic at ed, and

the accuracy of Y as an estimate of X.

In this work we focus on encoders which are unbiased, in the

following sense.

Deﬁnition 2.1 (Unbiased a nd Independent Encoder): We say

that encoder α is unbiased if E

α



α(X

i

)



= X

i

for all i =

1, 2, . . . , n. We say that it is independent, if α(X

i

) is independent

from α(X

j

) for all i 6= j.

Example 1 (Identity Encoder). A trivial example of an encoding

protocol is the identity function: α(X

i

) = X

i

. It is both unbiased

and independent. This encoder does not lead to any savings i n

communication that would be otherwise infeasible though.

Another examples of unbiased and independent Encoders

include the protocols introduced in section 3, or other existing

techniques [

12, 14, 15].

We now formalize the notion of accuracy of estimating X via

Y. Since Y can be random, the notion of accuracy will naturally

be prob abilistic.

Deﬁnition 2.2 (Estimation Error / Mean Squared Error): The

mean squared error of protocol (α, γ ) is the quantity

MSE

α,γ

(X

1

, . . . , X

n

) = E

α,γ



kY − Xk

2



= E

α,γ

h



γ (α(X

1

), . . . , α(X

n

)) − X



2

i

.

To illustrate the above concept, we now give a few examples:

Example 2 (Averaging Decoder). If γ is the averaging function,

i.e., γ (Y

1

, . . . , Y

n

) =

1

n

P

n

i = 1

Y

i

, th en

MSE

α,γ

(X

1

, . . . , X

n

) =

1

n

2

E

α







n

X

i = 1



α(X

i

) − X

i





2





.

The next example generalizes the identity encoder and averaging

decoder.

Example 3 (Linear Encoder and Inverse Linear Decoder). Let

A

:

R

d

→ R

d

be linear and invertible. Then we c a n set Y

i

=

α(X

i

)

def

= AX

i

and γ (Y

1

, . . . , Y

n

)

def

= A

−1



1

n

P

n

i = 1

Y

i



. If A is

random, then α and γ are random (e.g., a structured random

rotation, see [

16]). Note that

γ (Y

1

, . . . , Y

n

) =

1

n

X

i = 1

A

−1

Y

i

=

1

n

X

i = 1

X

i

= X,

and hence the MSE of (α, γ ) is zero.

We shall now prove a simple result for unbiased and

independent encoders used in subsequent sections.

Lemma 2.3 (Unbiased and Independent Encoder + Averaging

Decoder): If the encoder α is unbiased and independent, and

γ is the averaging decoder, then

MSE

α,γ

(X

1

, . . . , X

n

) =

1

n

2

n

X

i = 1

E

α



kY

i

− X

i

k

2



=

1

n

2

n

X

i = 1

Var

α



α(X

i

)



.

Proof: Note that E

α

[

Y

i

]

= X

i

for all i. We have

MSE

α

(X

1

, . . . , X

n

) = E

α



kY − Xk

2



=

1

n

2

E

α







n

X

i = 1

Y

i

− X

i



2





(∗)

=

1

n

2

n

X

i = 1

E

α



k

Y

i

− E

α

[

Y

i

]

k

2



=

1

n

2

n

X

i = 1

Var

α



α(X

i

)



,

where (*) follows from unbiasedness and independence.

One may wish to deﬁne the encoder as a combination of two

or more separate encoders: α(X

i

) = α

2

(α

1

(X

i

)). See Suresh et al.

[

13] for an example where α

1

is a random rotation and α

2

is

binary quantization.

3. A FAMILY OF RANDOMIZED ENCODING

PROTOCOLS

Let X

1

, . . . , X

n

∈ R

d

be given. We shall write X

i

=

(X

i

(1), . . . , X

i

(d)) to denote the entries of vector X

i

. In addition,

with each i we also associate a paramet er µ

i

∈ R. We refer to

µ

i

as the center of dat a at node i, or simply as node center. For

now, we assume these parameters are ﬁxed. As a special case, we

recover for instance classical binary quantization, see section 5.1.

We shall comment on how to choose the parameters optimally in

section 6.

We shall deﬁne support of α on node i to be the set S

i

def

=

{j

:

Y

i

(j) 6= µ

i

}. We now deﬁne two parametric families of

randomized encoding protocols. The ﬁrst results in S

i

of random

size, the second has S

i

of a ﬁxed size.

3.1. Encoding Protocol With Variable-Size

Support

With each pair (i, j) we associate a parameter 0 < p

ij

≤ 1,

representing a probability. The collection of parameters {p

ij

, µ

i

}

deﬁnes an encoding protocol α as follows:

Y

i

(j) =

(

X

i

(j)

p

ij

−

1−p

ij

p

ij

µ

i

with probability p

ij

,

µ

i

with probability 1 − p

ij

.

(1)

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 3 December 2018 | Volume 4 | Article 62

Kone

ˇ

cný and Richtárik Randomized Distributed Mean Estimation

Remark 2. Enforcing the probabilities to be positive, as opposed

to non-negative, leads to vastly simpliﬁed notation in what

follows. However, it is more natural to allow p

ij

to be zero,

in which case we have Y

i

(j) = µ

i

with probability 1. This

raises issues such as potential lack of unbiasedness, which c an

be resolved, but only at the expense of a larger-than-reasonable

notational overload.

In the rest of this section, let γ be the averaging decoder

(Example 2). Since γ is ﬁxed and deterministic, we shall for

simplicity write E

α

[

·

]

instead of E

α,γ

[

·

]

. Similarly, we shall write

MSE

α

(·) instead of MSE

α,γ

(·).

We now prove two lemmas describing properties of the

encoding protocol α. Lemma 3.1 states that the protocol yields

an unbiased estimate of the average X and Lemma 3.2 provides

the expected mean square error of the estimate.

Lemma 3.1 (Unbiasedness): The encoder α deﬁned in (1) is

unbiased. That is, E

α



α(X

i

)



= X

i

for all i. As a result, Y is an

unbiased estimate of the true average: E

α

[

Y

]

= X.

Proof: Due to line arity of expectation, it is enough to show that

E

α



Y(j)



= X(j) for all j . Since Y(j) =

1

n

P

n

i = 1

Y

i

(j) and X(j) =

1

n

P

n

i = 1

X

i

(j), it suﬃces to show that E

α



Y

i

(j)



= X

i

(j):

E

α



Y

i

(j)



= p

ij



X

i

(j)

p

ij

−

1 − p

ij

p

ij

µ

i

(j)



+ (1 − p

ij

)µ

i

(j) = X

i

(j),

and t he claim is proved.

Lemma 3.2 (Mean Squared Error): Let α = α(p

ij

, µ

i

) be the

encoder deﬁned in (1). Then

MSE

α

(X

1

, . . . , X

n

) =

1

n

2

X

i,j



1

p

ij

− 1





X

i

(j) − µ

i



2

. (2)

Proof: Using Lemma 2.3, we have

MSE

α

(X

1

, . . . , X

n

) =

1

n

2

n

X

i = 1

E

α



k

Y

i

− X

i

k

2



=

1

n

2

n

X

i = 1

E

α





d

X

j=1

(Y

i

(j) − X

i

(j))

2





=

1

n

2

n

X

i = 1

d

X

j=1

E

α



(Y

i

(j) − X

i

(j))

2



. (3)

For any i, j we further have

E

α



(Y

i

(j) − X

i

(j))

2



= p

ij



X

i

(j)

p

ij

−

1 − p

ij

p

ij

µ

i

− X

i

(j)



2

+ (1 − p

ij

)



µ

i

− X

i

(j)



2

=

(1 − p

ij

)

2

p

ij



X

i

(j) − µ

i



2

+ (1 − p

ij

)



µ

i

− X

i

(j)



2

=



1 − p

ij

p

ij





X

i

(j) − µ

i



2

.

It suﬃces to substitute the above into (3).

3.2. Encoding Protocol With Fixed-Size

Support

Here we propose an alternative encoding protocol, one with

deterministic support size. As we shall see later, this results in

deterministic communication cost.

Let σ

k

(d) denote the set of all subsets of {1, 2, . . . , d}

containing k elements. The protocol α with a single integer

parameter k is then working as follows: First, each node i samples

D

i

∈ σ

k

(d) uniformly at random, and then sets

Y

i

(j) =

(

dX

i

(j)

k

−

d−k

k

µ

i

if j ∈ D

i

,

µ

i

otherwise.

(4)

Note that due to the design, t h e size of the support of Y

i

is always

k, i.e., |S

i

| = k. Naturally, we can expect this protocol to perform

practically the s a me as the protocol (1) with p

ij

= k/d, for all i, j.

Lemma 3.4 indeed suggests this is the case. While th is protocol

admits a more eﬃcient communication protocol (as we shall see

in section 4.4), protocol (1) enjoys a larger parameters space,

ultimately leading to better MSE. We comment on this tradeoﬀ

in subsequent sections.

As for the d a ta-dependent protocol, we prove basic properties.

The proofs a re similar to those of Lemmas 3.1 and 3.2 and we

defer them to Appendix A.

Lemma 3.3 (Unbiasedness): The encoder α deﬁned in (4) is

unbiased. That is, E

α



α(X

i

)



= X

i

for all i. As a result, Y is an

unbiased estimate of the true average: E

α

[

Y

]

= X.

Lemma 3.4 (Mean Squared Error): Let α = α(k) be encoder

deﬁned as in (4). Then

MSE

α

(X

1

, . . . , X

n

) =

1

n

2

n

X

i = 1

d

X

j=1



d − k

k





X

i

(j) − µ

i



2

. (5)

4. COMMUNICATION PROTOCOLS

Having deﬁned the encoding protocols α, we need to speci fy

the way the encoded vectors Y

i

= α(X

i

), for i = 1, 2, . . . , n,

are communicated to the server. Given a speciﬁc communication

protocol β, we write β(Y

i

) to denote the (expected) number of bits

that are communicated by node i to the server. Since Y

i

= α(X

i

)

is in general not deterministic, β(Y

i

) can be a random variable.

Deﬁnition 4.1 (Communication Cost): The communication cost

of communication protocol β under randomized encoding α is

the total expected number of bits transmitted to the server:

C

α,β

(X

1

, . . . , X

n

) = E

α

"

n

X

i = 1

β(α (X

i

))

#

. (6)

Given Y

i

, a good communication protocol is able to encode Y

i

=

α(X

i

) using a few bits only. Let r denote the number of bits used

Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 4 December 2018 | Volume 4 | Article 62

Randomized Distributed Mean Estimation: Accuracy vs. Communication

Citations

Advances and Open Problems in Federated Learning

Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air

Expanding the Reach of Federated Learning by Reducing Client Resource Requirements

Gradient Sparsification for Communication-Efficient Distributed Optimization

Federated Learning over Wireless Fading Channels

References

Communication-Efficient Learning of Deep Networks from Decentralized Data

Federated Learning: Strategies for Improving Communication Efficiency

Federated Optimization: Distributed Machine Learning for On-Device Intelligence

On the complexity of best-arm identification in multi-armed bandit models

Federated Learning of Deep Networks using Model Averaging

Related Papers (5)

Federated Learning: Strategies for Improving Communication Efficiency

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs.

TernGrad: ternary gradients to reduce communication in distributed deep learning

Communication-Efficient Learning of Deep Networks from Decentralized Data