Nearly-Linear Time and Massively Parallel Algorithms for $k$ -anonymity

Kevin Aydin, Honghao Lin, David Woodruff, Peilin Zhong

K-anonymity Nearly-linear time Massively parallel

Abstract

k

-anonymity is a widely-used privacy-preserving concept that ensures each record in a dataset is indistinguishable from at least

k - 1

other records. In this paper, we revisit

k

-anonymity by suppression and give an

O (k)

-approximation algorithm with a nearly-linear runtime of

\tilde{O} (n d + n^{1 + 1/ C^{2}} / k^{1/ C^{2}})

for an arbitrary constant

C

, where

n

is the number of records and

d

is the number of attributes. Previous algorithms with provable guarantees either:

achieve the same $O (k)$ approximation ratio but require at least $O (n^{2} k)$ runtime, or
provide a better $O (lo g k)$ approximation ratio at the cost of an impractical $O (n^{2 k})$ worst-case runtime for general $d$ and $k$ .

Our algorithm extends to the Massively Parallel Computation (MPC) model, where it can be adapted into an MPC algorithm requiring

\tilde{O} (lo g^{1 + ϵ} n)

rounds and total space

O (n^{1 + 1/ C^{2}} (d + k))

. Empirically, we also demonstrate that our algorithmic ideas can be adapted to existing heuristic methods, leading to significant speed-ups while preserving comparable performance.

Although PS07 introduced improvements to achieve more practical runtimes for their

O (lo g k)

-approximation algorithm, its worst-case runtime remains

O (n^{2 k})

. A natural question arises: can we develop an algorithm with an

o (k)

approximation ratio and a polynomial runtime?

We investigate the single-point

k

-anonymity problem, where the goal is to select

k - 1

additional records to make a given record indistinguishable. Surprisingly, assuming the dense vs random conjecture in complexity theory, we show that for

n = k^{c}

, no algorithm can achieve a

k^{1 - O (1/ c)}

approximation in

poly (k)

time. This provides evidence of the inherent hardness of the

k

-anonymity problem.