本地化差分隐私随机扰动（回推）机制代码详解（K-RR）-EW帮帮网

K-RR主要克服了随机响应技术（RR）针对二值变量这一问题，对于变量中含有k个候选值的情况，可以直接进行随机响应。对于任意的输入R，其输出R'如下图公式：

1.概率生成，用于取不同分布的数据。生成一些比较简单的分布用于实验，帮助初学者可以直接上手运行。主要是平均，正态，递减，偏斜四种。

def generate_distribution(distribution_name, domain):
    if distribution_name == "uniform":
        return np.full(shape=domain, fill_value=1.0 / domain)
    elif distribution_name == "gauss":
        u = domain / 2
        sigma = domain / 6
        x = np.arange(1, domain+1)
        fx = 1 / (np.sqrt(2*np.pi) * sigma) * np.e**(- (x-u)**2 / (2 * sigma**2))
        return fx / sum(fx)
    elif distribution_name == "exp":#返回一个domin的概率数组，并且单调递减且和为1
        lmda = 2
        prob_list = np.array([lmda * np.e**(-lmda * x) for x in np.arange(1, domain+1)/10])
        return prob_list / sum(prob_list)
    elif distribution_name == "one":
        return [1,0,0,0]
    else:
        raise Exception("the distribution is not contained")

2.数据离散化操作。因为随机扰动（RR）需要的是离散型，如果数据是连续行需要进行离散化操作。思路很简单，离1,近就赋值1反之为0，代码如下：

def discretization(value, lower=0, upper=1):
    if value > upper or value < lower:
        raise Exception("the range of value is not valid in Function @Func: discretization")
    p = (value - lower) / (upper - lower)#v决定p，若v = 1，p = 1，必输出upper
    rnd = np.random.random()
    return upper if rnd < p else lower

3.分桶统计数据，这里主要统计一下原始数据的分布。代码如下：

def generate_bucket(n, bucket_size, distribution_name):
    distribution = generate_distribution(distribution_name, domain=bucket_size)#len = 4
    bucket_list = np.random.choice(range(bucket_size), n, p=distribution)#100，在0到4之间
    hist = np.histogram(bucket_list, bins=range(bucket_size+1))#返回数据统计，这里是10000划分成4份，1,2,3，...每个点出现了多少次
    return bucket_list, hist[0]

4.K-RR实现（1）。数据离散化后，K-RR的扰动思路就是按一定概率p返回原值，1-p的概率返回扰动到其他值。

#k-RR实现，p/(（1-p）*1/k-1)suit DP
def k_random_response(value, values, epsilon):
    if not isinstance(values, list):
        raise Exception("The values should be list")
    if value not in values:
        raise Exception("Errors in k-random response")
    p = np.e ** epsilon / (np.e ** epsilon + len(values) - 1)
    if np.random.random() < p:
        return value
    x = values[np.random.randint(low=0, high=len(values))]#返回0到len(values)的任一一个数
    while x == value:#避免返回一样的值
        x = values[np.random.randint(low=0, high=len(values))]
    return x

4.K-RR的实现（2）。其实逻辑一样的，只是这里直接用choice函数，根据函数分布直接选出，看起来会简洁易懂。

def k_random_response_news(item, k, epsilon):
    if not item < k:
        raise Exception("the input domain is wrong, item = %d, k = %d." % (item, k))
    p_min = 1 / (np.e ** epsilon + (k - 1)*k/2)
    p_max = np.e ** epsilon / (np.e ** epsilon + (k - 1)*k/2)
    respond_probability = np.full(shape=k, fill_value=p_min)#构造一个k维，里面都是p1的数组
    respond_probability[item] = p_max
    perturbed_item = np.random.choice(a=range(k), p=respond_probability)#按照p里面的概率输出a
    return perturbed_item

5.回推函数，扰动完回推到原来的值。

推导过程：推出

然后根据下面公式反推，其中。

即。

def aggregate_histogram(k,private_bucket_list):
    private_hist = np.zeros(shape=k)
    p_l = 1 / (np.e ** epsilon + k - 1)
    p_h = np.e ** epsilon / (np.e ** epsilon + k - 1)
    for private_bucket in private_bucket_list:
        private_hist[private_bucket] += 1#统计private_bucket_list里面0-100出现的次数
    estimate_hist = (private_hist - len(private_bucket_list) * p_l) / (p_h - p_l)#通过回推原始的值，s+p*N-N/2*P-1
    return estimate_hist

6.运行查看效果：

bucket_size = 4
epsilon = 1
n = 10000
values = [0,1,2,3]
bucket_list, true_hist = generate_bucket(n=n, bucket_size=bucket_size, distribution_name='uniform')
print("this is buckets: ", bucket_list)#10000的一个0-3的数组,这是原始数据
print("this is true hist: ", true_hist)#10000里面出现了多少个数的统计，真实统计
private_bucket_list = [k_random_response(item,values,epsilon) for item in bucket_list]
estimate_hist = aggregate_histogram(bucket_size,private_bucket_list)
print("this is krr estimate_hist", estimate_hist)

7.运行效果

this is buckets:  [0 3 1 ... 1 0 1]
this is true hist:  [2467 2556 2465 2512]
this is krr estimate_hist [2766.2325462  2373.53954056 2413.47442249 2446.75349076]

8.介绍一下K-RR的原理

随机响应技术主要包括两个步骤：扰动性统计和校正．

为了具体介绍随机响应技术，下面首先引入一个具体的问题场景．假设有n个用户，其中艾滋病患者的真实比例为乃但我们并不知道．我们希望对其比例君进行统计．于是发起一个敏感的问题．‘‘你是否为艾滋病患者?”。每个用户对此进行响应，第f个用户的答案置为是或否，但出于隐私性考虑，用户不会直接响应真实答案．假设其借助于一枚非均匀的硬币来给出答案，其正面向上的概率为P，反面向上的概率为1—p．抛出该硬币，若正面向上，则回答真实答案，反面向上，则回答相反的答案．首先，进行扰动性统计．利用上述扰动方法对n个用户的回答进行统计，可以得到艾滋病患者人数的统计值．假设统计结果中，回答“是”的人数为，l。，则回答“否”的人数为阼一n，I显然，按照上述统计，回答“是”和“否”的用户比例如下：

显然，上述统计比例并非真实比例的无偏估计，因此需要对统计结果进行校正．接着，对统计结果进行校正．构建以下似然函数：

本地化差分隐私随机扰动（回推）机制代码详解（K-RR）

网站公告

今日签到

热门文章

最新发布