[问题] 如何才能一直使用原始数据计算cluster?

楼主: piacere (Beol)   2023-04-18 14:06:07
各位先进们好:
小妹有一个code是要用KNN计算向量相似度之后cluster
但发现他都会与最近点cluster完就会用新的平均座标去计算次近点
有时本来1跟3也是相似的,但1先与2 cluster完后就反而与3不相似了(以及资料顺序也会有差QQ)
想请教该怎么修改才能使每个i点都先用原始座标计算完相近的K点,全部资料都计算完后再一起cluster+平均座标呢?QQ
几个可能有问题的function如下:
def KNN(i, k, data_mid_point, tree):
dist, ind = tree.query(np.expand_dims(data_mid_point[i], axis=0), k=k+1)
nearest_ids = list(ind[0])
if i in nearest_ids:
nearest_ids.remove(i)
else:
nearest_ids = nearest_ids[:-1]
distances = []
for j in nearest_ids:
distance = ((data_mid_point[j][0] - data_mid_point[i][0])**2 +
(data_mid_point[j][1] - data_mid_point[i][1])**2)**0.5
distances.append(distance)
print(f"The {k} nearest IDs to ID {i} are:")
for j in range(len(nearest_ids)):
print(f"ID: {nearest_ids[j]}, Distance: {(distances[j]/0.000009)} meters")
return nearest_ids
def calcClusterFlow(c, data):
ox = 0
oy = 0
dx = 0
dy = 0
for k in c:
ox += data[k][0]*data[k][8]
oy += data[k][1]*data[k][8]
dx += data[k][2]*data[k][8]
dy += data[k][3]*data[k][8]
d = 0
for k in c:
d += data[k][8]
ox /= d
oy /= d
dx /= d
dy /= d
return ox, oy, dx, dy
#计算相似性
def flowSim(vi, vj, alpha):
leni = math.sqrt((vi[0]**2+vi[1]**2))
lenj = math.sqrt((vj[0]**2+vj[1]**2))
dv = math.sqrt((vi[0] - vj[0]) ** 2 + (vi[1] - vj[1]) ** 2)
if leni > lenj:
return dv/(alpha*leni)
else:
return dv/(alpha*lenj)
#计算clusterID为ci和cj的两个cluster的相似性
def clusterSim(i, j, ci, cj, data, alpha):
oix, oiy, dix, diy = data[ci[0]][4], data[ci[0]][5], data[ci[0]][6],
data[ci[0]][7]
ojx, ojy, djx, djy = data[cj[0]][4], data[cj[0]][5], data[cj[0]][6],
data[cj[0]][7]
vi = [dix-oix, diy-oiy]
vj = [djx-ojx, djy-ojy]
sim = flowSim(vi, vj, alpha)
return sim
#合并两个clusters
def merge(c, ci_ID, cj_ID, l):
#保留小数字的clusterID
if ci_ID > cj_ID :
ci_ID, cj_ID = cj_ID, ci_ID
for l_ID in c[cj_ID]:
l[l_ID] = ci_ID
c[ci_ID].append(l_ID)
c.pop(cj_ID)
算式在这边:
for i in tqdm(range(dataLen)):
neighbors = KNN(i, K, data_mid_point, tree)
for j in neighbors:
if (data_mid_point[i][0]-data_mid_point[j][0])**2+(data_mid_point[i][1]-data_mid_point[j][1])**2>(Radius*0.000009)**2:
continue
if l[i] != l[j]:
if clusterSim(i, j, c[l[i]], c[l[j]], data, alpha) <= 1:
new_cluster_ID = min(l[i],l[j])
num_of_flow_in_cluster=0
merge(c, l[i], l[j], l)
for m in c[new_cluster_ID]:
num_of_flow_in_cluster+=data[m][8]
for m in c[new_cluster_ID]:
cox, coy, cdx, cdy = calcClusterFlow(c[new_cluster_ID],data)
data[m][4], data[m][5], data[m][6], data[m][7], data[m][9] = cox, coy, cdx, cdy, num_of_flow_in_cluster
目前感觉比较有问题的应该是merge那里,问了chatGPT但好像也不太能理解我想要的结果
再请各位帮帮忙,感激不尽QQ
作者: wuyiulin (龙破坏剑士-巴斯达布雷达)   2023-04-18 16:17:00
假设你拿X1点做KNN,拿到第一层 x_1j 们,你要存 x_1j们的座标传下去做第二层。所以可能是哪里有 mean 把它干掉调整一下就好了。然后为什么你用 queue 实现…怪怪的。
楼主: piacere (Beol)   2023-04-18 20:40:00
楼上大大,感谢您的回答但我看不懂....我现在就是抓不出来他哪里cluster后把座标也merge了TT对了我有用ball-tree唷
作者: wuyiulin (龙破坏剑士-巴斯达布雷达)   2023-04-19 08:03:00
我讲的是 Brute,如果是 ball-tree 我要想一下

Links booklink

Contact Us: admin [ a t ] ucptt.com