Skip to Content

Clustering Large, Sparse Dataset [migrated]

I have a few datasets of "interactions" between pairs of elements like so:

element1 element2 1
element2 element3 1
element4 element5 1
...
element505535 element4 2

where the value in the 3rd column is the "strength" of interaction. Almost all of these strengths are "1." A strength of 1 means that this interaction was observed one time. A strength of 2 means 2x, etc. I have actually gone one step further and normalized all of my datasets by the total # of interactions observed in the dataset so that datasets interaction values can be compared.

There are 5-6 million interactions listed in each file and each dataset is obviously under-sampled since there are ~500k elements (making a square matrix of ~250 trillion positions).

I would like to cluster these datasets so that I can make statements about which types of elements tend to cluster with which other types elements. Obviously, robusticity of clustering will be a factor—but this is partially ameliorated by the fact that I will make biological replicates of the data.

I have tried a few different "naive" clustering approaches just to see what I could do easily with the data. I fully realize that these are problematic ways of clustering, either because they are not robust or because they rely on the data being very undersampled, but here is what I've done:

  1. Clustering elements together as long as there is at least one interaction between each element in the cluster and at least one other element in the cluster. When I do this, all the elements end up in a single cluster. This was important to do because it tells me that there are no pairs of elements that are totally isolated from the rest of the group.

  2. Finding "superclusters"—that is, clusters where every member of the cluster interacts with every other member of the cluster (e.g. a triangle for a cluster of 3 and a box with an X in the middle for a cluster of 4, etc). This yields almost exclusively clusters with 2 and 3 elements after about 10% of the data has been analyzed (this is still running).

I would love to be able to do some sort of hierarchical clustering using my "interaction strength" values as the distance measure between each pair of elements (unobserved interactions have a strength of 0). Does anyone know of a way to do HC on this sort of large, sparse data—or know of a clustering method that might be more appropriate? I've used R up until now.