Skip to content

Don't require user to have all pixel SOM clusters represented during assignment #1124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 20, 2024

Conversation

alex-l-kong
Copy link
Contributor

What is the purpose of this PR?

During SOM clustering, it is possible in some cases that not all unique clusters discovered during training will be present in the dataset. In these cases, it's too strict for the SOM cluster mapping function to require all the SOM clusters to be represented.

How did you implement your changes

Add an additional parameter to the SOM cluster assignment function that is defaulted to not require all the SOM clusters to be present. This can be explicitly set to revert back to the original format.

The underlying averaging function is also adjusted accordingly.

@cliu72
Copy link
Contributor

cliu72 commented Mar 8, 2024

@alex-l-kong I don't think this is the best solution to the problem. The reason that this check is even in there in the first place is because when we have cohorts of like 2000 FOVs, it would take forever to generate the average file. So we decided to use a subset of all FOVs to generate the average file (the thought is that a subset is a representative sample of all clusters). However, when we take a subset of all FOVs, it's possible that some rare clusters are not included in that subset. For example, if there is a cluster that is in like 1000 FOVs, but not in the 100 subsetted FOVs, that cluster would not be in the average file. This would lead to downstream problems, since that cluster number does exist in some FOVs, but it didn't get a metacluster assignment. Therefore, we thought that adding a test to make sure all clusters are represented would be a good way to get around this issue. If not all clusters are represented in the average file, then the number of FOVs subsetted needs to be increased.

The issue arising now is that if we specify 100 SOM clusters, maybe only 98 SOM clusters are actually assigned across ALL FOVs. So in the average file, there are 98 SOM clusters. This is a different situation than if there are truly 100 SOM clusters assigned across all FOVs, but only 98 of them are in the subset of FOVs used for the average file generation. So the fundamental issue is that when we are checking for the number of clusters, we are using the weights file. However, the weights file can specify 100 clusters, when across all FOVs there are only 98 clusters. That's why the solution that I suggested in the issue is to keep track of the total number of clusters during the mapping step. Open to other suggestions though.

@alex-l-kong
Copy link
Contributor Author

@cliu72 thanks, I just realized that on my end too and am in the process of modifying it.

@alex-l-kong
Copy link
Contributor Author

alex-l-kong commented Mar 8, 2024

@cliu72 I still think the require_all_som_clusters argument could still be useful in some niche first scenario cases however. It was a bit of an annoyance to a few lab members in the past. We can allow the function to take None as a parameter but hide that functionality and default to requiring an explicit value (which will be the total number of SOM clusters seen).

@cliu72
Copy link
Contributor

cliu72 commented Mar 8, 2024

@alex-l-kong I think it's fine to have this as a temporary solution, but if we track the total number of SOM clusters seen and use that to make sure the average file is written correctly, this error should never come up. If it does, there is another problem.

@alex-l-kong
Copy link
Contributor Author

@alex-l-kong I think it's fine to have this as a temporary solution, but if we track the total number of SOM clusters seen and use that to make sure the average file is written correctly, this error should never come up. If it does, there is another problem.

Ok I did a quick test by removing one SOM cluster, I think this does the trick now.

Copy link
Contributor

@cliu72 cliu72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to properly stress test it because I didn't have a dataset easily on-hand that returned less than the desired number of clusters, but the logic makes sense to me.

@alex-l-kong alex-l-kong marked this pull request as ready for review March 20, 2024 02:28
@alex-l-kong alex-l-kong added this pull request to the merge queue Mar 20, 2024
Merged via the queue into main with commit 9bd4a39 Mar 20, 2024
@alex-l-kong alex-l-kong deleted the min_som_cluster branch March 20, 2024 04:59
@srivarra srivarra added the enhancement New feature or request label May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants