AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |
Back to Blog
Random forest clustering2/28/2024 There is much less confusion between clusters and the algorithm did not the same label to different clusters as before. title ( 'Hierarchical clustering: good results' ) plt. scatter ( sup_embed_et, sup_embed_et, s = 1, c = final_cluster_df, cmap = 'plasma' ) plt. This way, we save memory and time, at the cost of not clustering every instance in our data. We start by taking a sample of the embedding we have and runnning single-linkage clustering on it. Let us experiment with hierarchical clustering now. Also, it misses some of the assignments, especially on the large clusters. The algorithm assigns the same label to different clusters, oversimplifying the structure. ylabel ( '$x_1$' )Ĭlustering using a Decision tree showed fair results, but with much room for improvement. title ( 'Clustering with Decision Trees: room for improvement' ) plt. scatter ( sup_embed_et, sup_embed_et, s = 1, c = clusters, cmap = 'plasma' ) plt. # max_leaf_nodes should be our intended number of clustersĭt = DecisionTreeClassifier ( max_leaf_nodes = 20 ) # fitting and applyingĭt. We just train a Decision Tree Classifier setting the maximum number of leaf nodes max_leaf_nodes equal to our number of intended clusters. Let us use a decision tree for clustering, which is a simplification of our forest. Let us try some clustering alternatives to solve this problem. However, we need to provide it with the full distance matrix between all samples, which is very expensive to get. Given that we’re using a non-eucliean distance metric, the first algorithm that comes to mind is hierarchical clustering. Let us try to devise a way to make this viable. So now that we have a reliable way of comparing samples, we can perform clustering! However, we have a problem: the scale of the data. This fact helps explain why we could recover meaningful structure: our leaf similarity measure effectively throws away distances on the irrelevant dimensions’ axes. The algorithm correctly identified 5 important variables as well. sort_values ( 'importance', ascending = False ). DataFrame () # showing importances in decresing order # building a dataframe to inspect variable importance We use a cheap 2-fold cross validation, as we have a reasonable amount of data. We choose large min_samples_leaf to keep our trees from growing too much and our leaf-based similarity easy to compute. Model Validationįirst, let us validate our model, an ExtraTrees from sklearn. The intuition is simple: train a Random Forest (or better, Extremely Randomized Trees) to predict the target variable and extract similarities between samples by looking at how many times they co-ocurred on the leaves of the forest. Forest embeddingsĪs previously discussed here, forest embeddings present a nice solution for extracting relevant structure in messy and high-dimensional data. The shape of the individual clusters resembles what we got in the unsupervised embedding. Seems like UMAP just separated the classes not discarding the irrelevant features. The results are a little bit underwhelming, as we only observe 2 clusters (not the intended 20). legend ( fontsize = 16, markerscale = 5 ) title ( 'Supervised embedding with UMAP: trivial separation' ) plt. scatter ( sup_embed_umap, sup_embed_umap, s = 1, c = 'C1', cmap = 'viridis', label = '$y=1$' ) plt. scatter ( sup_embed_umap, sup_embed_umap, s = 1, c = 'C0', cmap = 'viridis', label = '$y=0$' ) plt. # plotting the supervised embedding with UMAP
0 Comments
Read More
Leave a Reply. |