Clustering is used to analyze which objects are similar to each others by assigning the items to clusters.
In this specific project, information from each character from the Starwars Wikia has been gathered and taken into use as the dataset. After having gathered the data from each character, high and low frequent terms have been removed since these are unusable for classification purposes; only terms between 10 and 80% persist.
Afterwards, the dataset has been transformed by using TF-IDF which retrieves the most relevant and describing terms for the dataset.
Although, since everyone has the possibility to change the content in the wiki, the clusters may not always represent the clusters which would normally emerge.
Hierarchical clustering is in this specific case shown as a dendogram where each parent item
has a number of childs which forms into clusters.
The distance from the child to its parent represents similarity, meaning that the further distance there is between a child and its parent, the more disimilar the items within a cluster is.
Two sample dendograms are shown below, one after and one prior to the TF-IDF transformation.