Parameter details for Mapper

Data Encoding: We have encoded the World Color Survey data by aggregating the responses by lexeme. For each lexeme, we store the fraction of speakers of that language that used that lexeme for that Munsell chart cell.

Available functions: Both filter value and value extraction works by applying some function to the dataset and using the function values to sort, filter, or rearrange the observations. In its current state, the tool implements sum, variance, standard deviation, mean for these functions.

Sum: The sum of a vector provides a measure of how many Munsell chart squares were selected, on average, by each speaker to represent that lexeme. The measure varies between

Variance, Standard Deviation: These differ almost exclusively through their behaviour in mid-ranges of values. When used for filter methods, they may give different results for mid-range values. These both measure the deviation of response rates of a single lexeme from the average response rate for that lexeme across the cells of the Munsell chart. As such, they are a proxy for the locality of the lexeme footprint: low variance means the footprint covers much of the diagram, while a high variance means the footprint is very narrow. Variance varies between 0.003 and 155, while standard deviation varies between 0.05 and 12.5.

Mean: The mean computes the mean value for the vector of response counts; it is a measure of how localized the lexeme footprint is. Mean varies between 0.003 and 15.4.

Filter function: The Mapper method works by dividing up the entire dataset in slices with similar filter function values, clustering within each slice, and then connecting similar clusters in adjacent slices with each other for a graph representation of the dataset.

Value extraction: In order to be able to study subsets of the data, we offer the possibility to extract subsets of the data based on giving allowed ranges for one of the defined functions. This way, it is possible to focus only on lexemes with particular response rates, or with limits on the size of their footprints.

Cluster linkage type: The clustering step uses a hierarchical clustering method. The method allows for single/average/complete linkage methods, controlling at which point two clusters are considered to merge. Single linkage merges at first contact, average linkage when the centroids meet, and complete linkage merges at last contact between two clusters. After these linkages have been computed, the algorithm considers the largest step between subsequent merges, flattens the hierarchy into the clusters before this largest step.

Metric choice: We offer three different metrics to use when clustering. Euclidean distance between the vectors representing the mass distributions on the Munsell chart; cosine distance between these vectors; or normalized Earth Mover's distance, which takes the perceptual difference between colors into account when computing distances between distributions.

These visualizations are a joint project between Mikael Vejdemo-Johansson, KTH Royal Institute of Technology and Susanne Vejdemo, Stockholm University. The tools were written using d3.js for visualization, and a not yet publicly released Mapper by Aravind and Daniel.

Mapper was developed by Gurjeet Singh at Stanford, and has been used for knowledge discovery in bioinformatics.
Singh, Mémoli, Carlsson: Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Eurographics Symposium on Point-Based Graphics, 2007
Lum, Singh, Lehman, Ishkanov, Vejdemo-Johansson, Alagappan, Carlsson, Carlsson: Extracting insights from the shape of complex data using topology, Scientific reports (3) 2013
Ayasdi Inc. is selling data analysis services based on this approach.

Berlin, Kay. Basic Color Terms: Their Universality and Evolution. Berkeley and Los Angeles. University of California Press, 1969.
Kay, Berlin, Maffi, Merrifield, Cook. The World Color Survey. Stanford: CSLI, July 2009 (ISBN (Cloth): 9781575864150)