Quantifying and Comparing Features in High-Dimensional Datasets
Abstract
Linking and brushing is a proven approach to analyzing multi-dimensional datasets in the context of multiple coordinated views. Nevertheless, most of the respective visualization techniques only offer qualitative visual results. Many user tasks, however, also require precise quantitative results as, for example, offered by statistical analysis. In succession of the useful Rank-by-Feature Framework, this paper describes a joint visual and statistical approach for guiding the user through a high-dimensional dataset by ranking dimensions (1D case) and pairs of dimensions (2D case) according to statistical summaries. While the original Rank-by-Feature Framework is limited to global features, the most important novelty here is the concept to consider local features, i.e., data subsets defined by brushing in linked views. The ability to compare subsets to other subsets and subsets to the whole dataset in the context of a large number of dimensions significantly extends the benefits of the approach especially in later stages of an exploratory data analysis. A case study illustrates the workflow by analyzing counts of keywords for classifying e-mails as spam or no-spam.
H. Piringer, W. Berger, and H. Hauser, "Quantifying and Comparing Features in High-Dimensional Datasets," in Proceedings of the International Conference on Information Visualisation (IV 2008), Washington, DC, USA, 2008, p. 240–245.
[BibTeX]
Linking and brushing is a proven approach to analyzing multi-dimensional datasets in the context of multiple coordinated views. Nevertheless, most of the respective visualization techniques only offer qualitative visual results. Many user tasks, however, also require precise quantitative results as, for example, offered by statistical analysis. In succession of the useful Rank-by-Feature Framework, this paper describes a joint visual and statistical approach for guiding the user through a high-dimensional dataset by ranking dimensions (1D case) and pairs of dimensions (2D case) according to statistical summaries. While the original Rank-by-Feature Framework is limited to global features, the most important novelty here is the concept to consider local features, i.e., data subsets defined by brushing in linked views. The ability to compare subsets to other subsets and subsets to the whole dataset in the context of a large number of dimensions significantly extends the benefits of the approach especially in later stages of an exploratory data analysis. A case study illustrates the workflow by analyzing counts of keywords for classifying e-mails as spam or no-spam.
@INPROCEEDINGS {piringer08comparing,
author = "Harald Piringer and Wolfgang Berger and Helwig Hauser",
title = "Quantifying and Comparing Features in High-Dimensional Datasets",
booktitle = "Proceedings of the International Conference on Information Visualisation (IV 2008)",
year = "2008",
pages = "240--245",
address = "Washington, DC, USA",
month = "7",
publisher = "IEEE Computer Society",
abstract = "Linking and brushing is a proven approach to analyzing multi-dimensional datasets in the context of multiple coordinated views. Nevertheless, most of the respective visualization techniques only offer qualitative visual results. Many user tasks, however, also require precise quantitative results as, for example, offered by statistical analysis. In succession of the useful Rank-by-Feature Framework, this paper describes a joint visual and statistical approach for guiding the user through a high-dimensional dataset by ranking dimensions (1D case) and pairs of dimensions (2D case) according to statistical summaries. While the original Rank-by-Feature Framework is limited to global features, the most important novelty here is the concept to consider local features, i.e., data subsets defined by brushing in linked views. The ability to compare subsets to other subsets and subsets to the whole dataset in the context of a large number of dimensions significantly extends the benefits of the approach especially in later stages of an exploratory data analysis. A case study illustrates the workflow by analyzing counts of keywords for classifying e-mails as spam or no-spam.",
images = "images/piringer08comparing1.png, images/piringer08comparing2.png, images/piringer08comparing3.png",
thumbnails = "images/piringer08comparing1_thumb.png, images/piringer08comparing2_thumb.jpg, images/piringer08comparing3_thumb.png",
location = "London, UK",
url = "//dx.doi.org/10.1109/IV.2008.17"
}