click below
click below
Normal Size Small Size show me how
Exam 1 Part 7
Data Similarities and Distances and KNN
Question | Answer |
---|---|
What is similarity? | numerical measure of how alike two data objects are |
What is dissimilarity? | numerical measure of how different two data objects are |
What is proximity? | refers to a similarity or dissimilarity |
How is a data matrix built? | n data points by p dimensions with two modes |
How is a dissimilarity matrix built? | n data points, but registers only the distance. is a triangular matrix with a single mode |
What is qualitative data? | categorical with nominal and ordinal attributes |
What is quantitative data? | numerical with discrete and continuous attributes |
Why is mean absolute deviation better at handling outliers than standard deviation? | MAD uses the average distances of the data points from the mean while SD squares the difference between each data point and the mean. Because the standard deviation squares the differences, outliers have a larger impact on it than on MAD. |
Why is logarithmic transformation useful? | makes highly skewed distributions less skewed. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics. |
What is a norm? | norm is a function that accepts as input a vector from our vector space and spits out a real number that tells us how big that vector is. |
What is cosine distance? What cases is it useful? | cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. often used to measure document similarity in text analysis. |
What is KNN, what type of learning happens? | nonparametric supervised |
What are the basic requirements for KNN? | An integer K, a set of labeled examples (training data), a metric to measure closeness |
What are the pros of using KNN? | analytically tractable, simple implementation, nearly optimal with large sample size, uses local info which yields highly adaptive behavior, parallel implementations |
What are the cons of using KNN? | has large storage requirements, computationally intensive, highly susceptible to curse of dimensionality |
How to find k in KNN? | use cross-validation, let k < sqrt(n) where n is the number of training examples. usually choose large value of k for better performance. |
How to handle attributes with large range in KNN? | normalize scale |
How to handle correlated attributes in KNN? | eliminate some attributes or vary and possibly adapt the weight of attributes |
How to handle symbols in KNN? | use hamming distance |
How to handle KNN expensive in testing? | use subset of dimensions, pre-sort training examples into fast data structures, compute only an approximate distance, remove redundant data |
How to handle KNN storage requirements? | remove redundant data, note that pre-sorting increases storage requirement |
How to handle KNN curse of dimensionality? | increase amount of data |