Help

Options

focusNode

Didn't know it?
click below

Knew it?
click below

Don't Know

Remaining cards (0)

Know

retry

shuffle

restart

0:00

Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

Normal Size Small Size show me how

Exam 1 Part 7

Data Similarities and Distances and KNN

Question	Answer
What is similarity?	numerical measure of how alike two data objects are
What is dissimilarity?	numerical measure of how different two data objects are
What is proximity?	refers to a similarity or dissimilarity
How is a data matrix built?	n data points by p dimensions with two modes
How is a dissimilarity matrix built?	n data points, but registers only the distance. is a triangular matrix with a single mode
What is qualitative data?	categorical with nominal and ordinal attributes
What is quantitative data?	numerical with discrete and continuous attributes
Why is mean absolute deviation better at handling outliers than standard deviation?	MAD uses the average distances of the data points from the mean while SD squares the difference between each data point and the mean. Because the standard deviation squares the differences, outliers have a larger impact on it than on MAD.
Why is logarithmic transformation useful?	makes highly skewed distributions less skewed. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics.
What is a norm?	norm is a function that accepts as input a vector from our vector space and spits out a real number that tells us how big that vector is.
What is cosine distance? What cases is it useful?	cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. often used to measure document similarity in text analysis.
What is KNN, what type of learning happens?	nonparametric supervised
What are the basic requirements for KNN?	An integer K, a set of labeled examples (training data), a metric to measure closeness
What are the pros of using KNN?	analytically tractable, simple implementation, nearly optimal with large sample size, uses local info which yields highly adaptive behavior, parallel implementations
What are the cons of using KNN?	has large storage requirements, computationally intensive, highly susceptible to curse of dimensionality
How to find k in KNN?	use cross-validation, let k < sqrt(n) where n is the number of training examples. usually choose large value of k for better performance.
How to handle attributes with large range in KNN?	normalize scale
How to handle correlated attributes in KNN?	eliminate some attributes or vary and possibly adapt the weight of attributes
How to handle symbols in KNN?	use hamming distance
How to handle KNN expensive in testing?	use subset of dimensions, pre-sort training examples into fast data structures, compute only an approximate distance, remove redundant data
How to handle KNN storage requirements?	remove redundant data, note that pre-sorting increases storage requirement
How to handle KNN curse of dimensionality?	increase amount of data

Created by: amhhh

Popular Computers sets

Definitions for Word Processing and the main parts of the MS Word 2010 window

WP Page Formatting

WP Paragraph Formatting terms

Review business letter parts

Review the three paragraph formats (block, indented, hanging indent)

WP font formatting features in Word 2010

Arkansas CBA unit 1 Hardware

"Know" box contains:
Time elapsed:
Retries: