Save
Busy. Please wait.
Log in with Clever
or

show password
Forgot Password?

Don't have an account?  Sign up 
Sign up using Clever
or

Username is available taken
show password


Make sure to remember your password. If you forget it there is no way for StudyStack to send you a reset link. You would need to create a new account.
Your email address is only used to allow you to reset your password. See our Privacy Policy and Terms of Service.


Already a StudyStack user? Log In

Reset Password
Enter the associated with your account, and we'll email you a link to reset your password.
focusNode
Didn't know it?
click below
 
Knew it?
click below
Don't Know
Remaining cards (0)
Know
0:00
Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

  Normal Size     Small Size show me how

Exam 1 Part 5

Data Processing and Cleaning

QuestionAnswer
What is data and attribute? Data is an object encapsulating a collection of attributes. Attribute is a property/characteristic of data object
What are the different types of attributes? Categorial (Nominal, Binary, Ordinal), Numeric (Interval, Ratio, Continuous, Discrete)
What is the difference between ordinal and nominal attributes? Ordinal attributes values order objects (<, >) while nominal attributes only distinguish (=, !=)
What is graph data? Give example. storage of data in a graph whose structure allows traversal in a particular way to convey additional information about the data. Examples: Generic graph, a molecule, and webpages, binary trees, heaps
What is noise in data? Give example. random or irrelevant information that is present in a dataset. Noise can be introduced by a variety of factors, such as measurement errors, outliers, incomplete data, or irrelevant features. For attributes, noise refers to modification of original values. E.g. measurement errors in common measurement equipment or distortion of person's voice on equipment
What is outlier in data? Give example. An outlier is an observation that is unusually small or large. It could be an error in recording the value, the point doesn't belong in the sample, or the observation is actually valid.
Explain the techniques in handling missing values for different types of attributes Eliminate data object instance, impute values (e.g. mean/mode), or ignore missing values.
What is the relation between correlation and covariance? Give examples. covariance measures degree to which 2 different variables vary (i.e. positive means they increase together, negative decrease). correlation is standardized linear relationship between 2 variables (cov/sdxsdy). It ranges from -1 to 1 and shows a standardized strength of the relationship.
What is curse of dimensionality? The amount of data needed to describe the space increases exponentially as the number of dimensions grows.
What are techniques to reduce dimensionality? PCA, LDA, feature selection, Autoencoder, manifold learning
Why is it important to reduce dimensionality? Increasing dimensionality causes the data to become more sparse and occupy more space. becomes difficult to get enough samples to accurately represent the space; leads to overfitting of models and a lack of generalization to new data. Easier to visualized data, reduce noise.
Explain PCA identifies the most important features or dimensions in a dataset by projecting the data onto a lower-dimensional space. It does this by identifying the axes that capture the most variance in the data, and then projecting the data onto those axes
What is binarization? converting continuous or categorical attributes into one or more binary variables (one-hot encoding)
What is discretization? process of converting continuous attributes into ordinal attribute (e.g. pedal width to S,M,L or 0,1,2)
Why is normalization done to data? data can have different scales, units, means, ranges. Simplifies data so comparisons are on even footing.
What is standardization normalization? Z-score normalization - standardizes attributes to mean of 0 and sd of 1.
What is min-max normalization? scales data to fixed range (e.g. 0 to 1). Bad for data w/ outliers
How to determine if a categorical or numerical attribute is redundant? If attribute has high collinearity with another - they capture the same feature probably.
What is entropy-based binning? Separating data into bins based on the highest information gain
Created by: amhhh
Popular Computers sets

 

 



Voices

Use these flashcards to help memorize information. Look at the large card and try to recall what is on the other side. Then click the card to flip it. If you knew the answer, click the green Know box. Otherwise, click the red Don't know box.

When you've placed seven or more cards in the Don't know box, click "retry" to try those cards again.

If you've accidentally put the card in the wrong box, just click on the card to take it out of the box.

You can also use your keyboard to move the cards as follows:

If you are logged in to your account, this website will remember which cards you know and don't know so that they are in the same box the next time you log in.

When you need a break, try one of the other activities listed below the flashcards like Matching, Snowman, or Hungry Bug. Although it may feel like you're playing a game, your brain is still making more connections with the information to help you out.

To see how well you know the information, try the Quiz or Test activity.

Pass complete!
"Know" box contains:
Time elapsed:
Retries:
restart all cards