Save
Upgrade to remove ads
Busy. Please wait.
Log in with Clever
or

show password
Forgot Password?

Don't have an account?  Sign up 
Sign up using Clever
or

Username is available taken
show password


Make sure to remember your password. If you forget it there is no way for StudyStack to send you a reset link. You would need to create a new account.
Your email address is only used to allow you to reset your password. See our Privacy Policy and Terms of Service.


Already a StudyStack user? Log In

Reset Password
Enter the associated with your account, and we'll email you a link to reset your password.
focusNode
Didn't know it?
click below
 
Knew it?
click below
Don't Know
Remaining cards (0)
Know
0:00
Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

  Normal Size     Small Size show me how

CSI 343 Exam 1

QuestionAnswer
What is Data Science cross-disciplinary practice that draws on methods from data engineering, descriptive statistics, data mining, machine learning, predictive analysis
What are the stages of a data science project? Get, Explore, Clean, Visualize, Train & Test
What the approach when data is expensive Define goal/problem first, then explore data
What the approach when data is cheap? Explore data first, then see how it can be used
What is a Primary Data source? Data collected directly by a resarcher
Example of a primary data source HMDA, Census
What is a Secondary Data source? Data collected by someone else
Secondary Data source example Kaggle, UCI ML Repository, Data.gov
Problems to look for when exploring data Missing/unusual values, anomalies/outliers, lack of diversity/representation, bias
How can missing values be handled? imputing with measures of central tendency (mean,median,mode)
Why are unbalanced distributions problematic? bias models, especially classification models, making them less effective
Difference between correlation and causation correlation is a relationship/pattern between two variables and causation is the principle where one event causes another
What i a spurious correlation Appears to be correlated but has no causation
Three main types of machine learning Supervised, unsupervised, reifnorcement
What is supervised learning learning from labeled data
What is unsupervised learning learning from finding patterns in unlabeled data
What is reinforcement learning learns decisions via environment interaction
What is Qualitative Data Data that cannot be measured numerically; sorted by categories.
Examples of Qualitative Data Audio, Images, text, race
What is Nominal Data Categorical data with no order or inherent ranking
Nominal Data example sex, race, hair color
What is ordinal data Categorical data with ordering, but differences between ranks may not be equal
Ordinal Data example rankings, letter grades, satisfaction levels
What is Quantitative Data Numerical data that can be measured or counted
Quantitative data example heigh, weight, income
What is Continuous Data? Quantitative data that can take any value (measured)
Continuous data example temperature, GPA
What is discrete data? Quantitative data that takes specific countable values
Discrete data example number of dogs, birth year
What is structured data? Data prganized into rows / columns in relational databases; about 20% of enterprise data; easier to manage
Structured Data example numbers, dates, strings
What is Unstructured Data? Data not stored in relational databases, lack schema; 80% of enterprise data
Unstructured Data example NoSQL databses (MongoDB,Couchbase,Cassandra)
Define Mean Average value of a dataset
Arithmetic mean formula sum of values / number of values
Weighted mean Accounts for different weights of values: sum(val x weight)/sum(weight)
median middle value of dataset when ordered
mode values that occur most frequently
Unimodal 1 mode
Bimodal 2 modes
Multimodal 2+ modes
Range max-min
Percentile value below which a percentage of data falls
Percentile example 95th percentile= better than 95% of test takers
IQR Interquartile Range; 75th percentile-25th percentile
Variance Average squared deviation from mean; measures spread
Difference between population vs sample variance population uses n in denominator; sample uses n-1
Why is standard deviation easier to interpret than variance? It is in the same units as the data (sqrt of variance)
Experiment Repeatable procedure th defined otucomes
Experiment example Tossing a coin
Sample space set of all possible outcomes
Event Subset of sample space
Simple event one outcome
Compount event multiple outcomes
mutually exclusive events events that cannot occur together (disjoint)
3 axioms of probability Non-negativity, Unity, Additivity
Non-Negativity P(A) ≥ 0
Unity P(S) = 1
Additivity If A and B are disjoint, P(A ∪ B) = P(A) + P(B)
State the complement rule P(not A) = 1 – P(A).
State the general addition rule. P(A ∪ B) = P(A) + P(B) – P(A ∩ B).
What is conditional probability P(A|B) = probability of A given B.
Define dependent vs independent events. Dependent: probability of one affects the other. Independent: no effect.
What does exhaustive mean in probability? Events that cover the entire sample space (one must happen).
Define outlier. Extreme observation differing from the rest. Identified by >1.5×IQR or >2 SD from mean.
Define anomaly Unexpected data pattern; may indicate fraud, system failure, or novel events.
How to handle missing values? Impute (mean, median, mode), create new fields, or drop column if mostly missing.
How to handle invalid values? Drop them, replace with valid values, or create sentinel values.
MNIST dataset (28x28 grayscale digits) → What type of data? Nominal.
Fashion MNIST dataset (clothing images) → Data type? Nominal.
Debt-to-income ratio as intervals/values → Data type? : Ordinal (mode but not mean).
Example when it’s okay to remove an entire column? When it isn’t relevant (e.g., phone number when analyzing grades).
Example when it’s okay to remove an entire row? When data is mostly missing or invalid.
What does an income dataset missing values illustrate? Mean of entire dataset may not be best imputation due to regional variation.
Created by: user-1979725
 

 



Voices

Use these flashcards to help memorize information. Look at the large card and try to recall what is on the other side. Then click the card to flip it. If you knew the answer, click the green Know box. Otherwise, click the red Don't know box.

When you've placed seven or more cards in the Don't know box, click "retry" to try those cards again.

If you've accidentally put the card in the wrong box, just click on the card to take it out of the box.

You can also use your keyboard to move the cards as follows:

If you are logged in to your account, this website will remember which cards you know and don't know so that they are in the same box the next time you log in.

When you need a break, try one of the other activities listed below the flashcards like Matching, Snowman, or Hungry Bug. Although it may feel like you're playing a game, your brain is still making more connections with the information to help you out.

To see how well you know the information, try the Quiz or Test activity.

Pass complete!
"Know" box contains:
Time elapsed:
Retries:
restart all cards