click below
click below
Normal Size Small Size show me how
CSI 343 Exam 1
| Question | Answer |
|---|---|
| What is Data Science | cross-disciplinary practice that draws on methods from data engineering, descriptive statistics, data mining, machine learning, predictive analysis |
| What are the stages of a data science project? | Get, Explore, Clean, Visualize, Train & Test |
| What the approach when data is expensive | Define goal/problem first, then explore data |
| What the approach when data is cheap? | Explore data first, then see how it can be used |
| What is a Primary Data source? | Data collected directly by a resarcher |
| Example of a primary data source | HMDA, Census |
| What is a Secondary Data source? | Data collected by someone else |
| Secondary Data source example | Kaggle, UCI ML Repository, Data.gov |
| Problems to look for when exploring data | Missing/unusual values, anomalies/outliers, lack of diversity/representation, bias |
| How can missing values be handled? | imputing with measures of central tendency (mean,median,mode) |
| Why are unbalanced distributions problematic? | bias models, especially classification models, making them less effective |
| Difference between correlation and causation | correlation is a relationship/pattern between two variables and causation is the principle where one event causes another |
| What i a spurious correlation | Appears to be correlated but has no causation |
| Three main types of machine learning | Supervised, unsupervised, reifnorcement |
| What is supervised learning | learning from labeled data |
| What is unsupervised learning | learning from finding patterns in unlabeled data |
| What is reinforcement learning | learns decisions via environment interaction |
| What is Qualitative Data | Data that cannot be measured numerically; sorted by categories. |
| Examples of Qualitative Data | Audio, Images, text, race |
| What is Nominal Data | Categorical data with no order or inherent ranking |
| Nominal Data example | sex, race, hair color |
| What is ordinal data | Categorical data with ordering, but differences between ranks may not be equal |
| Ordinal Data example | rankings, letter grades, satisfaction levels |
| What is Quantitative Data | Numerical data that can be measured or counted |
| Quantitative data example | heigh, weight, income |
| What is Continuous Data? | Quantitative data that can take any value (measured) |
| Continuous data example | temperature, GPA |
| What is discrete data? | Quantitative data that takes specific countable values |
| Discrete data example | number of dogs, birth year |
| What is structured data? | Data prganized into rows / columns in relational databases; about 20% of enterprise data; easier to manage |
| Structured Data example | numbers, dates, strings |
| What is Unstructured Data? | Data not stored in relational databases, lack schema; 80% of enterprise data |
| Unstructured Data example | NoSQL databses (MongoDB,Couchbase,Cassandra) |
| Define Mean | Average value of a dataset |
| Arithmetic mean formula | sum of values / number of values |
| Weighted mean | Accounts for different weights of values: sum(val x weight)/sum(weight) |
| median | middle value of dataset when ordered |
| mode | values that occur most frequently |
| Unimodal | 1 mode |
| Bimodal | 2 modes |
| Multimodal | 2+ modes |
| Range | max-min |
| Percentile | value below which a percentage of data falls |
| Percentile example | 95th percentile= better than 95% of test takers |
| IQR | Interquartile Range; 75th percentile-25th percentile |
| Variance | Average squared deviation from mean; measures spread |
| Difference between population vs sample variance | population uses n in denominator; sample uses n-1 |
| Why is standard deviation easier to interpret than variance? | It is in the same units as the data (sqrt of variance) |
| Experiment | Repeatable procedure th defined otucomes |
| Experiment example | Tossing a coin |
| Sample space | set of all possible outcomes |
| Event | Subset of sample space |
| Simple event | one outcome |
| Compount event | multiple outcomes |
| mutually exclusive events | events that cannot occur together (disjoint) |
| 3 axioms of probability | Non-negativity, Unity, Additivity |
| Non-Negativity | P(A) ≥ 0 |
| Unity | P(S) = 1 |
| Additivity | If A and B are disjoint, P(A ∪ B) = P(A) + P(B) |
| State the complement rule | P(not A) = 1 – P(A). |
| State the general addition rule. | P(A ∪ B) = P(A) + P(B) – P(A ∩ B). |
| What is conditional probability | P(A|B) = probability of A given B. |
| Define dependent vs independent events. | Dependent: probability of one affects the other. Independent: no effect. |
| What does exhaustive mean in probability? | Events that cover the entire sample space (one must happen). |
| Define outlier. | Extreme observation differing from the rest. Identified by >1.5×IQR or >2 SD from mean. |
| Define anomaly | Unexpected data pattern; may indicate fraud, system failure, or novel events. |
| How to handle missing values? | Impute (mean, median, mode), create new fields, or drop column if mostly missing. |
| How to handle invalid values? | Drop them, replace with valid values, or create sentinel values. |
| MNIST dataset (28x28 grayscale digits) → What type of data? | Nominal. |
| Fashion MNIST dataset (clothing images) → Data type? | Nominal. |
| Debt-to-income ratio as intervals/values → Data type? | : Ordinal (mode but not mean). |
| Example when it’s okay to remove an entire column? | When it isn’t relevant (e.g., phone number when analyzing grades). |
| Example when it’s okay to remove an entire row? | When data is mostly missing or invalid. |
| What does an income dataset missing values illustrate? | Mean of entire dataset may not be best imputation due to regional variation. |