# WGU-741

### Fundamentals of Data Analysis

Question | Answer |
---|---|

____ is a term used to describe information that derives from some form of measurement (counting, sorting, ranking, and so on). | Data |

True or false: the strength of the bootstrap method is that it can estimate the shape and spread for any distribution of interest. | True |

You can classify data as either quantitative or ___________. | Qualitative |

You can further classify quantitative data as discrete or _________. | Continuous |

True or false: recognizing quantitative data as discrete or continuous is another useful skill in statistics | True |

When a data set is at a __________ level, the data consists exclusively of categories (i.e. names, eye color, labels, gender, geographic location, etc.) | Nominal |

True or false: when the highest level of a data set is *ordinal level*, the data consists of categories that can be arranged in some meaningful order according to their relative size or quality | True |

A study is done to determine if the color of apples affects the purchase rate by consumers. Purchase counts for 3 colors of apples are collected. Which technique should be used to determine if there is a difference in purchase rate by color? | ANOVA |

The original scores are compared with the second scores to determine whether there was decline. What is the appropriate model to test the hypothesis? Hint: Two-______ t-test. | Sample |

When performing ANOVA on a data set, the error terms follow a _______ probability distribution. | Normal |

Machine learning frees humans and depends on ____________ | Algorithms |

One of the objectives in Data Mining is to discover _________ within a given data set | Patterns |

The technical term for "data about data" | Metadata |

What does the "A" in API stand for? | Application |

_________ reveals features distinguishing one class of data objects from the other, leading to new discoveries | Clustering |

the difference between classification and clustering is that classification starts with ____________ labels while the labels are created after the fact for clustering. | Predefined |

Developing a reasonable understanding of ____________ is a must for data scientists. | Statistics |

True or false: R^2 approaching 1 indicates a good fit for the model. | True |

The following formula is how the ______________ Sum of Squares is define: SSE=∑(Y − Y-hat)^2 | Residual |

An ___________ Data Warehouse (EDW) is a specialized database that focuses on analyzing data. | Enterprise |

An organization is generally worked with "big data" when all for "Vs" are present: Volume, Variety, Velocity, and ____________ | Veracity |

If you use a relational database, then you're going to be limited mostly to structured data. If you use a NoSQL cluster, then you'll be able to work with all types of data, but it will be more difficult to create __________. | Reports |

The degree that two things are related | Correlation |

True or false. With correlation coefficient, the closer you are to 1 or -1, the stronger the relationship | True |

Relative risk is the _____ of the risks of an outcome for two groups. | Ratio |

A researcher uses a chi-square test to find χ 2 = 6.78. Which conclusion can be drawn from this? | The χ 2 value does not allow conclusions, and a p-value must be calculated. |

An analyst was asked to analyze the test case of effectiveness of reducing a side effect, comparing the treatment versus placebo group on the result of having a side effect or not. What is the proper method to use in this analysis? | Logistic regression |

A data analyst is performing logistic regression on elements with high multicollinearity. What is an assumption of logistic regression that is being violated? | Variable independence |

What is the correct assumption for the logistic regression model? | The error terms for the variables are independent. |

True or false :z-Tests are commonly done when the population SD is known. | True |

True or false: t-Tests are commonly done when the population SD is unknown. | True |

True or false: 2 conditions for using a t-Test are: sigma is unknown, and n <30 | True |

Participants asked to rate severity. Responses are rated on Likert-type scale (1-5). Randomly assigned outcomes not normally distributed. The ____-_____ method (non-parametric) test should be used. | rank-based |

True or false: the strength of the Bootstrap method is that it can estimate the shape and spread of any distribution of interest. | True |

Which test should be used when finding whether the outcome is success or failure where: a) the experiment consists of a fixed number of trials, b) each trial only has two possible outcomes, c) each trial is independent of any other trial? | Binomial test |

Apple color is studied to determine consumer purchase rate. Three colors are used and counts taken. Which technique is best? | ANOVA |

A pre-test and a post-test administered to determine if significant change is observed. Best test? | Two-sample t-test |

Which assumptions are made using a linear regression model? | Homoscedasticity; and B0, B1, and variance are constant |

If a researcher uses an ANOVA with a minimum sampling size of 30, what will need to be calculated? | Mean & Standard deviation |

True or false: if a multiple regression model needs to be constructed, dependent error terms violates the standard model assumptions. | True (you want them independent!) |

What are the 2 conditions that must be met in order to use a t-test? | 1. Population standard deviation must be unknown. 2 sample size must be < 30. |

What is the z-test formula? | z=x̄-μ/(s/sqrt of n) |

True or false: in an f-test, if R-squared approaches 1, that indicates a good fit for the model | True |

If the goal is to obtain an accurate p-value for the effectiveness of two medications, which method should be used? (Hint: "accurate") | Fisher's exact test |

Which test should be used if examining the relationship among 3 categories? | Chi-square |

When performing a hypothesis test for slope of a regression, which is the appropriate test to use? | One-sample t-test |

True or false: the Kruskal Wallis test is a rank-based non-parametric test to find statistical differences between two or more groups of an independent variable | True |

True or false: the f-test is a test designed to test if two population variances are equal | True |

True or false: the overall F-test considers all the regression parameters other than the intercept parameter, β0. | True |

True or false: the One-way t-test is a test used to estimate the mean of a sample population | True |

A Probability Mass Function (pmf) utilizes which type of variable? | a discrete random variable |

True or false: Akaike Information Criterion (AIC) is a measure commonly used to compare models with a different number of parameters. | True |

True or false: standard deviation is the square root of the variance. | True |

True or false: the Hypergeometric distribution models the number of events in a certain number of trials (draws). Once selected, an item is no longer available for selection (known as "draws without replacement". | True |

True or false: the Beta distribution models probability for proportions. | True |

The Uniform distribution models possible values which fall within a range of _____ probabilities. | equal |

True or false: the sampling distribution of the mean is the distribution of sample means when taking random samples of the same size. | True |

True or false: the t-test is a method for hypothesis testing for the mean of a sample taken from a normally distributed population when the population standard deviation is known. | False (unknown) |

True or false: *bootstrapping* is a resampling technique that selects a set of data points, allowing the same data point to potentially be selected more than once. | True |

The Wilcoxon rank-sum test can be performed using ______ data. | ranked |

The Kruskal-Wallis test is another rank-based statistical test. It is a ______________ equivalent to a one-way analysis of variance,. | non-parametric |

Fisher's exact test is a method of calculating the exact _______ for contingency tables | p-value |

Computing probabilities for all possible tables is very time-consuming, so an approximation, the ____________ test, is sometimes used when the sample size is large. | chi-squared |

True or false: a Principal Component Analysis (PCA) transforms variables into possibly more components to simplify subsequent analyses. | False. "fewer" |

A standardized variable is a variable that has been adjusted to have a mean of zero and a standard deviation of ______. | one |

A principal component loading plot displays the ______ of each input variable between a pair of principal components. | weight |

True or false: Factor loading is the correlation between the original data and the new principal component | True |

If time dependence exists in the errors, the errors are said to be ______________. | autocorrelated |

The coefficient of determination, denoted R^2, measures the proportion of total variation in the response variable, Y, that is accounted for by the linear regression model. | True |

The category represented by the omitted indicator variable is the __________ ______ of the categorical predictor. | reference level |

True or false: an interaction item is two predictor variables multiplied together and used as an additional predictor term in a multiple linear regression model | True |

True or false: a Loess smooth curve is an estimate of the data trend obtained by predicting each data point using the points nearby. | True |

_____________ is the percentage of correctly classified observations with the desired outcome. | Sensitivity |

_____________ is the percentage of correctly classified observations WITHOUT the desired outcome. | Specificity |

True or false: when there is an even number of terms, the median is the mean of the middle two terms. | True |

A _______________ distribution has a peak of high-frequency data on the right with a tail of low-frequency data on the left side. | left-skewed |

Rolling less than a 3 on a six-sided die is best described as _____. a) a simple event b) a compound event | b) a compound event |

The Poisson distribution models the number of _______ that occurs in an interval; counts. Ex: The number of emails received each day. | events |

_______ is a statistical method used to compare the means of two or more groups. | ANOVA |

True or false: a t-test is a statistic that checks if two means are reliably different from each other. | True |

The concept of __________ ____________formalizes the intuition of making inferences about populations from observed sample statistics. | statistical significance |

If the p-value is less than a threshold known as the ______________ (written as alpha), then one rejects the null hypothesis in favor of the alternative hypothesis. | significance level |

True of false: the f-distribution is a non-symmetric, continuous, probability distribution that is used to perform inference on two population means | False (variances) |

True or false: the f-distribution is right-skewed and unimodal | True |

True or false: the Probability Mass Function can be defined as P(X=x) | True |

A _________________ describes characteristics of a population. | parameter |

True or false: a chi-square test compares an expected data set with an observed data set. | True |

A Logistic Regression Model is one such that the dependent variable is ____________. | categorical |

True or false: the t-distribution is used in place of the normal distribution in situations where the sample size (n) is small (<30) or the population standard deviation (σ) is unknown. | True |

Why must the ANOVA method be used on 3 or more groups? | Because there are variations within each group (columns) AND variations between each group (rows) |

Name the best test when conducting a Pre-test and Post-test from the same population? | Two-sample T-test |

True of false: the one-way t-test is used to estimate the mean of a sample population | True |

When conducting an experiment that involves a treatment group vs. a control group, why type of regression applies? | Logistic regression |

Which test would you use when the relative data is found in a contingency table? | Fisher's exact test |

Fisher's exact test uses the __________________ distribution to calculate the probability that the observed contingency table (or one even more unusual) would occur under the null hypothesis. | hypergeometric |

Twenty-five adults between 65 and 80 are selected to eat blackberries for six months and be retested. The original scores are compared with the second scores to determine whether there was decline. | Two-sample t-test |

A random sample of 30 students from Major One and 30 students from Major Two are selected to compare GPAs. The researcher decides to use an ANOVA test for this study. What will the researcher need to calculate? | Group means and standard deviations |

What is data mining and business analytics used for? | Collecting and analyzing data for better decision making with the goal of solving business-related problems. |

An example of a test to use when you have 2 discrete variables is: A) Kruskal-Wallis, B) ANOVA, C) Chi-squared | Chi-squared |

True or false. In the field of statistics, we study samples because collecting data for an entire population is generally not feasible. | True |

