Busy. Please wait.
Log in with Clever

show password
Forgot Password?

Don't have an account?  Sign up 
Sign up using Clever

Username is available taken
show password

Make sure to remember your password. If you forget it there is no way for StudyStack to send you a reset link. You would need to create a new account.
Your email address is only used to allow you to reset your password. See our Privacy Policy and Terms of Service.

Already a StudyStack user? Log In

Reset Password
Enter the associated with your account, and we'll email you a link to reset your password.
Didn't know it?
click below
Knew it?
click below
Don't know
Remaining cards (0)
Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

  Normal Size     Small Size show me how


Fundamentals of Data Analysis

This page has been left blank intentionally. Select the 'shuffle' option now.
____ is a term used to describe information that derives from some form of measurement (counting, sorting, ranking, and so on). Data
True or false: the strength of the bootstrap method is that it can estimate the shape and spread for any distribution of interest. True
You can classify data as either quantitative or ___________. Qualitative
You can further classify quantitative data as discrete or _________. Continuous
True or false: recognizing quantitative data as discrete or continuous is another useful skill in statistics True
When a data set is at a __________ level, the data consists exclusively of categories (i.e. names, eye color, labels, gender, geographic location, etc.) Nominal
True or false: when the highest level of a data set is *ordinal level*, the data consists of categories that can be arranged in some meaningful order according to their relative size or quality True
A study is done to determine if the color of apples affects the purchase rate by consumers. Purchase counts for 3 colors of apples are collected. Which technique should be used to determine if there is a difference in purchase rate by color? ANOVA
The original scores are compared with the second scores to determine whether there was decline. What is the appropriate model to test the hypothesis? Hint: Two-______ t-test. Sample
When performing ANOVA on a data set, the error terms follow a _______ probability distribution. Normal
Machine learning frees humans and depends on ____________ Algorithms
One of the objectives in Data Mining is to discover _________ within a given data set Patterns
The technical term for "data about data" Metadata
What does the "A" in API stand for? Application
_________ reveals features distinguishing one class of data objects from the other, leading to new discoveries Clustering
the difference between classification and clustering is that classification starts with ____________ labels while the labels are created after the fact for clustering. Predefined
Developing a reasonable understanding of ____________ is a must for data scientists. Statistics
True or false: R^2 approaching 1 indicates a good fit for the model. True
The following formula is how the ______________ Sum of Squares is define: SSE=∑(Y − Y-hat)^2 Residual
An ___________ Data Warehouse (EDW) is a specialized database that focuses on analyzing data. Enterprise
An organization is generally worked with "big data" when all for "Vs" are present: Volume, Variety, Velocity, and ____________ Veracity
If you use a relational database, then you're going to be limited mostly to structured data. If you use a NoSQL cluster, then you'll be able to work with all types of data, but it will be more difficult to create __________. Reports
The degree that two things are related Correlation
True or false. With correlation coefficient, the closer you are to 1 or -1, the stronger the relationship True
Relative risk is the _____ of the risks of an outcome for two groups. Ratio
A researcher uses a chi-square test to find χ 2 = 6.78. Which conclusion can be drawn from this? The χ 2 value does not allow conclusions, and a p-value must be calculated.
An analyst was asked to analyze the test case of effectiveness of reducing a side effect, comparing the treatment versus placebo group on the result of having a side effect or not. What is the proper method to use in this analysis? Logistic regression
A data analyst is performing logistic regression on elements with high multicollinearity. What is an assumption of logistic regression that is being violated? Variable independence
What is the correct assumption for the logistic regression model? The error terms for the variables are independent.
True or false :z-Tests are commonly done when the population SD is known. True
True or false: t-Tests are commonly done when the population SD is unknown. True
True or false: 2 conditions for using a t-Test are: sigma is unknown, and n <30 True
Participants asked to rate severity. Responses are rated on Likert-type scale (1-5). Randomly assigned outcomes not normally distributed. The ____-_____ method (non-parametric) test should be used. rank-based
True or false: the strength of the Bootstrap method is that it can estimate the shape and spread of any distribution of interest. True
Which test should be used when finding whether the outcome is success or failure where: a) the experiment consists of a fixed number of trials, b) each trial only has two possible outcomes, c) each trial is independent of any other trial? Binomial test
Apple color is studied to determine consumer purchase rate. Three colors are used and counts taken. Which technique is best? ANOVA
A pre-test and a post-test administered to determine if significant change is observed. Best test? Two-sample t-test
Which assumptions are made using a linear regression model? Homoscedasticity; and B0, B1, and variance are constant
If a researcher uses an ANOVA with a minimum sampling size of 30, what will need to be calculated? Mean & Standard deviation
True or false: if a multiple regression model needs to be constructed, dependent error terms violates the standard model assumptions. True (you want them independent!)
What are the 2 conditions that must be met in order to use a t-test? 1. Population standard deviation must be unknown. 2 sample size must be < 30.
What is the z-test formula? z=x̄-μ/(s/sqrt of n)
True or false: in an f-test, if R-squared approaches 1, that indicates a good fit for the model True
If the goal is to obtain an accurate p-value for the effectiveness of two medications, which method should be used? (Hint: "accurate") Fisher's exact test
Which test should be used if examining the relationship among 3 categories? Chi-square
When performing a hypothesis test for slope of a regression, which is the appropriate test to use? One-sample t-test
True or false: the Kruskal Wallis test is a rank-based non-parametric test to find statistical differences between two or more groups of an independent variable True
True or false: the f-test is a test designed to test if two population variances are equal True
True or false: the overall F-test considers all the regression parameters other than the intercept parameter, β0. True
True or false: the One-way t-test is a test used to estimate the mean of a sample population True
A Probability Mass Function (pmf) utilizes which type of variable? a discrete random variable
True or false: Akaike Information Criterion (AIC) is a measure commonly used to compare models with a different number of parameters. True
True or false: standard deviation is the square root of the variance. True
True or false: the Hypergeometric distribution models the number of events in a certain number of trials (draws). Once selected, an item is no longer available for selection (known as "draws without replacement". True
True or false: the Beta distribution models probability for proportions. True
The Uniform distribution models possible values which fall within a range of _____ probabilities. equal
True or false: the sampling distribution of the mean is the distribution of sample means when taking random samples of the same size. True
True or false: the t-test is a method for hypothesis testing for the mean of a sample taken from a normally distributed population when the population standard deviation is known. False (unknown)
True or false: *bootstrapping* is a resampling technique that selects a set of data points, allowing the same data point to potentially be selected more than once. True
The Wilcoxon rank-sum test can be performed using ______ data. ranked
The Kruskal-Wallis test is another rank-based statistical test. It is a ______________ equivalent to a one-way analysis of variance,. non-parametric
Fisher's exact test is a method of calculating the exact _______ for contingency tables p-value
Computing probabilities for all possible tables is very time-consuming, so an approximation, the ____________ test, is sometimes used when the sample size is large. chi-squared
True or false: a Principal Component Analysis (PCA) transforms variables into possibly more components to simplify subsequent analyses. False. "fewer"
A standardized variable is a variable that has been adjusted to have a mean of zero and a standard deviation of ______. one
A principal component loading plot displays the ______ of each input variable between a pair of principal components. weight
True or false: Factor loading is the correlation between the original data and the new principal component True
If time dependence exists in the errors, the errors are said to be ______________. autocorrelated
The coefficient of determination, denoted R^2, measures the proportion of total variation in the response variable, Y, that is accounted for by the linear regression model. True
The category represented by the omitted indicator variable is the __________ ______ of the categorical predictor. reference level
True or false: an interaction item is two predictor variables multiplied together and used as an additional predictor term in a multiple linear regression model True
True or false: a Loess smooth curve is an estimate of the data trend obtained by predicting each data point using the points nearby. True
_____________ is the percentage of correctly classified observations with the desired outcome. Sensitivity
_____________ is the percentage of correctly classified observations WITHOUT the desired outcome. Specificity
True or false: when there is an even number of terms, the median is the mean of the middle two terms. True
A _______________ distribution has a peak of high-frequency data on the right with a tail of low-frequency data on the left side. left-skewed
Rolling less than a 3 on a six-sided die is best described as _____. a) a simple event b) a compound event b) a compound event
The Poisson distribution models the number of _______ that occurs in an interval; counts. Ex: The number of emails received each day. events
_______ is a statistical method used to compare the means of two or more groups. ANOVA
True or false: a t-test is a statistic that checks if two means are reliably different from each other. True
The concept of __________ ____________formalizes the intuition of making inferences about populations from observed sample statistics. statistical significance
If the p-value is less than a threshold known as the ______________ (written as alpha), then one rejects the null hypothesis in favor of the alternative hypothesis. significance level
True of false: the f-distribution is a non-symmetric, continuous, probability distribution that is used to perform inference on two population means False (variances)
True or false: the f-distribution is right-skewed and unimodal True
True or false: the Probability Mass Function can be defined as P(X=x) True
A _________________ describes characteristics of a population. parameter
True or false: a chi-square test compares an expected data set with an observed data set. True
A Logistic Regression Model is one such that the dependent variable is ____________. categorical
True or false: the t-distribution is used in place of the normal distribution in situations where the sample size (n) is small (<30) or the population standard deviation (σ) is unknown. True
Why must the ANOVA method be used on 3 or more groups? Because there are variations within each group (columns) AND variations between each group (rows)
Name the best test when conducting a Pre-test and Post-test from the same population? Two-sample T-test
True of false: the one-way t-test is used to estimate the mean of a sample population True
When conducting an experiment that involves a treatment group vs. a control group, why type of regression applies? Logistic regression
Which test would you use when the relative data is found in a contingency table? Fisher's exact test
Fisher's exact test uses the __________________ distribution to calculate the probability that the observed contingency table (or one even more unusual) would occur under the null hypothesis. hypergeometric
Twenty-five adults between 65 and 80 are selected to eat blackberries for six months and be retested. The original scores are compared with the second scores to determine whether there was decline. Two-sample t-test
A random sample of 30 students from Major One and 30 students from Major Two are selected to compare GPAs. The researcher decides to use an ANOVA test for this study. What will the researcher need to calculate? Group means and standard deviations
What is data mining and business analytics used for? Collecting and analyzing data for better decision making with the goal of solving business-related problems.
An example of a test to use when you have 2 discrete variables is: A) Kruskal-Wallis, B) ANOVA, C) Chi-squared Chi-squared
True or false. In the field of statistics, we study samples because collecting data for an entire population is generally not feasible. True
Created by: ronzStack
Popular Miscellaneous sets




Use these flashcards to help memorize information. Look at the large card and try to recall what is on the other side. Then click the card to flip it. If you knew the answer, click the green Know box. Otherwise, click the red Don't know box.

When you've placed seven or more cards in the Don't know box, click "retry" to try those cards again.

If you've accidentally put the card in the wrong box, just click on the card to take it out of the box.

You can also use your keyboard to move the cards as follows:

If you are logged in to your account, this website will remember which cards you know and don't know so that they are in the same box the next time you log in.

When you need a break, try one of the other activities listed below the flashcards like Matching, Snowman, or Hungry Bug. Although it may feel like you're playing a game, your brain is still making more connections with the information to help you out.

To see how well you know the information, try the Quiz or Test activity.

Pass complete!
"Know" box contains:
Time elapsed:
restart all cards