click below
click below
Normal Size Small Size show me how
WGU-743
Data Mining and Analytics
Question | Answer |
---|---|
This page has been left blank intentionally. Select the 'shuffle' option now. | |
What is "Scoring"? (Hint: used in marketing.) | The probability that a customer will behave in a certain way in the future |
Segmenting the population into groups that are homogeneous (similar) in order to construct a specific model for each segment is called: A) Stratification of models, B) Unsupervised segmentation, C) PCA | A) Stratification of models |
What is "unsupervised" segmentation? | Segmenting where general characteristics of the sample have no direct relationship with the dependent variable |
The Shapiro–Wilk test tests the ____ __________ so that any given sample comes from a normally distributed population. | null hypothesis |
True or false. The Kolmogorov-Smirnov test is a statistical hypothesis test. | True |
For variables that are qualitative or discrete, we test their link with the dependent variable by using which of the following: A) Cramér's V, B) Student's T-test, C) Kruskal-Wallis test | Cramér's V |
The net present value of profitability of a customer is known as the ________ value. | lifetime |
True or false. The Anderson-Darling test is often used where a family of distributions is being tested. | True |
True or false. The Cramér's V test can be used in a bi-variate analysis. | True |
Select the best test for when you have 1 discrete variable and 1 continuous variable (e.g. dosage vs. recovery time): A) Cramér's V, B) ANOVA, C) Wilcoxan signed-rank | B) ANOVA |
A tool which is very suited to the detection of extreme values is the ____ ______. | box plot |
Statistical replacement of missing values uses a process called what? | Imputation |
What does the acronym MCMC stand for? | Markov chain Monte Carlo |
True or false. Aberrant values can often be detected using simple frequency tables. | True |
True or false. The normality of a variable can be verified by the Shapiro–Wilk test. | True |
True or false. The values of R^2 and χ^2 in the Kruskal–Wallis test increase with the strength of the link. | True |
True or false. Collinearity does not affect decision trees. | True |
True or false. Calculating the correlation coefficients of the variables in pairs is the simplest way of detecting collinearity? | True |
Predictive or Descriptive? Adverse events of a drug were explored by clustering the therapeutic classes. | Descriptive |
Predictive or Descriptive? A data analyst receives detailed customer purchasing data and a manager tasks the data analyst with finding associations of any type among customers. | Descriptive |
A data analyst is using analytical CRM to extract, store, analyze, and output relevant customer information. What is the first step within the analytical CRM phase that this analyst will be performing? | Combining a customer’s records to develop a holistic view |
True or false. The development phase cannot be completed in the absence of data. | True |
After performing a normality test on a dataset, results show the null hypothesis should be rejected. Which type of test should now be performed to analyze the data? | A non-parametric test |
What technique should be used to discover links between Age and Income? | Pearson correlation |
What type of data is used in psychographic data within commercial applications? | Lifestyle |
Name 2 types of data used in commercial applications? | Data on products and contracts |
What type of analysis is used to determine the distribution of purchases during a given period of time? | RFM (Recency, Frequency, Monetary) |
What is the term used when a data analyst needs to analyze the churn rate (customer retention) and the time of possible churn of customers for a local wireless company. | Survival analysis |
True or false. A decision tree algorithm is both a prediction model and a classification model. | True |
Which 2 methods should data analysts use to reduce processing time when working in SAS? A) Create Booleans as alphanumeric variables, B) Convert the data to a flat file, C) Place the analyzed file on a cloud network, D) Increase RAM | A & D |
A retailer is looking for patterns that will be used for marketing and has given a data analyst a large list of transactions with data on what customers purchased during each store visit. Which mining method should the analyst use? | Association rules |
An automobile manufacturer has obtained access to customer-related data that was previously unavailable. Which method should the manufacturer use to perform descriptive data mining? | Factor analysis |
An analyst has been tasked by a pizza company to provide recommendations for three new restaurants. The best indication of success is based on the population of a surrounding area. Which mining method should be used to provide the recommendation? | Clustering |
When the single dependent variable is quantitative, and the single independent variable is qualitative, which data mining method should the data analyst be using? | Decision trees |
A data analyst needs to identify the most frequently cited words in the documentation and classify them into groups. Which method should the analyst use for classification? | Clustering |
A data analyst wants to reduce the dimensionality of the text from a set of web pages. Which method should the data analyst apply to the dataset? A) Decision trees, B) Kohonen maps, C) ANOVA | B) Kohonen maps |
Understanding the expectations of customers and anticipating their needs is a major objective of which of the following? A) Data analysis, B) CRM, C) Inventory Management (IM) | CRM |
Looking at customer behavior and developing a descriptive profile in order to provide personalized marketing strategies for each group is known as: A) Correspondence Analysis (CA), B) Linear regression, C) Customer Segmentation | Customer Segmentation |
True or false. The variation in the original data set is not maintained when data is discretized. | False. Variation IS maintained |
Which of the following sampling methods divides the population and draws individuals at random from each group? A) Stratified, B) Clustering, C) Systematic | A) Stratified |
Relational data is a type of which category of data? A) Product, B) Geodemographic, C) Customer | C) Customer |
Which of the following statistical techniques uses several variables, collectively, to predict one outcome variable? A) PCA, B) Multiple regression, C) RFM | C) Multiple regression |
Which of the following statistical technique allows you to transform your set of variables by using the variables with the highest variance? A) ANOVA, B) PCA, C) CA | B) PCA |
True or false. Principle Component Analysis (PCA) helps you identify which variables are important so you can compress the data by reducing the number of dimesions. | True |
True or false. Association Analysis allows you to determine the degree to which the items tend to be associated with one another. | True. For example, people who buy hamburger buns will also likely buy ketchup and mustard. You can associate items together and create rules. |
True or false. Logistic regression is used to explain the relationship between a dependent BINARY variable and one (or more) independent variables. | True |
True or false. Naive Bayes is an algorithm that uses a set of training data to construct a model that will classify new data points. | True |
Which of the following prediction techniques involves determining "weights" that describe the influence of the inputs on the target variable? A) Association analysis, B) Linear regression, C) Support Vector Machine (SVM) | B) Linear regression |
How should the data be transformed prior to using the ANOVA test? | By taking the natural log |
A non-parametric tests doesn't require a ______ data distribution. | normal |
True or false. The Anderson-Darling test is a one-tailed test. | True |
True or false. Cramer's V measures frequency tables of categorical data types (2x2 or larger). | True |
Information collected by a company that measures the importance a consumer places on particular attributes of products or services is called: A) geodemographic data, B) bad data, C) attitudinal data | C) attitudinal data |
Winsorizing is the process of replacing an outlier's original value with the _________ value of an observation not seriously suspect. | nearest |
True or false. The lower the kurtosis, the smaller the range of values. | True |
Which of the following is a non-parametric test? A) Mann-Whitney, B) Independent-Samples T-test, C) Paired-Samples T-test | A) Mann-Whitney |
True or false. Parametric statistics are used to make inferences about population parameters. | True |
True or false. Non-parametric statistics do not assume that the data or population have any characteristic structure. | True |
The Wilcoxon rank-sum test can be performed using ______ data. | ranked |
Which of the following 2 non-parametric tests are used in conjunction with Chi-square? A) McNemar, B) Wilcoxan signed-rank, C) Fisher's exact | McNemar & Fisher's exact |
True or false. An "r' value in Pearson's correlation ranges between 0 and 1. | False. It ranges between -1 and 1 |
True or false. Cramér's V is a measure of association between two nominal variables, giving a value between 0 and 1. | True |
True or false. Correspondence analysis (CA) is a multivariate statistical technique. It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. | True |
True or false. A specific target is defined when analyzing supervised data. | True |
True or false: Factor loading is the correlation between the original data and the new principal component | True |
Which of the following detects the two-way interactions between tables? A) Decision tree, B) Cluster analysis, C) Text mining | Decision tree |
Which of the following identifies large volumes of data distilled into homogeneous groups? A) Decision tree, B) Neural networks, C) Cluster analysis, D) Text mining | Cluster analysis |
Name the preferred test given the following: normality & homoscedasticity & 2 samples. | Student's t test |
Name the preferred test given the following: normality & homoscedasticity & 3+ samples. | ANOVA |
Name the preferred test given the following: normality & heteroscedasticity & 2 samples. | Welch's t test |
Name the preferred test given the following: non-normality & heteroscedasticity & 2 samples. | Wilcoxon-Mann-Whitney |
Name the preferred test given the following: non-normality & heteroscedasticity & 3+ samples. | Kruskal-Wallis |
True or false. Principal component analysis (PCA) algorithms are ideal when working with continuous data. | True |
True or false. Correspondence analysis (CA) algorithms are ideal when working with qualitative and binary variables. | True |
Name the test: one sample, normal distribution. | One sample t-test |
Name the test: one sample, non-normal distribution . | Wilcoxon rank sum test |
Name the test: two samples, non-normal distribution, samples are NOT paired. | Mann-Whitney test |
Name the test: three samples, normal distribution, samples are NOT paired . | Kruskal-Wallis test |
Name the test: one independent continuous variable, testing for degree of relationship, normal distribution. (Hint: ____________ correlation.) | Pearson's correlation |
True or false. Factor analysis is a statistical method used to describe variability among observed, correlated variables | True |
Parametric statistics are used to make inferences about population ___________ . | parameters |
True or false. In the field of statistics, we study samples because collecting data for an entire population is generally not feasible. | True |
Of the following, which one is a parametric test: A) Mann-Whitney, B) One-way ANOVA, C) Kruskal-Wallis | B) One-way ANOVA |
Name any 3 tests that test for homoscedasticity. | Levine, Bartlett, Fisher |
Which of the following can be used to explore uni-variate data? A) Locating outliers, B) Graphs and tables, C) Statistical summaries | Statistical summaries (e.g. mean, median, mode, etc.) |
An example of a test to use when you have 2 discrete variables is: A) Kruskal-Wallis, B) ANOVA, C) Chi-squared | C) Chi-squared |
Name the best test when conducting a pre-test and a post-test from the same population. A) Chi-squared, B) Two-sample T-test, C) Levine | B) 2-sample T-test |
True or false. A t-test is a statistic that checks if two means are reliably different from each other. | True |
A(n) ___________ describes a characteristic of a population. | parameter |
What is prevalence? | The total sub-population with a predefined condition within a population itself |
The chi-squared test is a _______________ test. A) parametric, B) non-parametric | non-parametric |
The Levene test assesses the _________ of variances for 2 or more groups. | equality |
What is Correspondence Analysis (CA) | A multivariate statistical technique, similar to Principal Component Analysis but it applies to categorical data rather than continuous data. |
True or false. Cramer's V is a way of calculating correlation in tables. It is used as pre-test to determine strengths of association after chi-square has determined significance. | False. It is a post-test |
What is ANOVA? | Analysis of Variation |
What is the paired t-test? | A statistical procedure used to determine whether the mean difference between two sets of observations is zero. |
The Mann-Whitney test is used for __________ (rank) data; it is equivalent to the Independent T-test. | ordinal |
The Wilcoxan-signed rank test is used for ordinal (rank) data; it is equivalent to the __________ T-test. | Paired |
The McNemar test is used for __________ data; it is equivalent to the Paired T-test. | nominal (named) |
True of false. PCA (Principle Component Analysis) is a "dimension increasing" method, converting the correlations among all of the variables into a 2-D graph. | False. It's a dimension *decreasing* method |