click below
click below
Normal Size Small Size show me how
QM Exam 1
| Term | Definition |
|---|---|
| data warehouses | vast digital repositories that record and store data electronically |
| Big Data | describe data sets so large that traditional methods of storage and analysis are inadequate |
| transactional data | data collected for recording the companies' transactions |
| data mining or predictive analytics | process of using data, especially transactional data to make decisions and predictions |
| business analytics | describes any use of data and statistical analysis to drive business decisions from data whether the purpose is predictive or simply descriptive |
| data | numerical, alphabetic, or alphanumerical; useless unless we know what it represents |
| context | answering the questions who, what when, why, where, and how can make data values meaningful |
| data table | clearly shows who the data was about and what was measured |
| cases | rows of a data table correspond to individual __________ |
| variables | some recorded characteristics |
| respondents | individuals who answer a survey |
| subjects/participants | people on whom we experiment |
| experimental units | animals, plants, websites, and other inanimate subjects |
| records | rows in a database |
| metadata | typically contains information about how, when, and where (and possible why) the data were collected; who each case represents; and the definitions of all variables |
| spreadsheet | a name that comes from bookkeeping ledgers of financial information |
| relational database | two or more separate data tables are linked together so that information can be merged across them |
| categorical/qualitative variable | when the values of a variable are simply the names of categories |
| quantitative variable | when the values of a variable are measured numerical quantities |
| identifier variables | categorical variables whose only purpose is to assign a unique identifier code to each individual in the data set |
| ordinal | the variable is ______________ when the value of a categorical variable have an intrinsic order |
| nominal | categorical variable with unordered categories |
| cross-sectional data | several variables are measured at the same time point |
| frequency table | records the counts for each of the categories of the variable |
| area principle | says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents |
| bar chart | displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison |
| relative frequency bar chart | replace the counts with percentages in order to draw attention to the relative proportion of visits from each Source |
| pie chart | shows how w whole group breaks into several categories |
| contingency tables | they show how individuals are distributed along each variable depending on, or contingent on, the value of the other variable |
| marginal distribution | when presented like this, at the margins of a contingency table, the frequency distribution of either one of the variables is called __________ |
| cell | any intersection of a row and column of the table; gives the count for a combination of values of the two variables |
| total percent, row percent, or column percent | most statistics programs offer a choice for contingency tables |
| conditional distribution | shows the distribution of one variable for just those cases that satisfy a condition on another |
| independent | in a contingency table, when the distribution of one variable is the same for all categories of another variable, we say that the two variables are ________ |
| segmented (or stacked) bar chart | treats each bar as the "whole" and divides it proportionally into segments corresponding to the percentage in each group |
| mosaic plot | looks like a segmented bar chart, but obeys the area principle better by making the bars proportional to the sizes of the groups |
| Simpson's Paradox | only combine compatible measurements for comparable individuals |
| bins | give the distribution of the quantitative variable and provide the building blocks for the display of the distribution called a histogram |
| histogram | plots the bin counts as the heights of bars |
| gaps | indicate a region where there are no values |
| relative frequency histogram | alternative is to report the percentage of cases in each bin |
| stem-and-leaf displays | like histograms, but they also show the individual values |
| quantitative data condition | the data must be values of a quantitative variable whose units are known |
| shape, center, and spread | when you describe a distribution, you should pay attention to these three things |
| shape | we describe the shape of a distribution in terms of its modes, its symmetry, and whether it has any gaps or outlying values |
| modes | humps of a histogram |
| unimodal | a distribution whose histogram has one main hump |
| bimodal | distributions whose histograms have two humps |
| multimodal | histograms with three or more humps |
| uniform | a distribution whose histogram doesn't appear to habe any mode and in which all the bars are approximately the same height |
| symmetric | the halves of a distribution on either side of the center look, at least approximately, like mirror images |
| tails | the (usually) thinner ends of a distribution |
| skewed | if one tail stretches out farther than the other, the distribution is said to be ________ to the side of the longer tail |
| outliers | any stragglers that stand off away from the body of the distribution |
| mean (average) | add up all the values of the variable, x, and divide that sum by the number of data values |
| median | the value that splits the histogram into two equal areas |
| range | the difference between the extremes: max-min |
| lower quartile (Q1) | value for which one quarter of the data lie below it |
| upper quartile (Q3) | value for which one quarter of the data lie above it |
| interquartile rage (IQR) | summarizes the spread by focusing on the middle half of the data; it's defined as the difference between the two quartiles: Q3-Q1 |
| variance | the average of the squared deviations |
| standard deviation | we want measures of spread to have the same units as the data, so we usually take the square root of the variance, giving the __________ |
| standardized value | the resulting value of standard deviation |
| z-score | tells us how many standard deviations a value is from its mean |
| five-number summary | reports a distribution's median, quartiles, and extremes (max and min) |
| boxplot | displays the information from a five-number summary |
| stationary | when a time series has no strong trend or change in variability |
| time series plot | a display of values against time |
| re-express/transform | one way to make a skewed distribution more symmetric is to ___________ the data by applying a simple function to all the data values |
| scatterplot | plots one quantitative variable against another |
| direction | pattern that can either be negative, positive, or neither |
| form | straight, curved, exotic, no pattern? |
| straight line relationship/linear form | will appear as a cloud or swarm of points stretched out in a generally consistent, straight form |
| strength | tightly clustered in a single stream or so variable and spread out that we can barely discern a trend or pattern? |
| explanatory or predicator variable | variable on the x-axis |
| response variable | variable on the y-axis |
| independent and dependent variables | the idea is that the y-variable depends on the x-variable and the x-variable act independently to make y respond |
| correlation coefficient | a numerical measure of the direction and strength of a linear association |
| correlation | measures the strength of the linear association between two quantitative variables |
| quantitative variables condition | correlation applies only to quantitative variables |
| linearity condition | correlation measures the strength only of the linear association and will be misleading if the relationship is not straight enough |
| outlier condition | unusual observations can distort the correlation and can make an otherwise small correlation look big or, on the other hand, hide a large correlation |
| lurking variable | some third variable that affects both of the variables you have observed |
| linear model | just an equation of a straight line through the data |
| predicted value | the prediction for y found for each x-value in the data; found by substituting the x-value in the regression equation; values on the fitted line |
| residual | the difference between the predicted value and the observed value |
| line of best fit/least squares line | the line for which the sum of the squared residuals is smallest |
| slope | b1 is given in y-units per x-unit. differences of one unit in x are associated with differences of b1 units in predicted values of y |
| intercept | the value of the line when the x-variable is zero |
| regression lines | common name for least squares lines |
| regression to the mean | because the correlation is always less than 1.0 in magnitude, each predicted y tends to be fewer standard deviations from its mean than its corresponding x is from its mean |
| quantitative data condition | pretty easy to check, but don't be fooled by categorical data recorded as numbers |
| linearity assumption | the regression model assumes that the relationship between the variables is, in fact, linear |
| linearity condition | the two variables must have a linear association, or the model won't mean a thing and decisions you base on the model may be wrong |
| outlier condition | make sure that no points need special attention |
| independence assumption | assumption that the residuals are independent of each other |
| equal spread condition | new assumption about the standard deviation around the line gives us this new condition |
| R-squared | all regression analyses include this statistic, although by tradition, it is written with a capital letter; a fraction of a whole, it is often given a percentage |
| Spearman rank correlation | works with the ranks of the data rather than their values |
| random phenomena | we can't predict the individual outcomes, but we can hope to understand characteristics of their long-run behavior |
| trial | each attempt of a random phenomena |
| outcome | generated be each trial of a random phenomena |
| event | more general term to refer to outcomes or combinations of outcomes |
| sample space | a special event; the collection of all possible outcomes |
| probability | the percentage of the callers who qualify |
| independence | the outcome of one trial doesn't influence or change the outcome of another |
| Law of Large Numbers (LLN) | states that if the events are independent, then as the number of trials increases, the long-run relative frequency of any outcome gets closer and closer to a single value |
| empirical probability | because it is based on repeatedly observing the event's outcome, this definition of probability is often called ____________ |
| theoretical probability | when we have equally likely outcomes |
| personal probability | we call this kind of probability subjective |
| probability | a number between 0 and 1 |
| probability assignment rule | the probability of the set of all possible outcomes must be 1. P(S) =1 |
| complement rule | the probability of an event occurring is 1 minus the probability that doesn't occur. P(A)=1-P(A^c) |
| multiplication rule | to find the probability that two independent events occur, we multiply the probabilities; P(A and B)=P(A) x P(B), provided that A and B are independent |
| disjoint or mutually exclusive | two events are _________ if they have no outcome in common |
| addition rule | allows us tot add the probabilities of disjoint events to get the probability that either event occurs: P(A or B)=P(A) + P(B), provided that A and B are disjoint |
| general addition rule | does not require disjoint events: P(A or B)=P(A) + P(B) - B(A and B) for any two events A and B |
| marginal probability | uses a marginal frequency (from either the total row or total column) to compute the probability |
| joint probabilities | probability that two events occur together |
| conditional probability | a probability that takes into account a given condition |
| general multiplication rule | for compound events that does not require the events to be independent: P(A and B)=P(A) x P(B|A) for any two events A and B |
| independent | events A and B are __________ whenever P(B|A)=P(B) |
| tree diagram | probability tree used to help think through the decision-making process |