Normal Size Small Size show me how

# BSTAT 5301

Term | Definition |
---|---|

Data | Facts and figures from which conclusions can be drawn. |

What is Statistics? | Statistics is a way to get information from data. |

Data set | The data that are collected for a particular study |

Elements | Data set consists of Elements. Ex: stocks, students, homes for sale, or other entries |

Variable | Any characteristic of an element. Ex: price of a stock, height of a student |

Measurement | A way to assign a value of a variable. |

Quantitative | The possible measurements are numbers that represent quantities. |

Qualitative | The possible measurements are descriptive and not numbers. |

Cross-sectional data | Data collected at the same or approximately the same point in time |

Time series data | Data collected over different time periods |

Population | A set of all elements about which we wish to draw conclusions |

Census | An examination all elements of a population |

Sample | A subset of the elements of a population |

Descriptive Statistics | The science of describing the important aspects of a data set measurements. DOES NOT allow us to draw any conclusions or make any interference about the data. |

Inferential Statistics or Statistical Inference | Set of methods, but it is used to draw conclusions or inferences about characteristics of populations based on data from a sample. The process of making an estimate, prediction or decision about a population based on a sample. |

Statistical Inference | The science of drawing conclusion/inference about a population from a sample |

Bar Chart, Pie Chart, Pareto Chart, Histogram | A form of Descriptive Statistics using Graphical Techniques. Allows statistics practitioners to present data in ways that make it easy for the reader to extract useful information. |

Mean, Median | Popular numerical techniques in descriptive statistics to describe the location of the data. |

Range, Variance, Standard Deviation | Numerical technique in descriptive statistics to measure the variability of the data. |

Business analytics | The use of traditional and newly developed statistical methods, advances in IS, and techniques from management science to explore and investigate past performance Descriptive analytics, Predictive analytics, Prescriptive analytics |

Big data | Often needs quick analysis to support business decision making. |

Descriptive modeling | Which typically uses data aggregation to provide hindsight and insight into the past and strives to answer: “What has happened?” Predictive modeling |

Descriptive analytics | The use of traditional and newer graphics to represent easy-to-understand visual summaries of up-to-the-minute data Dot plots, Time series plots, Bar chart, Histograms, Dashboards, Numerical techniques |

Predictive analytics | Methods used to find anomalies, patterns, and associations in data sets to predict future outcomes Linear regression, Logistic regression, Decision trees, Neural networks, Cluster analysis, Factor analysis, Association Rules |

Data mining | The use of predictive analytics, algorithms, and IS techniques to extract useful knowledge from huge amounts of data K-Means algorithm, Support Vector Machines, Bayesian Belief Network, |

Prescriptive analytics | Looks at variables and constraints, along with predictions from predictive analytics, to recommend courses of action Optimization sub-routine, Liner programming, Non-linear programming, Dynamic programming, Integer programming, Simulation |

Nominal | A qualitative variable of description for which there is no meaningful ordering, or ranking, of the categories Example: gender, car color Only limited statistical techniques are applicable |

Ordinal | A qualitative variable for which there is a meaningful ordering, or ranking, of the categories Example: teaching effectiveness, choice of preference Only limited statistical techniques are applicable |

Interval Variables | Real numbers, i.e. heights, weights, prices, etc. Also referred to as quantitative or numerical data. Arithmetic operations can be performed on Interval Data, thus its meaningful to talk about 2*Height, or Price + $1, and so on. |

Qualitative Variables | Nominal and Ordinal. The possible measurements are descriptive and not numbers. |

Graphical Descriptive Technique for Nominal/Ordinal (Qualitative) Data | Frequency, Relative Frequency, Percentage (%) Frequency, Cumulative Relative Frequency (Ogive), Bar Chart, Pie-Chart, Pareto Analysis, Contingency Table |

Graphical Descriptive Technique for Interval (Quantitative) Data | Frequency Table, Histogram, Ogive, Dot Plot, Stem-and-Leaf Plot, Scatterplot |

frequency distribution | We can summarize the data in a table that presents the categories and their counts called a frequency distribution. |

relative frequency distribution | Lists the categories and the proportion with which each occurs. |

Frequency | The number of items in each ‘class’ in the data |

Relative frequency | Summarizes the proportion of items in each class |

Bar chart | A vertical or horizontal rectangle represents the frequency for each category Height can be frequency, relative frequency, or percent frequency |

Pie chart | A circle divided into slices where the size of each slice represents its relative frequency or percent frequency |

Pareto principle | In many economies, most of the wealth is held by a small minority of the population (80% - 20% principle) Application: a few classes of defects accounts for most quality problems in manufacturing. |

Development of Pareto Chart | Develop Bar chart representing the frequency of occurrence Bars are arranged in decreasing height from left to right Chart is augmented by plotting a cumulative percentage point for each bar (Pareto Line) |

Cross Classification Table | Lists the Frequency of each combination of values for two variables as a first step. To describe the relationship between two nominal variables, we must remember that we are permitted only to determine the frequency of the values. |

Contingency Tables | Classifies data on two dimensions Rows classify according to one dimension Columns classify according to a second dimension |

Frequency Distribution | A frequency distribution is a list of data classes with the count of values that belong to each class The frequency distribution is a table |

Histogram | The histogram is a picture of the frequency distribution |

K | K is the number of classes. K = 1 + 3.3 Log10 (n) |

n | n is the number of elements within the sample. |

N | N is the number of elements in the entire population. |

Length or Width of a class | (Max - Min) / k |

Skewed to the right | The right tail of the histogram is longer than the left tail |

Skewed to the left | The left tail of the histogram is longer than the right tail |

Symmetrical | The right and left tails of the histogram appear to be mirror images of each other |

Cumulative Distributions | To do this, use the same number of classes, class lengths, and class boundaries used for the frequency distribution. Rather than a count, we record the number of measurements that are less than the upper boundary of that class. A running total |

Ogive | A graph of a cumulative distribution |

Frequency Polygons | Plot a point above each class midpoint at a height equal to the frequency of the class Useful when comparing two or more distributions |

Dot Plots | A Dot placed on a real number line to quickly show potential Useful for detecting outliers. |

Stem-and-Leaf Displays | Purpose is to see the overall pattern of the data, by grouping the data into classes the variation from class to class, the amount of data in each class, the dist of the data within each class, Best for small to moderately sized data distributions |

Scatter Plots | Used to study relationships between two variables Each data has two-dimensions Place one variable on the x-axis Place a second variable on the y-axis Place dot on pair coordinates |

Linear | A straight line relationship between the two variables |

Linear Positive | When one variable goes up, the other variable goes up |

Linear Negative | When one variable goes up, the other variable goes down |

No linear relationship | There is no coordinated linear movement between the two variables |

Data Warehouses | A process for centralized data management and retrieval and has as its ideal objective the creation and maintenance of a central repository for all of an organizations data. |

Response variable vs factors | When initiating a study, we first define our variable of interest, or response variable. Other variables, typically called factors, that may be related to the response variable. |

Experimental Study | Means we are able to set or manipulate the values of the factors. |

Observational Study | Means we are not able to control the factors. |

Sample with or without replacement | When sampling with replacement, the selection is place back into the population to potentially be selected again. Sampling without replacement only allows the selection to be chosen once because it is not placed back into the population. |

Finite population vs infinite population | Finite population, no more can be added (ex. Number of cars produced in a specific year). Infinite populations can potentially always have one more added (ex. All car models that could be produced, because in theory one more car could always be produced). |

Statistical Model | A Statistical model is a set of assumptions about how sample data are selected and about the population from which the sample data are selected. The assumptions concerning the sampled populations often specify the probability distributions. |

Probability Distribution | A theoretical equation, graph, or curve that can be used to find the probability or likelihood that a measurement or observation randomly selected from a population will equal a value or fall into a range of values. |

Anomaly or Outlier | A value or measure that is atypical or situated away from the general group or cluster of other values. |

Measures of Central Tendency | Mean, Median, Mode |

Measures of Variability | Range, Standard Deviation, Variance, Coefficient of Variation |

Measures of Relative Standing | Percentiles, Quartiles, Deciles |

Measures of Linear Relationship | (2 variables) Covariance, Correlation Coefficient, Coefficient of Determination, Least-Square Line |

Ogive | A graph of a cumulative distribution. Plot a point above each upper class boundary at height of cumulative frequency, Connect points with line segments |

Mean | The average or expected value Sum of observations -Divided by- Number of observations |

Median | The value of the middle point of the ordered measurements. One advantage the median holds is that it not as sensitive to extreme values as is the mean. |

Mode | The most frequent value |

Greek letter “mu” | Arithmetic mean for a population |

"x-bar" | Arithmetic mean for a sample |

MMM for Symmetrical Curve | All mode, mean, median are all at the same point which is the highest peak. |

MMM for Skewed to the right | Order goes towards the tail. Mode is the highest peak, median "which is the longest word of the three" is in the middle, then mean |

MMM for Skewed to the left | Order goes towards the tail. Mode is the highest peak, median "which is the longest word of the three" is in the middle, then mean |

M little o subscript | Mode |

M little d subscript | Median |

Range | Largest observation (minus) smallest observation |

Variance | Variance and its related measure, standard deviation, are arguably the most important statistics. Used to measure variability, they also play a vital role in almost all statistical inference procedures. |

Population variance symbol | Lower case Greek letter “sigma” squared. Looks like a cursive o squared |

Sample variance symbol | Lower case “s” squared |

Population Standard Deviation | Square root of population variance. Looks like a lowercase "o" |

Sample Standard Deviation | Square root of sample variance. Lowercase "s" |

68% | Approximately 68% of all observations fall within one standard deviation of the mean. |

95% | Approximately 95% of all observations fall within two standard deviations of the mean. |

99.7% | Approximately 99.7% of all observations fall within three standard deviations of the mean. |

Percentile | n'th% will be below it and n'th% will be above it. n'th percentile of a data set is a value such that n'th% of the data have values less that the percentile. 0th percentile. Min 100th percentile. Max. |

Quartile | By 25% Q1 is first quartile means 25th percentile. Less than equal to. Q2 is the second quartile means 50th percentile. Median. Q3 is the third quartile 75th percentile. Q4 is the fourth quartile 100th percentile. |

Decile | By 10% 1st decile - 10th percentile. Lower decile. 2nd decile - 20th percentile. 5th decile - 50th percentile. Median. 9th decile - 90th percentile. Upper decile. |

Quantile | By 20% 1st quantile is 20th percentile 2nd quantile is 40th percentile 3rd quantile is 60th percentile 4th quantile is 80th percentile 5th quantile is 100th |

5 number summary | Min Q1 Q2 Median Q3 Max |

Inter quartile range (IRQ) | Quartile3 - Quartile1 approximates the standard deviation |

Box-whisker plot | 5number summary Upper limit = q3 + 1.5 * IQR Lower limit = q1 - 1.5 * IQR Inner fence (left whisker) the next highest number in the dataset from the Lower limit Out fence (right whisker) the next lower number in the dataset from the higher limit |

Inner fence | (left whisker) the next highest number in the dataset from the Lower limit |

Outter fence | (right whisker) the next lower number in the dataset from the higher limit |

Outliers | data which are lower than the inner fence or greater than the outer fence. |

Strata | The sub-population in a stratified sampling design |

Random Sample Population | A sample selected in such a way that every set of n elements in the population has the same chance of being selected. |