click below
click below
Normal Size Small Size show me how
STATISTICS REVIEW
| Term | Definition |
|---|---|
| Statistics Etymology | The word Statistics is derived from Latin word status meaning βstateβ. |
| Statistics | is a Science that deals with the collection, presentation, analysis and interpretation of data. |
| Collection | refers to the gathering of information or data. |
| Organization or Presentation | involves summarizing data in textual, graphical or tabular form. |
| Analysis | involves describing the data by statistical methods or procedures. |
| Interpretation | refers to the process of making conclusions based on the analyzed data |
| variable | is a characteristic or attribute that can assume different values. |
| Data | are the values (measurements or observations) that the variables can assume. |
| random variables | Variables whose values are determined by chance |
| data set | A collection of data values |
| data value or datum | Each value in the data set |
| Descriptive statistics | collection, organization, summarization, and presentation of data. |
| descriptive statistics | the statistician tries to describe a situation. |
| Inferential statistics | generalizing from samples to populations, performing estimations and hypothesis tests, determining relationships among variables, and making predictions. |
| Inferential statistics | the statistician tries to make inferences from samples to populations |
| Inferential statistics | uses probability |
| population | consists of all subjects (human or otherwise) that are being studied. |
| sample | is a group of subjects selected from a population. |
| parameter | is a numerical summary or any measurement coming from a population. |
| statistic | is a measurement from a sample. |
| Qualitative variables | are variables that can be placed into distinct categories, according to some characteristic or attribute. |
| Quantitative Variables | are numerical in nature. |
| Quantitative Variables | These are obtained from counting(discrete) or measuring(continuous). |
| Quantitative Variables | meaningful arithmetic operations can be done with these kinds of data. |
| Dependent Variable | a variable, which is affected or influenced by another variable. |
| Independent Variable | one that affects, or influences another variable |
| nominal scale | the categories of a qualitative variable are unordered |
| nominal scale | is used when we want to distinguish one object from another for identification purposes. We can only say that one object is different from another, but the amount if difference cannot be determined |
| ordinal scale | the categories of a qualitative variable can be put in order |
| ordinal scale | data are arranged in some specific order or rank. When objects are measured in this level, we can say that one is greater than the other, but we cannot tell how much more one has than the other. |
| interval scale | one can compare the differences between measurements of the quantitative variable meaningfully, but not the ratio of the measurements |
| interval scale | When data are measured not only that one is greater or less than the other, but can also specify the amount or difference |
| ratio scale | one can compare both the differences between measurements of the quantitative variable and the ratio of the measurements meaningfully |
| ratio scale | the level always starts from the absolute or true zero point. |
| Data | consists of information coming from observations, counts, measurements, or responses |
| Primary | Collected specifically for the analysis desired. Most common type is doing a survey |
| Secondary | have already been collected/compiled and are available for statistical analysis |
| Survey Study | systematic method for gathering information. It is an investigation of one or more characteristics of a population |
| The Direct or Interview Method | In this method, the researcher has a direct contact with the interviewee. |
| The Direct or Interview Method | The researcher obtains the information needed by asking questions and inquiries from the interviewee |
| The Direct or Interview Method | This method gives precise and consistent information because clarifications can be made. |
| The Direct or Interview Method | this method is time consuming, expensive, and has limited field coverage. |
| The Indirect or Questionnaire Method | This method makes used of a written questionnaire. |
| The Indirect or Questionnaire Method | The researcher distributes the questionnaire to the respondents either by personal delivery or by mail. |
| The Indirect or Questionnaire Method | Using this method, the researcher can save a lot of time and money in gathering the information needed because questionnaires can be given to a large number of respondents at the same time. |
| The Indirect or Questionnaire Method | the researcher cannot expect that all distributed questionnaires will be retrieved because some respondents simply ignore the questionnaires. |
| The Indirect or Questionnaire Method | In addition, clarification cannot be made if the respondent does not understand the question. |
| The Registration Method | This method of collecting data is governed by laws. |
| Retrospective Study | Uses either all or sample data and can also be called as Historical Data. |
| Retrospective Study | Quickest and easiest way to collect process data. |
| Retrospective Study | Provides limited information. |
| a type of primary data | The data recorded or internal data by a company such as sales and transactions |
| Primary data | is data that is collected directly from the source for a specific purpose or research question, and it has not been previously collected or analyzed by others. |
| Observational Study | Simply observes the process or population during a period of routine operation |
| Observational Study | Researcher interacts/disturbs the process only as much as is required to obtain data on the system. |
| Observational Study | May give valuable info but usually limited because you just altered a part of the system |
| The Experimental Design | This method is usually used to find out cause and effect relationships. |
| The Experimental Design | Scientific researchers often use this method. |
| The Experimental Design | We can establish cause-and-effect relationship unlike retrospective and observational studies where we are just informed about any interesting phenomena. |
| Simulation Study | Cost-effective, time efficient, safe-testing, increased understanding, optimization |
| Simulation data gathering | refers to the process of collecting data from a simulation, which is a computer model of a system that mimics its behavior. |
| Simulation | is a powerful technique and can be used to model many different types of systems. |
| Requirements of a Good Sample | a βscaled-downβ version of the population, mirroring every characteristic of the whole population. |
| Observation Unit/element | basic unit of observation, an object which a measurement is taken. |
| Target Population | the complete collection of observations we want to study. |
| Sample | subset of a population. |
| Sampled Population | the collection of all possible observation units that might have been chosen in a sample. |
| Sampling unit | a unit that can be selected for a sample. |
| Sampling frame | A list, map, or other specification of sampling units in the population from which a sample may be selected. |
| Selection Bias | It occurs when some part of the target population is not in the sampled population, or, more generally, when some population units are sampled at a different rate than intended by the investigator |
| A good sample | will be as free from selection bias as possible, has accurate responses to the items of interest |
| Measurement Error | When a response in the survey differs from the true value |
| Sampling error | the error that results from taking one sample instead of examining the whole population. |
| Non-sampling error | selection bias and measurement error are types of non-sampling error. |
| Non-sampling error | These are the errors that cannot be attributed to the sample-to-sample variability. |
| Simple probability samples | each unit in the population has a known probability of selection |
| Simple probability samples | random number table or other randomization mechanism is used to choose the specific units to be included in the sample. |
| Simple probability samples | Investigator can use a relatively small sample to make inferences about an arbitrarily large population |
| Simple random sampling | The most basic form of probability sampling and provides theoretical basis for the more complicated forms. |
| Simple Random Sample with Replacement (SRSWR) | One unit is randomly selected from the population to be the first sampled unit, with probability 1/N. (might include duplicates) |
| Simple Random Sample without Replacement (SRSWOR) | is much more preferred than Simple Random Sample with Replacement (SRSWR) |
| Simple Random Sample without Replacement (SRSWOR) | This sample is selected so that every possible subset of n distinct units in the population has the same probability of being selected as the sample. |
| Systematic sampling | It is used as a proxy for simple random sampling when no list of the population. |
| Systematic sampling | Selection of individuals is based on pre-determined interval (k) or sampling interval and we choose a random starting point. |
| Stratified random sampling | To stratify a population means to classify or to separate people into groups according to some of their characteristics, such as rank, income, education, sex, or ethnicity background. |
| strata | It partitions population into subclasses with notable distinctions |
| Cluster sampling | is similar to stratified random sampling, the total population is divided into clusters and a sample random sampling is used in each cluster. |
| Cluster | is usually based on geographic area. |
| Non-probability sampling | is a method of selecting sampling units from a target population using a subjective or non-random method. |
| Convenience sampling | The sample is selected based on accessibility or convenience. |
| Convenience sampling | It is the least effective of the non-probability sampling methods but if there are logistics and time constraints, it may be the only option. |
| Purposive Sampling | A method for obtaining sample units where researchers use their expertise to choose qualified participants to take the survey that will help the research study meets its goals. |
| Purposive Sampling | The researchers pick these participants purposively. |
| Quota sampling | The sample is selected based on certain quotas or predetermined criteria, such as age, educational attainment, gender or income level. |
| Quota sampling | is one of the most preferred methods of non-probability sampling because it forces the inclusion of members of different subpopulations. |
| Snowball sampling | The sample is selected based on referrals from other members of the population. |
| Snowball sampling | This type of sampling is used if the population of interest is hard to find like people with disabilities or certain diseases, drug users, victims of a specific crime. |
| raw data | Information obtained by observing values of a variable |
| qualitative data | Data obtained by observing values of a qualitative variable |
| quantitative data | Data obtained by observing values of a quantitative variable |
| discrete data | Quantitative data obtained from a discrete variable |
| continuous data | quantitative data obtained from a continuous variable |
| Ungrouped data | are data that are not organized, or if arranged, could only be from highest to lowest or lowest to highest |
| Grouped data | are data that are organized and arranged into different classes or categories. |
| Tabular method | By organizing the data in tables, important features about the data can be readily understood and comparisons are easily made. |
| Table Heading | consists of the table number and the title |
| Column Header | It describes the data in each column. |
| Row Classifier | It shows the classes or categories. |
| Body | This is the main part of the table. |
| Source Note | This is placed below the table when the data written are not original. |
| Frequency Distribution Table | The most commonly used method in presenting data by tabular method |
| frequency distribution | is the organization of raw data in table form, using classes and frequencies |
| Frequency Distribution Table (FDT) | is a statistical table showing the frequency or number of observations contained in each of the defined classes or categories. |
| frequency distribution for qualitative data | lists all categories and the number of elements that belong to each of the categories |
| relative frequency | is obtained by dividing the frequency(π) for a category by the sum of all the frequencies(π) |
| relative frequency | They are commonly expressed as percentages |
| Class Limits | endpoints of a class interval |
| Upper Class Limit | represents the largest data value that can be included in the class. |
| Lower Class Limit | represents the smallest data value that can be included in the class. |
| Class boundaries | used to separate the classes so that there are no gaps in the frequency distribution |
| Lower boundary | Lower Limit β 0.5 |
| Upper boundary | Upper limit + 0.5 |
| Class width (i) | the difference between the boundaries for any class., i.e. i=upper boundary β lower boundary or i=(upper limit-lower limit) +1 |
| Class mark | the midpoint of the class |
| less than cumulative frequency (<cf) | total number of observations less than the upper boundary of a class interval |
| greater than cumulative frequency (>cf) | total number of observations greater than the lower boundary of a class interval |
| Graphical Method | The purpose of graphs in statistics is to convey the data to the viewers in pictorial form. |
| Graphical Method | It is easier for most people to comprehend the meaning of data presented graphically than data presented numerically in tables or frequency distributions. |
| Bar Graph | is a graph composed of bars whose heights are the frequencies of the different categories. |
| Bar Graph | displays graphically the same information concerning qualitative data that a frequency distribution shows in tabular form. |
| Pie Chart | is also used to graphically display qualitative data |
| Histogram | is a graph that displays the data by using contiguous vertical bars (unless the frequency of a class is 0) of various heights to represent the frequencies of the classes. |
| Frequency Polygon | is a graph that displays the data by using lines that connect points plotted for the frequencies at the midpoints of the classes. |
| Ogive | is a graph that represents the cumulative frequencies for the classes in a frequency distribution. |
| Ogive | is a graph in which a point is plotted above each class boundary at a height equal to the cumulative frequency corresponding to that boundary. |
| measure of central tendency | gives a single value that acts as a representative or average of the values of all the outcomes of your data set. |
| central tendency | is a statistical measure that determines a single value that accurately describes the center of the distribution and represents the entire distribution of scores/measures/. |
| goal of central tendency | is to identify the single value that is the best representative for the entire set of data. |
| measure of central tendency | Any measure indicating the center of a set of data, arranged in an increasing or decreasing order of magnitude |
| mean, median, and mode | most commonly used measures |
| Mean | is the most commonly used measure of central tendency. |
| Computation of the mean | requires scores that are numerical values measured on an interval or ratio scale. |
| Mean | is obtained by computing the sum, or total, for the entire set of scores(data), then dividing this sum by the number of scores. |
| Median | is defined as the midpoint of the list. |
| Median | divides the scores so that 50% of the scores in the distribution have values that are equal to or less than the median |
| Median | Computation of the median requires scores that can be placed in rank order (smallest to largest) and are measured on an ordinal, interval, or ratio scale |
| Mode | is defined as the most frequently occurring category or score in the distribution. |
| Mode | is the category or score corresponding to the peak or high point of the distribution. |
| Mode | is not unique. Some data set can have more than one of it |
| Measures of Variability or Dispersion | are measures of the average distance of each observation from the center of the distribution. |
| Measures of Variability or Dispersion | tell us how spread out the scores are |
| Measures of Variability or Dispersion | They summarize and Describe the extent to which scores in a distribution differ from each other. |
| Measures of absolute dispersion | Range, Variance, Standard Deviation |
| Measures of relative dispersion | coefficient of variation |
| Measures of Absolute Dispersion | are expressed in the units of the original observations. |
| Measures of Absolute Dispersion | They cannot be used to compare variations of two data sets when the averages of these sets differ a lot in value or when the observations differ in units of measurements. |
| Range | is the difference between the highest and the lowest values. |
| Range | This is the simplest but most unreliable measure of dispersion since it only uses two values in the distribution |
| Variance | is the average of the squared deviation of each score from the mean |
| Standard Deviation | is the square root of the average of the squared deviation of each score from the mean, or simply, the square root of the variance. |
| Measures of Relative Dispersion | are unitless measures and are used when one wishes to compare the scatter of one distribution with another distribution. |
| Coefficient of Variation | is the ratio of the standard deviation to the mean and is usually expressed in percentage |
| Coefficient of Variation | It is used to compare variability of two or more sets of data even when they are expressed in different units of measurements. |
| fractiles or quantiles | are values below which a specific fraction or percentage of the observations in a given set must fall. |
| Quartiles, Deciles and Percentiles | fractiles of special interest |
| Measures of Location or Position | several measures that describe or locate the position of non-central pieces of data relative to the entire set of data. |
| Standard z-Score | measures how many standard deviation an observation is above or below the mean |
| Percentiles | are values that divide a set of observations into 100 equal parts |
| Deciles | are values that divide a set of observations into 10 equal parts |
| Quartiles | are values that divide a set of observations into 4 equal parts |
| Measures of Shapes | describe the shape of a certain distribution |
| Histogram | can give a general idea of the shape |
| skewness and kurtosis | two numerical measures of shape that can give a more precise evaluation |
| Skewness | refers to the degree of symmetry and asymmetry of a distribution |
| normal distribution | is bell-shaped and symmetric through the mean |
| normal distribution | It has the property mean=median=mode |
| distribution skewed to the left | the mean is less than the median |
| negatively skewed | The bulk of the distribution is on the right |
| distribution skewed to the right | the mean is greater than the median |
| positively skewed | The bulk of the distribution is on the left |
| Kurtosis | refers to the peakedness or flatness of a distribution |
| Mesokurtic | is a normal distribution |
| Leptokurtic | is more peaked than the normal distribution |
| Platykurtic | is flatter than the normal distribution |
| experiment | is the process of observing a phenomenon that has variation in its outcomes |
| experiment | It is a well-defined action leading to a single, well-defined result |
| outcome | is a result from a single trial of an experiment |
| Sample Space | The set of all possible outcomes of an experiment |
| event | is a collection of some outcome from an experiment |
| simple event | An event containing only one element |
| compound event | is one that can be expressed as a union of simple events. |
| null space or empty space | is a subset of the sample space that contains no elements |
| null space or empty space | It is denoted by β |
| union of two events A and B | denoted by A βͺ B |
| union of two events A and B | is the event containing all elements that belong to both A or to B, or to both |
| complement of an event A | is the set of all elements of S that are not in A |
| intersection of two events A and B | denoted by A β© B |
| intersection of two events A and B | is the event containing all elements common to both A and B |
| A and B have no elements in common | Two events A and B are mutually exclusive if π΄ β© π΅ = β |
| Multiplication Rule | If an operation can be performed in π1 ways, and if for each of these, a second operation can be performed in π2 ways, then the two operations can be performed together in π1π2 ways |
| Generalized Multiplication Rule | If an operation can be performed in π1 ways. If for each of these, a second operation can be performed in π2 ways. If for each of the first two a third operation can be performed in π3 ways, and so forth |
| Permutation | is an ordered arrangement of all or part of a set of objects. |
| Combination | is an arrangement of objects without regard to order |
| Probability | refers to the likelihood of occurrence of an event |
| Subjective Probability | chance of occurrence is given by a particular person based on his/her educated guess, opinion, intuition or beliefs |
| Empirical Probability | probability is assigned based on the prior knowledge of the events that happened on the past, or based on research or experiment |
| Classical Probability | applied when all possible outcomes are equally likely to happen |
| Conditional Probability (that B occurs given A has occurred) | The Probability of an event B occurring when it is known that some event A has occurred |
| Independent Events | A occurring does not affect the probability of B occurring |
| Random Variable | is a variable whose values are determined by chance |
| Discrete Random Variables | have a finite number of possible values or an infinite number of values that can be counted |
| Continuous Random Variables | variables that can assume all values in the interval between any two given values |
| discrete probability distribution | consists of the values a random variable can assume and the corresponding probabilities of the values |
| discrete probability distribution | The probabilities are determined theoretically or by observation. |
| probability mass function | The probability distribution of a discrete random variable |
| family of probability distributions | The collection of all probability distributions for different values of the parameter |
| parameter of the distribution | a quantity that can be assigned any one of a number of possible values, with each different value determining a different probability distribution |
| cumulative distribution function | πΉ(π₯) of a discrete random variable π with pmf π(π₯) is defined for every number π₯ by πΉ(π₯) = π(π β€ π₯) = B π(π¦) |
| mean or expected value of a probability distribution | is the theoretical average of the variable |
| binomial distribution | The outcomes of a binomial experiment and the corresponding probabilities of these outcomes |
| Poisson distribution | A discrete probability distribution that is useful when n is large, and p is small and when the independent variables occur over a period of time |
| normal distribution curve | is bell-shaped |
| mean, median, and mode | are equal and are located at the center of the distribution |
| normal distribution curve | is unimodal (i.e., it has only one mode) |
| curve | is symmetric about the mean, which is equivalent to saying that its shape is the same on both sides of a vertical line passing through the center |
| curve | never touches the x axis. Theoretically, no matter how far in either direction the curve extends, it never meets the x axisβbut it gets increasingly closer |
| total area under a normal distribution curve | is equal to 1.00, or 100% |
| standard normal distribution | is a normal distribution with a mean of 0 and a standard deviation of 1 |
| area under a normal distribution curve | is used to solve practical application problems |
| hypothesis testing | define the population under study, state da particular hypotheses dat will be investigated, give da signif. level, select a sample from da popul., collect da data, perform da calculations required 4 da stat. test, & reach a conclusion |
| statistical hypothesis | is a conjecture about a population parameter. This conjecture may or may not be true. |
| null hypothesis | symbolized by π»0 |
| null hypothesis | is a statistical hypothesis that states that there is no difference between a parameter and a specific value, or that there is no difference between two parameters |
| alternative hypothesis | symbolized by π»1 ππ π»a |
| alternative hypothesis | is a statistical hypothesis that states the existence of a difference between a parameter and a specific value, or states that there is a difference between two parameters |
| statistical test | uses the data obtained from a sample to make a decision about whether the null hypothesis should be rejected |
| test value | numerical value obtained from a statistical test |
| type-I error | occurs if you reject the null hypothesis when it is true |
| type-II error | occurs if you do not reject the null hypothesis when it is false |
| level of significance | is the maximum probability of committing a type I error |
| level of significance | This probability is symbolized by a (Greek letter alpha). That is, π π‘π¦ππ πΌ πππππ = πΌ. |
| critical value | separates the critical region from the noncritical region |
| critical or rejection region | is the range of values of the test value that indicates that there is a significant difference and that the null hypothesis should be rejected |
| noncritical or nonrejection region | is the range of values of the test value that indicates that the difference was probably due to chance and that the null hypothesis should not be rejected |
| one-tailed test | indicates that the null hypothesis should be rejected when the test value is in the critical region on one side of the mean |
| one-tailed test | is either a right-tailed test or left-tailed test, depending on the direction of the inequality of the alternative hypothesis |
| two-tailed test | the null hypothesis should be rejected when the test value is in either of the two critical regions |
| z test | is a statistical test for the mean of a population. It can be used when π β₯ 30, or when the population is normally distributed and s is known |
| t test | When the population standard deviation is unknown, t test is used. The distribution of the variable should be approximately normal. |
| t distribution | is similar to the standard normal distribution |
| t distribution | is bell-shaped |
| t distribution | is symmetric about the mean |
| t distribution | The mean, median, and mode are equal to 0 and are located at the center of the distribution |
| t distribution | The curve never touches the x axis |
| t distribution | The variance is greater than 1 |
| t distribution | is a family of curves based on the degrees of freedom, which is a number related to sample size |
| t distribution | As the sample size increases, the t distribution approaches the normal distribution |
| t test | is a statistical test for the mean of a population |
| t test | is used when the population is normally or approximately normally distributed, π is unknown, or when the sample is small, i.e π < 30 |