click below
click below
Normal Size Small Size show me how
BIOS480 Exam 1
Definitions and info for exam 1 of BIOS480
| Term | Definition |
|---|---|
| Sample | a random collection of observations from a population |
| Population | all possible observations |
| Frequentist statistics | probability of an outcome as how it is likely to happen in a very long run of events |
| Random variable | a variable whose values are not known for certain before a sample is taken |
| Sample space | the set of all possible outcomes of a random variable |
| Probability distribution | Distribution of observing each outcome in the sample space (tells you how high, or low, the likelihood of seeing an outcome is |
| Probability mass function | a mathematical equation used to describe a curve assigned to the probability of seeing each potential outcome |
| Probability density function | used for continuous variables, vertical axis is the probability density of the variable (f(x)) |
| for a continuous variable x = ? | 0; it is impossible to get the probability of a very specific outcome if it is continuous |
| Expected value (E(x) | the mean of the probability distribution of a random variable ("the average"); the most likely outcome of everything in the sample space |
| Mean | the arithmetic average value |
| Standard deviation | the dispersion of the data about the mean |
| Kurtosis | The "fatness" of the tails of the distribution; degree of outliers in the distribution |
| Skewness | Refers to deviations in the distribution's symmetry |
| Right-skewed | mean > median > mode |
| Left-skewed | mode > median > mean |
| Normal (Gaussian) distribution | Symmetric distribution where most distributions are around the mean (bell curve) |
| Normal distribution features | Symmetric bell shape (curve), mean = median at center, parameters = mean and standard deviation, 68% of data w/in 1 SD of mean 95% of data w/in 2 SD of mean 99.7% of data w/in 3 SD of mean |
| Lognormal distribution | Right-skewed data whose logarithm is normally distributed |
| Lognormal distribution features | Continuous w/ normal, skewed distribution Low mean values Large variance All-positive values Parameters = mean and standard deviation * If you take the log, the distribution shifts to normal (data transformation) |
| Exponential distribution | Model elapsed time between two events Continuous probability distribution describing waiting time until next event in a Poisson process (waiting time before next event) |
| Exponential distribution features | Single parameter distribution Parameter = rate |
| Beta distribution | Used to represent percentages, proportions, or probability outcomes |
| Beta distribution features | Defined on interval [0, 1] Parameters = alpha and beta (two positive shape parameters that appear as exponents of the random variable and control shape of distribution) |
| Bernoulli distribution | Single trial with two possible outcomes Ex., coin toss Outcomes are "success" or "failure" |
| Binomial distribution | A sequence of Bernoulli events * The probability distribution of the number of successes in a set number of independent trials |
| Binomial distribution features | Parameters = number of trials (n) and probability of success ina single trial (p) Expected value of a binomial trial "x" is the number of times a success is expected to occur in n total trials |
| Multinomial distribution | Generalization of the binomial distribution When there are more than two possible outcomes, which is the joint possibility distribution of multiple outcomes from n fixed trials |
| Poisson distribution | The probability that an event (or number of events) may occur Describes variables representing # of occurrences of a particular event in an interval of time/space |
| Poisson distribution features | Parameters: expected value = variance Generally right skewed |
| Uniform distribution | All outcomes are equally likely |
| Sample size | # of observations in the sample (n) |
| Statistics | Measured characteristics of the SAMPLE (Ex., sample mean) |
| Parameters | Characteristics of the POPULATION (Ex., population mean) |
| Random (simple) sampling | Basic method of collecting observations in a sample Any observation has the same probability of being collected Aim is to sample in a manner that doesn't create bias/favor any observation being selected |
| Random sample = ? | Independent and identically distributed (IID) |
| Independent (IID) | Sample items are all independent events Knowledge of the value of one variable gives no information about the other and vice versa |
| Identically distributed (IID) | No overall trends The distribution doesn't fluctuate, and all items in the sample are taken from the same probability distribution |
| Random sampling is usually ? | Haphazard Populations must be defined at the start of a study therefore there are spatial and temporal limits |
| More than 1 in 20 US teens have diagnosed anxiety or depression (Parameter or statistic) | Statistic, # describes whole population of US teens. It is impossible to collect info from every member |
| Latvian women are the tallest on the planet w/ a mean height of 170 cm (Parameter or statistic) | Statistic, describes the whole population. Not feasible to measure the height of every Latvian woman |
| The median annual income of all 37 employees at Company Y is $42,000 (Parameter or statistic) | Parameter, It is ALL employees at Y. not just a portion of them |
| The avg final math exam scores of all seniors from high school A have increased from 70% to 78% in the past decade (Parameter or statistic) | Parameter, % changed referred to the entire population of high schoolers |
| A good estimator (i.e., statistic) of a population parameter should have the following characteristics: | Unbiased - Expected value of the sample statistic should = the parameter Consistent - Sample size increases then the statistic will get closer to the population parameter Efficient - It has the lowest variance among all competing statistics |
| The two broad types of estimation are ? and ? | Point estimate Interval estimate |
| Point estimate | provides a single value which estimates a population parameter |
| Parts of a point estimate | Mean: estimator of population mean, weighted by 1/n Median: middle measurement of data set Trimmed mean Windsorized mean |
| Trimmed mean | Mean calculated after omitting a proportion (usually 5%) of highest and lowest observations |
| Windsorized mean | Same as for trimmed means except the omitted observations are replaced by the nearest remaining value |
| Interval estimate | Provides a range of values that might include the parameter with a known probability (Ex., confidence intervals) |
| Range | The difference between the largest and smallest observation Simplest measure of spread but no clear link between sample range and population range Generally increases as sample size increases |
| Sample variance | Estimate of the population variance |
| Sample variance steps | 1. Calculate mean 2. Subtract the mean and square the result 3. Work out the average of those differences If the result is s^2 and you need a length, you must square root your result |
| Standard deviation | Square root of sample variance |
| Coefficient of variation | Ratio of standard deviation to the mean and shows the extent of variability in relation to the mean of the population |
| Interquartile range | difference between the first quartile (the observation which has 25% of the observations below it) and the third quartile (the observation which has 25% of the observations above it) >Used in construction of box plots |
| Median absolute deviation (MAD) | less sensitive to outliers than the other measures and is the sensible measure of spread to present in association with medians |
| Confidence intervals | A range of values where you can be relatively confident the true value will be |
| Central limit theorem | The sampling distribution of a sample mean is approximately normal if the sample size is large enough (n > 30), even if the population distribution is not normal Distribution of averages (means) |
| Formula for confidence interval estimate for a population | Sample mean +/- critical value * standard error |
| Standard error | The standard deviation of the sample means (s)/(square root(n)) |
| Critical value | Conversion of the sample mean to a critical value by using a t-distribution |
| How to obtain a "t" critical value | can be obtained from the t-distribution based on the degree of freedom (df) and the confidence level you are using |
| "t" critical value | Used during statistical tests to assess the statistical significance of the difference between wo sample means, the constriction of confidence intervals and in linear regression analysis |
| Degrees of freedom (df) | number of independent values that a statistical analysis can estimate df = n - 1 You want as many degrees of freedom as possible |
| What if you knew the population standard deviation and want to calculate the confidence interval? | sample mean +/- critical value * standard error |
| z-distribution | The Standard Normal distribution; form of the normal distribution in which the mean is zero and the standard deviation is 1 |
| t- vs z-distribution | z-distribution assumes that you know the population standard deviation. t-distribution is based on the sample standard deviation |
| Methods for estimating parameters | Method of moments Maximum likelihood Ordinary least squares Bootstrapping |
| Moments of a function (Method of moments) | Every function has a moment: > Mean > Standard deviation > Kurtosis > Skewness |
| Maximum likelihood | Method that determines values for the parameters of a model "Which curve was most likely responsible for creating the data points that we observed?" |
| Maximum likelihood estimates (MLE) | Find the parameter values that give the distribution that maximize the probability of observing the data Calculating the total probability of observing all of the data Trying to maximize mean and standard deviation |
| With MLE you assume what if it follows a normal probability distribution? | Assuming that it follows a normal probability distribution, you multiply each point by the others in a formula |
| Ordinary least squares (OLS) | The least squares estimator for a given parameter is the one that minimizes the sum of the squared differences between each value in a sample and the parameter |
| Major application of OLS estimation | When we are estimating parameters of linear models, where the previous equation represents the sum of squared differences between observed values and those predicted by the model |
| ML and OLS are most commonly used for... | Estimating population parameters for the analyses we will discuss in this class |
| Bootstrapping | Stat. technique for estimating quantities about a population by avg. estimates from multiple small data samples (resampling w/ replacement) |
| Bootstrapping applications | Estimate confidence intervals for parameters (mean, median, variance, etc.) Estimate p-values when traditional parametric methods are challenging/not applicable Assess stability and reliability of models |
| Hypothesis testing | Statistical methods used to determine whether a pattern in a sample data is likely to hold true in the population from which the sample was drawn |
| Hypothesis testing steps | 1. State null and alternative hypotheses 2. Choose a significance level 3. Collect data and calculate the test statistic 4. Calculate p-value 5. Make a decision about the null hypothesis |
| Reject H0 if... | p-value < significant level = reject H0 |
| Fail to reject H0 if... | p-value > significant level = fail to reject h0 |
| Two-tailed | In most cases in biology, the H0 has no effect, and the HA can be in either direction Distributing uncertainty to two sides, one-tailed is only one H0 is rejected if one mean is bigger than the other Tests for an effect in both directions (+/-) |
| One-tailed | One tailed only distributes uncertainty to one side Only allow for testing in one direction |
| Parametric tests | Statistical tests most commonly used by biologists Ex., z-test, t-test, ANOVA |
| 4 assumptions of parametric tests | Normality Homogeneity of variances Linearity Independence |
| How to check for normality | To test assumption of normal distribution, skewness should be w/in +/- 2 range. Kurtosis values should be w//in +/- 7 range. (LOOK FOR BELL CURVE) Shapiro-Wilk’s W test Kolmogorov-Smirnov test Histograms, boxplots, and Q-Q plots |
| Q-Q (quantile-quantile) plots | Plots observed quantiles and expected quantiles are plotted on a graph. If the plotted value varies more from a straight line, then data is not normally distributed |
| Most common asymmetry in biological data | Positive skewness Often b/c variables have a lognormal or Poisson distribution Transformations of skewed variables can often improve normality |
| Homogeneity of variances | More important than normality (more robust if sample size is =) Linear model hypotheses assume that variance in the response variable is the same at each level, or combination of levels, of the predictor variables |
| How to check for homogeneity of variances | Bartlett's test Levene's test Flinger-Killeen's test |
| Linearity | Parametric correlation and linear regression analyses are based on straight-line relationships between variables |
| Independence | Implies that all the other observations should be independent of each other both w/in and between groups. The most common situation where this assumption is not met is when data are recorded in a time sequence |
| Graphical exploration of data is used to ... | Assess assumption violations Detect errors in data entry Detect patterns in data that may not be revealed by the statstical analysis you will use Detect unusual values (i.e., outliers) |
| Histograms | Graph used to represent the distribution of data points of one variable Often classify data into various “bins” or “range groups” and count how many data points belong to each of those bins |
| Boxplots | a plot showing the median in the center, the spread, the quartiles (25%), and potential outliers Because boxplots based on medians and quartiles, very resistant to extreme values Good for displaying single variable sample observations of size 8+ |
| Scatter plots | Vertical axis represents one variable, the horizontal axis represents the other variable and the points on the plot are individual observations Often used b/c we are interested in the relationship b/t variables |
| Four possible outcomes of testing a null hypothesis (H0) | Correctly reject H0 ("significant result), implies HA true Correctly retain H0, implies HA false Type I error, rejecting H0 when it is true Type II error, not rejecting H0 when it is false |
| Type I error | Rejecting null hypothesis (H0) when it is true Concluding results are stat. significant, but they actually occurred out of chance/unrelated factors Caused by significance level you choose |
| Type II error | Not rejecting H0 when it is false Failing to conclude an effect when there was Study may have needed more statistical power to detect it |
| Risk of type II error is inversely related to the ... | statistical power of the study Higher stat. power = lower probability of type II error |
| To reduce the risk of type II error you can ... | increase sample size or significance level |
| Statistical power is determined by the ... | Size of the effect (larger effects are more easily detected) Measurement error (systematic/random errors in data deduce power) Sample size (larger sample = reduced error + increased power) Significance level (inc. significance level = inc. power) |
| Standard error | Standard deviation of the sample means (s)/(squareroot of n) s = sample estimate of SD n = sample size |
| Type I vs Type II error trade-offs | Type I and II influence one another Sig. level (type I) affects stat. power, which is inversely related to type II error rate Setting lower sig. level decreases type I risk, but inc. type II Inc. power of a test decreases type II risk, but inc. type I |
| Rank-based non-parametric test examples | Wilcoxon signed rank test Mann-Whitney U test Kruskal-Wallis H test Spearman’s coefficient |
| Non-parametric tests | Statistical tests that do not assume anything about the distribution followed by the data (aka distribution-free tests) Based on ranks held by different data points |
| Non-parametric test ranks (1-4) | 1. Rank all observations, ignoring groups 2. Calculate sum of ranks for both samples 3. Compare smaller rank sum to probability of distribution of rank sums, test in usual manner 4. For larger samples, approx. norm. dist. then use z-stat. |
| Non-parametric test advantages | More stat. power when assump. of parametric tests are violated Assump. of normality does not apply Small sample sizes are ok Can be used for all data types (ordinal, nominal, inverval) |
| Non-parametric test disadvantages | Less powerful than parametric tests if assump. haven't been violated |
| Randomization/permutation tests | Resample/reshuffle original data many times to generate sampling distribution of test stat. directly Generate simulated data like those we would expect under H0 |
| Randomization/permutation tests advantages | Useful when analyzing data for which the distribution is unknown, when sampling from populations is not possible (i.e., museum specimens) |
| Randomization/permutation tests disadvantages | Computationally intensive Difficult to compare p-values across different analyses/studies May have lower power when sample sizes are small |
| Randomization/permutation tests main application | Often used to double check more traditional hypothesis test methods. If both tests are significant, then you can be pretty confident about your results. |
| Correcting a violation can be done by ... | Transforming the data |
| Transformations can ... | Make data closer to a normal distribution Reduce relationship b/t the mean and variance Reduce outlier influence Improve linearity in regression analyses |
| Types of transformations | Power Log Arc-sine Box-Cox |
| Power transformation + application | Transforms Y to Yp, where p is greater than 0 For data w/ right skew P=0.5 for data that are counts (Poisson ) and the variance is related to the mean. Cube roots (p=0.33), fourth roots (p=0.25), etc., effective for data that are inc. skewed |
| Log transformation + application | Transforming data to logarithms Make positively skewed distributions more symmetrical, especially when the mean is related to the SD Lognormal b/c it can be normal by log transforming the values |
| Arc-Sine transformation + application | Taking the arcsine of the square root of a number The result is given in radians and can range from −π/2 to π/2 Numbers must be in range 0 to 1 Commonly used for proportions and probabilities |
| Box-Cox transformation + application | Can be used to find best transformation in terms of homogeneity of variance and normality o Transformation on an exponent (lambda, λ), which varies from -5 to 5. All values of λ are considered and the optimal value is selected |