LIR 493: Quantitative Methods Professor Wallace Hendricks 

 

Glossary:

Note: Links for the words themselves refer back to class notes. 


Adjusted R-Squared:
The R2 from a regression that is adjusted for the number of explanatory variables relative to the size of the sample used. It is NOT the percentage of the variation explained by the regression, which is the R2 itself.
Alpha:
The conditional probability of a Type-I error in hypothesis testing, when the null hypothesis is true.
Alternative hypothesis:
A hypothesis against which the null hypothesis is tested and which will be held to be true if the null is held false. Often, the alternative hypothesis represents a new theory the scientist would like to prove. The theory's scientific status becomes stronger if experiments repeatedly show that the null hypothesis is untenable.
Analysis of Variance:
A technique that tests differences between two or more groups by comparing the variation between the groups with the variation within them (this is one-way anova). Can be thought of as an extension to the 2 sample t-test, or as linear regression with only categorical explanatory variables. See also F-test.
Analysis of Variance table:
a traditional formal layout that presents the main results of an analysis of variance.
ANOVA: see Analysis of Variance.
 
Arithmetic mean:
the sum of a set of numbers divided by the number of values. The most often used measure of central tendency. See also geometric mean.
Assumptions:
Statistical methods such as regression require the data to satisfy various conditions, for example that the data follow a normal distribution and are independent. When using such a method, we assume that these conditions hold: these are the assumptions required for the method to be valid. It is good statistical practice to check the assumptions as far as possible.
Autocorrelation
Correlation of error terms in a time series of data. First order autocorrelation is the correlation of successive error terms.
Average:
usually the ordinary, or arithmetic, mean, but also used for other "measures of central tendency", such as the median or mode.
Beta (b): the probability of a Type-II error in hypothesis testing, when the null hypothesis is false.
Beta Weights:
In a multiple regression, the beta weight associated with a particular variable measures the number of standard deviations that the dependent variable is estimated to change when the independent variable is changed by one standard deviation, all other things held constant.
between SS:
In a repeated measures ANOVA, there will be at least one factor that is measured at each level for every subject. This is a within (repeated measures) factor. For example, in an experiment in which each subject performs the same task twice, trial (or trial number) is a within factor. There may also be one or more factors that are measured at only one level for each subject, such as gender. This type of factor is a between or grouping factor.
bias:
An estimate for a parameter is unbiased if its expected value is the true value of the parameter. Otherwise, the estimate is biased. Thus, bias is the difference between the expected value and true value of a parameter.
binary variable:
A binary random variable is a discrete random variable that has only two possible values, such as male or female.
Categorical variable:
A variable that can take only a few distinct values (eg. sex, race, state). See also nominal variable, ordinal variable.
Central Limit Theorem:
As the size of a random sample gets large, the sampling distribution of the sample mean approaches the normal distribution regardless of the distribution of the population from which the sample is drawn.
central tendency:
The generalized concept of the "average" value of a distribution. Typical measures of central tendency are the mean, the median, the mode, and the geometric mean.
Coefficient:
a is the intercept and b is the slope of the line.
Coefficient Alpha:
Measures the internal reliability of a summed scale. A value of .7 is generally considered to indicate a reliable scale
Coefficient of determination:
A measure of the proportion of variability in the response variable explained by a linear regression model. It is a number between zero and one. A value close to zero suggests a poor model. Also called the coefficient of multiple correlation, or R2.
Collinearity:
A numerical problem that results when explanatory variables in a regression model are highly correlated. Typical signs of collinearity include large standard errors and implausible estimated values for the regression coefficients.
Confidence Band:
In regression, a forecast interval or confidence band can be computed for both the expected value of the dependent variable and the individual values of the dependent variable. The confidence band will be much larger for the individual values of the dependent variable. Like a confidence interval it is computed for a set probability (like 95%) of included the forecast value. It is smallest at the mean value(s) of the independent variable(s) and increases in size as the values for the independent variables used in the forecast deviate from their means.
Confidence interval:
a random interval that has a set probability of including the true value of a parameter. Defines an interval within which the true population parameter is likely to lie. It can be thought of as a measure of the precision of a sample statistic.
consistency:
An estimator is consistent if the expected value of the estimator approaches the true value of the parameter being estimated as the sample size increases toward infinity.
correlation:
Correlation is the linear association between two random variables X and Y. It is usually measured by a correlation coefficient, such as Pearson's r, such that the value of the coefficient ranges from -1 to 1. A positive value of r means that the association is positive; i.e., that if X increases, the value of Y tends to increase linearly, and if X decreases, the value of Y tends to decrease linearly. A negative value of r means that the association is negative; i.e., that if X increases, the value of Y tends to decrease linearly, and if X decreases, the value of Y tends to increase linearly. The larger r is in absolute value, the stronger the linear association between X and Y. If r is 0, X and Y are said to be uncorrelated, with no linear association between X and Y. Independent variables are always uncorrelated, but uncorrelated variables need not be independent.
covariate:
A covariate is a variable that may affect the relationship between two variables of interest, but is not of intrinsic interest itself. A covariate is often used to control for variation that is not attributable to the variables under study. A covariate may be a discrete factor, like a block effect, or it may be a continuous variable, like the X variable in an analysis of covariance.
Critical value:
a predetermined cutoff value for a test statistic, for deciding whether or not to reject the null hypothesis in hypothesis testing.
Cross-sectional:
a study design in which the data for each variable are measured at the same point in time.
curvilinear functions:
A curvilinear function is one whose value, when plotted, will follow a continuous but not necessarily straight line, such as a polynomial, logistic, or exponential curve.
Degrees of freedom (df):
A parameter which indexes the families of t-distributions and f-distributions. A t-distribution with many df is similar to the normal distribution, while one with few df has greater variance. An f-distribution has degrees of freedom associated with both the numerator and denominator that make up the f-statistic. It is computed as the number of unknown quantities (e.g., N) minus the number of independent equations linking the unknowns.
Decision rule:
a procedure that determines whether or not a hypothesis test is significant.
Density:
See probability density function.
Dependent variable:
the variable being explained in a regression or analysis of variance. It is assumed in regression that causation flows from the independent variables to the dependent variable.
distribution function:
A distribution function (also known as the probability distribution function) of a continuous random variable X is a mathematical relation that gives for each number x, the probability that the value of X is less than or equal to x. For example, a distribution function of height gives, for each possible value of height, the probability that the height is less than or equal to that value. For discrete random variables, the distribution function is often given as the probability associated with each possible discrete value of the random variable; for instance, the distribution function for a fair coin is that the probability of heads is 0.5 and the probability of tails is 0.5.
distribution-free tests:
Distribution-free tests are tests whose validity under the null hypothesis does not depend on the population distribution(s) from which the data have been sampled.
Dummy Variable:
A binary variable used in regression in place of a qualitative variable. The number of dummy variables necessary to replace a qualitative variable with k categories is k-1. The coefficient for a dummy variable measures the difference of means between the category represented by that variable and the omitted category.
Durbin-Watson Statistic:
A statistic that measures the first order autocorrelation of error terms in a time series regression.
Efficiency:
The expected value of the squared error of an estimator or test. It is the sum of the squared bias and the variance of the estimator.
Error Variance or residual variance:
The square of the standard error of the estimate.
Elasticity:
The percentage change in a dependent variable with a one percent change in an independent variable.
Estimator:
A statistic used to predict, or estimate, the value of a parameter in the population.
Expected value, Expectation:
the mean of a random variable.
F-distribution:
A family of probability distributions used for hypothesis tests in analysis of variance and regression. A particular distribution from the family is characterized by its numerator and denominator degrees of freedom.
F-test:
A test used in analysis of variance or regression. The test statistic is the ratio of the variance between groups to the variance within groups. If there is no difference between the groups (i.e. the null hypothesis is true), this statistic follows an F distribution. Its expected value is one (i.e., the variance between groups is equal to the variance within groups if the null hypothesis is true).
Forecast Interval:
See confidence band.
Geometric mean:
the product of a set of n numbers taken to the n'th root. The exponential of the arithmetic mean of the logarithms of the numbers.
Heteroscedastic:
having different variance: in a linear regression model, violation of the assumption of constant variance in the outcome variable ( homoscedasticity) is called heteroscedasticity.
homoscedasticity (homogeneity of variance):
Normal-theory-based tests for the equality of population means such as the t test and analysis of variance, assume that the data come from populations that have the same variance, even if the test rejects the null hypothesis of equality of population means. If this assumption of homogeneity of variance is not met, the statistical test results may not be valid. Heteroscedasticity refers to lack of homogeneity of variances.
Hypothesis test:
the acceptance or rejection of an assertion (the "null hypothesis") about one or more parameters according to the assertion's compatibility with the data.
independent:
Two random variables are independent if their joint probability density is the product of their individual (marginal) probability densities. Less technically, if two random variables A and B are independent, then the probability of any given value of A is unchanged by knowledge of the value of B.
Independent variables:
variables that are controlled or considered fixed, that may affect the values taken by dependent variables. Often called "X-variables", "predictor variables" or "explanatory variables".
interaction:
the condition that the strength of association between two variables depends on the value of a third, or that the effect of each of two explanatory variables on a response variable depends on the level of the other explanatory variable. For example, in an drug experiment involving rats, there would be an interaction between sex and treatment if the effect of treatment was not the same for males and females.
Intercept:
the constant in a regression equation; the point where a regression line intercepts the vertical axis, if the horizontal axis has a true zero origin.
Least-squares (ordinary least squares):
a method of estimating unknown parameters by minimizing the sum of squared residuals. The usual method of fitting a linear regression model.
linear functions:
A linear function of one or more X variables is a linear combination of the values of the variables:

Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk.
An X variable in the equation could be a curvilinear function of an observed variable (e.g., one might measure distance, but think of distance squared as an X variable in the model, or X2 might be the square of X1), as long as the overall function (Y) remains a sum of terms that are each an X variable multiplied by a coefficient (i.e., the function Y is linear in the coefficients). Sometimes, an apparently nonlinear function can be made linear by a transformation of Y, such as the function
Y = exp(b0 + b1*X1),
which can be made a linear function by taking the logarithm of Y
(log(Y) = b0 + b1*X1),
and then considering log(Y) to be the overall function.
linear logistic model:
A linear logistic model assumes that for each possible set of values for the independent (X) variables, there is a probability p that an event (success) occurs. Then the model is that Y is a linear combination of the values of the X variables:

Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk,
where Y is the logit tranformation of the probability p.
linear regression:
In a linear regression, the fitted (predicted) value of the response variable Y is a linear combination of the values of one or more predictor (X) variables:

Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk.
An X variable in the model equation could be a nonlinear function of an observed variable (e.g., one might observe distance, but use distance squared as an X variable in the model, or X2 might be the square of X1), as long as the fitted Y remains a sum of terms that are each an X variable multiplied by a coefficient. The most basic linear regression model is simple linear regression, which involves one X variable:
Y = b0 + b1*X.
Multiple linear regression refers to a linear regression with more than one X variable.
 
Logistic regression:
a regression model for binary (dichotomous) outcomes. The data are assumed to follow binomial distributions with probabilities that depend on the independent variables.
logit transformation:
The logit transformation Y of a probabilty p of an event is the logarithm of the ratio between the probability that the event occurs and the probability that the event does not occur:

Y = log(p/(1-p)).
Longitudinal:
a study design in which subjects are monitored over a period of time, or are observed on several occasions over a period.
Mean square:
a sum of squares divided by its degrees of freedom. (Also called variation)
median:
The median of a distribution is the value X such that the probability of an observation from the distribution being below X is the same as the probability of the observation being above X. For a continuous distribution, this is the same as the value X such that the probability of an observation being less than or equal to X is 0.5.
mode:
The mode of a distribution is the value in the distribution that has the highest frequency of occurrance.
multicollinearity:
In a multiple regression with more than one X variable, two or more X variables are collinear if they are nearly linear combinations of each other. Multicollinearity can make the calculations required for the regression unstable, or even impossible. It can also produce unexpectedly large estimated standard errors for the coefficients of the X variables involved. Multicollinearity is also known as collinearity and ill conditioning.
multiple regression:
Multiple regression refers to a regression model in which the fitted value of the response variable Y is function of the values of one or more predictor (X) variables. The most common form of multiple regression is multiple linear regression, a linear regression model with more than one X variable.
Nominal scale:
a scale which consists of categories with no particular ordering (eg. race). See also ordinal scale.
nonlinear functions:
A nonlinear function is one that is not a linear function, and cannot be made into a linear function by transforming the Y variable.
nonlinear regression:
In a nonlinear regression, the fitted (predicted) value of the response variable is a nonlinear function of one or more X variables.
nonparametric tests:
Nonparametric tests are tests that do not make distributional assumptions, particularly the usual distributional assumptions of the normal-theory based tests. However, distribution-free tests generally do make some assumptions, such as equality of population variances.
normal (Gaussian) distribution:
The normal or Gaussian distribution is a symmetric distribution that follows the familiar bell-shaped curve. The distribution is uniquely determined by its mean and variance. Even when a distribution is nonnormal, the distribution of the mean of many independent observations from the same distribution becomes arbitrarily close to a normal distribution as the number of observations grows large.
null hypothesis:
 

A maintained hypothesis that is held to be true until sufficient evidence to the contrary is obtained. The hypothesis of no effect, no difference, no relationship etc. See also alternative hypothesis. A scientific theory should be challenged by conducting tests in which the theory is represented by a null hypothesis. If it survives such tests, its scientific status is strengthened. Social scientists generally pose their theory as the alternative hypothesis.
 
one-sample problem:
 

In the one-sample problem, an independent random sample is collected, and then that sample is used to test a hypothesis about the population from which the sample came (e.g., whether the mean of the population is 0, or any other fixed constant chosen in advance). Paired samples are usually reduced to a one-sample problem by replacing each pair of responses by the difference between them (e.g., in a pre-test/post-test experiment, recording the change from pre-test to post-test).
One-sided test; One-tailed test:
 

an hypothesis test in which large deviations in only one direction from the null hypothesis are to be considered significant. See also two-sided test.
Ordinal scale:
 

a scale whose values are categories that have a natural order but no quantitative relationship e.g. {"small", "medium", "large"}.
P value:
 

In a statistical hypothesis test, the P value is the probability of observing a test statistic at least as extreme as the value actually observed, assuming that the null hypothesis is true. This probability is then compared to the pre-selected significance level of the test. If the P value is smaller than the significance level, the null hypothesis is rejected, and the test result is termed significant.
Parameter:
 

a fixed numerical value that describes a particular characteristic of a population (eg mean, proportion, coefficient from a regression equation). Because we can't make an infinite number of measurements, population parameters are never known exactly.
Parametric methods:
 

A group of statistical techniques that make strong assumptions about the distribution of the outcome variable (eg, that it is normally distributed). See also nonparametric.
Pearson Correlation Coefficient:
Another name for the simple correlation coefficient, r.
Point estimate:
 

a single number that is the best estimate of an unknown quantity from the available data.
pooled estimate of the variance:
 

The pooled estimate of the variance is a weighted average of each individual sample's variance estimate. When the estimates are all estimates of the same variance (i.e., when the population variances are equal), then the pooled estimate is more accurate than any of the individual estimates.
population:
 

The population is the universe of all the objects from which a sample could be drawn for an experiment. If a representative random sample is chosen, the results of the experiment should be generalizable to the population from which the sample was drawn, but not necessarily to a larger population. For example, the results of medical studies on males may not be generalizable for females.
power of a test:
 

The power of a test is the probability of (correctly) rejecting the null hypothesis when it is in fact false. The power depends on the significance level (alpha-level) of the test, the components of the calculation of the test statistic, and on the specific alternative hypothesis under consideration.
Probability density function:
 

A mathematical function defined over the range of a continuous random variable such that the area under the function between two values (for example a and b) is the probability that the random variable is between a and b. The area under this curve over the entire range of the random variable is always 1.0.
qualitative variable:
 

Qualitative variables are variables for which an attribute or classification is measured. Examples of qualitative variables are gender or state.
quantitative variable:
 

Quantitative variables are variables for which a numeric value representing an amount is measured.
r-squared:
 

the coefficient of determination in regression or analysis of variance. The proportion of the variation of a dependent variable that can be explained by variation in the independent variables.
random sample:
 

A random sample of size N is a collection of N objects that are independent and identically distributed. In a random sample, each member of the population has an equal chance of becoming part of the sample.
Random variable:
 

a theoretical quantity that takes different values according to a probability distribution. Random variables are the primary tools for building statistical models for data. The data are assumed to behave as if they were observations on random variables.
Regression:
 

a class of statistical methods in which one dependent variable is related to one or more independent variables. See also linear regression.
Rejection region:
 

the set of values of a test statistic which if observed will lead to rejection of the null hypothesis.
Reliability:
The ability of a scale to yield the same result over repeated trials.
residuals:
 

A residual is the difference between the observed value of a response measurement and the value that is fitted under the hypothesized model.
robust:
 

Robust statistical tests are tests that operate well across a wide variety of distributions. A test can be robust for validity, meaning that it provides accurate results in the presence of (slight) departures from its assumptions. It may also be robust for efficiency, meaning that it maintains its statistical power (the probability that a true violation of the null hypothesis will be detected by the test) in the presence of those departures.
Sample statistic:
 

An estimate of a population parameter obtained from a sample. The value will vary from sample to sample according to the sampling distribution of the statistic.
Sampling distribution:
 

the theoretical probability distribution of a statistic viewed as a random variable.
significance level:
 

The significance level (also known as the alpha-level) of a statistical test is the pre-selected probability of (incorrectly) rejecting the null hypothesis when it is in fact true. Usually a small value such as 0.05 is chosen. If the P value calculated for a statistical is smaller then the significance level, the null hypothesis is rejected.
Significant result:
 

a test result that leads to rejection of the null hypothesis.
Simple linear regression:
 

linear regression with only a single independent variable. The equation of the line is calculated by the method of least squares.
skewness:
 

Skewness is lack of symmetry in a distribution. Data from a positively skewed (skewed to the right) distribution have values that are bunched together below the mean, but have a long tail above the mean. (Distributions that are forced to be positive, such as annual income, tend to be skewed to the right.) Data from a negatively skewed (skewed to the left) distribution have values that are bunched together above the mean, but have a long tail below the mean.
Slope of a regression:
 

the change in the dependent variable (Y) per unit change in the independent variable (X).
Standard deviation:
 

the square root of the variance.
Standard error:
 

The standard deviation of a sampling distribution. That is, the standard error describes the variability of a sample statistic. Depends on the sample size and the variability of the individual measurements. Typically it is measured as the standard deviation divided by the square root of the sample size.
Standard normal distribution:
 

the normal (or Gaussian) distribution with mean µ=0 and standard deviation s=1. See also z-score.
Statistic:
 

a quantity calculated from data. Statistics that are used to estimate unknown parameters are called estimators: for example, the mean of a sample is a statistic that is commonly used as an estimator of the population mean.
Student's t-distribution:
 

a family of probability distributions similar to the normal distribution, discovered by W.Gossett, who used the alias 'Student'. All t-distributions have mean zero, but they have different shapes depending on the degrees of freedom.
Sum of squares:
 

the sum of squared differences of data values from their mean; an entry in an analysis of variance table.
t-distribution:
Student's t-distribution.
t-test:
 

a test for comparing the means of one or more normal distributions whose variance is not known but must be estimated from the data.
Test-statistic:
 

part of the process of hypothesis testing. Calculated under the assumption that the null hypothesis is true and compared to a theoretical probably distribution (e.g. the standard normal) to determine the p-value.
Time series:
 

a (usually long) sequence of observations made on a variable. Each observation may depend on (be correlated with) one or more preceding observations.
transformation:
 

A transformation of data values is done by applying the same function to each data value, such as by taking logarithms of the data.
two-sample problem:
 

In the two-sample problem, two independent random samples are collected, and then the samples are used to test a hypothesis about the populations from which the samples came (e.g., whether the means of the two populations are identical).
Two-sided test; Two-tailed test:
 

an hypothesis test in which large deviations in either direction from the null hypothesis are to be considered significant. See also one-sided test.
Type I error:
 

A type I error occurs if, based on the sample data, we decide to reject the null hypothesis when in fact (ie, in the population) the null hypothesis is true.
Type II error:
 

A type II error occurs if, based on the sample data, we decide not to reject the null hypothesis when in fact (ie, in the population) the null hypothesis is false.
Validity:
The ability of an indicator to measure what it is designed to measure.
Variance:
 

The most frequently used measure of dispersion. The expected squared deviation of a probability distribution from its mean. For a set of data, the average squared deviation from the mean, but with a denominator of n-1 rather than n to avoid bias. See also standard deviation.
Variation:
 

In regression the r-squared measures the percentage of the variation in the dependent variable that is explained by the independent variable(s). In this case, variation is the same thing as the sum of squares (the sum of the squared deviations around the variables mean).
violation of assumptions:
 

Statistical hypothesis tests generally make assumptions about the population(s) from which the data were sampled. For example, many normal-theory-based tests such as the t test and ANOVA assume that the central limit theorem holds, as well as that the variances of the different populations are the same (homoscedasticity:). If test assumptions are violated, the test results may not be valid.
within SS:
 

In a repeated measures ANOVA, there will be at least one factor that is measured at each level for every subject. This is a within (repeated measures) factor. For example, in an experiment in which each subject performs the same task twice, trial number is a within factor. There may also be one or more factors that are measured at only one level for each subject, such as gender. This type of factor is a between or grouping factor.
Z-score:
 

the number of standard deviations from the mean. For a value from a normal distribution, the z-score is found by dividing by subtracting the mean of the distribution and dividing by the standard deviation. Most commonly used for test statistics, since the z-score can be referred to tables of the standard normal distribution to determine the p-value.