The R2 from a regression that is adjusted for the number of explanatory
variables relative to the size of the sample used. It is NOT the percentage of the
variation explained by the regression, which is the R2 itself.
A hypothesis against which the null hypothesis is tested
and which will be held to be true if the null is held false. Often, the alternative
hypothesis represents a new theory the scientist would like to prove. The theory's
scientific status becomes stronger if experiments repeatedly show that the null hypothesis
is untenable.
A technique that tests differences between two or more groups by comparing the variation
between the groups with the variation within them (this is one-way anova). Can be thought
of as an extension to the 2 sample t-test, or as linear regression with only categoricalexplanatory variables. See also F-test.
the sum of a set of numbers divided by the number of values. The most often used measure
of central tendency. See also geometric mean.
Assumptions:
Statistical methods such as regression require the data to
satisfy various conditions, for example that the data follow a normal
distribution and are independent. When using such a method,
we assume that these conditions hold: these are the assumptions required for the method to
be valid. It is good statistical practice to check the assumptions as far as possible.
In a multiple regression, the beta weight associated with a particular variable measures
the number of standard deviations that the dependent variable is estimated to change when
the independent variable is changed by one standard deviation, all other things held
constant.
In a repeated measures ANOVA, there will be at least one factor that is measured at each
level for every subject. This is a within (repeated measures)
factor. For example, in an experiment in which each subject performs the same task twice,
trial (or trial number) is a within factor. There may also be one or more factors that are
measured at only one level for each subject, such as gender. This type of factor is a
between or grouping factor.
An estimate for a parameter is unbiased if its expected value is the true value
of the parameter. Otherwise, the estimate is biased. Thus, bias is the difference
between the expected value and true value of a parameter.
The generalized concept of the "average" value of a distribution.
Typical measures of central tendency are the mean, the median, the mode, and the geometric
mean.
A measure of the proportion of variability in the response variable explained by a linear regression model. It is a number between zero and one.
A value close to zero suggests a poor model. Also called the coefficient of multiple
correlation, or R2.
In regression, a forecast interval or confidence band can be computed for both
the expected value of the dependent variable and
the individual values of the dependent variable. The confidence band will be much larger
for the individual values of the dependent variable. Like a confidence
interval it is computed for a set probability (like 95%) of included the forecast
value. It is smallest at the mean value(s) of the independent variable(s) and increases in
size as the values for the independent variables used in the forecast deviate from their
means.
a random interval that has a set probability of including the true value of a parameter. Defines an interval within which the true population
parameter is likely to lie. It can be thought of as a measure of the precision of a sample statistic.
An estimator is consistent if the expected
value of the estimator approaches the true value of the parameter
being estimated as the sample size increases toward infinity.
Correlation is the linear association between two random
variables X and Y. It is usually measured by a correlation coefficient, such as
Pearson's r, such that the value of the coefficient ranges from -1 to 1. A positive
value of r means that the association is positive; i.e., that if X increases, the
value of Y tends to increase linearly, and if X decreases, the value of Y tends to
decrease linearly. A negative value of r means that the association is negative;
i.e., that if X increases, the value of Y tends to decrease linearly, and if X decreases,
the value of Y tends to increase linearly. The larger r is in absolute value, the
stronger the linear association between X and Y. If r is 0, X and Y are said to be
uncorrelated, with no linear association between X and Y. Independent
variables are always uncorrelated, but uncorrelated variables need not be independent.
covariate:
A covariate is a variable that may affect the relationship between two variables of
interest, but is not of intrinsic interest itself. A covariate is often used to control
for variation that is not attributable to the variables under study. A covariate may be a
discrete factor, like a block effect, or it may be a continuous variable, like the X
variable in an analysis of covariance.
A curvilinear function is one whose value, when plotted, will follow a continuous but
not necessarily straight line, such as a polynomial, logistic, or exponential curve.
Degrees of freedom (df):
A parameter which indexes the families of t-distributions
and f-distributions. A t-distribution with many df is similar
to the normal distribution, while one with few df has greater variance.
An f-distribution has degrees of freedom associated with both
the numerator and denominator that make up the f-statistic. It is
computed as the number of unknown quantities (e.g., N) minus the number of independent
equations linking the unknowns.
the variable being explained in a regression or analysis of variance. It is assumed in regression that causation flows
from the independent variables to the dependent variable.
distribution function:
A distribution function (also known as the probability distribution function) of a
continuous random variable X is a mathematical relation
that gives for each number x, the probability that the value of X is less than or equal to
x. For example, a distribution function of height gives, for each possible value of
height, the probability that the height is less than or equal to that value. For discrete random variables, the distribution function is often given as
the probability associated with each possible discrete value of the random variable; for
instance, the distribution function for a fair coin is that the probability of heads is
0.5 and the probability of tails is 0.5.
distribution-free tests:
Distribution-free tests are tests whose validity under the null hypothesis does not
depend on the populationdistribution(s)
from which the data have been sampled.
A binary variable used in regression in place of a qualitative variable. The number of
dummy variables necessary to replace a qualitative variable with k categories is k-1. The
coefficient for a dummy variable measures the difference of means between the category
represented by that variable and the omitted category.
A test used in analysis of variance or regression. The test statistic is the ratio of the variance between groups to the
variance within groups. If there is no difference between the groups (i.e. the null hypothesis is true), this statistic
follows an F distribution. Its expected
value is one (i.e., the variance between groups is equal to the variance within groups
if the null hypothesis is true).
having different variance: in a linear regression model,
violation of the assumption of constant variance in the outcome variable ( homoscedasticity) is called heteroscedasticity.
Normal-theory-based tests for the equality of population means such as the t test and analysis of variance, assume that the
data come from populations that have the same variance, even if
the test rejects the null hypothesis of equality of
population means. If this assumption of homogeneity of variance is not met, the
statistical test results may not be valid. Heteroscedasticity
refers to lack of homogeneity of variances.
the acceptance or rejection of an assertion (the "null
hypothesis") about one or more parameters according to the assertion's
compatibility with the data.
independent:
Two random variables are independent if their joint
probability density is the product of their individual (marginal) probability densities.
Less technically, if two random variables A and B are independent, then the probability of
any given value of A is unchanged by knowledge of the value of B.
Independent variables:
variables that are controlled or considered fixed, that may affect the values taken by dependent variables. Often called "X-variables",
"predictor variables" or "explanatory variables".
interaction:
the condition that the strength of association between two variables depends on the
value of a third, or that the effect of each of two explanatory
variables on a response variable depends on the level of the other explanatory
variable. For example, in an drug experiment involving rats, there would be an interaction
between sex and treatment if the effect of treatment was not the same for males and
females.
the constant in a regression equation; the point where a
regression line intercepts the vertical axis, if the horizontal axis has a true zero
origin.
Least-squares (ordinary least squares):
a method of estimating unknown parameters by minimizing the sum
of squared residuals. The usual method of fitting a linear regression model.
linear functions:
A linear function of one or more X variables is a linear combination of the values of
the variables:
Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk.
An X variable in the equation could be a curvilinear function of an observed variable
(e.g., one might measure distance, but think of distance squared as an X variable in the
model, or X2 might be the square of X1), as long as the overall function (Y) remains a sum
of terms that are each an X variable multiplied by a coefficient (i.e., the function Y is
linear in the coefficients). Sometimes, an apparently nonlinear function can be made
linear by a transformation of Y, such as the function
Y = exp(b0 + b1*X1),
which can be made a linear function by taking the logarithm of Y
(log(Y) = b0 + b1*X1),
and then considering log(Y) to be the overall function.
A linear logistic model assumes that for each possible set of values for the independent
(X) variables, there is a probability p that an event (success) occurs. Then the
model is that Y is a linear combination of the values of the X variables:
Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk,
where Y is the logit tranformation of the probability p.
linear regression:
In a linear regression, the fitted (predicted) value of the response variable Y is a
linear combination of the values of one or more predictor (X) variables:
Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk.
An X variable in the model equation could be a nonlinear function of an observed variable
(e.g., one might observe distance, but use distance squared as an X variable in the model,
or X2 might be the square of X1), as long as the fitted Y remains a sum of terms that are
each an X variable multiplied by a coefficient. The most basic linear regression model is
simple linear regression, which involves one X variable: Y = b0 + b1*X. Multiple linear regression refers to a linear
regression with more than one X variable.
The logit transformation Y of a probabilty p of an event
is the logarithm of the ratio between the probability that the event occurs and the
probability that the event does not occur:
Y = log(p/(1-p)).
Longitudinal:
a study design in which subjects are monitored over a period of time, or are observed on
several occasions over a period.
The median of a distribution is the value X such that the probability of an observation
from the distribution being below X is the same as the probability of the observation
being above X. For a continuous distribution, this is the same as the value X such that
the probability of an observation being less than or equal to X is 0.5.
mode:
The mode of a distribution is the value in the distribution that has the highest
frequency of occurrance.
In a multiple regression with more than one X variable, two or
more X variables are collinear if they are nearly linear combinations of each other.
Multicollinearity can make the calculations required for the regression unstable, or even
impossible. It can also produce unexpectedly large estimated standard errors for the
coefficients of the X variables involved. Multicollinearity is also known as collinearity
and ill conditioning.
Multiple regression refers to a regression model in which the fitted value of the
response variable Y is function of the values of one or more predictor (X) variables. The
most common form of multiple regression is multiple linear
regression, a linear regression model with more than one
X variable.
Nominal scale:
a scale which consists of categories with no particular ordering (eg. race). See also ordinal scale.
nonlinear functions:
A nonlinear function is one that is not a linear function, and
cannot be made into a linear function by transforming the Y
variable.
In a nonlinear regression, the fitted (predicted) value of the response variable is a nonlinear function of one or more X variables.
nonparametric tests:
Nonparametric tests are tests that do not make distributional
assumptions, particularly the usual distributional assumptions of the normal-theory based
tests. However, distribution-free tests generally do make some assumptions, such as
equality of population variances.
The normal or Gaussian distribution is a symmetric distribution
that follows the familiar bell-shaped curve. The distribution is uniquely determined by
its mean and variance. Even when a distribution is nonnormal, the distribution of the mean
of many independent observations from the same distribution becomes arbitrarily close to a
normal distribution as the number of observations grows large.
A maintained hypothesis that is held to be true until sufficient evidence to the contrary
is obtained. The hypothesis of no effect, no difference, no relationship etc. See also alternative hypothesis. A scientific theory should be
challenged by conducting tests in which the theory is represented by a null hypothesis. If
it survives such tests, its scientific status is strengthened. Social scientists generally
pose their theory as the alternative hypothesis.
In the one-sample problem, an independent random sample is
collected, and then that sample is used to test a hypothesis about the population from which the sample came (e.g., whether the mean of
the population is 0, or any other fixed constant chosen in advance). Paired samples are
usually reduced to a one-sample problem by replacing each pair of responses by the
difference between them (e.g., in a pre-test/post-test experiment, recording the change
from pre-test to post-test).
a scale whose values are categories that have a natural order but no quantitative
relationship e.g. {"small", "medium", "large"}.
P value:
In a statistical hypothesis test, the P value is the probability of observing a test
statistic at least as extreme as the value actually observed, assuming that the null hypothesis is true. This probability is then compared to
the pre-selected significance level of the test. If the
P value is smaller than the significance level, the null hypothesis is rejected, and the
test result is termed significant.
Parameter:
a fixed numerical value that describes a particular characteristic of a population (eg mean, proportion, coefficient
from a regression equation). Because we can't make an infinite number of measurements,
population parameters are never known exactly.
Parametric methods:
A group of statistical techniques that make strong assumptions about the distribution of
the outcome variable (eg, that it is normally distributed). See also nonparametric.
The pooled estimate of the variance is a weighted average of each individual sample's variance estimate. When the estimates are all estimates
of the same variance (i.e., when the population variances are
equal), then the pooled estimate is more accurate than any of the individual estimates.
population:
The population is the universe of all the objects from which a sample
could be drawn for an experiment. If a representative random
sample is chosen, the results of the experiment should be generalizable to the
population from which the sample was drawn, but not necessarily to a larger population.
For example, the results of medical studies on males may not be generalizable for females.
The power of a test is the probability of (correctly) rejecting the null hypothesis when it is in fact false. The power depends on
the significance level (alpha-level) of the test, the
components of the calculation of the test statistic, and on the specific alternative
hypothesis under consideration.
Probability density function:
A mathematical function defined over the range of a continuous random variable such that
the area under the function between two values (for example a and b) is the probability
that the random variable is between a and b. The area under this curve over the entire
range of the random variable is always 1.0.
qualitative variable:
Qualitative variables are variables for which an attribute or classification is measured.
Examples of qualitative variables are gender or state.
quantitative variable:
Quantitative variables are variables for which a numeric value representing an amount is
measured.
the coefficient of determination in regression or analysis of variance. The proportion of
the variation of a dependent variable that can be explained by
variation in the independent variables.
random sample:
A random sample of size N is a collection of N objects that are independent and identically distributed.
In a random sample, each member of the population has an equal
chance of becoming part of the sample.
Random variable:
a theoretical quantity that takes different values according to a probability
distribution. Random variables are the primary tools for building statistical models for
data. The data are assumed to behave as if they were observations on random variables.
Regression:
a class of statistical methods in which one dependent variable is related to one or more
independent variables. See also linear regression.
The ability of a scale to yield the same result over repeated trials.
residuals:
A residual is the difference between the observed value of a response measurement and the
value that is fitted under the hypothesized model.
robust:
Robust statistical tests are tests that operate well across a wide variety of distributions. A test can be robust for validity, meaning that it
provides accurate results in the presence of (slight) departures from its assumptions. It
may also be robust for efficiency, meaning that it maintains its statistical power (the
probability that a true violation of the null hypothesis
will be detected by the test) in the presence of those departures.
Sample statistic:
An estimate of a population parameter obtained from a sample. The
value will vary from sample to sample according to the sampling
distribution of the statistic.
The significance level (also known as the alpha-level) of a statistical test is the
pre-selected probability of (incorrectly) rejecting the null
hypothesis when it is in fact true. Usually a small value such as 0.05 is chosen. If
the P value calculated for a statistical is smaller then the
significance level, the null hypothesis is rejected.
Skewness is lack of symmetry in a distribution. Data from a
positively skewed (skewed to the right) distribution have values that are bunched together
below the mean, but have a long tail above the mean. (Distributions that are forced to be
positive, such as annual income, tend to be skewed to the right.) Data from a negatively
skewed (skewed to the left) distribution have values that are bunched together above the
mean, but have a long tail below the mean.
The standard deviation of a sampling distribution. That
is, the standard error describes the variability of a sample statistic. Depends on the sample size and the variability of
the individual measurements. Typically it is measured as the standard deviation divided by
the square root of the sample size.
a quantity calculated from data. Statistics that are used to estimate unknown parameters are called estimators: for
example, the mean of a sample is a statistic that is commonly used as an estimator of the
population mean.
a family of probability distributions similar to the normal distribution,
discovered by W.Gossett, who used the alias 'Student'. All t-distributions have mean zero,
but they have different shapes depending on the degrees of
freedom.
a test for comparing the means of one or more normal distributions
whose variance is not known but must be estimated from the data.
Test-statistic:
part of the process of hypothesis testing. Calculated under the assumption that the null
hypothesis is true and compared to a theoretical probably distribution (e.g. the standard
normal) to determine the p-value.
In the two-sample problem, two independent random samples are
collected, and then the samples are used to test a hypothesis about the populations from which the samples came (e.g., whether the means of
the two populations are identical).
A type I error occurs if, based on the sample data, we decide to reject the null hypothesis when in fact (ie, in the population)
the null hypothesis is true.
A type II error occurs if, based on the sample data, we decide not to reject the null hypothesis when in fact (ie, in the population) the null
hypothesis is false.
The most frequently used measure of dispersion. The expected squared deviation of a
probability distribution from its mean. For a set of data, the average
squared deviation from the mean, but with a denominator of n-1 rather than n
to avoid bias. See also standard
deviation.
Variation:
In regression the r-squared measures the percentage of the variation in the dependent
variable that is explained by the independent variable(s). In this case, variation is the
same thing as the sum of squares (the sum of the squared deviations around the variables
mean).
violation of assumptions:
Statistical hypothesis tests generally make assumptions about the population(s)
from which the data were sampled. For example, many
normal-theory-based tests such as the t test and ANOVA assume that
the central limit theorem holds, as well as that the variances of the different
populations are the same (homoscedasticity:). If test
assumptions are violated, the test results may not be valid.
In a repeated measures ANOVA, there will be at least one factor that
is measured at each level for every subject. This is a within
(repeated measures) factor. For example, in an experiment in which each subject performs
the same task twice, trial number is a within factor. There may also be one or more
factors that are measured at only one level for each subject, such as gender. This type of
factor is a between or grouping factor.
the number of standard deviations from the mean. For a value from a normal
distribution, the z-score is found by dividing by subtracting the mean of the
distribution and dividing by the standard deviation. Most commonly used for test statistics, since the z-score can be referred to tables of
the standard normal distribution to determine the p-value.