Glossary | Practical Applications of Statistics in the Social Sciences

ANOVA tests

ANOVA tests are significance tests used to compare the significance of differences in mean across more than two categories. Because ANOVAs compare means, they are used when variables are continuous, not categorical. An ANOVA works just like a t test, but is used when variables have more than two categories.

Bivariate analysis

Bivariate analysis refers to the analysis of a dependent variable and one independent variable in an effort to illuminate a relationship between the two variables. The type of bivariate analysis performed depends on the kind of variables being analysed. For categorical variables, analyses like cross tabulations and chi-squared tests can be used. For continuous variables, t tests and ANOVA tests are useful. Both linear and logistic regression, usually referred to as multivariate analyses, can be bivariate if a model is built using a dependent variable and just one independent variable.

Box plot

A box plot (sometimes referred to as a box and whisker plot) represents continuous, numerical data graphically by splitting data in quartiles, or fourths. The minimum and maximum data points are set at either end of the plot and the median data point separates the first half of the data from the second half. The first quartile is a point representing the halfway point between the minimum and the median of the data. The third quartile is a point representing the halfway point between the median of the data and the maximum data point. A box is drawn around the first and third quartiles, with a line used to represent the median. Two more lines are drawn to connect the minimum and maximum to the box plot. This graph illustrates both the range and distribution of the data analysed.

Categorical variable

A categorical variable places each response into a distinct category, such as “employed” or “unemployed.” Categorical variables cannot be represented by means, as you can’t find an average religious affiliation or marital status. Descriptive statistics such frequency tabulation can be used for categorical variables.

Chi Squared test

A chi squared test is used to compare observed results with expected results. The purpose of this test is to determine if a difference between observed data and expected data is due to chance, or if it is due to a relationship between the variables under review. Chi squared tests are used when variables are categorical.

Confidence intervals

Confidence intervals allow you to generalize your findings from the sample in your data to the population from which the sample was taken. We can calculate the means of variable values in our dataset, but we cannot say that these means are indicative of the true means across a population. A confidence interval is a range of values within which the mean across the entire population will fall. It is possible to calculate 95% confidence interval and a 99% confidence interval.

Continuous variable

A continuous variable is one that can have an infinite number of values. Examples of continuous variables are things such as age, height, income, weight, and exam score. Because these variables could take on any value within a set range, it is possible to calculate means, or averages, of the values.

Crosstabulations

Crosstabulation allows you to summarize the data in categorical variables and examine it to determine if there are any relationships present. SPSS provides cross tabulation charts that show you how many individuals (or cases) are present in each group. For example, if you ran a cross tabulation on gender and income bracket, with gender having two categories (female and male) and income level having five categories (very high, high, average, low, and very low), you would be able to see, for example, how women have high incomes and how many men have average incomes.

Crime Survey for England and Wales (CSEW)

Previously known as the British Crime Survey, the Crime Survey for England and Wales, or CSEW, has been reporting on individual and household perceptions and experiences of crime since 1981. Prior to 2001, the survey was biennial. In the years since 2001, the survey has been conducted all year round. Although the survey cycles and study name have changed, the format of the survey has remained the same since its inception. Respondents are initially asked to complete a non-victim questionnaire concerning their experiences with crime in the 12 months previous to their participation in the survey. This questionnaire includes topics such as performance of the criminal justice system, plastic card fraud, mass-marketing fraud, perceptions of crime, mobile and bicycle crime, demographics and media, and anti-social behaviour. In addition, the respondents are randomly assigned to one of four additional survey modules, in which they are asked questions specific to the parameters of the module:

1. Module A: experiences of the police
2. Module B: attitudes to the criminal justice system
3. Module C: crime prevention and security
4. Module D: ad-hoc crime topics

If a respondent has been the victim of crime, they are then asked to complete a victim questionnaire covering the emotional, social, and financial consequences of their experience. At the beginning of 2009, general non-victim and specific victim questionnaires for children ages 10-15 were introduced.

Histogram

Histograms are graphs used to determine if the distribution of values is normal, meaning that the values are all more or less equally distributed about the mean.

Linear regression

Linear regression is a statistical analysis that allows us to model the relationship between two (or more) variables and predict the values of dependent variables. Because linear regression produces predicted numerical values or scores, it is used when the dependent variable is continuous.

Logistic regression

If the dependent variable is categorical and binary, meaning that the outcome can be either 0 or 1, the binary logistic regression method can help illustrate trends in data. Because binary logistic regression is used when dependent variables are binary, this test allows prediction of the odds of certain values of the dependent variable.

Mean

The mean of the numerical values in a data set is the average of all the values in that data set. Means can only be calculated for continuous, numerical variables.

Median

The median is the numerical value that splits the data in a variable into halves. The median is often referred to as the “middle” number, as all values lower than the median make up one half of the data and all values higher than the median make up on the second half.

Mode

The mode is the numerical value that occurs most often in a data set.

Multivariate analysis

Multivariate analyses are performed to determine if there is a relationship between a dependent variable and two or more independent variables. The type of multivariate analysis used depends on the research question being asked and the type of variables being analysed. For continuous dependent variables, linear regression analysis can predict values given certain circumstances. For dependent categorical variables, logistic regression can predict the odds of values given certain circumstances.

Normal distribution

Normal distribution refers to the spread of a variable’s data around its mean. The data in a variable is said to have normal distribution when a histogram of the data has a standard bell-shaped curve, with symmetry around the mean. Normal distribution is an important requirement to satisfy when performing t tests and ANOVAs.

Standard deviation

Standard deviation is a measure of how much the data in a variable deviates from the mean. A small standard deviation indicates that the variable’s data are all close to the mean. A large standard deviation shows that the data are widespread and far from the mean.

Stem and leaf plot

A stem and leaf plot is very similar to a histogram in that it illustrates the distribution of the values in a variable. However, a stem and leaf plot uses the values themselves to graph the spread of the data.

T test

A t test is used to determine if differences in means between two variables are significant or are simply due to chance. T tests are used when analysing two continuous variables.

Univariate analysis

Univariate analysis refers to the descriptive statistical analyses performed on just one variable. These analyses help define the variable and can include things like frequency distributions or mean calculations.

Variance

The variance of a set of data is a measure of the relationship of the data points to the mean value of the set. A large variance shows that individual data points are quite spread out and have large differences both from the mean and each other. A small variance shows that individual data points are closer to the mean.

Youth Cohort Study of England and Wales, 2004-2007

The Youth Cohort Study of England and Wales (or YCS) began in 1985, using survey questionnaires to collect information about the educational goals and achievements of young people both during and after secondary school. The young people surveyed are drawn from schools all over England and Wales, and are selected for the YCS when they are 16 years old in Year 11. The data in the YCS is collected once a year, with answers from each year’s questionnaire stored as a “sweep,” so that all the information from the first year of a cohort is labelled “Sweep 1,” information from the second year is “Sweep 2,” and so on.

The YCS is broken up into cohorts based on what year the young people started participating in the study. The YCS dataset we’ll be using is from Cohort 12, which began in 2004. This dataset contains variables from the first four sweeps of the survey, and follows the young people from age 16 to age 20.