Running Preliminary Analysis for Multivariate Statistics using SPSS

We have to run a data screening by checking the following:

  • The accuracy of the data by examining descriptive statistics.
  • The underlying assumptions are met or not.
  • The outliers – cases that are extreme – that can distort results from MVS analysis.
  • The multicollinearity and singularity – perfect or near perfect correlations among variables – can threaten a multivariate analysis.
  • Handle missing data – missing pattern is more important than the amount missing. If only a few data points (5% or less – note. it is no firm guideline [some claimed that 10% or less is ignorable as well]) are missing in a random pattern, the problems are less serious.

First, you should get a dataset for Multivariate Statistics (MVS). It could be; raw data, or covariance matrix (S), or correlation matrix (R), or sum-of-square and cross-product (SSCP, Q).

In SPSS:

  • AnalyzeCorrelateBivariate → Move variables of interest to Variables -> Options → select “Cross-product deviations and covariances”.
  • Hit Continue and then hit Paste.

Or you could do it by writing the syntax (below):

CORRELATIONS

  /VARIABLES=Sepal.Length Sepal.Width Petal.Length Petal.Width

  /PRINT=TWOTAIL NOSIG

  /STATISTICS XPROD

  /MISSING=PAIRWISE.

From the above table, we could extract R, S, Q matrices and describe the pattern of relationship among variables. (direction/ magnitude-> correlation).

Checking Data Accuracy in SPSS

  • In SPSS, Descriptive Stats: Analyze -> Descriptive Statistics -> Descriptives…
  • Frequency : Analyze -> Descriptive Statistics -> Frequencies…

Checking Univariate Normality using SPSS

  • In SPSS, Analyze -> Explore -> Plot… -> Check Histogram under Descriptive -> Check Normality plots with tests

Output:

We want the test of normality to be insignificant. Based on Kolmogorov-Smirnov, all variables are significant p<.001 except for Sepal.Length p=.006. Based on Shapiro-Wilk Petal.Length and Petal.Width variables are significant p<.001 indicating that they have issues with normality while Sepal.Length and Sepal.Width are non-significant.

You can also get the same results by using the syntax (below):

EXAMINE VARIABLES=Sepal.Length Sepal.Width Petal.Length Petal.Width

  /PLOT BOXPLOT HISTOGRAM NPPLOT

  /COMPARE GROUPS

  /STATISTICS DESCRIPTIVES

  /CINTERVAL 95

  /MISSING LISTWISE

  /NOTOTAL.

Outliers

  • Outlier is a case with such an extreme value on one variable (univariate outlier), or such an abnormal combination of values on several variables (multivariate outlier), which may make it very influential for data analysis results.
  • Univariate outliers are ones with the associated z scores higher than 3 or smaller than -3.
  • In SPSS, Analyze -> Descriptive Statistics -> Descriptives.
  • In the appearance window, move all four variables to the Variable(s): and check save standardized values as variables

How to compute Mahalanobis distance in SPSS?

  1. Regress a variable of no interest upon all variables (DVs and IVs, treated here as ‘predictors’) that will be of concern in an analytic session. (May use subject ID as the DV in this regression.)
  2. Ask Mahalanobis distance to be saved as an additional variable in the original data set. No estimates, standard errors or tests for this regression are of any interest, only the individual Mah scores.
  3. Check the case with Mah > chi-square cut-off with a degree of freedom of #Variables + 1.

Detecting multivariate outliers

  • In SPSS, Analyze -> Regression -> Linear.
  • Move any continuous variable to Dependent: and all relevant variables to Independent(s): -> hit Save… -> check Mahalanobis -> Hit continue and paste.
  • From the menu, click on Transform -> Compute Variable… -> specify new target variable name, mv_outlier, in Target Variable: -> add 0 in Numeric Expression: -> Hit Continue and then hit Paste.
  • Click on Transform -> Compute Variable… -> specify new target variable name, mv_outlier, in Target Variable: -> hit If.. -> check Include if case satisfies condition: -> Mah_1 > 20.5150057 -> Hit Continue -> add 1 in Numeric Expression: -> Hit Continue and then hit Paste.

Syntax:

REGRESSION

  /MISSING LISTWISE

  /STATISTICS COEFF OUTS R ANOVA

  /CRITERIA=PIN(.05) POUT(.10)

  /NOORIGIN

  /DEPENDENT ZSepal.Length

  /METHOD=ENTER Sepal.Length Sepal.Width Petal.Length Petal.Width

  /SAVE MAHAL.

COMPUTE mv_outlier=0.

EXECUTE.

IF  (MAH_1>20.5150057 ) mv_outlier=1.

EXECUTE.

Which subjects are multivariate outliers?

 those with Mah_1 > chi-square cut-off.

*Note: chi-square cut-off can be found in excel using chiinv(alpha,df)

In r: qchisq(alpha, df, lower.tail=TRUE)

 df = 5 (number of variables = 4 + 1)

Detecting univariate outliers

  • From the menu, click on Analyze -> Descriptive Statistics -> Descriptives
  • In the appearance window, move all four variables to the Variable(s): and check save standardized values as variables

DESCRIPTIVES VARIABLES=Sepal.Length Sepal.Width Petal.Length Petal.Width

  /SAVE

  /STATISTICS=MEAN STDDEV MIN MAX.

  • From the menu, click on Transform -> Compute Variable… -> specify new target variable name, out_sepal.length, in Target Variable: -> add 0 in Numeric Expression: -> Hit Continue and then hit Paste.
  • click on Transform -> Compute Variable… -> specify new target variable name, out_sepal.length, in Target Variable: -> hit If.. -> check Include if case satisfies condition: -> move ZSepal.Length over the box and write the condition ZSepal.Length < = -3 or ZSepal.Length > =3 -> Hit Continue -> add 1 in Numeric Expression: -> Hit Continue and then hit Paste.

Compute out_sepal.length = 0.

IF  (ZSepal.Length <= -3 or ZSepal.Length  >= 3) out_sepal.length=1.

EXECUTE.

Repeat the same procedure for all other variables. You will get new variables added in the dataview. If there if a value less than -3 or bigger than 3 it will be considered as an outlier.

Checking missing pattern

  • Analyze -> Missing Value Analysis … -> Move all variables to either Quantitative Variables or Categorical Variables -> Check EM -> Click Patterns… -> Check Cases with missing values, … -> Click Descriptives… -> Check t tests with groups…

DATASET ACTIVATE DataSet3.

MVA VARIABLES=Ozone Solar.R Wind Temp Month Day

  /TTEST PROB PERCENT=5

  /MPATTERN

  /EM(TOLERANCE=0.001 CONVERGENCE=0.0001 ITERATIONS=25).

Description:

  • MVA (Missing Values Analysis)
  •  TTEST is requested to see if missingness is related to any other variables (Mean difference on variable between incomplete cases vs. complete cases), with an alpha of .05.
  •  EM syntax requests a table of correlations and a test of whether data missing completely at random (Little’s MCAR test). 

Output:

  • The above table shows that there is 1 missing value on ATTHOUSE and 26 missing on income.

Income has the highest missing data which is 26(5.6%) of n=439. Atthouse has 1 (.2%) missing data which could be ignorable since it is very small.

  • The above table shows whether missingness is related to variables.  For instance, t(32.2) = .2 shows that missingness (n = 26) is not related to the variable, timedrs.

From the above table we can see that case #52 has missing data (S) on income. 

  • Little’s MCAR test shows whether the data are missing completely at random (MCAR). A statistically nonsignificant result is desired, indicating that data are missing completely at random
  • Little’s MCAR test – Chi-square(df), p value (H0: Missing pattern is MCAR)

Write-up:

The results from Little’s MCAR test indicates that chi-square (12) = 19.55, p=.76 >.05 is not significant, so we fail to rejecting the null hypothesis, meaning that the missing pattern is MCAR.

Conducting Multiple Imputation (MI):

  • Multiple Imputation (MI) using MCMC algorithms can be done using SPSS.
  •  Basic idea is to impute the missing variables one at a time, using the filled-in-value from one step as a predictor in all the subsequent steps. But you should decide which variables should be included to predict missingness and thus fill in the values on missing cases.
  •  SPSS uses linear regression for continuous variables, and logistic regression for categorical variables.
  •  Before start, incomplete variables must be defined as nominal or scale prior to imputation.
  • In SPSS, Analyze -> Multiple Imputation -> Impute Missing Data Values…
  • In Variables tab: Move variables to Variables in Model -> Select # of imputation in Imputations (e.g., 20) -> Choose imputed dataset name (e.g., imputed).

In Method tab, choose Custom -> Fully conditional specification (MCMC) -> Choose # of iterations Maximum iterations (e.g., 10)

  • Imputation output: SPSS stacks the imputed datasets into a single file as such
  • Split File and choose “Compare groups” option -> Move Imputation_ to Groups Based on… (Split file by Imputation_ and thus can run analysis by each imputation).
  • Run the subsequent analyses using the split dataset. For instance, let us suppose that our interest is to run regression predicting attdrug (Attitudes toward medication) by income and timedrs.
  • Run the subsequent analyses using the split dataset. For instance, let us suppose that our interest is to run regression predicting attdrug (Attitudes toward medication) by income and timedrs. The pooled estimates and standard errors appear at the bottom of table, when available (Note. analysis will provide the pooled estimates and standard errors).

Based on the regression results, we could conclude that timedrs is a significant predictor, and income is not a significant predictor p=.709.

From the above table we can see that the overall model is significant F(5, 105)=34.99, p<.05.