Data cleaning

When data are collected, they are often not in a format where they can be easily analyzed. Data management is the process of taking a dataset and converting it into a form that can be used for other things, including descriptives, visualization, and modeling. This process involves cleaning the data, which consists of organizing and polishing the data. This tutorial will introduce some helpful ways to do this.

There are many ways to clean data, all of which are equally valid. Here we’ll cover some older ways, using base R functionality. I prefer these ways because of their backwards compatibility (the code will always work the same way no matter what version of R is used). I also like the way the steps are sequential, allowing me to check the results of the code at each step to make sure it’s doing what I expect it to do. There are newer approaches that are gaining popularity (e.g. dplyr, tidyverse, etc.) that some find easier to use, but those won’t be covered here.

Ihno data

The dataset we’ll be using is known as the Ihno dataset, a hypothetical dataset from (the 4th Edition of) Cohen’s Explaining Psychological Statistics (2013). The dataset is uploaded nearby.

The dataset is stored in a .RData file. This is an R workspace, created by saving the data in an R environment. This format is easy to work with in R, but data can be stored in any number of formats. For spreadsheets, the easiest is the .csv format, but using the Import Dataset tab in RStudio allows us to read many commonly used formats.

To load the data, we can use

load(file.choose())

and click the dataset we want to select. Or we can use the path address on our machine like so

load("./ihnodata.RData")

Using the following code, we can see the dataset has 100 people measured on 18 variables and the following summary statistics.

dim(Ihno)
[1] 100  18
summary(Ihno)
    Sub_num           Gender         Major          Reason    
 Min.   :  1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00  
 1st Qu.: 25.75   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:1.00  
 Median : 50.50   Median :1.00   Median :2.00   Median :2.00  
 Mean   : 50.50   Mean   :1.43   Mean   :2.52   Mean   :1.89  
 3rd Qu.: 75.25   3rd Qu.:2.00   3rd Qu.:3.25   3rd Qu.:3.00  
 Max.   :100.00   Max.   :2.00   Max.   :5.00   Max.   :3.00  
                                                              
    Exp_cond        Coffee        Num_cups        Phobia     
 Min.   :1.00   Min.   :0.00   Min.   :0.00   Min.   : 0.00  
 1st Qu.:1.75   1st Qu.:0.00   1st Qu.:0.00   1st Qu.: 1.00  
 Median :2.50   Median :0.00   Median :0.00   Median : 3.00  
 Mean   :2.50   Mean   :0.42   Mean   :0.68   Mean   : 3.31  
 3rd Qu.:3.25   3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.: 4.00  
 Max.   :4.00   Max.   :1.00   Max.   :3.00   Max.   :10.00  
                                                             
    Prevmath       Mathquiz        Statquiz        Exp_sqz     
 Min.   :0.00   Min.   : 9.00   Min.   : 1.00   Min.   : 1.00  
 1st Qu.:1.00   1st Qu.:22.00   1st Qu.: 6.00   1st Qu.: 6.00  
 Median :1.00   Median :30.00   Median : 7.00   Median : 7.00  
 Mean   :1.38   Mean   :29.07   Mean   : 6.86   Mean   : 6.83  
 3rd Qu.:2.00   3rd Qu.:35.00   3rd Qu.: 8.00   3rd Qu.: 8.00  
 Max.   :6.00   Max.   :49.00   Max.   :10.00   Max.   :11.00  
                NA's   :15                                     
    Hr_base          Hr_pre         Hr_post        Anx_base    
 Min.   :64.00   Min.   :62.00   Min.   :64.0   Min.   :10.00  
 1st Qu.:70.00   1st Qu.:70.00   1st Qu.:69.0   1st Qu.:16.00  
 Median :72.00   Median :74.00   Median :73.0   Median :18.00  
 Mean   :72.27   Mean   :73.85   Mean   :72.8   Mean   :18.43  
 3rd Qu.:74.00   3rd Qu.:78.00   3rd Qu.:76.0   3rd Qu.:20.00  
 Max.   :80.00   Max.   :87.00   Max.   :86.0   Max.   :39.00  
                                                               
    Anx_pre         Anx_post   
 Min.   : 8.00   Min.   : 9.0  
 1st Qu.:14.00   1st Qu.:16.0  
 Median :19.00   Median :19.0  
 Mean   :19.58   Mean   :19.4  
 3rd Qu.:25.00   3rd Qu.:22.0  
 Max.   :39.00   Max.   :40.0  
                               

Subsetting

Subsetting is the process of selecting a smaller portion of a larger dataset. We’ll start with a one-dimensional array (a vector). Let’s take a vector of numbers 11 through 20

x = 11:20
x
 [1] 11 12 13 14 15 16 17 18 19 20

“x” is a numeric vector of length 10. To find this, we can use the length() function.

length(x)
[1] 10

This tells us that there are 10 elements in “x”. We can access elements using single square brackets.

x[3]
[1] 13

gives us the 3rd element of “x”, in this case 13. Note that the number in the square brackets extracts value in the given position. We can also extract more than 1 value at a time.

x[3:5]
[1] 13 14 15

gives us the 3rd, 4th, and 5th elements of “x”. These need not be in order or consecutive. E.g.

x[c(2,7,3)]
[1] 12 17 13

gives us the 2nd, 7th, and 3rd elements. We can also extract all elements except certain ones.

x[-4]
[1] 11 12 13 15 16 17 18 19 20

gives us all elements except the 4th.

To change an element we simply need to assign a different value to it. E.g. to change the 6th element from 16 to 26, we can write

x[6] = 26
x
 [1] 11 12 13 14 15 26 17 18 19 20

Two-dimensional arrays have two dimensions on which we can subset, i.e. rows and columns. Take the first 6 lines of the Ihno data

head(Ihno)

To access the element in the 5th row and 8th column, we can use the same single bracket syntax, with the row index, a comma, and the column index.

Ihno[5,8]
[1] 10

If we want all rows but only a certain column (or columns), we can leave the row blank.

Ihno[,8]
  [1]  1  1  4  4 10  4  4  4  4  5  5  4  7  4  3  8  4  5  0  4  4  3  4
 [24]  0  1  1  0  4  3  5  9  3  4  1  2  0  3  2  8  4  4  0  3  2  2  1
 [47]  1  5  5  4  4 10  7  3  4  2  2  1  7  5  5  1  3  3  5  1  0  5  6
 [70]  3  5  4  0  2  2  3  1  1  0  2 10  3  3  8  3  3  1  1  2  8  3  0
 [93]  0  4  7  0  0  2  1  2

Likewise if we want all columns for a given row, we can leave the column blank.

Ihno[3,]

If we want the first 3 rows and first 4 columns, we can write

Ihno[1:3,1:4]

Using the rules we learned for vectors, we can extract any subset of our dataset by telling R which combination of rows and columns we’re interested in.

We can also extract variables by name using $. This is often helpful when trying to extract specific variables from data frames.

Ihno$Phobia
  [1]  1  1  4  4 10  4  4  4  4  5  5  4  7  4  3  8  4  5  0  4  4  3  4
 [24]  0  1  1  0  4  3  5  9  3  4  1  2  0  3  2  8  4  4  0  3  2  2  1
 [47]  1  5  5  4  4 10  7  3  4  2  2  1  7  5  5  1  3  3  5  1  0  5  6
 [70]  3  5  4  0  2  2  3  1  1  0  2 10  3  3  8  3  3  1  1  2  8  3  0
 [93]  0  4  7  0  0  2  1  2

extracts the Phobia variable in the Ihno dataset. Notice how this produces the same result as Ihno[,8]. However, when using $ we don’t need to know the position of the variable in the dataset, only its name.

To access or extract elements from a list, we need to use double square brackets. Let’s first create a list and fill it with a data frame and a vector.

list2 = list(Ihno[1:3,1:4], Ihno$Phobia)
list2
[[1]]
  Sub_num Gender Major Reason
1       1      1     1      3
2       2      1     1      2
3       3      1     1      1

[[2]]
  [1]  1  1  4  4 10  4  4  4  4  5  5  4  7  4  3  8  4  5  0  4  4  3  4
 [24]  0  1  1  0  4  3  5  9  3  4  1  2  0  3  2  8  4  4  0  3  2  2  1
 [47]  1  5  5  4  4 10  7  3  4  2  2  1  7  5  5  1  3  3  5  1  0  5  6
 [70]  3  5  4  0  2  2  3  1  1  0  2 10  3  3  8  3  3  1  1  2  8  3  0
 [93]  0  4  7  0  0  2  1  2

To access the data frame within the list, we can write

list2[[1]]

We can also access the elements within the elements of the list in the same way that we did before, by tacking on the appropriate single bracket syntax. So to find the Phobia score of the 5th person, we can write

list2[[2]][5]
[1] 10

since Phobia data are stored in the 2nd element of the list. To get the subject number of the 3rd person we can write

list2[[1]][3,1]
[1] 3

Operators

R has a number of useful operators, both relational and logical. Let’s create 2 vectors.

y = 1:11
y
 [1]  1  2  3  4  5  6  7  8  9 10 11
z = 11:1
z
 [1] 11 10  9  8  7  6  5  4  3  2  1

< compares 2 values and returns TRUE if the former is less than the latter. If given vectors, this comparison is made element by element. In the following comparison, 1 is compared to 11, 2 to 10, 3 to 9, etc.

y < z
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

The first 5 results are TRUE because the first 5 values of “y” are less than the first 5 values of “z”. We can also check if they are less than or equal to

y <= z
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

greater than

y > z
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

or greater than or equal to

y >= z
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

To check if values are equal, we use the logical operator ==

y == z
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

This is because a single = represents the assignment character. We can also take the opposite, by using the NOT operator, !

y != z
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

returns TRUE for each value of “y” that is not equal to the corresponding value of “z”.

Logical operators AND and OR can also be useful. Logical AND (&) returns TRUE if all conditions are met.

(y < 10) & (z < 5)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

is only both true for pairs y=8, z=4 and y=9, z=3.

Logical OR (|, above the \ on a standard keyboard) returns TRUE if either (any) conditions are met.

(y > 10) | (z > 10)
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

returns true for the pairs y=11, z=1 and y=1, z=11, because these are the only pairs where at least one of the values exceeds 10.

The functions all() and any() can also be useful here. all() returns TRUE if all of its elements are TRUE, and any() returns true if any (at least 1) of its elements are TRUE

all(y > 0)
[1] TRUE
all(y > 5)
[1] FALSE
all(y > 12)
[1] FALSE
any(y > 0)
[1] TRUE
any(y > 5)
[1] TRUE
any(y > 12)
[1] FALSE

Missing data

Missing data can be detected using the is.na() function. R denotes missing data with NA. is.na() returns TRUE if a value is NA. Let’s review the Mathquiz scores

Ihno$Mathquiz
  [1] 43 49 26 29 31 20 13 23 38 NA 29 32 18 NA 21 NA 37 37 32 NA 25 22 35
 [24] 47 41 26 39 21 NA 22 21 NA 26 20 30 40 35 10 35 44 26 NA 15 42 33 29
 [47] 39 38 NA 24 NA 26 14 45 28 31 NA 32 15 26 26 32 21 43 NA 34 33 30 46
 [70] NA 33 29 30 34 32 37 NA 31 30 28 14  9 11 30 15 32 22 NA 25 18 11 11
 [93] 37 28 NA 33 28 38 41 39

We can find the NA values using

is.na(Ihno$Mathquiz)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [12] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
 [23] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
 [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [45] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
 [56] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [67] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [89] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[100] FALSE

We can find their positions using

which(is.na(Ihno$Mathquiz))
 [1] 10 14 16 20 29 32 42 49 51 57 65 70 77 88 95

and remove them using

Ihno1 = Ihno[-which(is.na(Ihno$Mathquiz)),]
dim(Ihno)
[1] 100  18
dim(Ihno1)
[1] 85 18

The new Ihno1 dataset now has only 85 people.

We can also change values to NA. Returning to the Phobia data

Ihno$Phobia
  [1]  1  1  4  4 10  4  4  4  4  5  5  4  7  4  3  8  4  5  0  4  4  3  4
 [24]  0  1  1  0  4  3  5  9  3  4  1  2  0  3  2  8  4  4  0  3  2  2  1
 [47]  1  5  5  4  4 10  7  3  4  2  2  1  7  5  5  1  3  3  5  1  0  5  6
 [70]  3  5  4  0  2  2  3  1  1  0  2 10  3  3  8  3  3  1  1  2  8  3  0
 [93]  0  4  7  0  0  2  1  2

we notice that scores range from 0 to 10. But what if the scale is only supposed to go from 1 to 10? What if the 0s were marked by coders to designate missing values? We can edit them using the tools we’ve learned.

Ihno$Phobia[Ihno$Phobia==0] = NA
Ihno$Phobia
  [1]  1  1  4  4 10  4  4  4  4  5  5  4  7  4  3  8  4  5 NA  4  4  3  4
 [24] NA  1  1 NA  4  3  5  9  3  4  1  2 NA  3  2  8  4  4 NA  3  2  2  1
 [47]  1  5  5  4  4 10  7  3  4  2  2  1  7  5  5  1  3  3  5  1 NA  5  6
 [70]  3  5  4 NA  2  2  3  1  1 NA  2 10  3  3  8  3  3  1  1  2  8  3 NA
 [93] NA  4  7 NA NA  2  1  2

This line finds the Phobia values of Phobia that are marked 0 and assigns NA to them.

These kinds of edits may be easier to do by hand in a spreadsheet, but only up to a certain point. The larger the dataset, the more impractical doing it by hand becomes. Once the sample size gets into the hundreds, managing data using code like we’ve done here becomes the only option!