When data are collected, they are often not in a format where they can be easily analyzed. Data management is the process of taking a dataset and converting it into a form that can be used for other things, including descriptives, visualization, and modeling. This process involves cleaning the data, which consists of organizing and polishing the data. This tutorial will introduce some helpful ways to do this.
There are many ways to clean data, all of which are equally valid. Here we’ll cover some older ways, using base R functionality. I prefer these ways because of their backwards compatibility (the code will always work the same way no matter what version of R is used). I also like the way the steps are sequential, allowing me to check the results of the code at each step to make sure it’s doing what I expect it to do. There are newer approaches that are gaining popularity (e.g. dplyr, tidyverse, etc.) that some find easier to use, but those won’t be covered here.
The dataset we’ll be using is known as the Ihno dataset, a hypothetical dataset from (the 4th Edition of) Cohen’s Explaining Psychological Statistics (2013). The dataset is uploaded nearby.
The dataset is stored in a .RData file. This is an R workspace, created by saving the data in an R environment. This format is easy to work with in R, but data can be stored in any number of formats. For spreadsheets, the easiest is the .csv format, but using the Import Dataset tab in RStudio allows us to read many commonly used formats.
To load the data, we can use
load(file.choose())
and click the dataset we want to select. Or we can use the path address on our machine like so
load("./ihnodata.RData")
Using the following code, we can see the dataset has 100 people measured on 18 variables and the following summary statistics.
dim(Ihno)
[1] 100 18
summary(Ihno)
Sub_num Gender Major Reason
Min. : 1.00 Min. :1.00 Min. :1.00 Min. :1.00
1st Qu.: 25.75 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:1.00
Median : 50.50 Median :1.00 Median :2.00 Median :2.00
Mean : 50.50 Mean :1.43 Mean :2.52 Mean :1.89
3rd Qu.: 75.25 3rd Qu.:2.00 3rd Qu.:3.25 3rd Qu.:3.00
Max. :100.00 Max. :2.00 Max. :5.00 Max. :3.00
Exp_cond Coffee Num_cups Phobia
Min. :1.00 Min. :0.00 Min. :0.00 Min. : 0.00
1st Qu.:1.75 1st Qu.:0.00 1st Qu.:0.00 1st Qu.: 1.00
Median :2.50 Median :0.00 Median :0.00 Median : 3.00
Mean :2.50 Mean :0.42 Mean :0.68 Mean : 3.31
3rd Qu.:3.25 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.: 4.00
Max. :4.00 Max. :1.00 Max. :3.00 Max. :10.00
Prevmath Mathquiz Statquiz Exp_sqz
Min. :0.00 Min. : 9.00 Min. : 1.00 Min. : 1.00
1st Qu.:1.00 1st Qu.:22.00 1st Qu.: 6.00 1st Qu.: 6.00
Median :1.00 Median :30.00 Median : 7.00 Median : 7.00
Mean :1.38 Mean :29.07 Mean : 6.86 Mean : 6.83
3rd Qu.:2.00 3rd Qu.:35.00 3rd Qu.: 8.00 3rd Qu.: 8.00
Max. :6.00 Max. :49.00 Max. :10.00 Max. :11.00
NA's :15
Hr_base Hr_pre Hr_post Anx_base
Min. :64.00 Min. :62.00 Min. :64.0 Min. :10.00
1st Qu.:70.00 1st Qu.:70.00 1st Qu.:69.0 1st Qu.:16.00
Median :72.00 Median :74.00 Median :73.0 Median :18.00
Mean :72.27 Mean :73.85 Mean :72.8 Mean :18.43
3rd Qu.:74.00 3rd Qu.:78.00 3rd Qu.:76.0 3rd Qu.:20.00
Max. :80.00 Max. :87.00 Max. :86.0 Max. :39.00
Anx_pre Anx_post
Min. : 8.00 Min. : 9.0
1st Qu.:14.00 1st Qu.:16.0
Median :19.00 Median :19.0
Mean :19.58 Mean :19.4
3rd Qu.:25.00 3rd Qu.:22.0
Max. :39.00 Max. :40.0
Subsetting is the process of selecting a smaller portion of a larger dataset. We’ll start with a one-dimensional array (a vector). Let’s take a vector of numbers 11 through 20
x = 11:20
x
[1] 11 12 13 14 15 16 17 18 19 20
“x” is a numeric vector of length 10. To find this, we can use the length() function.
length(x)
[1] 10
This tells us that there are 10 elements in “x”. We can access elements using single square brackets.
x[3]
[1] 13
gives us the 3rd element of “x”, in this case 13. Note that the number in the square brackets extracts value in the given position. We can also extract more than 1 value at a time.
x[3:5]
[1] 13 14 15
gives us the 3rd, 4th, and 5th elements of “x”. These need not be in order or consecutive. E.g.
x[c(2,7,3)]
[1] 12 17 13
gives us the 2nd, 7th, and 3rd elements. We can also extract all elements except certain ones.
x[-4]
[1] 11 12 13 15 16 17 18 19 20
gives us all elements except the 4th.
To change an element we simply need to assign a different value to it. E.g. to change the 6th element from 16 to 26, we can write
x[6] = 26
x
[1] 11 12 13 14 15 26 17 18 19 20
Two-dimensional arrays have two dimensions on which we can subset, i.e. rows and columns. Take the first 6 lines of the Ihno data
head(Ihno)
To access the element in the 5th row and 8th column, we can use the same single bracket syntax, with the row index, a comma, and the column index.
Ihno[5,8]
[1] 10
If we want all rows but only a certain column (or columns), we can leave the row blank.
Ihno[,8]
[1] 1 1 4 4 10 4 4 4 4 5 5 4 7 4 3 8 4 5 0 4 4 3 4
[24] 0 1 1 0 4 3 5 9 3 4 1 2 0 3 2 8 4 4 0 3 2 2 1
[47] 1 5 5 4 4 10 7 3 4 2 2 1 7 5 5 1 3 3 5 1 0 5 6
[70] 3 5 4 0 2 2 3 1 1 0 2 10 3 3 8 3 3 1 1 2 8 3 0
[93] 0 4 7 0 0 2 1 2
Likewise if we want all columns for a given row, we can leave the column blank.
Ihno[3,]
If we want the first 3 rows and first 4 columns, we can write
Ihno[1:3,1:4]
Using the rules we learned for vectors, we can extract any subset of our dataset by telling R which combination of rows and columns we’re interested in.
We can also extract variables by name using $. This is often helpful when trying to extract specific variables from data frames.
Ihno$Phobia
[1] 1 1 4 4 10 4 4 4 4 5 5 4 7 4 3 8 4 5 0 4 4 3 4
[24] 0 1 1 0 4 3 5 9 3 4 1 2 0 3 2 8 4 4 0 3 2 2 1
[47] 1 5 5 4 4 10 7 3 4 2 2 1 7 5 5 1 3 3 5 1 0 5 6
[70] 3 5 4 0 2 2 3 1 1 0 2 10 3 3 8 3 3 1 1 2 8 3 0
[93] 0 4 7 0 0 2 1 2
extracts the Phobia variable in the Ihno dataset. Notice how this produces the same result as Ihno[,8]. However, when using $ we don’t need to know the position of the variable in the dataset, only its name.
To access or extract elements from a list, we need to use double square brackets. Let’s first create a list and fill it with a data frame and a vector.
list2 = list(Ihno[1:3,1:4], Ihno$Phobia)
list2
[[1]]
Sub_num Gender Major Reason
1 1 1 1 3
2 2 1 1 2
3 3 1 1 1
[[2]]
[1] 1 1 4 4 10 4 4 4 4 5 5 4 7 4 3 8 4 5 0 4 4 3 4
[24] 0 1 1 0 4 3 5 9 3 4 1 2 0 3 2 8 4 4 0 3 2 2 1
[47] 1 5 5 4 4 10 7 3 4 2 2 1 7 5 5 1 3 3 5 1 0 5 6
[70] 3 5 4 0 2 2 3 1 1 0 2 10 3 3 8 3 3 1 1 2 8 3 0
[93] 0 4 7 0 0 2 1 2
To access the data frame within the list, we can write
list2[[1]]
We can also access the elements within the elements of the list in the same way that we did before, by tacking on the appropriate single bracket syntax. So to find the Phobia score of the 5th person, we can write
list2[[2]][5]
[1] 10
since Phobia data are stored in the 2nd element of the list. To get the subject number of the 3rd person we can write
list2[[1]][3,1]
[1] 3
Sometimes we don’t know the positions of the values we’re looking for. Sometimes they’re based on the values themselves. We can use R to search for the values we want by setting conditions. Take the Phobia scores
Ihno$Phobia
[1] 1 1 4 4 10 4 4 4 4 5 5 4 7 4 3 8 4 5 0 4 4 3 4
[24] 0 1 1 0 4 3 5 9 3 4 1 2 0 3 2 8 4 4 0 3 2 2 1
[47] 1 5 5 4 4 10 7 3 4 2 2 1 7 5 5 1 3 3 5 1 0 5 6
[70] 3 5 4 0 2 2 3 1 1 0 2 10 3 3 8 3 3 1 1 2 8 3 0
[93] 0 4 7 0 0 2 1 2
If we want to see which Phobia scores are greater than 8, we can write
Ihno$Phobia > 8
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[78] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[100] FALSE
If we want to select these values, we can use this condition to subset
Ihno$Phobia[Ihno$Phobia > 8]
[1] 10 9 10 10
prints the Phobia scores for which the previous line of code yielded TRUE values. In this case, it prints the values of the Phobia scores greater than 8.
To find the positions of these values, we can use the which() function. The which() function returns the indices of the values, instead of the values themselves.
which(Ihno$Phobia > 8)
[1] 5 31 52 81
tells us that the 5th, 31st, 52nd, and 81st values are all above 8. If we subset just these values we find the same results as before.
Ihno$Phobia[c(5, 31, 52, 81)]
[1] 10 9 10 10
We can use which() to select a subset of people to analyze
Ihno[which(Ihno$Phobia > 8),]
This returns the scores on all variables for those people with Phobia above 8. An alternative but equivalent way to do this is to use the subset() function
subset(Ihno, Phobia > 8)
R has a number of useful operators, both relational and logical. Let’s create 2 vectors.
y = 1:11
y
[1] 1 2 3 4 5 6 7 8 9 10 11
z = 11:1
z
[1] 11 10 9 8 7 6 5 4 3 2 1
< compares 2 values and returns TRUE if the former is less than the latter. If given vectors, this comparison is made element by element. In the following comparison, 1 is compared to 11, 2 to 10, 3 to 9, etc.
y < z
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
The first 5 results are TRUE because the first 5 values of “y” are less than the first 5 values of “z”. We can also check if they are less than or equal to
y <= z
[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
greater than
y > z
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
or greater than or equal to
y >= z
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
To check if values are equal, we use the logical operator ==
y == z
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
This is because a single = represents the assignment character. We can also take the opposite, by using the NOT operator, !
y != z
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
returns TRUE for each value of “y” that is not equal to the corresponding value of “z”.
Logical operators AND and OR can also be useful. Logical AND (&) returns TRUE if all conditions are met.
(y < 10) & (z < 5)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
is only both true for pairs y=8, z=4 and y=9, z=3.
Logical OR (|, above the \ on a standard keyboard) returns TRUE if either (any) conditions are met.
(y > 10) | (z > 10)
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
returns true for the pairs y=11, z=1 and y=1, z=11, because these are the only pairs where at least one of the values exceeds 10.
The functions all() and any() can also be useful here. all() returns TRUE if all of its elements are TRUE, and any() returns true if any (at least 1) of its elements are TRUE
all(y > 0)
[1] TRUE
all(y > 5)
[1] FALSE
all(y > 12)
[1] FALSE
any(y > 0)
[1] TRUE
any(y > 5)
[1] TRUE
any(y > 12)
[1] FALSE
Missing data can be detected using the is.na() function. R denotes missing data with NA. is.na() returns TRUE if a value is NA. Let’s review the Mathquiz scores
Ihno$Mathquiz
[1] 43 49 26 29 31 20 13 23 38 NA 29 32 18 NA 21 NA 37 37 32 NA 25 22 35
[24] 47 41 26 39 21 NA 22 21 NA 26 20 30 40 35 10 35 44 26 NA 15 42 33 29
[47] 39 38 NA 24 NA 26 14 45 28 31 NA 32 15 26 26 32 21 43 NA 34 33 30 46
[70] NA 33 29 30 34 32 37 NA 31 30 28 14 9 11 30 15 32 22 NA 25 18 11 11
[93] 37 28 NA 33 28 38 41 39
We can find the NA values using
is.na(Ihno$Mathquiz)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[12] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[45] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
[56] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[67] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[89] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[100] FALSE
We can find their positions using
which(is.na(Ihno$Mathquiz))
[1] 10 14 16 20 29 32 42 49 51 57 65 70 77 88 95
and remove them using
Ihno1 = Ihno[-which(is.na(Ihno$Mathquiz)),]
dim(Ihno)
[1] 100 18
dim(Ihno1)
[1] 85 18
The new Ihno1 dataset now has only 85 people.
We can also change values to NA. Returning to the Phobia data
Ihno$Phobia
[1] 1 1 4 4 10 4 4 4 4 5 5 4 7 4 3 8 4 5 0 4 4 3 4
[24] 0 1 1 0 4 3 5 9 3 4 1 2 0 3 2 8 4 4 0 3 2 2 1
[47] 1 5 5 4 4 10 7 3 4 2 2 1 7 5 5 1 3 3 5 1 0 5 6
[70] 3 5 4 0 2 2 3 1 1 0 2 10 3 3 8 3 3 1 1 2 8 3 0
[93] 0 4 7 0 0 2 1 2
we notice that scores range from 0 to 10. But what if the scale is only supposed to go from 1 to 10? What if the 0s were marked by coders to designate missing values? We can edit them using the tools we’ve learned.
Ihno$Phobia[Ihno$Phobia==0] = NA
Ihno$Phobia
[1] 1 1 4 4 10 4 4 4 4 5 5 4 7 4 3 8 4 5 NA 4 4 3 4
[24] NA 1 1 NA 4 3 5 9 3 4 1 2 NA 3 2 8 4 4 NA 3 2 2 1
[47] 1 5 5 4 4 10 7 3 4 2 2 1 7 5 5 1 3 3 5 1 NA 5 6
[70] 3 5 4 NA 2 2 3 1 1 NA 2 10 3 3 8 3 3 1 1 2 8 3 NA
[93] NA 4 7 NA NA 2 1 2
This line finds the Phobia values of Phobia that are marked 0 and assigns NA to them.
These kinds of edits may be easier to do by hand in a spreadsheet, but only up to a certain point. The larger the dataset, the more impractical doing it by hand becomes. Once the sample size gets into the hundreds, managing data using code like we’ve done here becomes the only option!