3 Data handling
This chapter covers the basics of data handling in R.
3.1 Basic data handling
3.1.1 Creating objects
Anything created in R is an object. You can assign values to objects using the assignment operator <-
:
Note that comments may be included in the code after a #
. The text after #
is not evaluated when the code is run; they can be written directly after the code or in a separate line.
To see the value of an object, simply type its name into the console and hit enter:
## [1] "hello world"
You can also explicitly tell R to print the value of an object:
## [1] "hello world"
Note that because we assign characters in this case (as opposed to e.g., numeric values), we need to wrap the words in quotation marks, which must always come in pairs. Although RStudio automatically adds a pair of quotation marks (i.e., opening and closing marks) when you enter the opening marks it could be that you end up with a mismatch by accident (e.g., x <- "hello
). In this case, R will show you the continuation character “+”. The same could happen if you did not execute the full command by accident. The “+” means that R is expecting more input. If this happens, either add the missing pair, or press ESCAPE to abort the expression and try again.
To change the value of an object, you can simply overwrite the previous value. For example, you could also assign a numeric value to “x” to perform some basic operations:
## [1] 2
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] FALSE
Note that the name of the object is completely arbitrary. We could also define a second object “y”, assign it a different value and use it to perform some basic mathematical operations:
y <- 5 #assigns the value of 2 to the object x
x == y #checks whether the value of x to the value of y
## [1] FALSE
## [1] 10
## [1] 7
## [1] 31
Object names
Please note that object names must start with a letter and can only contain letters, numbers, as well as the .
, and _
separators. It is important to give your objects descriptive names and to be as consistent as possible with the naming structure. In this tutorial we will be using lower case words separated by underscores (e.g., object_name
). There are other naming conventions, such as using a .
as a separator (e.g., object.name
), or using upper case letters (objectName
). It doesn’t really matter which one you choose, as long as you are consistent.
3.1.2 Data types
The most important types of data are:
Data type | Description |
---|---|
Numeric | Approximations of the real numbers, \(\normalsize\mathbb{R}\) (e.g., mileage a car gets: 23.6, 20.9, etc.) |
Integer | Whole numbers, \(\normalsize\mathbb{Z}\) (e.g., number of sales: 7, 0, 120, 63, etc.) |
Character | Text data (strings, e.g., product names) |
Factor | Categorical data for classification (e.g., product groups) |
Logical | TRUE, FALSE |
Date | Date variables (e.g., sales dates: 21-06-2015, 06-21-15, 21-Jun-2015, etc.) |
Variables can be converted from one type to another using the appropriate functions (e.g., as.numeric()
,as.integer()
,as.character()
, as.factor()
,as.logical()
, as.Date()
). For example, we could convert the object y
to character as follows:
## [1] "5"
Notice how the value is in quotation marks since it is now of type character.
Entering a vector of data into R can be done with the c(x1,x2,..,x_n)
(“concatenate”) command. In order to be able to use our vector (or any other variable) later on we want to assign it a name using the assignment operator <-
. You can choose names arbitrarily (but the first character of a name cannot be a number). Just make sure they are descriptive and unique. Assigning the same name to two variables (e.g. vectors) will result in deletion of the first. Instead of converting a variable we can also create a new one and use an existing one as input. In this case we omit the as.
and simply use the name of the type (e.g. factor()
). There is a subtle difference between the two: When converting a variable, with e.g. as.factor()
, we can only pass the variable we want to convert without additional arguments and R determines the factor levels by the existing unique values in the variable or just returns the variable itself if it is a factor already. When we specifically create a variable (just factor()
, matrix()
, etc.), we can and should set the options of this type explicitly. For a factor variable these could be the labels and levels, for a matrix the number of rows and columns and so on.
#Numeric:
top10_sales <- c(163608, 126687, 120480, 110022, 108630, 95639, 94690, 89011, 87869, 85599)
#Character:
top10_product_names <- c("Bio-Kaisersemmel", "Laktosefreie Bio-Vollmilch", "Ottakringer Helles", "Milka Ganze Haselnüsse", "Bio-Toastkäse Scheiben", "Ottakringer Bio Zwickl", "Vienna Coffee House Espresso", "Bio-Mozzarella", "Basmati Reis", "Bananen") # Characters have to be put in ""
#Factor variable with two categories:
private_label <- c(1,1,0,0,1,0,0,1,1,1)
private_label <- factor(private_label,
levels = 0:1,
labels = c("national brand", "private label"))
#Factor variable with more than two categories:
top10_brand <- c("Ja! Natürlich", "Ja! Natürlich", "Ottakringer", "Milka", "Ja! Natürlich", "Ottakringer", "Julius Meinl", "Ja! Natürlich", "Billa Bio", "Clever")
top10_brand <- as.factor(top10_brand)
#Date:
date_most_sold <- as.Date(c("2023-05-24", "2023-06-23", "2023-09-01", "2023-06-30", "2023-05-05", "2023-06-09", "2023-07-14", "2023-06-16", "2023-05-18", "2023-05-19"))
#Logical
private_label_logical <- c(TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,FALSE,TRUE,TRUE,TRUE)
In order to “return” a vector we can now simply enter its name:
## [1] 163608 126687 120480 110022 108630 95639 94690 89011 87869 85599
## [1] "2023-05-24" "2023-06-23" "2023-09-01" "2023-06-30" "2023-05-05"
## [6] "2023-06-09" "2023-07-14" "2023-06-16" "2023-05-18" "2023-05-19"
In order to check the type of a variable the class()
function is used.
## [1] "Date"
3.1.3 Data structures
Now let’s create a table that contains the variables in columns and each observation in a row (like in SPSS or Excel). There are different data structures in R (e.g., Matrix, Vector, List, Array). In this course, we will mainly use data frames.
Data frames are similar to matrices but are more flexible in the sense that they may contain different data types (e.g., numeric, character, etc.), where all values of vectors and matrices have to be of the same type (e.g. character). It is often more convenient to use characters instead of numbers (e.g. when indicating a persons sex: “F”, “M” instead of 1 for female , 2 for male). Thus we would like to combine both numeric and character values while retaining the respective desired features. This is where “data frames” come into play. Data frames can have different types of data in each column. For example, we can combine the vectors created above in one data frame using data.frame()
. This creates a separate column for each vector, which is usually what we want (similar to SPSS or Excel).
sales_data <- data.frame(top10_sales,
top10_product_names,
private_label,
top10_brand,
date_most_sold,
private_label_logical)
3.1.3.1 Accessing data in data frames
When entering the name of a data frame, R returns the entire data frame:
Hint: You may also use the View()
-function to view the data in a table format (like in SPSS or Excel), i.e. enter the command View(data)
. Note that you can achieve the same by clicking on the small table icon next to the data frame in the “Environment”-window on the right in RStudio.
Sometimes it is convenient to return only specific values instead of the entire data frame. There are a variety of ways to identify the elements of a data frame. One easy way is to explicitly state, which rows and columns you wish to view. The general form of the command is data.frame[rows,columns]
. By leaving one of the arguments of data.frame[rows,columns]
blank (e.g., data.frame[rows,]
) we tell R that we want to access either all rows or columns, respectively. Note that a:b
(where a
and b
are numbers and a
< b
) is short hand notation for seq(from = a, to = b, by = 1)
. Here are some examples:
Typically we don’t want to remember which row or column number is needed but use names and conditions (e.g, all explicit tracks). In order to make that easier we will add more functions to R by installing a package (sometimes also referred to as “library”) called tidyverse
. We only have to install it once (per computer) and subsequently we can add the functions the package provides by calling library(tidyverse)
. Typically library(PACKAGENAME)
has to be called again whenever you restart R/RStudio. If you see the error message could not find function ...
make sure you have loaded all the required packages. The tidyverse
provides us with convenient tools to manipulate data.
You may create subsets of the data frame, e.g., using mathematical expressions using the filter
function:
library(tidyverse)
filter(sales_data, private_label == "private label") # show only products that belong to private labels
filter(sales_data, top10_product_names == 'Bio-Kaisersemmel') # returns all observations from product "Bio-Kaisersemmel"
private_labels <- filter(sales_data, private_label == "private label") # creates a new data.frame by assigning only observations belonging to private labels
You may also change the order of the rows in a data.frame
by using the arrange()
-function
You can order observations by several characteristics. Please note that the order of variables in the arrange()
-function specifies the order of arranging the data set. For example, here we first arrange the observations by the brand of the product, and only then require ordering by sales amounts:
# Arrange by brand (ascending by default) and then sales (descending: most - least)
arrange(sales_data, top10_brand, desc(top10_sales))
3.1.3.2 Inspecting the content of a data frame
The head()
function displays the first X elements/rows of a vector, matrix, table, data frame or function.
The tail()
function is similar, except it displays the last elements/rows.
names()
returns the names of an R object. When, for example, it is called on a data frame, it returns the names of the columns.
## [1] "top10_sales" "top10_product_names" "private_label"
## [4] "top10_brand" "date_most_sold" "private_label_logical"
str()
displays the internal structure of an R object. In the case of a data frame, it returns the class (e.g., numeric, factor, etc.) of each variable, as well as the number of observations and the number of variables.
## 'data.frame': 10 obs. of 6 variables:
## $ top10_sales : num 163608 126687 120480 110022 108630 ...
## $ top10_product_names : chr "Bio-Kaisersemmel" "Laktosefreie Bio-Vollmilch" "Ottakringer Helles" "Milka Ganze Haselnüsse" ...
## $ private_label : Factor w/ 2 levels "national brand",..: 2 2 1 1 2 1 1 2 2 2
## $ top10_brand : Factor w/ 6 levels "Billa Bio","Clever",..: 3 3 6 5 3 6 4 3 1 2
## $ date_most_sold : Date, format: "2023-05-24" "2023-06-23" ...
## $ private_label_logical: logi TRUE TRUE FALSE FALSE TRUE FALSE ...
nrow()
and ncol()
return the rows and columns of a data frame or matrix, respectively. dim()
displays the dimensions of an R object.
## [1] 10
## [1] 6
## [1] 10 6
ls()
can be used to list all objects that are associated with an R object.
## [1] "date_most_sold" "private_label" "private_label_logical"
## [4] "top10_brand" "top10_product_names" "top10_sales"
3.1.3.3 Select, group, append and delete variables to/from data frames
To return a single column in a data frame, use the $
notation. For example, this returns all values associated with the variable “top10_track_streams”:
## [1] 163608 126687 120480 110022 108630 95639 94690 89011 87869 85599
If you want to select more than one variable you can use the select
function. It takes the data.frame
containing the data as its first argument and the variables that you need after it:
select
can also be used to remove columns by prepending a -
to their name:
Assume that you wanted to add an additional variable to the data frame. You may use the $
notation to achieve this:
# Create new variable as the log of sales
sales_data$log_sales <- log(sales_data$top10_sales)
# Create an ascending count variable which might serve as an ID
sales_data$obs_number <- 1:nrow(sales_data)
head(sales_data)
In order to add a function (e.g., log
) of multiple existing variables to the data.frame
use mutate
. Multiple commands can be chained using so called pipes - operators that can be read as “then”. Since R version 4.1 there are native pipes (|>
) as well as the ones provided by the tidyverse
(%>%
):
music_data_new <- mutate(sales_data,
sqrt_sales = sqrt(top10_sales),
# "%m" extracts the month, format returns a character so we convert it to integer
most_sales_month = as.integer(format(date_most_sold, "%m"))
) %>%
select(top10_product_names, sqrt_sales, most_sales_month)
Two other important functions of tidyverse
help calculating important summary statistics, such as totals, averages, etc. By using group_by
function, we can ask R to pay attention to group-specific observations (e.g., by brand, label, date, …) to then obtain values of interest by calling summarize
:
sales_total <- sales_data %>%
group_by(top10_brand) %>%
summarize(total_sales = sum(top10_sales), avg_sales = mean(top10_sales))
head(sales_total)
You can also rename variables in a data frame, e.g., using the rename()
-function. In the following code “::” signifies that the function “rename” should be taken from the package “dplyr” (note: this package is part of the tidyverse
). This can be useful if multiple packages have a function with the same name. Calling a function this way also means that you can access a function without loading the entire package via library()
.
Note that the same can be achieved using:
Or by referring to the index of the variable:
3.2 Data import and export
Before you can start your analysis in R, you first need to import the data you wish to perform the analysis on. You will often be faced with different types of data formats (usually produced by some other statistical software like SPSS or Excel or a text editor). Fortunately, R is fairly flexible with respect to the sources from which data may be imported and you can import the most common data formats into R with the help of a few packages. R can, among others, handle data from the following sources:
In the previous chapter, we saw how we may use the keyboard to input data in R. In the following sections, we will learn how to import data from text files and other statistical software packages.
3.2.1 Getting data for this course
Most of the data sets we will be working with in this course will be stored in text files (i.e., .dat, .txt, .csv). All data sets we will be working with are stored in a repository on GitHub (similar to other cloud storage services such as Dropbox). You can directly import these data sets from GitHub without having to copy data sets from one place to another. If you know the location, where the files are stored, you may conveniently load the data directly from GitHub into R using the read.csv()
function. To figure out the structure of the data you can read the first couple of lines of a file using the readLines
function. The header=TRUE
argument in the read.csv
function indicates that the first line of data represents the header, i.e., it contains the names of the columns. The sep=";"
-argument specifies the delimiter (the character used to separate the columns), which is a “;” in this case.
## [1] "\"isrc\";\"artist_id\";\"streams\";\"weeks_in_charts\";\"n_regions\";\"danceability\";\"energy\";\"speechiness\";\"instrumentalness\";\"liveness\";\"valence\";\"tempo\";\"song_length\";\"song_age\";\"explicit\";\"n_playlists\";\"sp_popularity\";\"youtube_views\";\"tiktok_counts\";\"ins_followers_artist\";\"monthly_listeners_artist\";\"playlist_total_reach_artist\";\"sp_fans_artist\";\"shazam_counts\";\"artistName\";\"trackName\";\"release_date\";\"genre\";\"label\";\"top10\";\"expert_rating\""
## [2] "\"BRRGE1603547\";3679;11944813;141;1;50,9;80,3;4;0,05;46,3;65,1;166,018;3,11865;228,285714285714;0;450;51;145030723;9740;29613108;4133393;24286416;3308630;73100;\"Luan Santana\";\"Eu, Você, O Mar e Ela\";\"2016-06-20\";\"other\";\"Independent\";1;\"excellent\""
## [3] "\"USUM71808193\";5239;8934097;51;21;35,3;75,5;73,3;0;39;43,7;191,153;3,228;144,285714285714;0;768;54;13188411;358700;3693566;18367363;143384531;465412;588550;\"Alessia Cara\";\"Growing Pains\";\"2018-06-14\";\"Pop\";\"Universal Music\";0;\"good\""
test_data <- read.csv("https://short.wu.ac.at/ma22_musicdata",
sep = ";",
header = TRUE)
head(test_data)
Note that it is also possible to download the data, placing it in the working directory and importing it from there. However, this requires an additional step to download the file manually first. If you chose this option, please remember to put the data file in the working directory first. If the import is not working, check your working directory setting using getwd()
. Once you placed the file in the working directory, you can import it using the same command as above. Note that the file must be given as a character string (i.e., in quotation marks) and has to end with the file extension (e.g., .csv, .tsv, etc.).
3.2.2 Export data
Exporting to different formats is also easy, as you can just replace “read” with “write” in many of the previously discussed functions (e.g. write.csv(object, "file_name")
). This will save the data file to the working directory. To check what the current working directory is you can use getwd()
. By default, the write.csv(object, "file_name")
function includes the row number as the first variable. By specifying row.names = FALSE
, you may exclude this variable since it doesn’t contain any useful information.
Learning check
(LC3.1) Which of the following are data types are recognized by R?
(LC3.2) What function should you use to check if an object is a data frame?
(LC3.3) You would like to combine three vectors (student, grade, date) in a data frame. What would happen when executing the following code?
student <- c("Max", "Jonas", "Saskia", "Victoria")
grade <- c(3, 2, 1, 2)
date <- as.Date(c("2020-10-06", "2020-10-08", "2020-10-09"))
df <- data.frame(student, grade, date)
You would like to analyze the following data frame
(LC3.4) How can you obtain Christina’s grade from the data frame?
(LC3.5) How can you add a new variable ‘student_id’ to the data frame that assigns numbers to students in an ascending order?
(LC3.6) How could you obtain all rows with students who obtained a 1?
(LC3.7) How could you create a subset of observations where the grade is not missing (NA)
(LC3.8) What is the share of students with a grade better than 3?
(LC3.9) You would like to load a .csv file from your working directory. What function would you use do it?
(LC3.10) After you loaded the file, you would like to inspect the types of data contained in it. How would you do it?