R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated.

# Read CSV files <- Everything after a # is a comment and not evaluated
library(tidyverse) # Load a library that provides more functions
charts <- read_csv("data/charts_global_at.csv") # Read the data; '<-' is the assignment operator, notice that 'charts' appears on the right  
## Rows: 292600 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (13): trackID, trackName, artistName, artistIds, region, isrc, primary_...
## dbl  (19): rank, streams, dayNumber, explicit, trackPopularity, n_available_...
## date  (2): day, releaseDate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
charts # View some of the data

R-Code

R-code is contained in so called “chunks”. These chunks always start with three backticks ` and r in curly braces ({r}) and end with three backticks. Optionally, parameters can be added after the r to influence how a chunk behaves. Additionally, you can also give each chunk a name. Note that these have to be unique, otherwise R will refuse to knit your document. A new code chunk can also be added by using the shortcut Ctrl+Alt+i (Strg+Alt+i on a German keyboard).

Chunk options

You can suppress messages and warnings by adding message=FALSE, warning=FALSE to the chunk header like so

```{r charts_no_messages, message=FALSE, warning=FALSE}
# Read CSV files
library(tidyverse)
charts <- read_csv("charts_global_at.csv")
charts
```
# Read CSV files
library(tidyverse)
charts <- read_csv("charts_global_at.csv")
charts

In addition you can even hide the code using echo=FALSE (similar to slides)

```{r charts_no_code, message=FALSE, warning=FALSE, echo=FALSE}
# Read CSV files
library(tidyverse)
charts <- read_csv("charts_global_at.csv")
charts
```

All those options can be set using the knitr::opts_chunk$set(...) function that is already included in every new document, e.g.,

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
```

In this chunk we see the option include=FALSE which means “run the code but do not show the code or any output”. Perfect for setting preferences the audience does not need to be aware of! Another useful option is scipen which controls which numbers are formatted in scientific formatting like so

x <- 100000000000
x
## [1] 1e+11

Changing scipen to something large will tell R to print zeros instead. Can you hide the following code chunk?

options(scipen = 99999)
x
## [1] 100000000000

We will see more chunk options related to visualizations later.

Adding text

RMarkdown combines R code/ output with text. You can use this for example, for your thesis.

Headings

Usually you want to include some kind of heading to structure your text. A heading is created using # signs. A single # creates a first level heading, two ## a second level and so on.

# First level heading

First level heading

## Second level heading

Second level heading

##### Fith level heading

Fith level heading

It is important to note here that the # symbol means something different within the code chunks as opposed to outside of them. If you continue to put a # in front of all your regular text, it will all be interpreted as a first level heading, making your text very large.

Lists

Bullet point lists are created using *, + or -. Sub-items are created by indenting the item using 4 spaces or 2 tabs.

* First Item
* Second Item
    + first sub-item
        - first sub-sub-item
    + second sub-item
  • First Item
  • Second Item
    • first sub-item
      • first sub-sub-item
    • second sub-item

Ordered lists can be created using numbers and letters. If you need sub-sub-items use A) instead of A. on the third level.

1. First item
    a. first sub-item
        A) first sub-sub-item 
     b. second sub-item
2. Second item
  1. First item
    1. first sub-item
      1. first sub-sub-item
    2. second sub-item
  2. Second item

Text formatting

Text can be formatted in italics (*italics*) or bold (**bold**). In addition, you can add block quotes with >

> Lorem ipsum dolor amet chillwave lomo ramps, four loko green juice messenger bag raclette forage offal shoreditch chartreuse austin. Slow-carb poutine meggings swag blog, pop-up salvia taxidermy bushwick freegan ugh poke.

Lorem ipsum dolor amet chillwave lomo ramps, four loko green juice messenger bag raclette forage offal shoreditch chartreuse austin. Slow-carb poutine meggings swag blog, pop-up salvia taxidermy bushwick freegan ugh poke.

Exploring data

library(skimr)
skim(charts)
Data summary
Name charts
Number of rows 292600
Number of columns 34
_______________________
Column type frequency:
character 13
Date 2
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
trackID 0 1.00 22 22 0 5516 0
trackName 0 1.00 1 191 0 4619 0
artistName 0 1.00 2 329 0 2587 0
artistIds 0 1.00 22 436 0 2579 0
region 0 1.00 2 6 0 2 0
isrc 0 1.00 4 12 0 4713 0
primary_artistName 0 1.00 2 39 0 1131 0
primary_artistID 0 1.00 22 22 0 1127 0
artistIDs 0 1.00 22 436 0 2565 0
albumName 0 1.00 1 191 0 3178 0
albumID 0 1.00 22 22 0 3405 0
available_markets 10989 0.96 2 366 0 257 0
releaseDate_precision 0 1.00 3 5 0 3 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
day 0 1 2019-01-01 2021-01-01 2020-01-02 732
releaseDate 0 1 1942-01-01 2021-01-01 2019-06-28 860

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rank 0 1 100.50 57.73 1.00 50.75 100.50 150.25 200.00 ▇▇▇▇▇
streams 0 1 635436.32 872930.82 2491.00 6386.00 552818.00 950352.00 17223237.00 ▇▁▁▁▁
dayNumber 0 1 1092.54 211.38 727.00 909.00 1093.00 1276.00 1458.00 ▇▇▇▇▇
explicit 0 1 0.38 0.49 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
trackPopularity 0 1 75.95 15.08 0.00 69.00 80.00 86.00 100.00 ▁▁▂▇▇
n_available_markets 0 1 71.80 24.36 0.00 78.00 79.00 79.00 92.00 ▁▁▁▁▇
danceability 27 1 0.70 0.13 0.13 0.63 0.72 0.80 0.98 ▁▁▃▇▃
energy 27 1 0.64 0.16 0.00 0.54 0.65 0.75 1.00 ▁▁▅▇▂
key 27 1 5.51 3.59 0.00 2.00 6.00 8.00 11.00 ▇▂▅▅▇
loudness 27 1 -6.28 2.35 -43.99 -7.34 -5.96 -4.76 1.51 ▁▁▁▂▇
mode 27 1 0.52 0.50 0.00 0.00 1.00 1.00 1.00 ▇▁▁▁▇
speechiness 27 1 0.13 0.11 0.02 0.05 0.08 0.18 0.94 ▇▂▁▁▁
acousticness 27 1 0.25 0.23 0.00 0.06 0.17 0.36 0.99 ▇▃▂▁▁
instrumentalness 27 1 0.01 0.06 0.00 0.00 0.00 0.00 0.96 ▇▁▁▁▁
liveness 27 1 0.17 0.13 0.02 0.09 0.12 0.19 0.96 ▇▂▁▁▁
valence 27 1 0.51 0.22 0.03 0.34 0.51 0.67 0.98 ▂▆▇▆▃
tempo 27 1 120.58 28.54 45.78 97.06 119.10 139.96 216.33 ▁▇▇▃▁
duration_ms 27 1 195400.99 38215.08 30133.00 170560.00 192172.00 214290.00 943529.00 ▇▃▁▁▁
time_signature 27 1 3.98 0.30 1.00 4.00 4.00 4.00 5.00 ▁▁▁▇▁
str(charts)
## spec_tbl_df [292,600 × 34] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ trackID              : chr [1:292600] "003VDDA7J3Xb2ZFlNx7nIZ" "003VDDA7J3Xb2ZFlNx7nIZ" "003vvx7Niy0yvhvHt4a68B" "003vvx7Niy0yvhvHt4a68B" ...
##  $ rank                 : num [1:292600] 108 168 197 194 194 193 193 158 158 195 ...
##  $ streams              : num [1:292600] 982766 747972 4574 4784 4403 ...
##  $ trackName            : chr [1:292600] "YELL OH" "YELL OH" "Mr. Brightside" "Mr. Brightside" ...
##  $ artistName           : chr [1:292600] "Trippie Redd feat. Young Thug" "Trippie Redd feat. Young Thug" "The Killers" "The Killers" ...
##  $ artistIds            : chr [1:292600] "6Xgp2XMz1fhVYe7i6yNAax,50co4Is1HCEo8bhOyUWKpn" "6Xgp2XMz1fhVYe7i6yNAax,50co4Is1HCEo8bhOyUWKpn" "0C0XlULifJtAgn6ZNCW2eu" "0C0XlULifJtAgn6ZNCW2eu" ...
##  $ day                  : Date[1:292600], format: "2020-02-07" "2020-02-08" ...
##  $ dayNumber            : num [1:292600] 1129 1130 1333 1375 1393 ...
##  $ region               : chr [1:292600] "global" "global" "at" "at" ...
##  $ isrc                 : chr [1:292600] "QZJ842000061" "QZJ842000061" "USIR20400274" "USIR20400274" ...
##  $ explicit             : num [1:292600] 1 1 0 0 0 0 0 0 0 0 ...
##  $ trackPopularity      : num [1:292600] 75 75 13 13 13 13 13 13 13 13 ...
##  $ primary_artistName   : chr [1:292600] "Trippie Redd" "Trippie Redd" "The Killers" "The Killers" ...
##  $ primary_artistID     : chr [1:292600] "6Xgp2XMz1fhVYe7i6yNAax" "6Xgp2XMz1fhVYe7i6yNAax" "0C0XlULifJtAgn6ZNCW2eu" "0C0XlULifJtAgn6ZNCW2eu" ...
##  $ artistIDs            : chr [1:292600] "6Xgp2XMz1fhVYe7i6yNAax,50co4Is1HCEo8bhOyUWKpn" "6Xgp2XMz1fhVYe7i6yNAax,50co4Is1HCEo8bhOyUWKpn" "0C0XlULifJtAgn6ZNCW2eu" "0C0XlULifJtAgn6ZNCW2eu" ...
##  $ albumName            : chr [1:292600] "YELL OH" "YELL OH" "Hot Fuss" "Hot Fuss" ...
##  $ albumID              : chr [1:292600] "2orYogfKeURqyS1hRP1vZ4" "2orYogfKeURqyS1hRP1vZ4" "4piJq7R3gjUOxnYs6lDCTg" "4piJq7R3gjUOxnYs6lDCTg" ...
##  $ available_markets    : chr [1:292600] "AD, AE, AR, AT, AU, BE, BG, BH, BO, BR, CA, CH, CL, CO, CR, CY, CZ, DE, DK, DO, DZ, EC, EE, EG, ES, FI, FR, GB,"| __truncated__ "AD, AE, AR, AT, AU, BE, BG, BH, BO, BR, CA, CH, CL, CO, CR, CY, CZ, DE, DK, DO, DZ, EC, EE, EG, ES, FI, FR, GB,"| __truncated__ "AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, BR, BY, CA, CH, CL, CO, CR, CY, CZ, DE, DK, DO, DZ, EC, EE, EG, ES,"| __truncated__ "AD, AE, AL, AR, AT, AU, BA, BE, BG, BH, BO, BR, BY, CA, CH, CL, CO, CR, CY, CZ, DE, DK, DO, DZ, EC, EE, EG, ES,"| __truncated__ ...
##  $ n_available_markets  : num [1:292600] 79 79 92 92 92 92 92 92 92 92 ...
##  $ releaseDate          : Date[1:292600], format: "2020-02-07" "2020-02-07" ...
##  $ releaseDate_precision: chr [1:292600] "day" "day" "year" "year" ...
##  $ danceability         : num [1:292600] 0.842 0.842 0.352 0.352 0.352 0.352 0.352 0.352 0.352 0.352 ...
##  $ energy               : num [1:292600] 0.578 0.578 0.911 0.911 0.911 0.911 0.911 0.911 0.911 0.911 ...
##  $ key                  : num [1:292600] 6 6 1 1 1 1 1 1 1 1 ...
##  $ loudness             : num [1:292600] -6.05 -6.05 -5.23 -5.23 -5.23 -5.23 -5.23 -5.23 -5.23 -5.23 ...
##  $ mode                 : num [1:292600] 0 0 1 1 1 1 1 1 1 1 ...
##  $ speechiness          : num [1:292600] 0.138 0.138 0.0747 0.0747 0.0747 0.0747 0.0747 0.0747 0.0747 0.0747 ...
##  $ acousticness         : num [1:292600] 0.0042 0.0042 0.0012 0.0012 0.0012 0.0012 0.0012 0.0012 0.0012 0.0012 ...
##  $ instrumentalness     : num [1:292600] 0 0 0 0 0 0 0 0 0 0 ...
##  $ liveness             : num [1:292600] 0.228 0.228 0.0995 0.0995 0.0995 0.0995 0.0995 0.0995 0.0995 0.0995 ...
##  $ valence              : num [1:292600] 0.19 0.19 0.236 0.236 0.236 0.236 0.236 0.236 0.236 0.236 ...
##  $ tempo                : num [1:292600] 74.5 74.5 148 148 148 ...
##  $ duration_ms          : num [1:292600] 236779 236779 222973 222973 222973 ...
##  $ time_signature       : num [1:292600] 4 4 4 4 4 4 4 4 4 4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trackID = col_character(),
##   ..   rank = col_double(),
##   ..   streams = col_double(),
##   ..   trackName = col_character(),
##   ..   artistName = col_character(),
##   ..   artistIds = col_character(),
##   ..   day = col_date(format = ""),
##   ..   dayNumber = col_double(),
##   ..   region = col_character(),
##   ..   isrc = col_character(),
##   ..   explicit = col_double(),
##   ..   trackPopularity = col_double(),
##   ..   primary_artistName = col_character(),
##   ..   primary_artistID = col_character(),
##   ..   artistIDs = col_character(),
##   ..   albumName = col_character(),
##   ..   albumID = col_character(),
##   ..   available_markets = col_character(),
##   ..   n_available_markets = col_double(),
##   ..   releaseDate = col_date(format = ""),
##   ..   releaseDate_precision = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   key = col_double(),
##   ..   loudness = col_double(),
##   ..   mode = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   liveness = col_double(),
##   ..   valence = col_double(),
##   ..   tempo = col_double(),
##   ..   duration_ms = col_double(),
##   ..   time_signature = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(charts)

Functions

When analyzing data in R, you will access most of the functionalities by calling functions. A function is a piece of code written to carry out a specified task (e.g., the skim(charts)-function to get an overview of charts). It may or may not accept arguments or parameters and it may or may not return one or more values. Functions are generally called like this:

function_name(argument1 = value1, argument2 = value2)

Functions have a default order of arguments which allows us to omit the argument name and write

function_name(value1, value2)

if we know the correct order. The easiest way to learn about a function is to look at the help file using ?function_name. Try it out:

?skim

However, this will only work for loaded packages (after calling library(skimr)). If you are not sure which (installed) package provides a function try ??function_name, e.g.,

??nnet

Many packages also come with companion websites and so called vignettes (btw you can add hyperlinks to your RMarkdown documents using [text](https://my-link.html).

You can also define your own functions to reuse some operations

add_one <- function(x){
  new_value <- x + 1 # intermediate variables in functions are not saved in your environment. Check!
  return(new_value)
}
add_one(5)
## [1] 6
add_one(67)
## [1] 68

Of course we can do more interesting stuff like converting temperatures:

\[ ^{\circ}\mathbf{C} = (^{\circ}\mathbf{F} - 32) \times 5/9 \]

FtoC <- function(temperature_f){
  return((temperature_f - 32) * 5/9)
}
FtoC(100)
## [1] 37.77778
FtoC(70)
## [1] 21.11111

An example Icecream sales:

Icecream <- read.csv("icecream.csv")
skim(Icecream)
Data summary
Name Icecream
Number of rows 30
Number of columns 6
_______________________
Column type frequency:
numeric 6
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X 0 1 15.50 8.80 1.00 8.25 15.50 22.75 30.00 ▇▇▇▇▇
cons 0 1 0.36 0.07 0.26 0.31 0.35 0.39 0.55 ▆▆▇▂▁
income 0 1 84.60 6.25 76.00 79.25 83.50 89.25 96.00 ▇▆▃▂▃
price 0 1 0.28 0.01 0.26 0.27 0.28 0.28 0.29 ▆▆▇▆▃
temp 0 1 49.10 16.42 24.00 32.25 49.50 63.75 72.00 ▇▃▂▃▇
temp_c 0 1 9.50 9.12 -4.44 0.14 9.72 17.64 22.22 ▇▃▂▃▇
Icecream$temp_c <- FtoC(Icecream$temp) # Using the $ operator we can assign a new variable in an existing data.frame
skim(Icecream)
Data summary
Name Icecream
Number of rows 30
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
cons 0 1 0.36 0.07 0.26 0.31 0.35 0.39 0.55 ▆▆▇▂▁
income 0 1 84.60 6.25 76.00 79.25 83.50 89.25 96.00 ▇▆▃▂▃
price 0 1 0.28 0.01 0.26 0.27 0.28 0.28 0.29 ▆▆▇▆▃
temp 0 1 49.10 16.42 24.00 32.25 49.50 63.75 72.00 ▇▃▂▃▇
temp_c 0 1 9.50 9.12 -4.44 0.14 9.72 17.64 22.22 ▇▃▂▃▇
ggplot(Icecream, aes(x = temp_c, y = cons)) + # cons -> consumption
  geom_point()

Data types

The most important types of data are:

Data type Description
Numeric Approximations of the real numbers, \(\normalsize\mathbb{R}\) (e.g., mileage a car gets: 23.6, 20.9, etc.)
Integer Whole numbers, \(\normalsize\mathbb{Z}\) (e.g., number of sales: 7, 0, 120, 63, etc.)
Character Text data (strings, e.g., product names)
Factor Categorical data for classification (e.g., product groups)
Logical TRUE, FALSE
Date Date variables (e.g., sales dates: 21-06-2015, 06-21-15, 21-Jun-2015, etc.)

Variables can be converted from one type to another using the appropriate functions (e.g., as.numeric(),as.integer(),as.character(), as.factor(),as.logical(), as.Date()). For example, we could convert the object y to character as follows:

y <- 5
print(y)
## [1] 5
y <- as.character(y)
print(y)
## [1] "5"

Notice how the value is in quotation marks since it is now of type character.

Entering a vector of data into R can be done with the c(x1,x2,..,x_n) (“concatenate”) command. In order to be able to use our vector (or any other variable) later on we want to assign it a name using the assignment operator <-. You can choose names arbitrarily (but the first character of a name cannot be a number). Just make sure they are descriptive and unique. Assigning the same name to two variables (e.g. vectors) will result in deletion of the first. Instead of converting a variable we can also create a new one and use an existing one as input. In this case we omit the as. and simply use the name of the type (e.g. factor()). There is a subtle difference between the two: When converting a variable, with e.g. as.factor(), we can only pass the variable we want to convert without additional arguments and R determines the factor levels by the existing unique values in the variable or just returns the variable itself if it is a factor already. When we specifically create a variable (just factor(), matrix(), etc.), we can and should set the options of this type explicitly. For a factor variable these could be the labels and levels, for a matrix the number of rows and columns and so on.

head(charts$explicit)
## [1] 1 1 0 0 0 0
charts$explicit <- factor(charts$explicit, levels = c(1,0), labels = c("Explicit", "Not Explicit"))
skim(charts, explicit)
Data summary
Name charts
Number of rows 292600
Number of columns 34
_______________________
Column type frequency:
factor 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
explicit 0 1 FALSE 2 Not: 181268, Exp: 111332

Data structures

Now let’s create a table that contains the variables in columns and each observation in a row (like in SPSS or Excel). There are different data structures in R (e.g., Matrix, Vector, List, Array). In this course, we will mainly use data frames.

data types

vec <- c(1,2,3,4,5,6,7,8)
vec
## [1] 1 2 3 4 5 6 7 8
mat <- matrix(vec, ncol = 2)
mat
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8
arr <- array(vec, c(2,2,2))
arr
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
li <- list(vector = vec, matrix = mat, array = arr)
li
## $vector
## [1] 1 2 3 4 5 6 7 8
## 
## $matrix
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8
## 
## $array
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8

data.frames are similar to matrices but are more flexible in the sense that they may contain different data types (e.g., numeric, character, etc.), where all values of vectors and matrices have to be of the same type (e.g. character). It is often more convenient to use characters instead of numbers (e.g. when indicating a persons sex: “F”, “M” instead of 1 for female , 2 for male). Thus we would like to combine both numeric and character values while retaining the respective desired features. This is where “data frames” come into play. data.frames can have different types of data in each column. data.frame() creates a separate column for each vector, which is usually what we want (similar to SPSS or Excel).

df <- data.frame(vec = vec, vec_plus_one = add_one(vec), letters = c('a','b', 'c', 'd', 'e', 'f','g','h'))
df
# matrix will convert everything to characters
matrix(c(vec, add_one(vec), c('a','b', 'c', 'd', 'e', 'f','g','h')), ncol = 3)
##      [,1] [,2] [,3]
## [1,] "1"  "2"  "a" 
## [2,] "2"  "3"  "b" 
## [3,] "3"  "4"  "c" 
## [4,] "4"  "5"  "d" 
## [5,] "5"  "6"  "e" 
## [6,] "6"  "7"  "f" 
## [7,] "7"  "8"  "g" 
## [8,] "8"  "9"  "h"

Accessing Variables in a data.frame

# Single column
Icecream$temp_c
##  [1]  5.0000000 13.3333333 17.2222222 20.0000000 20.5555556 18.3333333
##  [7] 16.1111111  8.3333333  0.0000000 -4.4444444 -2.2222222 -3.3333333
## [13]  0.0000000  4.4444444 12.7777778 17.2222222 22.2222222 22.2222222
## [19] 19.4444444 15.5555556  6.6666667  4.4444444  0.0000000 -2.7777778
## [25] -2.2222222  0.5555556  5.0000000 11.1111111 17.7777778 21.6666667
# Multiple columns
Icecream[, c("temp_c", "temp")]
# First row
Icecream[1, ]
# First 5 rows
Icecream[1:5, ]
# Combination
Icecream[1:5, c("temp_c", "temp")]

Data Input

In general: click on file via the RStudio file explorer and the correct function will be shown. Copy that into your script.

In some cases we want to download data from external sources (e.g., APIs). There are a couple of packages that can facilitate that.

Wikipedia

Wikipedia includes many interesting and up-to-date tables. For example you might be looking for a suitable TikTok influencer for your products:

library(rvest)
library(janitor)
library(stringr)
most_followed_link <- 'https://en.wikipedia.org/wiki/List_of_most-followed_TikTok_accounts'
most_followed_page <- read_html(most_followed_link)
most_followed_tables <- html_nodes(most_followed_page, "table.wikitable")
most_followed <- most_followed_tables[[1]] %>% html_table(fill = TRUE)
most_followed
names(most_followed)
## [1] "Rank"                    "Username"               
## [3] "Owner"                   "Followers[10](millions)"
## [5] "Likes[10](millions)"     "Description"            
## [7] "Country"                 "Brand Account"
most_followed <- clean_names(most_followed)
names(most_followed)
## [1] "rank"                  "username"              "owner"                
## [4] "followers_10_millions" "likes_10_millions"     "description"          
## [7] "country"               "brand_account"
names(most_followed) <- str_remove(names(most_followed), "10_")
names(most_followed)
## [1] "rank"               "username"           "owner"             
## [4] "followers_millions" "likes_millions"     "description"       
## [7] "country"            "brand_account"

JSON APIs

Reading data from websites can be tricky since you need to analyze the page structure first. Many web-services (e.g., Facebook, Twitter, YouTube) actually have application programming interfaces (API’s), which you can use to obtain data in a pre-structured format. JSON (JavaScript Object Notation) is a popular lightweight data-interchange format in which data can be obtained. The process of obtaining data is visualized in the following graphic:

Obtaining data from APIs

The process of obtaining data from APIs consists of the following steps:

Identify an API that has enough data to be relevant and reliable (e.g., www.programmableweb.com has >12,000 open web APIs in 63 categories). Request information by calling (or, more technically speaking, creating a request to) the API (e.g., R, python, php or JavaScript). Receive response messages, which is usually in JavaScript Object Notation (JSON) or Extensible Markup Language (XML) format. Write a parser to pull out the elements you want and put them into a of simpler format Store, process or analyze data according the marketing research question. Let’s assume that you would like to obtain population data again. The World Bank has an API that allows you to easily obtain this kind of data. The details are usually provided in the API reference, e.g., here. You simply “call” the API for the desired information and get a structured JSON file with the desired key-value pairs in return. For example, the population for Austria from 1960 to 2019 can be obtained using this call. The file can be easily read into R using the fromJSON()-function from the jsonlite-package. Again, the result is a list and the second element ctrydata[[2]] contains the desired data, from which we select the “value” and “data” columns using the square brackets as usual [,c("value","date")]

library(jsonlite)
url <- "http://api.worldbank.org/v2/countries/AT/indicators/SP.POP.TOTL/?date=1960:2021&format=json&per_page=100" #specifies url
ctrydata <- fromJSON(url) #parses the data 
str(ctrydata)
## List of 2
##  $ :List of 7
##   ..$ page       : int 1
##   ..$ pages      : int 1
##   ..$ per_page   : int 100
##   ..$ total      : int 61
##   ..$ sourceid   : chr "2"
##   ..$ sourcename : chr "World Development Indicators"
##   ..$ lastupdated: chr "2022-02-15"
##  $ :'data.frame':    61 obs. of  8 variables:
##   ..$ indicator      :'data.frame':  61 obs. of  2 variables:
##   .. ..$ id   : chr [1:61] "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" ...
##   .. ..$ value: chr [1:61] "Population, total" "Population, total" "Population, total" "Population, total" ...
##   ..$ country        :'data.frame':  61 obs. of  2 variables:
##   .. ..$ id   : chr [1:61] "AT" "AT" "AT" "AT" ...
##   .. ..$ value: chr [1:61] "Austria" "Austria" "Austria" "Austria" ...
##   ..$ countryiso3code: chr [1:61] "AUT" "AUT" "AUT" "AUT" ...
##   ..$ date           : chr [1:61] "2020" "2019" "2018" "2017" ...
##   ..$ value          : int [1:61] 8917205 8879920 8840521 8797566 8736668 8642699 8546356 8479823 8429991 8391643 ...
##   ..$ unit           : chr [1:61] "" "" "" "" ...
##   ..$ obs_status     : chr [1:61] "" "" "" "" ...
##   ..$ decimal        : int [1:61] 0 0 0 0 0 0 0 0 0 0 ...
ctrydata[[2]][,c("value","date")]
ctrydata[[2]]$date <- as.numeric(ctrydata[[2]]$date)
ggplot(ctrydata[[2]], aes(x = date, y = value)) +
  geom_line()

Try to recreate the following table for the “ease of doing business” indicator (see function arrange)

doing_business_url <- "http://api.worldbank.org/v2/countries/all/indicators/IC.BUS.EASE.XQ/?date=2019&format=json&per_page=6000" #specifies url
#...

Data manipulation: dplyr & tidyr

Both dplyr and tidyr are already included in the tidyverse package so we don’t have to load anything else.

From dplyr we are going to use the following functions

  • select() picks variables based on their names. Reduces columns.
  • filter() picks cases based on their values. Reduces rows by removing based on filtering function.
  • mutate() adds new variables that are functions of existing variables. Adds column(s).
  • summarize() reduces multiple values down to a single summary. Reduces rows by summarizing values.
  • arrange() changes the ordering of the rows. Sorts data based on column(s)

These combine naturally with group_by() which allows you to perform any operation “by group”.

For filtering we will need the following logical operations

Logical operations

Operation Description Example Result
a==b a equal b 8/2==4 TRUE
a!=b a not equal b 8/2!=5 TRUE
a>b a greater b 2*2>3 TRUE
a>=b a greater or equal b 5>=10/2 TRUE
a<b a less b 6/2 < 5 TRUE
a<=b a less or equal b 5<=10/2 TRUE

Logical AND: && e.g. 5>=4 && 7>5 \(\Rightarrow\) TRUE

Logical OR: || e.g. 5>=4 || 7>10 \(\Rightarrow\) TRUE

  • & and |: element-wise;
  • && and ||: only first element
select(charts, trackName, region, day)
filter(charts, danceability > 0.96, region == "at", explicit == "Not Explicit")
# %>% inserts the previous output as the first argument
mutate(charts, log_streams = log(streams)) %>% 
  select(trackName, region, day, streams, log_streams)
group_by(charts, trackName) %>%
  mutate(streams_std = scale(streams), 
         streams_mean = mean(streams), 
         streams_sd = sd(streams),
         streams_std_manual = (streams - streams_mean)/streams_sd) %>%
  select(trackName,streams_mean, streams_sd, streams_std, streams_std_manual,  streams)
summarize(charts, streams=sum(streams))
group_by(charts, trackName) %>% 
  summarize(total_streams = sum(streams)) 
group_by(charts, artistName) %>% 
  summarize(total_streams = sum(streams)) %>%
  arrange(desc(total_streams))
group_by(charts, trackName) %>%
  filter(region == "global") %>%
  summarize(days_in_charts = n(), total_streams = sum(streams), avg_rank = mean(rank)) %>%
  filter(days_in_charts > 720) %>%
  arrange(desc(days_in_charts))

The tidyr package provides functions to “pivot” tables from long to wide and vice versa.

year_streams <- filter(charts, region == "global") %>%
  group_by(year = format(day, "%Y"), trackName) %>%
  summarize(streams = sum(streams))
year_streams
year_wide <- pivot_wider(year_streams, names_from = year, values_from = streams)
year_wide
filter(year_wide, across(`2019`:`2021`, ~ !is.na(.)))
filter(year_wide, !is.na(`2019`) & !is.na(`2020`) & !is.na(`2021`))

Usually the more useful function is pivot_wider because most packages (e.g., ggplot2) expect long data.

pivot_longer(year_wide, `2019`:`2021`, 
             names_to = "year", 
             values_to = "streams", 
             values_drop_na = TRUE)

Including Plots

Let’s create a plot that shows the streams for the top 10 artists in the sample for 2019 and 2020. First we prepare the data

top10artists <- filter(charts, format(day, '%Y') %in% c("2019", "2020")) %>%
  group_by(artistName) %>% 
  summarize(total_streams = sum(streams)) %>%
  top_n(n=10,total_streams)
top10artists
top10streams <- filter(charts, artistName %in% top10artists$artistName & format(day, '%Y') %in% c("2019", "2020")) %>%
  group_by(artistName, year = format(day, "%Y")) %>%
  summarise(streams = sum(streams))
## `summarise()` has grouped output by 'artistName'. You can override using the
## `.groups` argument.
top10streams$year <- factor(top10streams$year, levels = c("2019", "2020"))
top10streams

Anatomy of a ggplot

The ggplot function prepares the “canvas” for the plot by looking at the data we want to plot. The axes and coloring/fill color can be passed to aes as variables.

ggplot(top10streams, aes(x = artistName, y = streams, fill = year))

Next we add (+) layers to the plot to show the data. To see which artists are increasing their success its better to draw the two bars next to each other.

ggplot(top10streams, aes(x = artistName, y = streams, fill = year)) +
  geom_bar(stat = "identity", position = "dodge")

Alternatively, if we are more interested in total streams:

ggplot(top10streams, aes(x = artistName, y = streams, fill = year)) +
  geom_bar(stat = "identity", position = "stack")

We can fix the overlapping labels by adding another layer:

ggplot(top10streams, aes(x = artistName, y = streams, fill = year)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_x_discrete(guide = guide_axis(n.dodge = 2))

Next we want to have an ordering to easily compare the success of the artists:

# arrange by total streams:
top10streams$artistName <- factor(top10streams$artistName, 
                                  levels = arrange(top10artists, desc(total_streams))$artistName) 
ggplot(top10streams, aes(x = artistName, y = streams, fill = year)) +
  geom_bar(stat = "identity", position = "dodge")  +
  scale_x_discrete(guide = guide_axis(n.dodge = 2))

# arrange by 2020 streams
level_order <- filter(top10streams, year=="2020") %>% arrange(desc(streams))
top10streams$artistName <- factor(top10streams$artistName, levels = level_order$artistName)
plt_streams <- ggplot(top10streams, aes(x = artistName, y = streams, fill = year)) +
  geom_bar(stat = "identity", position = "dodge")  +
  scale_x_discrete(guide = guide_axis(n.dodge = 2))
plt_streams

Finally let’s clean the plot up a bit. Notice that we can save plots and add to them later! In addition the width of the plot is increase through the chunk option fig.width=12 here.

plt_streams + 
  ggtitle("Total streams of most successful artists", subtitle = "2019-2020") + # add title layer
  theme_bw() +
  theme(panel.border = element_blank(), # remove box around plot
        axis.line = element_line(color = 'black'), # add x, y axes
        panel.grid.major.x = element_blank(), # remove x grid lines
        legend.title = element_blank(), # remove "year" from legend
        axis.title.y = element_text(size = 16), # increase y title text size
        axis.title.x = element_blank(), # remove x title
        axis.text = element_text(size = 15), # increase text size of labels on both axes
        legend.text = element_text(size = 15), # increase legend text size
        title = element_text(size = 18)) + # increase title text size
  scale_y_continuous(expand = expansion(mult = c(0, .1)), # remove spacing on bottom
                     labels = scales::comma) # add commas to the number of streams

Let’s try to clean this plot up together:

charts %>%
  filter(region == "global") %>% # Only global streams
  group_by(day) %>% # We want to summarize per day
  summarize(streams = sum(streams)) %>% # Calculate sum of streams
  ggplot(aes(x = day, y = streams)) + # plot setup
    geom_line() + # add lines
    ggtitle("Total global streams of top 200 songs") # add title