Intro to Data Analytics in R (Part 1)

A Brief Overview of R

The name "R" came from the names of its core developers, Robert Gentleman, and Ross Ihaka. It’s also a play on the name of its parent language: S.

R’s syntax was very similar to that of S, in its early years. R’s semantics, however, is closer to that of Scheme, a functional programming language. "Syntax " and "Semantics" are terms borrowed from linguists that describe different aspects of a language. In computer programming, Syntax refers to the rules that dictate a programming language’s ‘spelling’ and ‘grammar,’ while Semantics refers to how the language’s data or commands are presented.

R is grouped under functional programming languages. In R, functions are first-class objects. This means that you can do anything with functions as you can do with any other R object. Even assigning an object to a variable is a function.

To understand computations in R, two slogans are helpful: Everything that exists is an object. Everything that happens is a function call. — John Chambers, quoted in Section 6.3 of Advanced R by Hadley Wickham

R, however, isn't a "pure" functional programming language. It also has features of imperative programming languages, such as loops and assignment.

Why Learn R?

Open-source
Great for Statistical Analysis
Compatible with other programming languages
Compatible across platforms
Can process large datasets via parallel or distributed computing
Can interact with databases.
Can be used for many common data-related tasks eg. Data Wrangling, Web scraping, Machine Learning.
Popularly used in the data industry.

Data Types in R

Integer e.g. 350L (The letter "L" is added to indicate that it's an integer, as these are stored differently from objects of the numeric class). To read more on the difference between integer and numeric classes in R check out: What's the difference between integer class and numeric class in R This difference is generally not important for most day-to-day work as a data analyst though.
Numeric e.g. 45, 45.4
** Character** e.g. "Intro to R"
** Logical** e.g. TRUE
** Complex** e.g. 30 + 6i
** Raw**: Values are displayed as raw bytes. The function charToRaw() is used to convert an object of class "character" to class "raw". rawToChar() reverses this.

A function is a block of code that performs a specific task. This block of code is given a name, and whenever the task is to be performed, the function is "called". Functions are useful when a task would be performed multiple times and we wouldn't want to write out a lot of code anytime we want to perform said task. R has some built-in functions and users can also create their own functions. We can also use functions created by other users which are shared via packages.

The function class() can show the data type of an object. We can try this in the RStudio console:

NB: "//" indicates a comment in this article. If you copy the code into RStudio, please delete those sections or replace the symbol with "#" otherwise your code will throw up errors when run

> s <- 350L // "<- " is the main assignment operator in R
> class(s)

> [1] "integer"

Commonly Used Data Structures by Data Analysts in R

A data structure is a particular way of organizing data in a computer so that it can be used effectively. Data Structures in R include vectors, matrices, arrays, lists, factors and data frames. I will touch briefly on vectors, factors, lists and data frames.

Vectors: A vector is a one-dimensional data structure. A vector in R can only contain objects of only one data type. This example shows how to create a vector in R:

> example1 <- c(2,3,4) // c() is the function used to create vectors.
> example1
> [1] 2 3 4

Lists: Lists in R can contain objects of various data types and structures - other lists, vectors, matrices, and even functions. list() is the function used to create lists.
Factors: Factors are used to categorize the data and store it as levels. They are useful for storing categorical data. To create a factor, you use the function: factor(). The argument to this is the vector. For example, say we want to record how often we eat our three favourite fruits within a month, we can create a factor: "Fruits", which has levels: "oranges, apples, plums".

> Fruits <- factor(c("oranges","apples", "plums", "oranges", "plums", "oranges", "apples"))
> Fruits
[1] oranges apples  plums  oranges plums oranges apples
Levels: apples oranges plums
> class(Fruits)
[1] "factor"

Data frame: This is a tabular representation of data, i.e. made of rows and columns and can contain objects of various data types. A data frame must have column names. Each column must have the same number of items e.g. One column cannot have 4 items and the other 20. Each item in a single column must be of the same data type. To create a data frame you use the function data.frame()

Studio Time!

Warming up with some useful shortcuts

Ctrl+Shift+F10 -> restart r session
Ctrl + Enter key -> run this line/selected block of code
Ctrl+Shift+C -> comment out a line or block of code
Ctrl+Shift+M -> %>% this is the forward pipe operator. In a sequence of operations, it is used to process a data object by passing the result of one step as input for the next step
Ctrl+Shift+E -> automatic indenting
Ctrl+L -> Clear Console
Alt + - -> used to create the assignment operator "< -"

For this practical session, we'd be analyzing the "Salaries" dataset from the carData package: The carData package contains a variety of datasets that we can perform analysis on. NB: Best practices in coding wouldn't allow for this frequent commenting, especially as the comments address the "what" instead of the "why" but do bear with me for the sake of absolute beginners.

install.packages("carData") // this installs the carData package
// A package is a shareable collection of  code 
// (usually functions), documentation and/or data 
// that is usually geared towards solving a particular problem/ set of problems

library(carData) // load the package so we can work with it in this R session
?Salaries // this calls the help function on Salaries so we can better understand our dataset
data<-Salaries  // here we are assigning the Salaries dataset to a variable we are calling "data"

The following lines of code would each give as an idea of how the dataset is like

str(data)
head(data)
tail(data)
summary(data)

Now, let's clean up our data! We can replace the A's and B's in the "discipline" column with what they represent. From the dataset's associated documentation (which we explored with ?Salaries), we realize that "A" stands for the Theoretical Departments and "B" stands for the Applied Departments. Let's make the change. Similarly, let's clean up the "rank" column.

// First, we can check the structure of the column with:

str(data$discipline) //We realize it's a Factor with 2 levels: "A" , "B"

// You'd have to install the tidyverse package if you haven't done so already. It's a popular and excellent package for data wrangling in R. After installing, we can load it into our R session: 

library(tidyverse)  

data<- data %>% 
  mutate(
    discipline = fct_recode(discipline, 
      "Theoretical Department" = "A", 
      "Applied Department" = "B"),
    rank = fct_recode(rank, 
                      "Assistant Professor" = "AsstProf", 
                      "Associate Professor" = "AssocProf",
                      "Professor" = "Prof")
          )

//Next, we can change the column name to all caps, to make them stand out. 
//Then, we rename the "YRS.SINCE.PHD" column to get rid of the "."

data <- data %>% rename_with(toupper)
data <- data %>% rename("YEARS SINCE PhD" = YRS.SINCE.PHD)

That's enough data cleaning for now. Now, let's do some basic analysis. There is a great function tbl_summary() from the gtsummary package that can give an overview of our dataset in a presentable format.

library(gtsummary)
data %>%  // we see the pipe operator at work here!
  tbl_summary()

We're off to a good start! However, there are a few things we can do to make our table more presentable. Some of the column names can be modified. We can also bolden their appearance in the table. Let's add on to our code:

data %>% 
  tbl_summary(
          label = list(
            YRS.SERVICE ~ "YEARS OF SERVICE",
            SALARY ~ "SALARY (in dollars)" )
)%>%  
bold_labels() %>% 
  modify_caption(
    "**Overview of the Salaries Dataset**") // the extra asterisks is syntax for boldening the title

Okay! This looks much better. We can add a few more touches from the "gt" package. First, we can assign our output from the previous step to a variable, we can name it "table1". More experienced users can skip this part.

table1 <- data %>% 
  tbl_summary(
          label = list(
            YRS.SERVICE ~ "YEARS OF SERVICE",
            SALARY ~ "SALARY (in dollars)" 
)%>%  
bold_labels() %>% 
  modify_caption(
    "**Overview of the Salaries Dataset**")

Then, we convert table1 into a "gt" table and pimp it up a little. To do this, we must load the "gt" package. You would have to install it first if you haven't already, as explained in the previous sections.

library(gt)
table1 %>% 
as_gt() %>% tab_style(locations = cells_body(
    columns = everything(),
    rows=c(4,7,12)), // we are selecting these rows because they have the highest percentages.
    style=list(cell_fill(color="firebrick"),
    cell_text(color="white"))
    ) %>% tab_header(subtitle= "Data Digest Episode 8",
                     title="Table 1: General Overview of the Salaries Dataset")

NB: It is best to load all required packages at the beginning of your R script, but we separated them here so beginners could follow along to see what each package was accomplishing in the script.

We've done well for today! There's a more condensed version of this article on https://resagratia.com/resources/datadigest/introduction-data-analytics-r which includes how to report these findings using R Markdown. A live code-along can also be found on Youtube: https://www.youtube.com/watch?v=FPO__kebfyQ&t=3456s

Happy coding!

Acknowledgements

These resources helped me in preparing the article:

Stat 8054 Lecture Notes: R as a Functional Programming Language
DataCamp R Cheatsheet
The R Language: an Overview
R Data Types
Data Structures in R Programming
Stack Overflow