R¶

Data Structures¶

R has many data structures, which include

atomic vector
list
matrix
data frame
…

To check and convert data type:

class() high-level data type/structures: data frame, matrix, etc
typeof() low-level data type: character, logical, double, integer …
is.sth(): is numeric? character?
as.sth(): convert to numeric, character etc.
identical(x1, x2)

Comparison of the four data structures

Dimensions	Homogenous	Heterogeneous
1-D	atomic vector	list
2-D	matrix	data frame

Vector¶

A vector is the most common and basic data structure in R. A vector is a collection of elements that are of one (homogenous) mode character, logical, integer or numeric.

creation

empty vector: vector() (By default the mode is logical).
- vector("character", length = 5) creates a vector of mode ‘character’ with 5 elements.
- character(5) same thing
- numeric(5) a numeric vector with 5 elements
- logical(5) # a logical vector with 5 elements
integer vs numeric
- c(1, 2): double precision real numbers
- c(1L, 2L): integer, or use as.integer()
character c('a', 'b'), logical, etc.

Note

The contents of a list are not restricted to a single mode. If multiple types are inside a vector c(1, 'a'), R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”.

manual coercion: as.integer(), as.character(), etc.

List¶

A list can contains elements of different types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors.

Creation

from vector: as.list(1:10)
from input: pie = list(type="key lime", diameter=7, is.vegetarian=TRUE)

Indexing

by element name: pie$diameter
by brackets: pie[1] the first element, pie[[1]] the first element in the first element.

Matrix¶

A matrix is an extension of the numeric or character vectors. It is simply an atomic vector with dimensions.

Data Frames¶

A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list). One can check is.list(df) returns TRUE.

Data frames can have additional attributes such as rownames().

Creation

from imported data: read.csv(), read.table()
from vectors: data.frame(id = letters[1:10], x = 1:10, y = 11:20)

Functions

shape: dim(), nrow(), ncol(), str()
preview data: head(), tail(), str()
names() or colnames(), na.omit()
sapply(df, fun): run fun() function for each column of df

Indexing

by column name: df$colname
df[index, col1], still matrix
df[index, ]$col1, vector
df[row_num, col_num], or df[vec_of_num, vec_of_num]

Data cleaning

is.na()
anyNA()

Factors¶

Factor objects are how R stores categorical variables, which is character data into fixed numbers of integers. Each category is called a ‘level’.

x = factor(c('a', 'b', 'c', 'a'))
str(x) # Factor w/ 3 levels "a","b","c": 1 2 3 1
typeof(x) # [1] "integer"
levels(x) # [1] "a" "b" "c"
nlevels(x) # 3

By default, R always sorts levels in alphabetical order. To specify levels, use ordered = TRUE.

chili <- factor(c("hotter", "hot", "hotter", "hottest", "hot", "hot"))
factor(chili, levels = c("hot", "hotter", "hottest"), ordered = TRUE)
min(chili) # hot
max(chili) # hottest

Data Visualization via `ggplot`¶

cheat sheet

Syntax¶

Basic syntax is as below:

ggplot(data = [dataset],
       mapping = aes(x = [x-variable],
                     y = [y-variable]) +
    geom_xxx() +
    other_options

For instance, scatter plot (point plot) of two variables

ggplot(data = starwars,
       mapping = aes(x = height,
                     y = mass)) +
    geom_point()

Aesthetic Mapping¶

To display values, map variables in the data to visual properties of the geom (aesthetics) in your plot, e.g. x and y location, size, shape, color etc. Most of the RHS variables are from data.

mapping = aes(x = height,
              y = mass,
              color = sex))

Warning

Putting aes() in different places can affect the final output.

In the coe below, only one smoothing line is fitted. The aesthetic mapping aes(color=sex) is in the geom_point() plot level. Points will have different colors according to the corresponding sex.

ggplot(data = starwars,
       mapping = aes(x = height, y = mass)) +
  geom_point(aes(color=sex)) +
  geom_smooth(se = FALSE) +
  xlim(80, 250) +
  ylim(0, 180)

In the code below, the aesthetic mapping aes(color=sex) is in the ggplot() input level, so the data sets are partitioned into parts by sex, and R runs the next geom commands for each part.

ggplot(data = starwars,
       mapping = aes(x = height, y = mass, color=sex)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  xlim(80, 250) +
  ylim(0, 180)

aesthetics	discrete	continuous
color	rainbow of colors	gradient
size	discrete steps	linear mapping between radius and value
shape	different shape for each	NA

Use of `geom`¶

Numerical data

density: geom_density()
histogram: geom_histogram(binwidth = 10)
smoothing line: geom_smooth(se = FALSE) + xlim() + ylim()

Categorical data

bar plot: geom_bar()

Multivariate

box plot: geom_boxplot() where y is numeric and x is categorical
- many options, see documentation.
scatter plot: geom_point()

Other Options¶

y-axis label: + labs(y = 'ylabel')
create sub plots conditioning on some discrete variables
- facet_grid(): 2d grid, rows ~ cols, . for no split
- facet_wrap(): 1d ribbon wrapped into 2d, useful when the condition variable takes too many values.
  - + facet_wrap(. ~ x_discrete)
  - + facet_wrap(y_discrete ~ .)

Data Manipulation via `dplyr`¶

cheat sheet

Consider a set of nested functions expressing a sequence of actions

park(drive(start_car(find("keys")), to =
"campus"))

We need to read from the inside. Is there a more friendly way?

Introduction to `%>%`¶

The pipes operator %>%, in package magrittr, implements these functions in a more natural (and easier to read) way.

find("keys") %>%
start_car() %>%
drive(to = "campus") %>%
park()

By default, it sends the result of the LHS function as the first argument to the RHS function, to form a flow of results. To send results to a function argument other than first one, or to use the previous result for multiple arguments, use .:

starwars %>%
filter(species == "Human") %>%
lm(mass ~ height, data = .)

Calling dplyr verbs always outputs a new data frame, it does not alter the existing data frame. To store the output, add it at the front

output <- data %>%
          ... %>%
          ...

If we want to store intermediate results, use {. ->> obj}

c <- cars %>%
    mutate(var1 = dist*speed) %>%
    {. ->> b } %>%   # here is save
    summary()

Data Wrangling¶

select: pick columns by name (no quotes)
- select(col1, col2)
- To add a chunk of columns use the start_col:end_col syntax: select(col1:col3, col5:col7)
- Not select columns: select(-col1, -(col3:col5))
- one_of: the column names can be stored in a vector cols, then passed in select(one_of(cols)). This ignores the column name in col that is not in that of the data set. Also see any_of, all_of, and other select helpers.
- select(col2, col3, everything()) where everything() selects all columns after col3
- select(col1_name=col1) to rename, or use rename(col1_name=col1)
select_if: uses takes in each column same as sapply
- select_if(~is.numeric(.)) or in short select_if(is.numeric) selects columns with numerical values. For negation, use select_if(~!is.numeric(.)) or select_if(funs(!is.numeric(.))).
- select_if(funs(mean(.) > 4)) or select_if(~mean(.) > 4) select columns with column mean > 4. mean > 4 is not a function in itself, so you will need to add a tilde ~.
- together, select_if(~is.numeric(.) & mean(.,na.rm=TRUE) > 4)
select_all() function allows changes to all columns, and takes a function as an argument, e.g. select_all(tolower) upfront, or wrap it inside funs()
slice: pick rows using index(es)
- slice(c(7,8,14:15))
- slice(-c(7,8,14:15))
filter: pick rows matching criteria
- can use AND, OR and NOT, or below styles
- filter(condition1, condition2) is equivalent to using AND
- filter(condition1, !condition2)
- filter(condition1 | condition2)
- filter(xor(condition1, condition2) will return all rows where only one of the conditions is met, and not when both conditions are met.
arrange: order rows by values of column(s), default ascending.
- arrange(col1)
- arrange(desc(col1), col2)
- If col1 in desc(col1) is a character variable, then sort from Z to A.
rename: rename specific columns
- rename(col1_new_name=col1, col2_new_name=col2)
mutate: add new columns
- New columns can be made with aggregate functions such as average, median, max, min, sd, and ifelse
- mutate(col1 = col1 - mean(col1))
- mutate(col1or2 = ifelse(col1 > col2, 'col1', 'col2'))
- mutate(col2 = case_when(col1>80 ~ 'great', col1>60 ~ 'good', TRUE~'others') works like multi-level ifelse.
mutate_at applies a function to one or several columns, like sapply. It does not create new columns.
- mutate_at(c('col1', 'col2'), ~ .-mean(.)), note that column names are in quotes.
- can also use %>% inside, mutate_at(cols, ~ .-mean(.) %>% round(2)).
transmute_at return a new data frame with only the mutated variables.
bind_cols, bind_rows: bind multiple data frames by row and column.
- %>% bind_cols(., df2)
count: return frequency of discrete variables. Add sort=TRUE to sort by counts.
- %>% count(col1, col2, sort=TRUE)
group_by: create groups of rows according to a condition. Use ungroup to recover
summarise: apply computations across groups of rows, used with aggregation functions
- n, n_distinct, sum, max, min, mean, median, sd
gather: make wide data longer
- gather(key=new_key_col_name, value=new_value_col_name, -old_index)
spread: make long data wider
- spread(key=key_col_name, value=value_col_name)
separate: separate a character column to two by a separator.
- separate(col, c('new-col-1', 'new-col-2'), sep=' ')
unite: unite two columns to one by separator.
- unite('new-col', c(col1, col2), sep=' ')
map: apply computations across columns, related map_dbl, map_dfr
pmap: apply computations across rows, related pmap_dbl, pmap_dfr
*_join where * can be inner, left, right, or full: join two data frames together according to common values in certain columns, and * indicates how many rows to keep.
- *_join(df2, by='key'). If column names differ, use by=c("df1.name"="df2.name")

Comparison with SQL¶

Translate commands to SQL using translate_sql.

Objective	R	SQL
select columns	`select(col1, col2)`	`SELECT col1, col2`
filter	`filter(col1 > 2)`	`WHERE col1 > 2`
order rows	`arrange(col1, desc(col2))`	`ORDER BY col1, col2 DESC`

Pros¶

By comment a line

easy to see its effect on the final result, e.g. some filter

Comparison with Python¶

R is procedural, while Python is object-oriented.

R

find('key') %>%
  start('car') %>%
  drive(to='campus') %>%
  park()

Python

key.find()
car.start()
car.drive(to='campus')
car.park()

Functions¶

Objective	Python	R	Remark
function help	`?function`	`function?`
append to array	`mylist.append(x)`	`myvec = c(myvec, x)`
create a sequence	`range(start, stop, step)` or `range(10)`	`seq(from, to, by)` or `seq(10)`	python is zero indexed
add y-axis label	`plt.ylabel('ylabel')`	`plot(..., ylab = 'ylabel')` `... + labs(y = 'ylabel')`
add title	`plt.title('title')`	`title('title')`
length of a string	`len(my_string)`	`nchar(my_string)`
arrays comparison	`np.isin(a, b)` `np.isin(a, b, invert=True)`	`a %in% b` `!(a %in% b)`	returns a logical array of the same shape as `a`
add a column	`df['new_col'] = ...`	`df$new_col = ...` or `df = data.frame(df, new_col_name = ...)`
delete a column	`df.drop(['B', 'C'], axis=1)` or `df.drop(columns=['B', 'C'])`	`df$col <- NULL` or `df = df[, -col_index]`
delete rows	`df.drop([0, 1])`	`df[-c(0,1),]`
check missing	`df.isnull()`	`is.na(df)`
min index	`np.argmin()`	`which.min()`
frequency of discrete values in a vector	`pd.Series.value_counts()`	`summary()`
default vector format	row-major (‘C’)	column-major (‘F’/fortran)	mind broadcasting in `R`
array stack	`np.vstack((...))`, `np.hstack((...))`	`rbind()`, `cbind()`
array extension	`list1.extend(list2)`, `np.hstack((ary1, ary2))`	`c(v1, v2)`	In numpy the default format is row-major, so use `hstack`.

Miscellaneous¶

= vs <-

in most cases, they are equivalent
unlike =, the other direction -> also works, which is convenient at the end of a line
inside function argument assignment fun(x=...), must use =.

Data Science Handbook

R¶