R

Data Structures

R has many data structures, which include

  • atomic vector

  • list

  • matrix

  • data frame

To check and convert data type:

  • class() high-level data type/structures: data frame, matrix, etc

  • typeof() low-level data type: character, logical, double, integer …

  • is.sth(): is numeric? character?

  • as.sth(): convert to numeric, character etc.

  • identical(x1, x2)

Comparison of the four data structures

Dimensions

Homogenous

Heterogeneous

1-D

atomic vector

list

2-D

matrix

data frame

Vector

A vector is the most common and basic data structure in R. A vector is a collection of elements that are of one (homogenous) mode character, logical, integer or numeric.

creation

  • empty vector: vector() (By default the mode is logical).

    • vector("character", length = 5) creates a vector of mode ‘character’ with 5 elements.

    • character(5) same thing

    • numeric(5) a numeric vector with 5 elements

    • logical(5) # a logical vector with 5 elements

  • integer vs numeric

    • c(1, 2): double precision real numbers

    • c(1L, 2L): integer, or use as.integer()

  • character c('a', 'b'), logical, etc.

Note

The contents of a list are not restricted to a single mode. If multiple types are inside a vector c(1, 'a'), R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”.

manual coercion: as.integer(), as.character(), etc.

List

A list can contains elements of different types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors.

Creation

  • from vector: as.list(1:10)

  • from input: pie = list(type="key lime", diameter=7, is.vegetarian=TRUE)

Indexing

  • by element name: pie$diameter

  • by brackets: pie[1] the first element, pie[[1]] the first element in the first element.

Matrix

A matrix is an extension of the numeric or character vectors. It is simply an atomic vector with dimensions.

Data Frames

A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list). One can check is.list(df) returns TRUE.

Data frames can have additional attributes such as rownames().

Creation

  • from imported data: read.csv(), read.table()

  • from vectors: data.frame(id = letters[1:10], x = 1:10, y = 11:20)

Functions

  • shape: dim(), nrow(), ncol(), str()

  • preview data: head(), tail(), str()

  • names() or colnames(), na.omit()

  • sapply(df, fun): run fun() function for each column of df

Indexing

  • by column name: df$colname

  • df[index, col1], still matrix

  • df[index, ]$col1, vector

  • df[row_num, col_num], or df[vec_of_num, vec_of_num]

Data cleaning

  • is.na()

  • anyNA()

Factors

Factor objects are how R stores categorical variables, which is character data into fixed numbers of integers. Each category is called a ‘level’.

x = factor(c('a', 'b', 'c', 'a'))
str(x) # Factor w/ 3 levels "a","b","c": 1 2 3 1
typeof(x) # [1] "integer"
levels(x) # [1] "a" "b" "c"
nlevels(x) # 3

By default, R always sorts levels in alphabetical order. To specify levels, use ordered = TRUE.

chili <- factor(c("hotter", "hot", "hotter", "hottest", "hot", "hot"))
factor(chili, levels = c("hot", "hotter", "hottest"), ordered = TRUE)
min(chili) # hot
max(chili) # hottest

Data Visualization via ggplot

cheat sheet

Syntax

Basic syntax is as below:

ggplot(data = [dataset],
       mapping = aes(x = [x-variable],
                     y = [y-variable]) +
    geom_xxx() +
    other_options

For instance, scatter plot (point plot) of two variables

ggplot(data = starwars,
       mapping = aes(x = height,
                     y = mass)) +
    geom_point()

Aesthetic Mapping

To display values, map variables in the data to visual properties of the geom (aesthetics) in your plot, e.g. x and y location, size, shape, color etc. Most of the RHS variables are from data.

mapping = aes(x = height,
              y = mass,
              color = sex))

Warning

Putting aes() in different places can affect the final output.

In the coe below, only one smoothing line is fitted. The aesthetic mapping aes(color=sex) is in the geom_point() plot level. Points will have different colors according to the corresponding sex.

ggplot(data = starwars,
       mapping = aes(x = height, y = mass)) +
  geom_point(aes(color=sex)) +
  geom_smooth(se = FALSE) +
  xlim(80, 250) +
  ylim(0, 180)

In the code below, the aesthetic mapping aes(color=sex) is in the ggplot() input level, so the data sets are partitioned into parts by sex, and R runs the next geom commands for each part.

ggplot(data = starwars,
       mapping = aes(x = height, y = mass, color=sex)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  xlim(80, 250) +
  ylim(0, 180)

aesthetics

discrete

continuous

color

rainbow of colors

gradient

size

discrete steps

linear mapping between radius and value

shape

different shape for each

NA

Use of geom

Numerical data

  • density: geom_density()

  • histogram: geom_histogram(binwidth = 10)

  • smoothing line: geom_smooth(se = FALSE) + xlim() + ylim()

Categorical data

  • bar plot: geom_bar()

Multivariate

  • box plot: geom_boxplot() where y is numeric and x is categorical

  • scatter plot: geom_point()

Other Options

  • y-axis label: + labs(y = 'ylabel')

  • create sub plots conditioning on some discrete variables

    • facet_grid(): 2d grid, rows ~ cols, . for no split

    • facet_wrap(): 1d ribbon wrapped into 2d, useful when the condition variable takes too many values.

      • + facet_wrap(. ~ x_discrete)

      • + facet_wrap(y_discrete ~ .)

Data Manipulation via dplyr

cheat sheet

Consider a set of nested functions expressing a sequence of actions

park(drive(start_car(find("keys")), to =
"campus"))

We need to read from the inside. Is there a more friendly way?

Introduction to %>%

The pipes operator %>%, in package magrittr, implements these functions in a more natural (and easier to read) way.

find("keys") %>%
start_car() %>%
drive(to = "campus") %>%
park()

By default, it sends the result of the LHS function as the first argument to the RHS function, to form a flow of results. To send results to a function argument other than first one, or to use the previous result for multiple arguments, use .:

starwars %>%
filter(species == "Human") %>%
lm(mass ~ height, data = .)

Calling dplyr verbs always outputs a new data frame, it does not alter the existing data frame. To store the output, add it at the front

output <- data %>%
          ... %>%
          ...

If we want to store intermediate results, use {. ->> obj}

c <- cars %>%
    mutate(var1 = dist*speed) %>%
    {. ->> b } %>%   # here is save
    summary()

Data Wrangling

  • select: pick columns by name (no quotes)

    • select(col1, col2)

    • To add a chunk of columns use the start_col:end_col syntax: select(col1:col3, col5:col7)

    • Not select columns: select(-col1, -(col3:col5))

    • one_of: the column names can be stored in a vector cols, then passed in select(one_of(cols)). This ignores the column name in col that is not in that of the data set. Also see any_of, all_of, and other select helpers.

    • select(col2, col3, everything()) where everything() selects all columns after col3

    • select(col1_name=col1) to rename, or use rename(col1_name=col1)

  • select_if: uses takes in each column same as sapply

    • select_if(~is.numeric(.)) or in short select_if(is.numeric) selects columns with numerical values. For negation, use select_if(~!is.numeric(.)) or select_if(funs(!is.numeric(.))).

    • select_if(funs(mean(.) > 4)) or select_if(~mean(.) > 4) select columns with column mean > 4. mean > 4 is not a function in itself, so you will need to add a tilde ~.

    • together, select_if(~is.numeric(.) & mean(.,na.rm=TRUE) > 4)

  • select_all() function allows changes to all columns, and takes a function as an argument, e.g. select_all(tolower) upfront, or wrap it inside funs()

  • slice: pick rows using index(es)

    • slice(c(7,8,14:15))

    • slice(-c(7,8,14:15))

  • filter: pick rows matching criteria

    • can use AND, OR and NOT, or below styles

    • filter(condition1, condition2) is equivalent to using AND

    • filter(condition1, !condition2)

    • filter(condition1 | condition2)

    • filter(xor(condition1, condition2) will return all rows where only one of the conditions is met, and not when both conditions are met.

  • arrange: order rows by values of column(s), default ascending.

    • arrange(col1)

    • arrange(desc(col1), col2)

    • If col1 in desc(col1) is a character variable, then sort from Z to A.

  • rename: rename specific columns

    • rename(col1_new_name=col1, col2_new_name=col2)

  • mutate: add new columns

    • New columns can be made with aggregate functions such as average, median, max, min, sd, and ifelse

    • mutate(col1 = col1 - mean(col1))

    • mutate(col1or2 = ifelse(col1 > col2, 'col1', 'col2'))

    • mutate(col2 = case_when(col1>80 ~ 'great', col1>60 ~ 'good', TRUE~'others') works like multi-level ifelse.

  • mutate_at applies a function to one or several columns, like sapply. It does not create new columns.

    • mutate_at(c('col1', 'col2'), ~ .-mean(.)), note that column names are in quotes.

    • can also use %>% inside, mutate_at(cols, ~ .-mean(.) %>% round(2)).

  • transmute_at return a new data frame with only the mutated variables.

  • bind_cols, bind_rows: bind multiple data frames by row and column.

    • %>% bind_cols(., df2)

  • count: return frequency of discrete variables. Add sort=TRUE to sort by counts.

    • %>% count(col1, col2, sort=TRUE)

  • group_by: create groups of rows according to a condition. Use ungroup to recover

  • summarise: apply computations across groups of rows, used with aggregation functions

    • n, n_distinct, sum, max, min, mean, median, sd

  • gather: make wide data longer

    • gather(key=new_key_col_name, value=new_value_col_name, -old_index)

  • spread: make long data wider

    • spread(key=key_col_name, value=value_col_name)

  • separate: separate a character column to two by a separator.

    • separate(col, c('new-col-1', 'new-col-2'), sep=' ')

  • unite: unite two columns to one by separator.

    • unite('new-col', c(col1, col2), sep=' ')

  • map: apply computations across columns, related map_dbl, map_dfr

  • pmap: apply computations across rows, related pmap_dbl, pmap_dfr

  • *_join where * can be inner, left, right, or full: join two data frames together according to common values in certain columns, and * indicates how many rows to keep.

    • *_join(df2, by='key'). If column names differ, use by=c("df1.name"="df2.name")

Comparison with SQL

Translate commands to SQL using translate_sql.

Objective

R

SQL

Remark

select columns

select(col1, col2)

SELECT col1, col2

filter

filter(col1 > 2)

WHERE col1 > 2

order rows

arrange(col1, desc(col2))

ORDER BY col1, col2 DESC

Pros

By comment a line

  • easy to see its effect on the final result, e.g. some filter

Comparison with Python

R is procedural, while Python is object-oriented.

R

find('key') %>%
  start('car') %>%
  drive(to='campus') %>%
  park()

Python

key.find()
car.start()
car.drive(to='campus')
car.park()

Functions

Objective

Python

R

Remark

function help

?function

function?

append to array

mylist.append(x)

myvec = c(myvec, x)

create a sequence

range(start, stop, step) or range(10)

seq(from, to, by) or seq(10)

python is zero indexed

add y-axis label

plt.ylabel('ylabel')

plot(..., ylab = 'ylabel')
... + labs(y = 'ylabel')

add title

plt.title('title')

title('title')

length of a string

len(my_string)

nchar(my_string)

arrays comparison

np.isin(a, b)
np.isin(a, b, invert=True)

a %in% b
!(a %in% b)

returns a logical array of the same shape as a

add a column

df['new_col'] = ...

df$new_col = ... or df = data.frame(df, new_col_name = ...)

delete a column

df.drop(['B', 'C'], axis=1) or df.drop(columns=['B', 'C'])

df$col <- NULL or df = df[, -col_index]

delete rows

df.drop([0, 1])

df[-c(0,1),]

check missing

df.isnull()

is.na(df)

min index

np.argmin()

which.min()

frequency of discrete values in a vector

pd.Series.value_counts()

summary()

default vector format

row-major (‘C’)

column-major (‘F’/fortran)

mind broadcasting in R

array stack

np.vstack((...)), np.hstack((...))

rbind(), cbind()

array extension

list1.extend(list2), np.hstack((ary1, ary2))

c(v1, v2)

In numpy the default format is row-major, so use hstack.

Miscellaneous

= vs <-

  • in most cases, they are equivalent

  • unlike =, the other direction -> also works, which is convenient at the end of a line

  • inside function argument assignment fun(x=...), must use =.