R¶
Data Structures¶
R has many data structures, which include
- atomic vector 
- list 
- matrix 
- data frame 
- … 
To check and convert data type:
- class()high-level data type/structures: data frame, matrix, etc
- typeof()low-level data type: character, logical, double, integer …
- is.sth(): is numeric? character?
- as.sth(): convert to numeric, character etc.
- identical(x1, x2)
Comparison of the four data structures
| Dimensions | Homogenous | Heterogeneous | 
|---|---|---|
| 1-D | atomic vector | list | 
| 2-D | matrix | data frame | 
Vector¶
A vector is the most common and basic data structure in R. A vector is a collection of elements that are of one (homogenous) mode character, logical, integer or numeric.
creation
- empty vector: - vector()(By default the mode is logical).- vector("character", length = 5)creates a vector of mode ‘character’ with 5 elements.
- character(5)same thing
- numeric(5)a numeric vector with 5 elements
- logical(5)# a logical vector with 5 elements
 
- integer vs numeric - c(1, 2): double precision real numbers
- c(1L, 2L): integer, or use- as.integer()
 
- character - c('a', 'b'), logical, etc.
Note
The contents of a list are not restricted to a single mode. If multiple types are inside a vector c(1, 'a'), R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”.
manual coercion: as.integer(), as.character(), etc.
List¶
A list can contains elements of different types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors.
Creation
- from vector: - as.list(1:10)
- from input: - pie = list(type="key lime", diameter=7, is.vegetarian=TRUE)
Indexing
- by element name: - pie$diameter
- by brackets: - pie[1]the first element,- pie[[1]]the first element in the first element.
Matrix¶
A matrix is an extension of the numeric or character vectors. It is simply an atomic vector with dimensions.
Data Frames¶
A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list). One can check is.list(df) returns TRUE.
Data frames can have additional attributes such as rownames().
Creation
- from imported data: - read.csv(), read.table()
- from vectors: - data.frame(id = letters[1:10], x = 1:10, y = 11:20)
Functions
- shape: - dim(), nrow(), ncol(), str()
- preview data: - head(), tail(), str()
- names()or- colnames(),- na.omit()
- sapply(df, fun): run- fun()function for each column of df
Indexing
- by column name: - df$colname
- df[index, col1], still matrix
- df[index, ]$col1, vector
- df[row_num, col_num], or- df[vec_of_num, vec_of_num]
Data cleaning
- is.na()
- anyNA()
Factors¶
Factor objects are how R stores categorical variables, which is character data into fixed numbers of integers. Each category is called a ‘level’.
x = factor(c('a', 'b', 'c', 'a'))
str(x) # Factor w/ 3 levels "a","b","c": 1 2 3 1
typeof(x) # [1] "integer"
levels(x) # [1] "a" "b" "c"
nlevels(x) # 3
By default, R always sorts levels in alphabetical order. To specify levels, use ordered = TRUE.
chili <- factor(c("hotter", "hot", "hotter", "hottest", "hot", "hot"))
factor(chili, levels = c("hot", "hotter", "hottest"), ordered = TRUE)
min(chili) # hot
max(chili) # hottest
Data Visualization via ggplot¶
Syntax¶
Basic syntax is as below:
ggplot(data = [dataset],
       mapping = aes(x = [x-variable],
                     y = [y-variable]) +
    geom_xxx() +
    other_options
For instance, scatter plot (point plot) of two variables
ggplot(data = starwars,
       mapping = aes(x = height,
                     y = mass)) +
    geom_point()
Aesthetic Mapping¶
To display values, map variables in the data to visual properties of the geom (aesthetics) in your plot, e.g. x and y location, size, shape, color etc. Most of the RHS variables are from data.
mapping = aes(x = height,
              y = mass,
              color = sex))
Warning
Putting aes() in different places can affect the final output.
In the coe below, only one smoothing line is fitted. The aesthetic mapping aes(color=sex) is in the geom_point() plot level. Points will have different colors according to the corresponding sex.
ggplot(data = starwars,
       mapping = aes(x = height, y = mass)) +
  geom_point(aes(color=sex)) +
  geom_smooth(se = FALSE) +
  xlim(80, 250) +
  ylim(0, 180)
In the code below, the aesthetic mapping aes(color=sex) is in the ggplot() input level, so the data sets are partitioned into parts by sex, and R runs the next geom commands for each part.
ggplot(data = starwars,
       mapping = aes(x = height, y = mass, color=sex)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  xlim(80, 250) +
  ylim(0, 180)
| aesthetics | discrete | continuous | 
|---|---|---|
| color | rainbow of colors | gradient | 
| size | discrete steps | linear mapping between radius and value | 
| shape | different shape for each | NA | 
Use of geom¶
Numerical data
- density: - geom_density()
- histogram: - geom_histogram(binwidth = 10)
- smoothing line: - geom_smooth(se = FALSE) + xlim() + ylim()
Categorical data
- bar plot: - geom_bar()
Multivariate
- box plot: - geom_boxplot()where- yis numeric and- xis categorical- many options, see documentation. 
 
- scatter plot: - geom_point()
Other Options¶
- y-axis label: - + labs(y = 'ylabel')
- create sub plots conditioning on some discrete variables - facet_grid(): 2d grid,- rows ~ cols,- .for no split
- facet_wrap(): 1d ribbon wrapped into 2d, useful when the condition variable takes too many values.- + facet_wrap(. ~ x_discrete)
- + facet_wrap(y_discrete ~ .)
 
 
Data Manipulation via dplyr¶
Consider a set of nested functions expressing a sequence of actions
park(drive(start_car(find("keys")), to =
"campus"))
We need to read from the inside. Is there a more friendly way?
Introduction to %>%¶
The pipes operator %>%, in package magrittr, implements these functions in a more natural (and easier to read) way.
find("keys") %>%
start_car() %>%
drive(to = "campus") %>%
park()
By default, it sends the result of the LHS function as the first argument to the RHS function, to form a flow of results. To send results to a function argument other than first one, or to use the previous result for multiple arguments, use .:
starwars %>%
filter(species == "Human") %>%
lm(mass ~ height, data = .)
Calling dplyr verbs always outputs a new data frame, it does not alter
the existing data frame. To store the output, add it at the front
output <- data %>%
          ... %>%
          ...
If we want to store intermediate results, use {. ->> obj}
c <- cars %>%
    mutate(var1 = dist*speed) %>%
    {. ->> b } %>%   # here is save
    summary()
Data Wrangling¶
- select: pick columns by name (no quotes)- select(col1, col2)
- To add a chunk of columns use the - start_col:end_colsyntax:- select(col1:col3, col5:col7)
- Not select columns: - select(-col1, -(col3:col5))
- one_of: the column names can be stored in a vector- cols, then passed in- select(one_of(cols)). This ignores the column name in- colthat is not in that of the data set. Also see- any_of,- all_of, and other select helpers.
- select(col2, col3, everything())where- everything()selects all columns after- col3
- select(col1_name=col1)to rename, or use- rename(col1_name=col1)
 
- select_if: uses takes in each column same as- sapply- select_if(~is.numeric(.))or in short- select_if(is.numeric)selects columns with numerical values. For negation, use- select_if(~!is.numeric(.))or- select_if(funs(!is.numeric(.))).
- select_if(funs(mean(.) > 4))or- select_if(~mean(.) > 4)select columns with column mean > 4.- mean > 4is not a function in itself, so you will need to add a tilde- ~.
- together, - select_if(~is.numeric(.) & mean(.,na.rm=TRUE) > 4)
 
- select_all()function allows changes to all columns, and takes a function as an argument, e.g.- select_all(tolower)upfront, or wrap it inside funs()
- slice: pick rows using index(es)- slice(c(7,8,14:15))
- slice(-c(7,8,14:15))
 
- filter: pick rows matching criteria- can use - AND, ORand- NOT, or below styles
- filter(condition1, condition2)is equivalent to using- AND
- filter(condition1, !condition2)
- filter(condition1 | condition2)
- filter(xor(condition1, condition2)will return all rows where only one of the conditions is met, and not when both conditions are met.
 
- arrange: order rows by values of column(s), default ascending.- arrange(col1)
- arrange(desc(col1), col2)
- If - col1in- desc(col1)is a character variable, then sort from Z to A.
 
- rename: rename specific columns- rename(col1_new_name=col1, col2_new_name=col2)
 
- mutate: add new columns- New columns can be made with aggregate functions such as - average,- median,- max,- min,- sd, and- ifelse
- mutate(col1 = col1 - mean(col1))
- mutate(col1or2 = ifelse(col1 > col2, 'col1', 'col2'))
- mutate(col2 = case_when(col1>80 ~ 'great', col1>60 ~ 'good', TRUE~'others')works like multi-level- ifelse.
 
- mutate_atapplies a function to one or several columns, like- sapply. It does not create new columns.- mutate_at(c('col1', 'col2'), ~ .-mean(.)), note that column names are in quotes.
- can also use - %>%inside,- mutate_at(cols, ~ .-mean(.) %>% round(2)).
 
- transmute_atreturn a new data frame with only the mutated variables.
- bind_cols,- bind_rows: bind multiple data frames by row and column.- %>% bind_cols(., df2)
 
- count: return frequency of discrete variables. Add- sort=TRUEto sort by counts.- %>% count(col1, col2, sort=TRUE)
 
- group_by: create groups of rows according to a condition. Use- ungroupto recover
- summarise: apply computations across groups of rows, used with aggregation functions- n,- n_distinct,- sum,- max,- min,- mean,- median,- sd
 
- gather: make wide data longer- gather(key=new_key_col_name, value=new_value_col_name, -old_index)
 
- spread: make long data wider- spread(key=key_col_name, value=value_col_name)
 
- separate: separate a character column to two by a separator.- separate(col, c('new-col-1', 'new-col-2'), sep=' ')
 
- unite: unite two columns to one by separator.- unite('new-col', c(col1, col2), sep=' ')
 
- map: apply computations across columns, related- map_dbl,- map_dfr
- pmap: apply computations across rows, related- pmap_dbl,- pmap_dfr
- *_joinwhere- *can be- inner,- left,- right, or- full: join two data frames together according to common values in certain columns, and- *indicates how many rows to keep.- *_join(df2, by='key'). If column names differ, use- by=c("df1.name"="df2.name")
 
Comparison with SQL¶
Translate commands to SQL using translate_sql.
| Objective | R | SQL | Remark | 
|---|---|---|---|
| select columns | 
 | 
 | |
| filter | 
 | 
 | |
| order rows | 
 | 
 | 
Comparison with Python¶
R is procedural, while Python is object-oriented.
R
find('key') %>%
  start('car') %>%
  drive(to='campus') %>%
  park()
Python
key.find()
car.start()
car.drive(to='campus')
car.park()
Functions¶
| Objective | Python | R | Remark | 
|---|---|---|---|
| function help | 
 | 
 | |
| append to array | 
 | 
 | |
| create a sequence | 
 | 
 | python is zero indexed | 
| add y-axis label | 
 | 
 | |
| add title | 
 | 
 | |
| length of a string | 
 | 
 | |
| arrays comparison | 
 | 
 | returns a logical array of the same shape as  | 
| add a column | 
 | 
 | |
| delete a column | 
 | 
 | |
| delete rows | 
 | 
 | |
| check missing | 
 | 
 | |
| min index | 
 | 
 | |
| frequency of discrete values in a vector | 
 | 
 | |
| default vector format | row-major (‘C’) | column-major (‘F’/fortran) | mind broadcasting in  | 
| array stack | 
 | 
 | |
| array extension | 
 | 
 | In numpy the default format is row-major, so use  | 
Miscellaneous¶
= vs <-
- in most cases, they are equivalent 
- unlike - =, the other direction- ->also works, which is convenient at the end of a line
- inside function argument assignment - fun(x=...), must use- =.