R¶
Data Structures¶
R has many data structures, which include
atomic vector
list
matrix
data frame
…
To check and convert data type:
class()high-level data type/structures: data frame, matrix, etctypeof()low-level data type: character, logical, double, integer …is.sth(): is numeric? character?as.sth(): convert to numeric, character etc.identical(x1, x2)
Comparison of the four data structures
Dimensions |
Homogenous |
Heterogeneous |
|---|---|---|
1-D |
atomic vector |
list |
2-D |
matrix |
data frame |
Vector¶
A vector is the most common and basic data structure in R. A vector is a collection of elements that are of one (homogenous) mode character, logical, integer or numeric.
creation
empty vector:
vector()(By default the mode is logical).vector("character", length = 5)creates a vector of mode ‘character’ with 5 elements.character(5)same thingnumeric(5)a numeric vector with 5 elementslogical(5)# a logical vector with 5 elements
integer vs numeric
c(1, 2): double precision real numbersc(1L, 2L): integer, or useas.integer()
character
c('a', 'b'), logical, etc.
Note
The contents of a list are not restricted to a single mode. If multiple types are inside a vector c(1, 'a'), R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”.
manual coercion: as.integer(), as.character(), etc.
List¶
A list can contains elements of different types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors.
Creation
from vector:
as.list(1:10)from input:
pie = list(type="key lime", diameter=7, is.vegetarian=TRUE)
Indexing
by element name:
pie$diameterby brackets:
pie[1]the first element,pie[[1]]the first element in the first element.
Matrix¶
A matrix is an extension of the numeric or character vectors. It is simply an atomic vector with dimensions.
Data Frames¶
A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list). One can check is.list(df) returns TRUE.
Data frames can have additional attributes such as rownames().
Creation
from imported data:
read.csv(), read.table()from vectors:
data.frame(id = letters[1:10], x = 1:10, y = 11:20)
Functions
shape:
dim(), nrow(), ncol(), str()preview data:
head(), tail(), str()names()orcolnames(),na.omit()sapply(df, fun): runfun()function for each column of df
Indexing
by column name:
df$colnamedf[index, col1], still matrixdf[index, ]$col1, vectordf[row_num, col_num], ordf[vec_of_num, vec_of_num]
Data cleaning
is.na()anyNA()
Factors¶
Factor objects are how R stores categorical variables, which is character data into fixed numbers of integers. Each category is called a ‘level’.
x = factor(c('a', 'b', 'c', 'a'))
str(x) # Factor w/ 3 levels "a","b","c": 1 2 3 1
typeof(x) # [1] "integer"
levels(x) # [1] "a" "b" "c"
nlevels(x) # 3
By default, R always sorts levels in alphabetical order. To specify levels, use ordered = TRUE.
chili <- factor(c("hotter", "hot", "hotter", "hottest", "hot", "hot"))
factor(chili, levels = c("hot", "hotter", "hottest"), ordered = TRUE)
min(chili) # hot
max(chili) # hottest
Data Visualization via ggplot¶
Syntax¶
Basic syntax is as below:
ggplot(data = [dataset],
mapping = aes(x = [x-variable],
y = [y-variable]) +
geom_xxx() +
other_options
For instance, scatter plot (point plot) of two variables
ggplot(data = starwars,
mapping = aes(x = height,
y = mass)) +
geom_point()
Aesthetic Mapping¶
To display values, map variables in the data to visual properties of the geom (aesthetics) in your plot, e.g. x and y location, size, shape, color etc. Most of the RHS variables are from data.
mapping = aes(x = height,
y = mass,
color = sex))
Warning
Putting aes() in different places can affect the final output.
In the coe below, only one smoothing line is fitted. The aesthetic mapping aes(color=sex) is in the geom_point() plot level. Points will have different colors according to the corresponding sex.
ggplot(data = starwars,
mapping = aes(x = height, y = mass)) +
geom_point(aes(color=sex)) +
geom_smooth(se = FALSE) +
xlim(80, 250) +
ylim(0, 180)
In the code below, the aesthetic mapping aes(color=sex) is in the ggplot() input level, so the data sets are partitioned into parts by sex, and R runs the next geom commands for each part.
ggplot(data = starwars,
mapping = aes(x = height, y = mass, color=sex)) +
geom_point() +
geom_smooth(se = FALSE) +
xlim(80, 250) +
ylim(0, 180)
aesthetics |
discrete |
continuous |
|---|---|---|
color |
rainbow of colors |
gradient |
size |
discrete steps |
linear mapping between radius and value |
shape |
different shape for each |
NA |
Use of geom¶
Numerical data
density:
geom_density()histogram:
geom_histogram(binwidth = 10)smoothing line:
geom_smooth(se = FALSE) + xlim() + ylim()
Categorical data
bar plot:
geom_bar()
Multivariate
box plot:
geom_boxplot()whereyis numeric andxis categoricalmany options, see documentation.
scatter plot:
geom_point()
Other Options¶
y-axis label:
+ labs(y = 'ylabel')create sub plots conditioning on some discrete variables
facet_grid(): 2d grid,rows ~ cols,.for no splitfacet_wrap(): 1d ribbon wrapped into 2d, useful when the condition variable takes too many values.+ facet_wrap(. ~ x_discrete)+ facet_wrap(y_discrete ~ .)
Data Manipulation via dplyr¶
Consider a set of nested functions expressing a sequence of actions
park(drive(start_car(find("keys")), to =
"campus"))
We need to read from the inside. Is there a more friendly way?
Introduction to %>%¶
The pipes operator %>%, in package magrittr, implements these functions in a more natural (and easier to read) way.
find("keys") %>%
start_car() %>%
drive(to = "campus") %>%
park()
By default, it sends the result of the LHS function as the first argument to the RHS function, to form a flow of results. To send results to a function argument other than first one, or to use the previous result for multiple arguments, use .:
starwars %>%
filter(species == "Human") %>%
lm(mass ~ height, data = .)
Calling dplyr verbs always outputs a new data frame, it does not alter
the existing data frame. To store the output, add it at the front
output <- data %>%
... %>%
...
If we want to store intermediate results, use {. ->> obj}
c <- cars %>%
mutate(var1 = dist*speed) %>%
{. ->> b } %>% # here is save
summary()
Data Wrangling¶
select: pick columns by name (no quotes)select(col1, col2)To add a chunk of columns use the
start_col:end_colsyntax:select(col1:col3, col5:col7)Not select columns:
select(-col1, -(col3:col5))one_of: the column names can be stored in a vectorcols, then passed inselect(one_of(cols)). This ignores the column name incolthat is not in that of the data set. Also seeany_of,all_of, and other select helpers.select(col2, col3, everything())whereeverything()selects all columns aftercol3select(col1_name=col1)to rename, or userename(col1_name=col1)
select_if: uses takes in each column same assapplyselect_if(~is.numeric(.))or in shortselect_if(is.numeric)selects columns with numerical values. For negation, useselect_if(~!is.numeric(.))orselect_if(funs(!is.numeric(.))).select_if(funs(mean(.) > 4))orselect_if(~mean(.) > 4)select columns with column mean > 4.mean > 4is not a function in itself, so you will need to add a tilde~.together,
select_if(~is.numeric(.) & mean(.,na.rm=TRUE) > 4)
select_all()function allows changes to all columns, and takes a function as an argument, e.g.select_all(tolower)upfront, or wrap it inside funs()slice: pick rows using index(es)slice(c(7,8,14:15))slice(-c(7,8,14:15))
filter: pick rows matching criteriacan use
AND, ORandNOT, or below stylesfilter(condition1, condition2)is equivalent to usingANDfilter(condition1, !condition2)filter(condition1 | condition2)filter(xor(condition1, condition2)will return all rows where only one of the conditions is met, and not when both conditions are met.
arrange: order rows by values of column(s), default ascending.arrange(col1)arrange(desc(col1), col2)If
col1indesc(col1)is a character variable, then sort from Z to A.
rename: rename specific columnsrename(col1_new_name=col1, col2_new_name=col2)
mutate: add new columnsNew columns can be made with aggregate functions such as
average,median,max,min,sd, andifelsemutate(col1 = col1 - mean(col1))mutate(col1or2 = ifelse(col1 > col2, 'col1', 'col2'))mutate(col2 = case_when(col1>80 ~ 'great', col1>60 ~ 'good', TRUE~'others')works like multi-levelifelse.
mutate_atapplies a function to one or several columns, likesapply. It does not create new columns.mutate_at(c('col1', 'col2'), ~ .-mean(.)), note that column names are in quotes.can also use
%>%inside,mutate_at(cols, ~ .-mean(.) %>% round(2)).
transmute_atreturn a new data frame with only the mutated variables.bind_cols,bind_rows: bind multiple data frames by row and column.%>% bind_cols(., df2)
count: return frequency of discrete variables. Addsort=TRUEto sort by counts.%>% count(col1, col2, sort=TRUE)
group_by: create groups of rows according to a condition. Useungroupto recoversummarise: apply computations across groups of rows, used with aggregation functionsn,n_distinct,sum,max,min,mean,median,sd
gather: make wide data longergather(key=new_key_col_name, value=new_value_col_name, -old_index)
spread: make long data widerspread(key=key_col_name, value=value_col_name)
separate: separate a character column to two by a separator.separate(col, c('new-col-1', 'new-col-2'), sep=' ')
unite: unite two columns to one by separator.unite('new-col', c(col1, col2), sep=' ')
map: apply computations across columns, relatedmap_dbl,map_dfrpmap: apply computations across rows, relatedpmap_dbl,pmap_dfr*_joinwhere*can beinner,left,right, orfull: join two data frames together according to common values in certain columns, and*indicates how many rows to keep.*_join(df2, by='key'). If column names differ, useby=c("df1.name"="df2.name")
Comparison with SQL¶
Translate commands to SQL using translate_sql.
Objective |
R |
SQL |
Remark |
|---|---|---|---|
select columns |
|
|
|
filter |
|
|
|
order rows |
|
|
Comparison with Python¶
R is procedural, while Python is object-oriented.
R
find('key') %>%
start('car') %>%
drive(to='campus') %>%
park()
Python
key.find()
car.start()
car.drive(to='campus')
car.park()
Functions¶
Objective |
Python |
R |
Remark |
|---|---|---|---|
function help |
|
|
|
append to array |
|
|
|
create a sequence |
|
|
python is zero indexed |
add y-axis label |
|
|
|
add title |
|
|
|
length of a string |
|
|
|
arrays comparison |
|
|
returns a logical array of the same shape as |
add a column |
|
|
|
delete a column |
|
|
|
delete rows |
|
|
|
check missing |
|
|
|
min index |
|
|
|
frequency of discrete values in a vector |
|
|
|
default vector format |
row-major (‘C’) |
column-major (‘F’/fortran) |
mind broadcasting in |
array stack |
|
|
|
array extension |
|
|
In numpy the default format is row-major, so use |
Miscellaneous¶
= vs <-
in most cases, they are equivalent
unlike
=, the other direction->also works, which is convenient at the end of a lineinside function argument assignment
fun(x=...), must use=.