R¶
Data Structures¶
R has many data structures, which include
atomic vector
list
matrix
data frame
…
To check and convert data type:
class()
high-level data type/structures: data frame, matrix, etctypeof()
low-level data type: character, logical, double, integer …is.sth()
: is numeric? character?as.sth()
: convert to numeric, character etc.identical(x1, x2)
Comparison of the four data structures
Dimensions |
Homogenous |
Heterogeneous |
---|---|---|
1-D |
atomic vector |
list |
2-D |
matrix |
data frame |
Vector¶
A vector is the most common and basic data structure in R. A vector is a collection of elements that are of one (homogenous) mode character, logical, integer
or numeric
.
creation
empty vector:
vector()
(By default the mode is logical).vector("character", length = 5)
creates a vector of mode ‘character’ with 5 elements.character(5)
same thingnumeric(5)
a numeric vector with 5 elementslogical(5)
# a logical vector with 5 elements
integer vs numeric
c(1, 2)
: double precision real numbersc(1L, 2L)
: integer, or useas.integer()
character
c('a', 'b')
, logical, etc.
Note
The contents of a list are not restricted to a single mode. If multiple types are inside a vector c(1, 'a')
, R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”.
manual coercion: as.integer(), as.character()
, etc.
List¶
A list can contains elements of different types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors.
Creation
from vector:
as.list(1:10)
from input:
pie = list(type="key lime", diameter=7, is.vegetarian=TRUE)
Indexing
by element name:
pie$diameter
by brackets:
pie[1]
the first element,pie[[1]]
the first element in the first element.
Matrix¶
A matrix is an extension of the numeric or character vectors. It is simply an atomic vector with dimensions.
Data Frames¶
A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list). One can check is.list(df)
returns TRUE
.
Data frames can have additional attributes such as rownames()
.
Creation
from imported data:
read.csv(), read.table()
from vectors:
data.frame(id = letters[1:10], x = 1:10, y = 11:20)
Functions
shape:
dim(), nrow(), ncol(), str()
preview data:
head(), tail(), str()
names()
orcolnames()
,na.omit()
sapply(df, fun)
: runfun()
function for each column of df
Indexing
by column name:
df$colname
df[index, col1]
, still matrixdf[index, ]$col1
, vectordf[row_num, col_num]
, ordf[vec_of_num, vec_of_num]
Data cleaning
is.na()
anyNA()
Factors¶
Factor objects are how R stores categorical variables, which is character data into fixed numbers of integers. Each category is called a ‘level’.
x = factor(c('a', 'b', 'c', 'a'))
str(x) # Factor w/ 3 levels "a","b","c": 1 2 3 1
typeof(x) # [1] "integer"
levels(x) # [1] "a" "b" "c"
nlevels(x) # 3
By default, R always sorts levels in alphabetical order. To specify levels, use ordered = TRUE
.
chili <- factor(c("hotter", "hot", "hotter", "hottest", "hot", "hot"))
factor(chili, levels = c("hot", "hotter", "hottest"), ordered = TRUE)
min(chili) # hot
max(chili) # hottest
Data Visualization via ggplot
¶
Syntax¶
Basic syntax is as below:
ggplot(data = [dataset],
mapping = aes(x = [x-variable],
y = [y-variable]) +
geom_xxx() +
other_options
For instance, scatter plot (point plot) of two variables
ggplot(data = starwars,
mapping = aes(x = height,
y = mass)) +
geom_point()
Aesthetic Mapping¶
To display values, map variables in the data to visual properties of the geom (aesthetics) in your plot, e.g. x and y location, size, shape, color etc. Most of the RHS variables are from data.
mapping = aes(x = height,
y = mass,
color = sex))
Warning
Putting aes()
in different places can affect the final output.
In the coe below, only one smoothing line is fitted. The aesthetic mapping aes(color=sex)
is in the geom_point()
plot level. Points will have different colors according to the corresponding sex.
ggplot(data = starwars,
mapping = aes(x = height, y = mass)) +
geom_point(aes(color=sex)) +
geom_smooth(se = FALSE) +
xlim(80, 250) +
ylim(0, 180)
In the code below, the aesthetic mapping aes(color=sex)
is in the ggplot()
input level, so the data sets are partitioned into parts by sex, and R runs the next geom
commands for each part.
ggplot(data = starwars,
mapping = aes(x = height, y = mass, color=sex)) +
geom_point() +
geom_smooth(se = FALSE) +
xlim(80, 250) +
ylim(0, 180)
aesthetics |
discrete |
continuous |
---|---|---|
color |
rainbow of colors |
gradient |
size |
discrete steps |
linear mapping between radius and value |
shape |
different shape for each |
NA |
Use of geom
¶
Numerical data
density:
geom_density()
histogram:
geom_histogram(binwidth = 10)
smoothing line:
geom_smooth(se = FALSE) + xlim() + ylim()
Categorical data
bar plot:
geom_bar()
Multivariate
box plot:
geom_boxplot()
wherey
is numeric andx
is categoricalmany options, see documentation.
scatter plot:
geom_point()
Other Options¶
y-axis label:
+ labs(y = 'ylabel')
create sub plots conditioning on some discrete variables
facet_grid()
: 2d grid,rows ~ cols
,.
for no splitfacet_wrap()
: 1d ribbon wrapped into 2d, useful when the condition variable takes too many values.+ facet_wrap(. ~ x_discrete)
+ facet_wrap(y_discrete ~ .)
Data Manipulation via dplyr
¶
Consider a set of nested functions expressing a sequence of actions
park(drive(start_car(find("keys")), to =
"campus"))
We need to read from the inside. Is there a more friendly way?
Introduction to %>%
¶
The pipes operator %>%
, in package magrittr
, implements these functions in a more natural (and easier to read) way.
find("keys") %>%
start_car() %>%
drive(to = "campus") %>%
park()
By default, it sends the result of the LHS function as the first argument to the RHS function, to form a flow of results. To send results to a function argument other than first one, or to use the previous result for multiple arguments, use .
:
starwars %>%
filter(species == "Human") %>%
lm(mass ~ height, data = .)
Calling dplyr
verbs always outputs a new data frame, it does not alter
the existing data frame. To store the output, add it at the front
output <- data %>%
... %>%
...
If we want to store intermediate results, use {. ->> obj}
c <- cars %>%
mutate(var1 = dist*speed) %>%
{. ->> b } %>% # here is save
summary()
Data Wrangling¶
select
: pick columns by name (no quotes)select(col1, col2)
To add a chunk of columns use the
start_col:end_col
syntax:select(col1:col3, col5:col7)
Not select columns:
select(-col1, -(col3:col5))
one_of
: the column names can be stored in a vectorcols
, then passed inselect(one_of(cols))
. This ignores the column name incol
that is not in that of the data set. Also seeany_of
,all_of
, and other select helpers.select(col2, col3, everything())
whereeverything()
selects all columns aftercol3
select(col1_name=col1)
to rename, or userename(col1_name=col1)
select_if
: uses takes in each column same assapply
select_if(~is.numeric(.))
or in shortselect_if(is.numeric)
selects columns with numerical values. For negation, useselect_if(~!is.numeric(.))
orselect_if(funs(!is.numeric(.)))
.select_if(funs(mean(.) > 4))
orselect_if(~mean(.) > 4)
select columns with column mean > 4.mean > 4
is not a function in itself, so you will need to add a tilde~
.together,
select_if(~is.numeric(.) & mean(.,na.rm=TRUE) > 4)
select_all()
function allows changes to all columns, and takes a function as an argument, e.g.select_all(tolower)
upfront, or wrap it inside funs()slice
: pick rows using index(es)slice(c(7,8,14:15))
slice(-c(7,8,14:15))
filter
: pick rows matching criteriacan use
AND, OR
andNOT
, or below stylesfilter(condition1, condition2)
is equivalent to usingAND
filter(condition1, !condition2)
filter(condition1 | condition2)
filter(xor(condition1, condition2)
will return all rows where only one of the conditions is met, and not when both conditions are met.
arrange
: order rows by values of column(s), default ascending.arrange(col1)
arrange(desc(col1), col2)
If
col1
indesc(col1)
is a character variable, then sort from Z to A.
rename
: rename specific columnsrename(col1_new_name=col1, col2_new_name=col2)
mutate
: add new columnsNew columns can be made with aggregate functions such as
average
,median
,max
,min
,sd
, andifelse
mutate(col1 = col1 - mean(col1))
mutate(col1or2 = ifelse(col1 > col2, 'col1', 'col2'))
mutate(col2 = case_when(col1>80 ~ 'great', col1>60 ~ 'good', TRUE~'others')
works like multi-levelifelse
.
mutate_at
applies a function to one or several columns, likesapply
. It does not create new columns.mutate_at(c('col1', 'col2'), ~ .-mean(.))
, note that column names are in quotes.can also use
%>%
inside,mutate_at(cols, ~ .-mean(.) %>% round(2))
.
transmute_at
return a new data frame with only the mutated variables.bind_cols
,bind_rows
: bind multiple data frames by row and column.%>% bind_cols(., df2)
count
: return frequency of discrete variables. Addsort=TRUE
to sort by counts.%>% count(col1, col2, sort=TRUE)
group_by
: create groups of rows according to a condition. Useungroup
to recoversummarise
: apply computations across groups of rows, used with aggregation functionsn
,n_distinct
,sum
,max
,min
,mean
,median
,sd
gather
: make wide data longergather(key=new_key_col_name, value=new_value_col_name, -old_index)
spread
: make long data widerspread(key=key_col_name, value=value_col_name)
separate
: separate a character column to two by a separator.separate(col, c('new-col-1', 'new-col-2'), sep=' ')
unite
: unite two columns to one by separator.unite('new-col', c(col1, col2), sep=' ')
map
: apply computations across columns, relatedmap_dbl
,map_dfr
pmap
: apply computations across rows, relatedpmap_dbl
,pmap_dfr
*_join
where*
can beinner
,left
,right
, orfull
: join two data frames together according to common values in certain columns, and*
indicates how many rows to keep.*_join(df2, by='key')
. If column names differ, useby=c("df1.name"="df2.name")
Comparison with SQL¶
Translate commands to SQL using translate_sql
.
Objective |
R |
SQL |
Remark |
---|---|---|---|
select columns |
|
|
|
filter |
|
|
|
order rows |
|
|
Comparison with Python¶
R is procedural, while Python is object-oriented.
R
find('key') %>%
start('car') %>%
drive(to='campus') %>%
park()
Python
key.find()
car.start()
car.drive(to='campus')
car.park()
Functions¶
Objective |
Python |
R |
Remark |
---|---|---|---|
function help |
|
|
|
append to array |
|
|
|
create a sequence |
|
|
python is zero indexed |
add y-axis label |
|
|
|
add title |
|
|
|
length of a string |
|
|
|
arrays comparison |
|
|
returns a logical array of the same shape as |
add a column |
|
|
|
delete a column |
|
|
|
delete rows |
|
|
|
check missing |
|
|
|
min index |
|
|
|
frequency of discrete values in a vector |
|
|
|
default vector format |
row-major (‘C’) |
column-major (‘F’/fortran) |
mind broadcasting in |
array stack |
|
|
|
array extension |
|
|
In numpy the default format is row-major, so use |
Miscellaneous¶
=
vs <-
in most cases, they are equivalent
unlike
=
, the other direction->
also works, which is convenient at the end of a lineinside function argument assignment
fun(x=...)
, must use=
.