Python¶

In this section we introduce intermediate python programming and some packages e.g. numpy, pandas.

references
- https://github.com/yasoob/intermediatePython
- https://www.liaoxuefeng.com/wiki/1016959663602400 (Chinese)
- https://wiki.python.org/moin/Powerful%20Python%20One-Liners

Programmer Tools¶

Debugging¶

methods
- pdb.set_trace() to pause running. now use breakpoint() after 3.7
- assert x == 2, 'msg'
- logging and output specific msg type
see details

Object Introspection¶

dir()
- return a list of attributes and methods belonging to an object
type()
id()
inspect.getmembers()

Decorators and Decorator Classes¶

see details
- note the methods __init__ and __call__
@lru_cache(maxsize=32) to cache the values of function calls
- not execute the function if the function has been called with the same args before
- returns value in cash
- saves time and effort

Syntax¶

exceptions¶

try, except, else, finally
try, except E1 as e, except E2 as e to catch multiple error one by one
try, except Exception as e to catch multiple errors at once

for/else¶

else clause is executes after the loop completes normally (without break)
e.g. loop to search, if found then break, if not found then go to else

tenery operator¶

x = 1 if a > 1 else a
name = 'a' or 'b' return a.
dynamic default name

def my_function(real_name, optional_display_name=None):
    optional_display_name = optional_display_name or real_name

`*args` and `**kwargs`¶

when define def fun(*arg, **kwarg)
- args passes an unspecified number of non-keyworded arguments in a list
- kwargs passes an unspecified number of keyworded arguments in a dictionary
- e.g. pass plot arguments to plt.plot() in self-defined plot functions.
when call fun(*arg, **kwarg)
- args can be a pre-defined tuple
- kwargs can be a pre-defined dictionary, with arg-value being the key-value pair
- * and ** is used to unpack

`open()` and context managers¶

see
- https://github.com/yasoob/intermediatePython/blob/master/open_function.rst
- https://github.com/yasoob/intermediatePython/blob/master/context_managers.rst

Data Structures¶

Mutable vs Immutable¶

identity, type, and value
- an object’s identity never changes once it has been created; you may think of it as the object’s address in memory.
  - The is operator compares the identity of two objects
  - The id() function returns an integer representing its identity
  - objects with different names but have the same identity are referencing to the same object in computer memory
- an object’s type defines the possible values and methods
  - the type() function returns the type of an object. An object type is unchangeable like the identity.
- an object’s value can or cannot be changed depending by its type
  - objects whose value is changed with their identity unchanged are said to be mutable, e.g., list, dictionary, set and user-defined classes.
  - objects whose identity must be changed once its value is changed are called immutable, this means if we change its value then a new object (with new identity) is created, e.g., int, float, decimal, bool, string, tuple, and range. After the value is changed, its id also changes
  - the == operator is used to compare the values of two objects

difference between mutable and immutable objects arises when you assign a variable to another variable

a = 1 ## int is immutable
b = a
print(b is a) ## True
idb_before = id(b)
b = b + 1
id(b) == idb_before ## False, a new object is created
print(a is b) ## False
print(a) ## 1

a = [1] ## list is mutable
b = a
print(b is a) ## True
idb_before = id(b)
b.append([2])
id(b) == idb_before ## True, only the value is changed
print(a is b) ## True
print(a) ## [1,2], changed

there is also difference if we set the default argument to be a mutable object in a function
```
def add_to(num, target=[]):
    target.append(num)
    return target

add_to(1) ## [1]
add_to(2) ## [1, 2]
add_to(3, target=[4]) ## [4, 3]
add_to(5) ## [1, 2, 5]
```
- in Python, the default arguments are evaluated (and their identities are created in the memory) once the function is defined, not each time the function is called
- add_to(1) used the default value of target [] which is created in the memory when the function is defined, and changed it to [1]
- add_to(2) used the target in the memory, which is [1]
- add_to(3, target=[4]) used the passed value [4], not the object in the memory
- add_to(5) used the target in the memory again, which is [1,2] resulting from add_to(2)
- to be safe, mutable type as default value should be defined in the following way
```
def add_to(num, target=None):
  if target is None:
      target = []
  target.append(num)
  return target
```
mutability of containers
- some objects contain references to other objects, these objects are called containers. Some examples of containers are a tuple, list, and dictionary. The value of an immutable container that contains a reference to a mutable member can be changed if that mutable member is changed. However, the container is still considered immutable because when we talk about the mutability of a container only the identities of the contained objects are implied.
- if an immutable container contains only immutable members, then the value of the container cannot be changed if we change the immutable members
ref:
- https://towardsdatascience.com/https-towardsdatascience-com-python-basics-mutable-vs-immutable-objects-829a0cb1530a

Classes and Magic Methods¶

functions vs methods
- functions may be associated with packages, e.g. np.sqrt()
- methods are always associated with objects, e.g. df.head()

class variables vs instance variables

instance variables are unique to every object
class variables are for data shared between different instances of a class
using mutable class variables is dangerous

e.g. in the example below, name is a instance variable, pi is a immutable class variable, and superpowers is a mutable class variable

class SuperClass():
  superpowers = []
  pi = 3.14

  def __init__(self, name):
      self.name = name

  def add_superpower(self, power):
      self.superpowers.append(power)

foo = SuperClass('foo')
bar = SuperClass('bar')
foo.name ## 'foo'
bar.name ## 'bar'

foo.pi = 10
print(foo.pi) ## 10
print(bar.pi) ## 3.14

foo.add_superpower('fly')
bar.superpowers ## ['fly']
foo.superpowers ## ['fly']

magic methods
- magic methods are also called dunder (double underscore) methods
  - e.g. __init__, __getitem__, __iter__, __next__, __call__, etc.
- __slots__
  - by default Python uses a dict to store an object’s instance attributes
    - pros: allows setting arbitrary new attributes at runtime
    - cons: wastes a lot of RAM if you create a lot of objects with known attributes
    - solution: store the fixed set of attributes in slots to save 50% RAM
  - just specify the attributes names in a list and pass it to __slots__
    class MyClass(object): __slots__ = ['name', 'identifier'] def __init__(self, name, identifier): self.name = name self.identifier = identifier self.set_up()
  - ref: https://stackoverflow.com/questions/472000/usage-of-slots

Iterables, Iterators, Generators and Coroutines¶

An iteratble is any object in Python which has an __iter__ or a __getitem__ method defined, which returns an iterator or can take indexes
An iterator is any object in Python which has a __next__ method defined
- e.g. str is an itertable but not an iterator. iter(iterable) will return an inerator object.
- next(iterator) allows us to access the next element
```
my_string = "Yasoob"
my_iter = iter(my_string)
print(next(my_iter)) ## 'Y'
```
An generator is an iterator, but you can only iterate over it once. They do not store all he values in memory, they generate the values on the fly.
- can be defined by generator comprehensions (i for i in range(10))
- can be defined by function using yield
```
def generator_function():
    for i in range(3):
        yield i
gen = generator_function()
print(next(gen)) ## 0
for i in gen:
    print(i) ## 1,2
```
coroutines are similar to generators but it takes value from input
- next() to execute it
- .send() to input the next value to yield
- .close() to close
- see https://github.com/yasoob/intermediatePython/blob/master/coroutines.rst

Operations¶

Complexity of operations (link)[https://www.cnblogs.com/luozx207/p/12793168.html].
set.remove() vs set.discard(): The remove() method raises an error when the specified element doesn’t exist in the given set, however the discard() method doesn’t raise any error if the specified element is not present in the set and the set remains unchanged.

`collections` module¶

the collections python module contains a number of useful container data types
defaultdict
- defaultdict is a sub-class of the dict class that returns a dictionary-like object. The functionality of both dictionaries and defualtdict are almost same except for the fact that defualtdict never raises a KeyError. It provides a default value for the key that does not exists.
- for details see https://www.geeksforgeeks.org/defaultdict-in-python/
- e.g.
```
from collections import defaultdict
def default_value():
    return "Not Present"
d = defaultdict(def_value) ## or defaultdict(lambda: "Not Present")
d["a"] = 1
print(d["a"]) ## 1
print(d["b"]) ## "Not Present"
```
- one can also specify the default ‘null’ type of the value by defaultdict(factory_function), where factory_function can be int, str, set, list etc.
OrderedDict
- OrderedDict keeps its entries sorted as they are initially inserted. Overwriting a value of an existing key doesn’t change the position of that key. However, deleting and reinserting an entry moves the key to the end of the dictionary.
- e.g.
```
from collections import OrderedDict
colours = OrderedDict([("Red", 198), ("Green", 170), ("Blue", 160)])
for k, v in colours.items():
    print(k, v)
```

Counter

Counter is used to count the number of occurrences of a particular item in an iterable, and return a dictionary-like Counter object.

e.g.

from collections import Counter
l = ['a', 'b', 'a', 'c']
freq = Counter(l)
for k, v in colours.items():
    print(k, v)

deque
- deque is preferred over list in the cases where we need quicker append and pop operations from both the ends of container, as deque provides an O(1) time complexity for append and pop operations as compared to list which provides O(n) time complexity.
- methods include appendleft(), extendleft() and popleft()
- We can also limit the amount of items a deque can hold, e.g. deque([0, 1, 2, 3, 5], maxlen=5). By doing this when we achieve the maximum limit of our deque it will simply pop out the items from the opposite end.

namedtuple

We’ve known that a tuple is basically a immutable list. Likewise, a namedtuple can be seen as a immutable dictionary.
namedtuples are backwards compatible with normal tuples (e.g. indexed by integer), and require no more memory than regular tuples.
namedtuples are more lightweight and faster than dictionaries, and can be convert to dictionaries

A named tuple has two required argument: tuple name and the field_names. e.g.

  from collections import namedtuple
  Animal = namedtuple('Animal', 'name age type') ## tuple name and field names
  perry = Animal(name='perry', age=31, type='cat')
  print(perry[0]) ## 'perry', index by integer like a regular tuple
  print(perry.name) ## 'perry', index by key like a dictionary
  print(perry._asdict()) ## convert to an OrderedDict
  perry.age = 42 ## error, since it is immutable

Enum

Enum is a data container that is preferred when we require immutable and unique keys or values. For instance, weekday names and weekday numbers.

from enum import Enum
class Weekday(Enum):
    Mon = 1
    Tue = 2
    Mon = 4 ## error, duplicate keys are not allowed
    Monday = 1 ## alias keys are allowed. can use @unique to disable them

Weekday.Mon = 1 ## error, immutable
Weekday.Monday ## output: Weekday.Mon

To get enumeration members, use Weekday(1), Weekday['Mon'] or Weekday.Mon
To get member names and values, use member.name and member.value

A one-liner to define an Enum class (indexing from 1 by default)

Weekday = Enum('Day', ('Mon', 'Tue', 'Wed', 'Th', 'Fri', 'Sat', 'Sun'))
print(Weekday.Mon.value) ## 1

Functional Programming¶

`enumerate()`¶

can take an optional argument to specify the starting index enumerate(my_list, 1)
can also be used to create a list of tuples list(enumerate(my_list, 1))

`lambda`¶

used to define a anonymous function

e.g. sort a list of tuples by the first element in that tuple

a = [(1, 2), (4, 1), (9, 10)]
a.sort(key=lambda x: x[1])

`sorted()`¶

the list.sort() method is only defined for lists.
in contrast, the sorted() function accepts any iterable.

e.g. sort words in a sentence in alphabet order.

sorted("This is a test string from Andrew".split(), key=str.lower)

the key-function can be itemgetter() or attrgetter() from the operator module.
see https://docs.python.org/3/howto/sorting.html

`map(), filter()` and `reduce()`¶

map(fun, iterable) may be faster than list comprehension if fun is pre-defined (not through lambda)
filter(fun, iterable) is used for masking, where fun should return True/False

reduce(fun, iterable, initilizer=None) applies a particular function passed in its argument to all of the list elements mentioned in the sequence passed along.

def reduce(function, iterable, initializer=None):  roughly equivalent
    it = iter(iterable)
    if initializer is None:
        value = next(it)
    else:
        value = initializer
    for element in it:
        value = function(value, element)
    return value

from functools import reduce
reduce(lambda a, b: a + b, l)  sum(l)
reduce(lambda a, b : a if a > b else b, l)  max(l)
reduce(lambda z, x: z + [y + [x] for y in z], l, [[]])  all subsets of l

Comprehensions¶

list comprehensions: squared = [x**2 for x in range(10)]
set comprehensions: {x**2 for x in [1, 1, 2]}
dict comprehensions: {key: value for ... }
- e.g. swap keys and values {v: k for k, v in some_dict.items()}
generator comprehensions
- don’t allocate memory for the whole list but generate one item at a time, thus more memory efficient.
```
my_gen = (i for i in range(30) if i % 3 == 0)
for x in my_gen:
  ...
```

Numpy¶

isin
repeat vs tile
concatenate, vstack and hstack
strides
empty
array vs asarray

Pandas¶

ref

note

Series¶

“Series = Vector + labels”

attributes

.index
.values

methods

.describe()
.head(), .tail()
.plot()

DataFrame¶

Indexing¶

Select	Syntax	Result
a column	`df['col']`	Series
columns by labels	`df[['col1', 'col3']`]	DataFrame
columns by labels	`df.loc[:, 'col1':'col3']`	DataFrame
a row by its label	`df.loc['row']`	Series
a row by its integer location	`df.iloc[1]`	Series
rows by integers	`df[1:5]`	DataFrame
rows by labels	`df['row1':'row5']`	DataFrame
rows by boolean	`df[mask]`	DataFrame
entries by integers	`df.iloc[1:5, 2:6]`	DataFrame
entries by labels	`df.loc[['row1', 'row2'], ['col1', 'col2']]`	DataFrame

Note that

if only select rows or columns, then [] is enough.
.loc is primarily label based, but may also be used with a boolean array. The following are valid input
- a single label, a list of labels, a slice of labels, a boolean array
.iloc is primarily integer based. The following are valid input
- an integer, a list of integers, a slice of integers, a boolean array

Methods¶

.max()
df['col1'].corr(df['col2'])
df.dtypes and df['col'].dtype
df['col'].astype(). To convert a column of numbers in string format to float format, two methods pd.to_numeric('col') or df['col'].astype(float) can be used. But if there is entry missing, it is better to use the former one, since we can specify errors='coerce' to set the invalid parsing as NaN.
.rename(columns=names) where names is a dictionary of 'old_name':'new_name'
.isnull() to check missing
.dropna() to drop all rows containing any missing entries
- add subset = ['col1', 'col2'] to specify columns
.fillna(value='') fill missing values with specific value
- .fillna(method='bfill')
- .fillna(method='ffill')
df.agg(func, axis=0) to apply built-in aggregation or self-defined functions to column(s) ‘0’ or row(s) ‘1’.
df.apply(fun) apply self-defined function to columns or rows
- axis = {0 or ‘index’, 1 or ‘columns’}, default 0
df.applymap(fun) apply self-defined function to each entry
df._get_numeric_data() filter only numeric columns
df.sort_values(by, axis=0, ascending=True) sort values

Multiple methods can be applied sequentially and arranged in a easy-to-read format using ()

(df
  .groupby('type')
  .mean()
  .sort_values()
  .plot
  .bar(
    figsize = (4,3),
    layout = (4,5),
  )
)

Plot¶

Equal axis aspect ratio using ax or plt
- axs[0, 1].axis('equal')
- axs.set_aspect('equal', 'box')
- plt.gca().set_aspect('equal', adjustable='box')
- plt.axis('square')

Spine placement docu

ax.spines.left.set_position('center')
ax.spines.bottom.set_position('center')
ax.spines.right.set_color('none')
ax.spines.top.set_color('none')

Shaded area

plt.fill_between(x, yhigh, ylow,
               facecolor="orange", # The fill color
               color='blue',       # The outline color
               alpha=0.2)          # Transparency of the fill

Miscellaneous¶

%who will give you a list of all current user-defined variables
%whos will give you more details on all current user-defined variables
dir() will give you the list of in scope variables

Data Science Handbook