The Summary of Machine learning (StatQuest)

2023-08-31

cross validation
sensitivity：the TP(true positive) were correctly identified by the model. TP/(TP+FN)
specificity:the TN(true negative) were correctly identified by the model. TN/(TN+FP)
bias & variance : overfit
3 commonly model: regularization, boosting, and bagging(random forest)
ROC (FPR,TPR): to choose the optimal threshold. higher y, lower x.
AUC (ROC下面积): to choose the model. ROC与x轴围成的面积，越大越好
precision=TP/(TP+FP) [研究罕见病时使用，因为此时TN多

calculate AUC

code from StatQuest
par(pty = "s") #make the plot as a square

Entropy:value of surprise
mutual information; joint probility; marginal probility

design matrix

combine t-test with regression (F-test)

batch effect correct

odd and odd ratio

3 ways to determine if an odds ratio is statistically significant.

1)Fisher’s Exact Test
2)Chi-Square Test
3)The Wald Test

coeffecient of logistic regression

？why the standard deviation is calculated by these
and how to calculate the standard error and z-value(=Etimate/standard error.by wald test.it’s the number of standard deviations the estimated intercept is away from 0 on a standard normal curve)

R square and p-value of logistic regression

log-likelihood based R square (aka McFadden’s Pseudo R square)

Saturated Model and Deviance

LL(Saturated model)always=0 because the saturated model fit the spot well.

AIC: Akaike Information Criterion. Residual adjusted for the number of parameters. can be used to compare one model to another.

logistic regression code

# chi-square value = 2*(LL(Proposed) - LL(Null))
# p-value = 1 - pchisq(chi-square value, df = 2-1)
1 - pchisq(2*(ll.proposed - ll.null), df=1)
1 - pchisq((logistic$null.deviance - logistic$deviance), df=1)

and predict.

Deviance residuals

the square root of the contribution that each data point has to the overall Residual Deviance. used to identify outliers.

PCA

using SVD(singular value decomposition)
singular vector/eigenvector
loading scores
eigenvalue

Tips:scaling,centering,and verify the number of principal components.
code in R
note: prcomp require samples on rows. square pca$sdev can be used for %(eigenvector). rotation = loading scores
code in python
StandardScaler().fit_transform(data.T)

LDA

maximizing the seperatibility between 2 groups while PCA by focusing on the genes with the most variation.
2 criteria: Maximize the distance between means; minimize the variation (scatter;square s)within each category.

MDS and PCoA

1 2	dist() cmdscale()

code

t-SNE

Hierarchical Clustering

similarity(distance) to form clusters;
decide how to compare sub-clusters:
centroid
single-linkage
complete-linkage;
height of the branches in the ‘dendrogram’ shows what is most similar.

k-means clustering

elbow plot to find ‘K’

DBSCAN

radius of circle for a initial cluster
core point

K-nearest neighbors

Naive Bayes

Multinomial Naive Bayes Classifier
probability for discrete called likelihood
set alpha to avoid 0
high bia and low variance in ML
Gaussian Naive Bayes Classification
use log() to avoid ‘uderflow’

Decision and Classification tree

test leaves impure: gini impurity, entropy and information gain.
how to calculate gini impurity:

to calculate for numeric data, using average age for all adjacent people.
put lowest impurity leaves as node.
overfit if sample too small -> pruning
or -> put limits on how trees grow(i.e. require x samples per leaf, x is determined by cross varidation)
feature selection: cut off the feature not reduce the impurity to avoid overfit.

Regression trees

determine thresholds by calculating the sum of squared residual, pick up the smallest one.

prune regression tree:

cost complexity pruning(aka weakest link pruning)
– tree score = SSR + alpha * T
(T represent the total number of leaves; alpha T = tree complexity panelty)
– set which alpha gave us the lowest sum of squared ressiduals with the testing data；tree built by full data(train + test)

? what is new training data and testing data

Encoding

one-hot encoding: each discrete option as a colname.
label encoding: as 1,2,3 but not good for meaning.
target encoding: mean for each discrete option.
less samples less confidence.
so Baysian Mean Encoding (aka target encoding, the equation in below picture)
setting m =2 means we need at least 3 rows of data before the option mean, the mean we calculated for blue, becomes more important than the overall mean.
data leakage -> overfit
so k-fold Target encoding turn discrete to continuous variable.
leave-one-out-target encoding

Classification Trees in Python from Start to Finish. code & tutorial

random forest

bootstrapping the data plus using the aggregate to make a decision is called bagging.
out-of-bag dataset used to measure accuracy.
out-of-bag error:the proportion of incorrect classified.

fill-in missing values
proximity matrix: the time of / trees number
– orifinal dataset :for dicrete item: the weighted frequency for ‘yes’ is the frequency of ‘yes’ * the weight for ‘yes’
the weight for ‘yes’= proximity of ‘yes’/all proximities
iterative
distance= 1-proximity value ->heatmap/MDS
– new sample: for dicrete item :guess each result and option, then iterative calculate, choose the most correct option, finally fill-in the data and classify the sample.

code
##mtry

loss function

fix slope to find optimal intercept.
the chain rule (拆分求导)
sum of the squaired residuals is a type of loss function.

gradient descent

for LR: step size = slope * learning rates (minimum = 0.001)
note: the results is sensitive to learning rates,the way we change lr is called schedule.

new intercept = old intercept - step size
(also can be used to find both optimal slope and intercept)
stochastic gradient descent which select subset of data rather than full dataset.
using 1 or mini-batch of samples for each step. easily update the parameters when add new data.
can be used to logistic regression and t-sne.

AdaBoost

(stumps is a node with 2 leaves.)
Amount of Say

Weighted Gini Function -> sample weights

Gradient Boost

Gradient Boost for Regression is different from doing Linear Regression.

M often = 100
for classification
log(odds)
set predicted xx
logistic function -> probability
limiting the number of leaves between 8 and 32.
Pseudo residuals means change residual by multiply to make calculate more simple.
the larger log(likelihood),the better prediction.
log(likelihood) multiply -1 as loss function.(argmin)
using 2nd order Taylor Polynomial in (A)

(A)calculate new residuels
(B)create new regression tree
(C)calculate output value
(D)make new prediction (log(odds))

XGBoost(eXtreme Gradient Boost)

regression
classification
similarity score
lambda: a Regularization Parameter, which reduce the prediction’s sensitivity to individual observations. prevented overfit. smaller output values for leaves.
gain-gamma -> prune or not
eta (learning rate)
cover min_child_weight -> prune
?predicted drug effectiveness = 0.5

pick optimal outputvalue to make L equation min
regularization penalty by increasing lambda, the optimal outputvalue gets closer to 0.
gradients (g); hessians (h)

first 3 part is to predict, next is to Optimization for large dataset.
greedy algorithm
quantile (default 33)
sketch algorithms
weighted quantile sketch
weight = previous probability * (1 - previous probability)
Sparsity-Aware Split Fingding
for missing data

Cosine Similarity

Support Vector Machines

Maximal Margin Classifiers
support vector classifier aka soft margin classifiers
above 2 can not handle many cluster of data, so we use support vector machines to find a relatively high dimensional support vector classifier.
hyperplane

Function:

polynomial kernel

(a*b + r)^d
r,d varified by cross validation.

radial kernel aka radial basis function(RBF)kernel: in infinite dimension. behaves like weighted Nearest Neighbor model.

e ^ (-gamma(a-b)^2)
gamma determined by cross validation.
Taylor Series Expansion

Dot product
when we plug numbers into radial kernel, the value we get is the relationship between the 2 points in infinite-dimensions.

Neural Networks

Backpropagation
Hidden Layers
Activation Function:

sigmoid e^x/(e^x + 1)
ReLU (Rectified Linear Unit) max(0,x)
softplus log(1 + e^x)
initialize weights using a standard normal distribution.(one of many ways)
initialize bias to 0
multi-input
? when input constant output descerete, how to encode output, using one encode?

multi-output: ArgMax(0or1) can not be used for backpropagation.
or SortMax(e^x/e^x+e^y) used for gradient descent because of the derivative not being 0.
calculate deviation of sortmax -> Quotient Rule

Cross Entropy
each cross entropy- sum (observed * log (predicted))

predicted = softmax output
observed = 0/1
Total Cross Entropy, as total error.
why ce not residual: for step size = slope(derivative) * learning rate. used to backpropagation.

note: when finding minimum total error, even the signal of “p” is the same, we need to use each sample to calculate each ‘p’.

Convolutional Neural Networks(CNNs)

*filter(convolution)=feature map -> ReLu activation function -> max pooling (or average/mean pooling) -> as input node
(?how to determin filter?)

Recurrent Neural Networks(RNNs)

feedback loops
w2>1 ->the exploding gradient problem
w2<1 -> the vanishing problem
so

Long Short-Term Memory(LSTM)

sigmoid activation function (0,1)
tanh activation function (-1,1) e^x - e^-x / e^x + e^-x
forget gate
input gate
output gate

Sequence-to-Sequence(seq2seq) Encoder-Decoder

Tokens
embedding values
cell, layer
context vector

word embedding(turn word to number)

transformer

position encoding(encode positions of the words)
? position encoding 1st 2nd x轴位置怎么确定的？
query, key and value
(dot product simularity between query and key)
self-attention(encode the relationship among the words)
8 self-attention cells aka multi-head attention
residual connection(relatively easily and quickly train inparallel)
encoder-decoder attention(encoder and decoder similarity to determine which word to be translate first)

(encoder)

decoder-only transformers

using mask self-attention （gpt 12 self-attention cells） not self-attention
diffenrence：

only include the ones that came before to determine how each words are related each other.
in trandition transformer, using self-attention and encoder-decoder-attention (in decoder. During training, this is kindly like mask self-attention and also called it)
n-dimensional Tensor
automatic differentiation