COMP90014 Algorithms for Bioinformatics

Dale

Prerequisites 1. Introduction to Genomics 基因组学

eukaryotes, prokaryotes, viruses
- 真核生物、原核生物、病毒
- water 70%, protein 15%, RNA 6%, DNA 1%
genome 基因组
- bacterium, human, plant, virus
- 细菌、人类、植物、病毒
- chromosome 染色体
- plasmid 质粒
- mitochondria 线粒体
- chloroplast 叶绿体
ploidy 倍性
- 2 copies of the same chromosome, one from each parent
- haploid 单倍体
- diploid 二倍体
- polyploid, triploid, tetraploid
molecular biology 分子生物学
- DNA, RNA, protein
- nucleotide 核苷酸, amino acid 氨基酸
- gene, genome, transcriptome 转录组, proteome 蛋白质组
DNA: A-T, C-G
- double helix 双螺旋
- sugar-phosphate backbone 糖磷酸骨架
- 5' to 3' direction
- chromosome, gene, exon 外显子, intron 内含子, promoter 启动子, enhancer 增强子, terminator 终止子
protein
- encoded by gene
- polymer of amino acid monomers 氨基酸单体聚合物
- cellular functions 细胞功能
  - catalysis 催化
  - structure 结构
  - signalling 信号
amino acid
- 20 types
- 3-letter and 1-letter code
- hydrophobic 疏水, hydrophilic 亲水, charged 带电
central dogma
- DNA -> RNA -> protein
- transcription 转录, translation 翻译
- U (uracil) in RNA instead of T (thymine) in DNA
- ribosome 核糖体, tRNA, mRNA
reserve transcription 逆转录
- mRNA -> DNA
- reverse transcriptase 逆转录酶
- retrovirus 逆转录病毒
gene
- unit of heredity 遗传单位
- 2% of human genome codes for genes
- promoter, 5' UTR, coding sequence, 3' UTR, terminator
- coding sequence: exon, intron
variation
- sequence variation
  - single nucleotide variation (SNV) 单核苷酸变异
  - single nucleotide polymorphism (SNP) 单核苷酸多态性
  - indels: insertion, deletion
- structural variation
  - large-scale (>50bp)
  - involving entire sections of DNA
  - gain or loss of fucntion (GoF, LoF)
- conservation
sequencing technology
- DNA/RNA extraction -> library preparation -> sequencing -> export fastq files -> QC/analysis
- Illumina, Nanopore
FASTQ files, reads
- 4 lines per read
  - identifier - unique name
  - sequence - AGTC
  - spacer - +
  - quality - confidence score for each DNA base
sequencing strategies
- whole genome sequencing (WGS)
  - all DNA in cells
- whole exome sequencing (WES)
  - all exons
  - cheap, capture importants
- RNA sequencing (RNA-seq)
  - all RNA in cells
- scRNA-seq
  - all RNA but idividual cells
- Amplicon
  - use PCR target specific region
- Pair-end
  - sequence both ends of DNA fragment

1. Indexing

common classes of algorithms
- brute force
- divide and conquer
- recursive
- dynamic programming
- greedy
data in bioinformatics
- FASTA, sequence, genome
- FASTQ, with quality, read
- BAM/SAM, alignment
- Matrix
- VCF, genetic variants
data type
- primitives: int, float, char
- collections: array, list, set, map
- derived types: class, enum, tuple
data structure
- arrangement of data in algorithm
- queue, stack, heap, tree, graph
evaluation
- correctness
- efficiency
- asymptotic complexity 渐近复杂度 (big O notation)
- tractability 可解性 (P, NP, NP-complete)
when complex
- heuristic
- approximation
indexing
- pattern matching
- see if something exists
- linear search O(n)
- binary search O(log n)
- hash table O(1)
double hashing (perfect hashing): use another hash function to treat collision
kmer
- breaking sequence into smaller pieces
- key of genomic data in indexing

2. Sequence Alignment and Mapping

why align sequence
- compare sequence
- assess similarity
- find evolutionary relationships
- identify functionally conserved sequences
- homology 同源性，指不同生物体的基因或蛋白质序列由于共同的祖先而具有相似性，有助于理解生物进化、基因功能和结构
  - homologs 同源基因
  - orthologs 直系 separated by speciation
  - paralogs 旁系 separated by duplication
- phylogeny 系统发育 infer the evolutionary relationships among sequences
- protein function, find regions of known function, infer function
- conservation, find conserved regions 在不同物种或不同基因组中，某些基因或蛋白质序列在进化过程中保持不变或变化很小的区域，通常这些序列在生物学功能上具有重要性
- data searching
- de novo assembly, reconstruct the original sequence
- read mapping, infer variants
how to align
- pairwise
- BLAST
Hamming distance
- for equal length sequences, count the number of different characters
kmer distance
- for unequal length sequences, count the number of different kmers
Levenshtein distance (pairwise)
- edit distance
- 最小编辑距离，指两个字符串之间，由一个转换成另一个所需的最少编辑操作次数
- count mismatch, insertion, deletion as transformations
- argmin of
  - (i-1, j-1) + (missmatch 1 or match 0)
  - (i-1, j) + (gap 1)
  - (i, j-1) + (gap 1)
- return the bottom-right cell of the matrix
pairwise distance with penalties
- apply penalties
  - i.e. match 0, mismatch -1, gap -2
- return the bottom-right cell of the matrix
Needleman-Wunsch algorithm (global)
- return the bottom-right cell of the matrix
Smith-Waterman algorithm (local)
- return the maximum value in the matrix
semi-global alignment
- clipped by best offset
- Needleman-Wunsch variant
  - start from max cell, end at reaching top row/left column
(scoring) substitution matrix 替换矩阵
- assign scores to each pair of characters
- why not use fixed scores
  - two types of DNA mutations 两种DNA突变
    - transition, purine to purine or pyrimidine to pyrimidine, frequently (A↔G, C↔T)
    - transversion, purine to pyrimidine or vice versa, less likely (A↔C, A↔T, G↔C, G↔T)
- it reflect
  - chemical similarity
  - observed mutation frequencies
  - how proteins sequence evolve
Point Accepted Mutation (PAM) matrix
- the number represents the evolutionary distance between the sequences
- higher numbers denote greater distances
- limitation
  - only for closely related sequences and small dataset
  - not consider different rates of evolution
BLOcks SUbstitution Matrix (BLOSUM) matrix
- scores derived from frequencies of substitutions in blocks of ungapped local alignments
- BLOSUM62: 62% identity
gap panalty
- negative score
- fixed / affine
read mapping
- seed extension
  - combine indexing and alignment
  - process
    1. index kmer
    2. find seed
    3. extend seed using alignment
    4. return best
- seed chain align
  - for long, noisy sequences
  - process
    1. find seeds
    2. identify colinear chains
    3. base-level alignments to fill gaps
    4. return best
Basic Local Alignment Search Tool (BLAST)
- efficient heuristic
  - DNA translated to protein seq for use
    - Amino Acid (AA) sequences more conserved than DNA
    - degeneracy / codon wobble
    - 3rd base in codon: multiple different nucleotides encode same AA
    - eliminate unnecessary DNA mismatches
    - more useful kmer hits
  - low-complexity regions removed
    - repetitive DNA
  - exploit kmer
    - generate 'neighbors'
    - allow mismatches
    - ungapped alignment and pick with threshold
  - identify nearby hits on same diagonal
    - like seed chain align
    - extend region to a high-scoring segment pair (HSP)

3.1. Comparing Sequences

multiple sequence alignment
- identify similarities between sequences
- find conserved patterns (motifs 模体) in a protein family
- Sum-of-Pairs (SP) score
- dynamic programming
heuristic: progressive alignment
1. construct a matrix
2. construct a guide tree
3. align sequences
Basic Local Alignment Search Tool (BLAST)
- one to many query
1. split query into words
2. find neighbourhood words of similar words
3. collect seeds
4. extend seeds
5. get the best alignment
minhash
- fingerprint, known as kmers
- big data
Jaccard Coefficient
- measure the similarity
- proportion
- J(A, B) = |A ∩ B| / |A ∪ B|
minimizers
- window
- minimizer: select the minimum kmer from a window according to an order
shazam
- search audio
- audio, fingerprint, hash, reproducible
limitation
- window size W and kmer size k trade-off
- longer kmer, more informative
- bigger window, more accurate, less efficient

3.2. Graph Theory

graph, interaction, biology
describe and analyse relationships
nodes, edges
labels, directional, weights
adjacent, degree
walk (edge), path (node), cycle (closed path)
Eulerian path: visit each edge once and only once
Hamiltonian path: visit each node exactly once, NP-complete
tree, usually sparse
clique, edges between all pairs of nodes
graph, adjacency list, adjacency matrix
traversals: depth-first search, breadth-first search
recursion

4.1. Network Analysis

connected?
distance of nodes?
density?
diversity of distribution? (node degree distribution)
centrality
- degree
- closeness
- betweenness
- eigenvector
classification of networks
- scale-free
- small-world
- random
- regular

4.2. Advanced Indexing

trie (prefix tree or radix tree)
- used in word other
suffix tree 后缀树
- complexity, space
suffix array
- each item is a word
Burrrows-Wheeler Transform (BWT)
- popular
- efficient in indexing human genome

5. Evolutionary Trees

taxonomy 分类学
phylogenetics 系统发育
speciation 物种形成
multiple sequence alignment
- issue: local not represent global
greedy: pair the most similar sequences
building a tree
- next merge
- internal node
- branch length
UPGMA
- unweighted pair group method with arithmetic mean
- $O (n^{3})$ , n for number of merge, n * n for distance matrix
- distance based
- ultrameric: distance from root to leaf is the same
Neighbor joining
distance based methods
- pros: simple, flexible, fast, scalable
- cons: sensitive to distance, evolutionary rates not considered, uncertain
character based methods
- taxon be described by a number of characters
parsimony 解析 methods
- compute a phylogenetic tree T for a set of sequences that minimizes L(T)
- L(T) is the min of subsitutions requires to explain tree T
evaluating trees
- label internal nodes e.g. Hamming distance
small parsimony
- given T, find L(T)
FItch's algorithm
- repeatly assign labels
- $S_{v} = S_{u} \cup S_{v} i f S_{u} \cap S_{v} = \emptyset, S_{u} \cap S_{v} o t h e r w i s e$
Sankoff's algorithm
- count smallest number of possible changes needed on a given tree
- $S_{a} (i) = m i n [c_{i} j + S_{l} e f t (j)] + m i n [c_{i} K + S_{r} i g h t (k)]$
build a tree for character based method
- greedy
- sequential addition
bootstrapping 重复抽样
- re‐sample, create new alignment

6. Assembly

to build contigs 连续序列
greedy extension
- for N reads of L length
- O( $N^{3} * L^{2}$ )
overlap layout consensus (OLC)
- Newbler, Celera, Canu
sequencing by hybridisation
- microarray
De Bruijn Graphs
- kmer
- Eulerian path
- ACGT -> 0,1,2,3
- 4^k nodes
- O(kN) space
- O(N) time
considerations
OLC simplification
De Bruijn graph simplification
Kmers, coverage, depth
assessing assemblies
software

7. Dimensionality Reduction

approaches
- feature selection
- feature extraction
principal component analysis (PCA)
- linear transformation
- eigenvectors
- eigenvalues
- variance
- covariance
- projection
- dimensionality reduction
multidimensional scaling (MDS)
- distance matrix
- similarity matrix
- stress
- metric MDS
- non-metric MDS
ISOMAP
t-SNE
UMAP

8. Clustering

machine learning
clustering
distance metrics
partitional clustering: k-means
clustering approaches
density-based clustering
divisive clustering

9. Optimisation

exploring the solution space
- constraints
- branch and bound
search heuristics
- gradient descent
- simulated annealing
- genetic algorithms
optimisation

10. Supervised Machine Learning

unsupervised learning
- associative rule learning
unsupervised vs supervised ML
- GeoGuessr
supervised ML
- KNN
- naive Bayes
- SVM
- decision trees
- ensemble methods
genomics example
- markov models for gene identification

11. Evaluation

classification
regression