COMP90014 Algorithms for Bioinformatics
Prerequisites 1. Introduction to Genomics 基因组学
- eukaryotes, prokaryotes, viruses
- 真核生物、原核生物、病毒
- water 70%, protein 15%, RNA 6%, DNA 1%
- genome 基因组
- bacterium, human, plant, virus
- 细菌、人类、植物、病毒
- chromosome 染色体
- plasmid 质粒
- mitochondria 线粒体
- chloroplast 叶绿体
- ploidy 倍性
- 2 copies of the same chromosome, one from each parent
- haploid 单倍体
- diploid 二倍体
- polyploid, triploid, tetraploid
- molecular biology 分子生物学
- DNA, RNA, protein
- nucleotide 核苷酸, amino acid 氨基酸
- gene, genome, transcriptome 转录组, proteome 蛋白质组
- DNA: A-T, C-G
- double helix 双螺旋
- sugar-phosphate backbone 糖磷酸骨架
- 5' to 3' direction
- chromosome, gene, exon 外显子, intron 内含子, promoter 启动子, enhancer 增强子, terminator 终止子
- protein
- encoded by gene
- polymer of amino acid monomers 氨基酸单体聚合物
- cellular functions 细胞功能
- catalysis 催化
- structure 结构
- signalling 信号
- amino acid
- 20 types
- 3-letter and 1-letter code
- hydrophobic 疏水, hydrophilic 亲水, charged 带电
- central dogma
- DNA -> RNA -> protein
- transcription 转录, translation 翻译
- U (uracil) in RNA instead of T (thymine) in DNA
- ribosome 核糖体, tRNA, mRNA
- reserve transcription 逆转录
- mRNA -> DNA
- reverse transcriptase 逆转录酶
- retrovirus 逆转录病毒
- gene
- unit of heredity 遗传单位
- 2% of human genome codes for genes
- promoter, 5' UTR, coding sequence, 3' UTR, terminator
- coding sequence: exon, intron
- variation
- sequence variation
- single nucleotide variation (SNV) 单核苷酸变异
- single nucleotide polymorphism (SNP) 单核苷酸多态性
- indels: insertion, deletion
- structural variation
- large-scale (>50bp)
- involving entire sections of DNA
- gain or loss of fucntion (GoF, LoF)
- conservation
- sequence variation
- sequencing technology
- DNA/RNA extraction -> library preparation -> sequencing -> export fastq files -> QC/analysis
- Illumina, Nanopore
- FASTQ files, reads
- 4 lines per read
- identifier - unique name
- sequence - AGTC
- spacer - +
- quality - confidence score for each DNA base
- 4 lines per read
- sequencing strategies
- whole genome sequencing (WGS)
- all DNA in cells
- whole exome sequencing (WES)
- all exons
- cheap, capture importants
- RNA sequencing (RNA-seq)
- all RNA in cells
- scRNA-seq
- all RNA but idividual cells
- Amplicon
- use PCR target specific region
- Pair-end
- sequence both ends of DNA fragment
- whole genome sequencing (WGS)
1. Indexing
- common classes of algorithms
- brute force
- divide and conquer
- recursive
- dynamic programming
- greedy
- data in bioinformatics
- FASTA, sequence, genome
- FASTQ, with quality, read
- BAM/SAM, alignment
- Matrix
- VCF, genetic variants
- data type
- primitives: int, float, char
- collections: array, list, set, map
- derived types: class, enum, tuple
- data structure
- arrangement of data in algorithm
- queue, stack, heap, tree, graph
- evaluation
- correctness
- efficiency
- asymptotic complexity 渐近复杂度 (big O notation)
- tractability 可解性 (P, NP, NP-complete)
- when complex
- heuristic
- approximation
- indexing
- pattern matching
- see if something exists
- linear search O(n)
- binary search O(log n)
- hash table O(1)
- double hashing (perfect hashing): use another hash function to treat collision
- kmer
- breaking sequence into smaller pieces
- key of genomic data in indexing
2. Sequence Alignment and Mapping
- why align sequence
- compare sequence
- assess similarity
- find evolutionary relationships
- identify functionally conserved sequences
- homology 同源性,指不同生物体的基因或蛋白质序列由于共同的祖先而具有相似性,有助于理解生物进化、基因功能和结构
- homologs 同源基因
- orthologs 直系 separated by speciation
- paralogs 旁系 separated by duplication
- phylogeny 系统发育 infer the evolutionary relationships among sequences
- protein function, find regions of known function, infer function
- conservation, find conserved regions 在不同物种或不同基因组中,某些基因或蛋白质序列在进化过程中保持不变或变化很小的区域,通常这些序列在生物学功能上具有重要性
- data searching
- de novo assembly, reconstruct the original sequence
- read mapping, infer variants
- how to align
- pairwise
- BLAST
- Hamming distance
- for equal length sequences, count the number of different characters
- kmer distance
- for unequal length sequences, count the number of different kmers
- Levenshtein distance (pairwise)
- edit distance
- 最小编辑距离,指两个字符串之间,由一个转换成另一个所需的最少编辑操作次数
- count mismatch, insertion, deletion as transformations
- argmin of
- (i-1, j-1) + (missmatch 1 or match 0)
- (i-1, j) + (gap 1)
- (i, j-1) + (gap 1)
- return the bottom-right cell of the matrix
- pairwise distance with penalties
- apply penalties
- i.e. match 0, mismatch -1, gap -2
- return the bottom-right cell of the matrix
- apply penalties
- Needleman-Wunsch algorithm (global)
- return the bottom-right cell of the matrix
- Smith-Waterman algorithm (local)
- return the maximum value in the matrix
- semi-global alignment
- clipped by best offset
- Needleman-Wunsch variant
- start from max cell, end at reaching top row/left column
- (scoring) substitution matrix 替换矩阵
- assign scores to each pair of characters
- why not use fixed scores
- two types of DNA mutations 两种DNA突变
- transition, purine to purine or pyrimidine to pyrimidine, frequently (A↔G, C↔T)
- transversion, purine to pyrimidine or vice versa, less likely (A↔C, A↔T, G↔C, G↔T)
- two types of DNA mutations 两种DNA突变
- it reflect
- chemical similarity
- observed mutation frequencies
- how proteins sequence evolve
- Point Accepted Mutation (PAM) matrix
- the number represents the evolutionary distance between the sequences
- higher numbers denote greater distances
- limitation
- only for closely related sequences and small dataset
- not consider different rates of evolution
- BLOcks SUbstitution Matrix (BLOSUM) matrix
- scores derived from frequencies of substitutions in blocks of ungapped local alignments
- BLOSUM62: 62% identity
- gap panalty
- negative score
- fixed / affine
- read mapping
- seed extension
- combine indexing and alignment
- process
- index kmer
- find seed
- extend seed using alignment
- return best
- seed chain align
- for long, noisy sequences
- process
- find seeds
- identify colinear chains
- base-level alignments to fill gaps
- return best
- seed extension
- Basic Local Alignment Search Tool (BLAST)
- efficient heuristic
- DNA translated to protein seq for use
- Amino Acid (AA) sequences more conserved than DNA
- degeneracy / codon wobble
- 3rd base in codon: multiple different nucleotides encode same AA
- eliminate unnecessary DNA mismatches
- more useful kmer hits
- low-complexity regions removed
- repetitive DNA
- exploit kmer
- generate 'neighbors'
- allow mismatches
- ungapped alignment and pick with threshold
- identify nearby hits on same diagonal
- like seed chain align
- extend region to a high-scoring segment pair (HSP)
- DNA translated to protein seq for use
- efficient heuristic
3.1. Comparing Sequences
- multiple sequence alignment
- identify similarities between sequences
- find conserved patterns (motifs 模体) in a protein family
- Sum-of-Pairs (SP) score
- dynamic programming
- heuristic: progressive alignment
- construct a matrix
- construct a guide tree
- align sequences
- Basic Local Alignment Search Tool (BLAST)
- one to many query
- split query into words
- find neighbourhood words of similar words
- collect seeds
- extend seeds
- get the best alignment
- minhash
- fingerprint, known as kmers
- big data
- Jaccard Coefficient
- measure the similarity
- proportion
- J(A, B) = |A ∩ B| / |A ∪ B|
- minimizers
- window
- minimizer: select the minimum kmer from a window according to an order
- shazam
- search audio
- audio, fingerprint, hash, reproducible
- limitation
- window size W and kmer size k trade-off
- longer kmer, more informative
- bigger window, more accurate, less efficient
3.2. Graph Theory
- graph, interaction, biology
- describe and analyse relationships
- nodes, edges
- labels, directional, weights
- adjacent, degree
- walk (edge), path (node), cycle (closed path)
- Eulerian path: visit each edge once and only once
- Hamiltonian path: visit each node exactly once, NP-complete
- tree, usually sparse
- clique, edges between all pairs of nodes
- graph, adjacency list, adjacency matrix
- traversals: depth-first search, breadth-first search
- recursion
4.1. Network Analysis
- connected?
- distance of nodes?
- density?
- diversity of distribution? (node degree distribution)
- centrality
- degree
- closeness
- betweenness
- eigenvector
- classification of networks
- scale-free
- small-world
- random
- regular
4.2. Advanced Indexing
- trie (prefix tree or radix tree)
- used in word other
- suffix tree 后缀树
- complexity, space
- suffix array
- each item is a word
- Burrrows-Wheeler Transform (BWT)
- popular
- efficient in indexing human genome
5. Evolutionary Trees
taxonomy 分类学
phylogenetics 系统发育
speciation 物种形成
multiple sequence alignment
- issue: local not represent global
greedy: pair the most similar sequences
building a tree
- next merge
- internal node
- branch length
UPGMA
- unweighted pair group method with arithmetic mean
, n for number of merge, n * n for distance matrix- distance based
- ultrameric: distance from root to leaf is the same
Neighbor joining
distance based methods
- pros: simple, flexible, fast, scalable
- cons: sensitive to distance, evolutionary rates not considered, uncertain
character based methods
- taxon be described by a number of characters
parsimony 解析 methods
- compute a phylogenetic tree T for a set of sequences that minimizes L(T)
- L(T) is the min of subsitutions requires to explain tree T
evaluating trees
- label internal nodes e.g. Hamming distance
small parsimony
- given T, find L(T)
FItch's algorithm
- repeatly assign labels
Sankoff's algorithm
- count smallest number of possible changes needed on a given tree
build a tree for character based method
- greedy
- sequential addition
bootstrapping 重复抽样
- re‐sample, create new alignment
6. Assembly
to build contigs 连续序列
greedy extension
- for N reads of L length
- O(
)
overlap layout consensus (OLC)
- Newbler, Celera, Canu
sequencing by hybridisation
- microarray
De Bruijn Graphs
- kmer
- Eulerian path
- ACGT -> 0,1,2,3
- 4^k nodes
- O(kN) space
- O(N) time
considerations
OLC simplification
De Bruijn graph simplification
Kmers, coverage, depth
assessing assemblies
software
7. Dimensionality Reduction
- approaches
- feature selection
- feature extraction
- principal component analysis (PCA)
- linear transformation
- eigenvectors
- eigenvalues
- variance
- covariance
- projection
- dimensionality reduction
- multidimensional scaling (MDS)
- distance matrix
- similarity matrix
- stress
- metric MDS
- non-metric MDS
- ISOMAP
- t-SNE
- UMAP
8. Clustering
- machine learning
- clustering
- distance metrics
- partitional clustering: k-means
- clustering approaches
- density-based clustering
- divisive clustering
9. Optimisation
- exploring the solution space
- constraints
- branch and bound
- search heuristics
- gradient descent
- simulated annealing
- genetic algorithms
- optimisation
10. Supervised Machine Learning
- unsupervised learning
- associative rule learning
- unsupervised vs supervised ML
- GeoGuessr
- supervised ML
- KNN
- naive Bayes
- SVM
- decision trees
- ensemble methods
- genomics example
- markov models for gene identification
11. Evaluation
- classification
- regression