COMP90042 Natural Language Processing

Dale8/1/24

What is this course about?

This course is an introduction to natural language processing for engineering students. The course covers the following topics:

Preprocessing

sentence segmentation
tokenization, subword tokenization
word normalization
- inflectional vs derivational morphology
- lemmatization vs stemming
stopword removal

N-gram Language Model

derivation
smoothing techniques
- add-k
- absolute discounting
- Katz backoff
- Kneser-Ney smoothing
- interpolation

Text Classification

build a classifier
task
- topic classification
- sentiment analysis
- native language identification
algorithms
- naive bayes, logistic regression, SVM
- kNN, neural networks
bias vs variance：欠拟合under和过拟合over的取舍
evaluation
- precision, recall, F1

Part of Speech Tagging

english POS
- closed vs open classes
tagsets
- penn treebank tagset
automatic taggers
- rule-based
- statistical
  - unigram, classifier-based, HMM

Hidden Markov Model

probabilistic formulation:
- emission & transition
training
Viterbi algorithm
generative vs discriminative models

Feedforward Neural Network

formulation
tasks:
- topic classification
- language models
- POS tagging
word embeddings
convolutional networks

Recurrent Neural Network

formulation
RNN: language models
LSTM:
- functions of gates
- variants
tasks:
- text classification: sentiment analysis
- POS tagging

Lexical Semantics

definition of word sense, gloss
lexical relationship:
- synonymy, antonymy, hypernymy, meronymy
structure of wordnet
word similarity
- path length
- depth information
- information content
word sense unambiguation
- supervised, unsupervised

Distributional Semantics

matrices:
- VSM, TF-IDF, word-word co-occurrence
association measures: PMI, PPMI
count-based method: SVM
neural method: skip-gram, CBOW
evaluation:
- word similarity, analogy

Contextual Representation

formulation with RNN
ELMo
BERT
- objective
- fine-tuning for downstream tasks
transformers
- multi-head attention

Attention

sequential model with attention
attention variants:
- concat
- dot product
- scaled dot product
- location-based
- cosine similarity
global vs local attention
self-attention

Machine Translation

statistical MT
neural MT with teacher forcing

Transformer

multi-head self-attention
position encoding

Pretrained Language Models

encoder architecture
- BERT
- bidirectional context
- good for classification
encoder-decoder architecture
decoder architecture
- GPT
- unidirectional context
- good for generation

Large Language Models

in-context learning
step-by-step reasoning
human-feedback reinforcement
instruction following

Named Entity Recognition

predict entities in a text
traditional ML methods
bi-LSTM with another RNN/CRF
multi-aspect NER

Coreference Resolution 共指消解

coreference, anaphora
B-cubed metric

Question Answering and Reading Comprehension

knowledge-based QA
visual QA

Spoken Language Understanding

attention RNNs
joint BERT
tri-level model
explainable NLU

Vision Language Pretrained Model

V-L interaction model
- self-attention, co-attention, VSE
constrastive language-image pretraining

Ethics

social bias 刻板印象
incivility 仇恨言论
privacy violation 歧视
misinformation 谣言
technological divide 发展不平衡

What is Natural Language Processing?

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. It involves the development of algorithms and models that enable computers to understand, interpret, and generate human language. NLP is used in a wide range of applications, including machine translation, sentiment analysis, chatbots, and speech recognition.