[试题] 106-1 陈建锦 资讯检索与文字探勘导论 期中考

楼主: petC (sixeyeflyingfish)   2017-11-07 21:18:13
课程名称︰资料检索与文字探勘导论
课程性质︰资讯管理学系选修
课程教师︰陈建锦
开课学院:管理学院
开课系所︰资讯管理学系
考试日期(年月日)︰Nov. 7, 2017
考试时限(分钟):3hr
试题 :
1. Term and vocabulary:
(a) (5 points) Why do we need to do token normalization?
(b) (5 points) Foreign names are often translated into different forms (e.g.,
Beijing and Peking); how do we normalize such tokens?
(c) (5 points) What are Zipf's law and Heaps' law? (5 points) Using them to
explain why statistical-based text mining is difficult even if we extend the
size of our training corpus.
2. Term weighting and ranking:
(a) (5 points) TF-IDF and cosine similarity are useful to rank documents
against a query. Explain why the length normalization of cosine similarity is
critical when ranking documents, and what would happen if we drop it?
(b) (10 points) Suppose you are constructing an IR system using TF-IDF and
cosine similarity. Discuss the pros and cons of string normalized
TF-IDF weights in postings lists.
(c) (5 points) Show the formula of IDF and explain how it discriminates
terms.
3. PAT Tree:
(a) (10 points) Show the PAT trees by inserting the first 8 sistrings of the
given text. You need to show the result of each sistring insertion.
Text: 000110011101110...
(b) (5 points) What is the longest string pattern in the PAT tree?
4. Evaluation:
(a) (5 points) Define precision and recall in terms of
the relevant-retrieved contingency table.
(b) (5 points) Why is recall a non-decreasing function of the number of
retrieved documents?
(c) (5 points) Why do precision and recall generally trade off against one
another.
5. BIM & BM25:
(a) (5 points) Under what circumstance would the log odds ration of a term
be negative?
(b) (5 points) Show the retrieval status value of BM25 and (5 points) using
it to discuss how BM25 improves BIM.
6. Language Models:
(a) (5 points) The probability of a sentence under a unigram model is
normally higher than that under a bi-gram model. If so, why do we prefer
complex language models in doing NLP tasks?
(b) (10 points) The following table illustrates the statistics of a corpus
used for training a bi-gram language model. Calculate the probability of an
unseen bigram using Laplace's law and Good-Turing estimation, respectively.
(Good-Turing estimation is employed only for r < 3)
╔═════════╗
║ r ║ Nr ║
╠═════════╣
║ 1 ║ 4500 ║
║ 2 ║ 15000 ║
║ 3 ║ 1000 ║
║ 4 ║ 500 ║
║ 5 ║ 500 ║
║ |V = 100|║
╚═════════╝

Links booklink

Contact Us: admin [ a t ] ucptt.com