[试题] 106-1 陈建锦资讯检索与文字探勘导论期中考 petC PTT批踢踢实业坊

[试题] 106-1 陈建锦资讯检索与文字探勘导论期中考

楼主: petC (sixeyeflyingfish) 2017-11-07 21:18:13

课程名称︰资料检索与文字探勘导论
课程性质︰资讯管理学系选修
课程教师︰陈建锦
开课学院：管理学院
开课系所︰资讯管理学系
考试日期（年月日）︰Nov. 7, 2017
考试时限（分钟）：3hr
试题 :
1. Term and vocabulary:
(a) (5 points) Why do we need to do token normalization?
(b) (5 points) Foreign names are often translated into different forms (e.g.,
Beijing and Peking); how do we normalize such tokens?
(c) (5 points) What are Zipf's law and Heaps' law? (5 points) Using them to
explain why statistical-based text mining is difficult even if we extend the
size of our training corpus.
2. Term weighting and ranking:
(a) (5 points) TF-IDF and cosine similarity are useful to rank documents
against a query. Explain why the length normalization of cosine similarity is
critical when ranking documents, and what would happen if we drop it?
(b) (10 points) Suppose you are constructing an IR system using TF-IDF and
cosine similarity. Discuss the pros and cons of string normalized
TF-IDF weights in postings lists.
(c) (5 points) Show the formula of IDF and explain how it discriminates
terms.
3. PAT Tree:
(a) (10 points) Show the PAT trees by inserting the first 8 sistrings of the
given text. You need to show the result of each sistring insertion.
Text: 000110011101110...
(b) (5 points) What is the longest string pattern in the PAT tree?
4. Evaluation:
(a) (5 points) Define precision and recall in terms of
the relevant-retrieved contingency table.
(b) (5 points) Why is recall a non-decreasing function of the number of
retrieved documents?
(c) (5 points) Why do precision and recall generally trade off against one
another.
5. BIM & BM25:
(a) (5 points) Under what circumstance would the log odds ration of a term
be negative?
(b) (5 points) Show the retrieval status value of BM25 and (5 points) using
it to discuss how BM25 improves BIM.
6. Language Models:
(a) (5 points) The probability of a sentence under a unigram model is
normally higher than that under a bi-gram model. If so, why do we prefer
complex language models in doing NLP tasks?
(b) (10 points) The following table illustrates the statistics of a corpus
used for training a bi-gram language model. Calculate the probability of an
unseen bigram using Laplace's law and Good-Turing estimation, respectively.
(Good-Turing estimation is employed only for r < 3)
╔═════════╗
║ r ║ Nr ║
╠═════════╣
║ 1 ║ 4500 ║
║ 2 ║ 15000 ║
║ 3 ║ 1000 ║
║ 4 ║ 500 ║
║ 5 ║ 500 ║
║ |V = 100|║
╚═════════╝

继续阅读

[试题] 106-1 简资修法学绪论期中考best8508062 [试题] 106-1 计算理论颜嗣钧期中考WhyThe [求救] 游启璋民法概要乙考古题fishmt27 [试题] 106-1 萧全政政治经济学期中考best8508062 [试题] 106-1 萧全政政治经济学期中考andyjyc [试题] 106-1 林郁真环境工程概论期中考senkawa [试题] 106-1 陈国任茶作学期中考tairry2009 [求救] 生医所林则彬老师生理学考古题或笔记Johnson1005 [试题] 106-1 蔡宜展总体经济学（上）1st 期中考eopXD [试题] 106-1 周伯戡印度史第一次小考tengyueh

[试题] 106-1 陈建锦 资讯检索与文字探勘导论 期中考

[试题] 106-1 陈建锦资讯检索与文字探勘导论期中考