[试题] 110-1 陈建锦 文字探勘初论 期末考

楼主: unmolk (UJ)   2022-01-28 01:47:16
课程名称︰文字探勘初论
课程性质︰资管系选修
课程教师︰陈建锦
开课学院:管理学院
开课系所︰资管系
考试日期(年月日)︰110.01.07
考试时限(分钟):180
试题 :
1. (5 points) Why token normalization (e.g., stemming) is important to text
mining? (5 points) Also, what is the difference between stemming and lemmatiza-
tion?
2. TF-IDF is a classic term weighting scheme of text mining. (5 points) Explain
why only TF is not good enough to measure the weight of a term? (10 points) Sh-
ow the definition of IDF and explain how it helps discriminate important terms.
3. (5 points) Explain the role of validation data in building supervised text
mining models (e.g., classification).
4. (10 points) What are classification precision and recall? (5 points) And why
do we say precision and recall generally trade off to each other? (5 points) A-
lso, when measuring multi-class classification results, which average (micro or
macro) would make large classes dominate small classes?
5. (10 points) What is the advantage of Latent Semantic Analysis (dimension re-
duction) over the bag-of-word model when computing cosine similarity between d-
ocuments?
6. (5 points) Explain why kernal SVM would be capable of solving difficult cla-
ssification problems. (5 points) Is kernel SVM still a linear classification m-
odel, explain your answer?
7. (5 points) What is n-gram? (10 points) Also, explain why n usually is not a
big number in practice?
8. (10 points) Explain why Word2Vec is able to produce similar embeddings for
semantically similar words (e.g., synonyms)?
9. (5 points) Please explain the following code.
model.fit(partial x train, partial y trian,
epochs = 20,
bathc_size = 512,
validation_data = (x_val,y_val))

Links booklink

Contact Us: admin [ a t ] ucptt.com