[问题] text2vec 怎么取 context vectors

楼主: augustana (微小的希望)   2017-07-26 18:09:02
[问题类型]:
程式咨询(我想用R 做某件事情,但是我不知道要怎么用R 写出来)
[软件熟悉度]:
入门(写过其他程式,只是对语法不熟悉)
[问题叙述]:
我想利用text2vec将文字向量化
不过我没办法抓到他的word vector跟context vector
感觉是因为这两个向量是在glove这个environment中private的区块
所以怎么写都抓不出来
当我输入glove$的时候, R自己跑出来的选单都只有public的东西
environment如下, 其中w_i, w_j就是我想要抓的向量
<GloVe>
Inherits from: <word_embedding_model>
Public:
clone: function (deep = FALSE)
dump_every_n: 0
dump_model: function ()
fit: function (x, n_iter, convergence_tol = -1, ...)
get_history: function ()
get_word_vectors: function ()
initialize: function (word_vectors_size, vocabulary, x_max, learning_rate
= 0.15,
shuffle: FALSE
verbose: TRUE
Private:
alpha: 0.75
b_i: -0.088758796453476 -0.200479492545128 -0.276277631521225 ...
b_j: 0.158077865839005 0.00269329198636115 -0.506954908370972 ...
cost_history: 0.0582876185658562 0.0376007230450009 0.0264438356106707 ...
fitted: TRUE
glove_fitter: Rcpp_GloveFitter
grain_size: 100000
initial: NULL
internal_matrix_format: dgTMatrix
lambda: 0
learning_rate: 0.15
max_cost: 10
vocab_terms: 3000 体型 较 呵护 m62 护肤 新生儿 s12 级 白金 特极 顶级 ...
w_i: 0.194914728403091 -0.0265734232962132 -0.611702501773834 ...
w_j: 0.273217976093292 -0.193755224347115 0.475706458091736 0 ...
word_vectors_history: NULL
word_vectors_size: 50
x_max: 10
[程式范例]:
library(text2vec)
keyword <- as.character(article_list$keyword[1])
keyword <- enc2utf8(keyword) #转UTF8
keyword <- strsplit(keyword,',')
########计算不重复的词
# iterator
token <- itoken(keyword)
# to create unique word matrix
vocab <- create_vocabulary(token, ngram=c(1, 1)) #词,频率,文章占比
##只筛出现5次以上的词
#vocab <- prune_vocabulary(vocab, term_count_min = 5L)
########向量化
# vectorization of words
vectorizer <- vocab_vectorizer(vocab,
grow_dtm= FALSE, #don't vectorize input
skip_grams_window= 5L) #use window of 5 for
context words
# tcm= term co-occurrence matrix 字段共现矩阵
tcm <- create_tcm(token, vectorizer)
# glove fitting model, 分解TCM矩阵
glove <- GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max
= 10)
glove$fit(tcm, n_iter = 20)
#词向量
word.vec <- glove$word_vectors$w_i + #文字向量
glove$word_vectors$w_j #脉络向量
就是最后一行这出了问题
不知道是不是因为text2vec里面的glove()已经被删除
改成GlobalVectors()的关系所以这条就失败了
[环境叙述]:
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 LC_CTYPE=Chinese
(Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.1.2 RODBC_1.3-14 text2vec_0.4.0
loaded via a namespace (and not attached):
[1] compiler_3.4.0 magrittr_1.5 R6_2.2.0 Matrix_1.2-9
tools_3.4.0
[6] Rcpp_0.12.11 codetools_0.2-15 grid_3.4.0
iterators_1.0.8 foreach_1.4.3
[11] data.table_1.10.4 digest_0.6.10 RcppParallel_4.3.20
lattice_0.20-35
[关键字]:
text2vec, environment, private

Links booklink

Contact Us: admin [ a t ] ucptt.com