Re: [问题] 正则表示式 regex in R

楼主: cywhale (cywhale)   2016-04-30 23:51:48
※ 引述《celestialgod (天)》之铭言:
: ※ 引述《cywhale (cywhale)》之铭言:
: : [问题类型]:
: : 程式咨询(我想用R 做某件事情,但是我不知道要怎么用R 写出来)
: : 若一字串的开头与结尾只想留下英文字,我写
: : gsub("^[^a-zA-Z]+|[^a-zA-Z]+$", "", x)
: : 但若结尾是"sp." or "spp." 我想保留"." 这个符号不被上面这个式子滤掉
: : 比如 "aaa bbb sp." 就维持原字串
: : 但其他情况的"."应该要被滤掉 比如 "aaa bbb22." -> "aaa bbb"
: : 试了一些?: ?! 等语法都没抓到,向大家请教~~ 谢谢~
: str <- c("aaa bbb sp.", "aaa bbb sp2.")
: gsub("[^a-zA-Z]*([a-zA-Z. ]+).*", "\\1", str)
: ^ 这个空格要留着 不然会出事XD
: # [1] "aaa bbb sp." "aaa bbb sp"
: 我忘了问 会不会有 "aa2 bb3 cc." 要变成 "aa bb cc." 这种情况了?
: 有这种情况建议用regmatches,把 "aa", "bb", "cc."都抓出来,再处理QQ
: 大概像这样(可能考虑还不够周延):
: str <- c("aaa bbb sp.", "aaa bbb sp2.", "aa2 bb3 cc.")
: sapply(regmatches(str, gregexpr("[a-zA-Z. ]+", str)), function(x){
: paste0(x[x != "."], collapse = "")
: })
: # [1] "aaa bbb sp." "aaa bbb sp" "aa bb cc."
From previous post (thanks celestialgod), I learned "\\1" and got some idea..
So I tried and made the following code.
The results closed to my targets, to simplify some scientific names collected
from web. Those formats were just in a mess. ><
After these trials, learned a lot for handling regex... ^_^
gsub("^[^a-zA-Z]+|(?!\\.)[^a-zA-Z]+$|
\\b((sp\\.)+$)|\\b((spp\\.)+$)|((\\w{0,})\\.+$)","\\2\\4\\6",
c("33aaa sp.", "aaa sp.bb33", "aaasp.bb 33 de","aaa w2sp.",
"aaa www spp. ", "spp.","bb.", "XXX sp. ",
"YYY spp.()", "ZZZZ.."), perl=T)
[1] "aaa sp." "aaa sp.bb" "aaasp.bb 33 de" "aaa w2sp" "aaa www spp."
[6] "spp." "bb" "XXX sp." "YYY spp." "ZZZZ"
Any comments or bugs found, just tell me! Thanks for the help~
作者: celestialgod (天)   2016-04-30 23:55:00
这个regex真的好丑XDD
楼主: cywhale (cywhale)   2016-05-01 00:01:00
haha.. really.. @@

Links booklink

Contact Us: admin [ a t ] ucptt.com