[心得] 重复抓特定html标签资料 felaray PTT批踢踢实业坊

[心得] 重复抓特定html标签资料

楼主: felaray (傲娇鱼) 2013-06-19 13:07:09

前天开始研究RexExp,这东西令人恼羞.做了两天以后才弄好想要的东西
所以在此分享给大家.语法为C#
说明:html是用HttpWebRequest抓出来的网页资料,
目标是抓出<dt>...</dt>里面的资料
string pattern = @"<dt[^>]*?>(?<word>.*?)</dt>";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = regex.Matches(html);
int index = 0;
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
string x = groups["word"].Value.Trim();
if (x != "") //因为抓到不明空白,所以在此解决
Response.Write( x + "<BR>");
要条列序号的话就把++index加入上行
}
输出资料
1: absolute bolometric magnitude 绝对热星等
2: absolute zero 绝对零度，绝对零点
3: acceleration 加速度
4: acceleration of gravity 重力加速度
5: accretion 吸积
6: Achernar 水委一
7: achondrites 无球粒陨石
8: achromatic lens 消色差透镜
9: albedo 反照率
10: Alcaid 摇光
11: Alcor 辅、开阳伴星
12: Alcyone 昂宿六
Html原始码(节录)
<dt><b>absolute zero 绝对零度，绝对零点 </b></dt>

楼主: felaray (傲娇鱼) 2013-01-09 19:20:00

抱歉现在才看到回应我后来的确是用那个解决XD

作者: s25g5d4 (function(){})() 2013-06-19 14:31:00

[^>]*?既然已用否定就不要再下非贪婪效能会减损

楼主: felaray (傲娇鱼) 2013-06-19 14:54:00

好的谢谢建议.目前来说是小资料所以还感受不到XD

作者: henry10423 (MrElsonXu) 2013-08-19 21:19:00

如果你用C#在做网页解析时，建议可以用HtmlAgilityParser,这个东西挺强大的。不需要用Regex

继续阅读

[问题] 连续数字eaden Re: [问题] 抓出符合的字markchen [问题] 抓出符合的字markchen [问题] python 的一则判断式和 PCRE 不同，求解ggirls Re: [问题] 搜寻C++ source code违规语法s25g5d4 [问题] 搜寻C++ source code违规语法xvid [问题] 网址列透过正规表示式HiTeacher Re: [问题] 该怎么写(改善) 我的 Regexphpo14 Re: [问题] 该怎么写(改善) 我的 RegexpNo [问题] 该怎么写(改善) 我的 Regexphpo14