H.P 琥珀
BeautifulSoup解析实例,获取校内个人页面的好友列表 - [python]
canri62 发表于 2008-12-09 22:56:43
版权声明:转载时请以超链接形式标明文章原始出处和作者信息及本声明
http://blozer.blogbus.com/logs/32328244.html
'''parse the HTML page to load the buddy list.
args:
html - HTML doc as string
return:
a list contain buddies.'''
soup = BeautifulSoup(html)
li = soup.findAll('a', href=re.compile(self.GET_FRIEND_REX))
return [(tag['href'].split('=')[-1], tag.contents[0])
for tag in li if tag.contents] # only retain those with contents
需求是这样的:
* 获取某位用户页面右侧好友的信息( ID 和 名字)
假设已经抓取到该页面的HTML
>>> print doc
# part of the strings
<div class="box-body">
<div class="clearfix">
<ul class="people-list">
<li><a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=1879406230" title="查看夏文杰的个人主页" style="background-image:url(http://hd52.xiaonei.com/photos/hd52/20080803/18/52/tiny_4903n169.jpg);"></a><span><a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=1879406230">夏文杰</a></span></li>
<li><a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=170169633" title="查看叶雨*刘洋的个人主页" style="background-image:url(http://hd37.xiaonei.com/photos/hd37/20081029/14/05/tiny_LgRL_2498m204215.jpg);"></a><span><a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=170169633">叶雨*刘洋</a></span></li>
<li><a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=222680340" title="查看艾俊峰的个人主页" style="background-image:url(http://hd11.xiaonei.com/photos/hd11/20071006/21/22/tiny_115g171.jpg);"></a><span><a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=222680340">艾俊峰</a></span></li>
<li><a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=228684452" title="查看彭巍的个人主页" style="background-image:url(http://hd50.xiaonei.com/photos/hd50/20071110/23/47/tiny_7830g169.jpg);"></a><span><a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=228684452">彭巍</a></span></li>
....
感觉最快的途径是通过搜索href这个attribute里包含的特定字符串(即获取好友的URL),从字面上理解http://xiaonei.com/profile.do?portal=profileFriendlist&id=XXX 这段URL里,portal因该是用来统计访问页面时“来自哪里”,id即用户ID了。
试了下,将portal=后给任何值都能访问:)
下面来代码:
form BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(doc) # doc contains html strings
soup.findAll('a', href=re.compile(r"^http://xiaonei\.com/profile\.do\?portal=profileFriendlist&id=."))
# be careful about "." "?"
结果
[<a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=1879406230" title="查看夏文杰的个人主页" style="background-image:url(http://hd52.xiaonei.com/photos/hd52/20080803/18/52/tiny_4903n169.jpg);"></a>, <a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=1879406230">夏文杰</a>, <a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=170169633" title="查看叶雨*刘洋的个人主页" style="background-image:url(http://hd37.xiaonei.com/photos/hd37/20081029/14/05/tiny_LgRL_2498m204215.jpg);"></a>, <a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=170169633">叶雨*刘洋</a>,...]
似乎有点冗余,每个人有两条解析结果,其中一条包括用户头像的URL。似乎这样<a href="http://xiaonei.com/profile.do?portal=profileFriendlist&id=1879406230">夏文杰</a>的就已经满足需求了。
收藏到:Del.icio.us
博客大巴 提供程序服务器托管支持,GimpStyle theme design by Horacio Bella
版权声明:本站文章使用《署名 3.0 Unported》授权,转载时请注意标明文章原始出处和作者信息及本声明。
评论