可爱的 Python: 使用 mechanize 和 Beautiful Soup 轻松收集 Web 数据
2010-01-20 00:00:00 来源:WEB开发网不管使用哪一种工具来对准备实现自动化交互的 Web 站点做实验,您都需要花比编写简洁的 mechanize 代码(用于执行您的任务)更多的时间来了解站点实际发生的行为。
搜索结果 scraper
考虑到上面提到的项目的意图,我将把包含 100 行代码的脚本分为两个功能:
检索所有感兴趣的结果
从被检索的页面中拉取我感兴趣的信息
使用这种方式组织脚本是为了便于开发;当我开始任务时,我需要知道如何完成这两个功能。我觉得我需要的信息位于一个普通的页面集合中,但是我还没有检查这些页面的具体布局。
首先我将检索一组页面并将它们保存到磁盘,然后执行第二个任务,从这些已保存的文件中拉取所需的信息。当然,如果任务涉及使用检索到的信息构成同一会话内的新交互,那么您将需要使用顺序稍微不同的开发步骤。
因此,首先让我们查看我的 fetch() 函数:
清单 1. 获取页面内容import sys, time, os
from mechanize import Browser
LOGIN_URL = 'http://www.example.com/login'
USERNAME = 'DavidMertz'
PASSWORD = 'TheSpanishInquisition'
SEARCH_URL = 'http://www.example.com/search?'
FIXED_QUERY = 'food=spam&' 'utensil=spork&' 'date=the_future&'
VARIABLE_QUERY = ['actor=%s' % actor for actor in
('Graham Chapman',
'John Cleese',
'Terry Gilliam',
'Eric Idle',
'Terry Jones',
'Michael Palin')]
def fetch():
result_no = 0 # Number the output files
br = Browser() # Create a browser
br.open(LOGIN_URL) # Open the login page
br.select_form(name="login") # Find the login form
br['username'] = USERNAME # Set the form values
br['password'] = PASSWORD
resp = br.submit() # Submit the form
# Automatic redirect sometimes fails, follow manually when needed
if 'Redirecting' in br.title():
resp = br.follow_link(text_regex='click here')
# Loop through the searches, keeping fixed query parameters
for actor in in VARIABLE_QUERY:
# I like to watch what's happening in the console
print >> sys.stderr, '***', actor
# Lets do the actual query now
br.open(SEARCH_URL + FIXED_QUERY + actor)
# The query actually gives us links to the content pages we like,
# but there are some other links on the page that we ignore
nice_links = [l for l in br.links()
if 'good_path' in l.url
and 'credential' in l.url]
if not nice_links: # Maybe the relevant results are empty
break
for link in nice_links:
try:
response = br.follow_link(link)
# More console reporting on title of followed link page
print >> sys.stderr, br.title()
# Increment output filenames, open and write the file
result_no += 1
out = open(result_%04d' % result_no, 'w')
print >> out, response.read()
out.close()
# Nothing ever goes perfectly, ignore if we do not get page
except mechanize._response.httperror_seek_wrapper:
print >> sys.stderr, "Response error (probably 404)"
# Let's not hammer the site too much between fetches
time.sleep(1)
更多精彩
赞助商链接