开发学院软件开发 Python 可爱的 Python: 使用 mechanize 和 Beautiful Sou... 阅读

可爱的 Python: 使用 mechanize 和 Beautiful Soup 轻松收集 Web 数据

　2010-01-20 00:00:00　来源：WEB开发网　　　

核心提示： 不管使用哪一种工具来对准备实现自动化交互的 Web 站点做实验，您都需要花比编写简洁的 mechanize 代码（用于执行您的任务）更多的时间来了解站点实际发生的行为，可爱的 Python: 使用 mechanize 和 Beautiful Soup 轻松收集 Web 数据(3)，搜索结果 s

不管使用哪一种工具来对准备实现自动化交互的 Web 站点做实验，您都需要花比编写简洁的 mechanize 代码（用于执行您的任务）更多的时间来了解站点实际发生的行为。

搜索结果 scraper

考虑到上面提到的项目的意图，我将把包含 100 行代码的脚本分为两个功能：

检索所有感兴趣的结果

从被检索的页面中拉取我感兴趣的信息

使用这种方式组织脚本是为了便于开发；当我开始任务时，我需要知道如何完成这两个功能。我觉得我需要的信息位于一个普通的页面集合中，但是我还没有检查这些页面的具体布局。

首先我将检索一组页面并将它们保存到磁盘，然后执行第二个任务，从这些已保存的文件中拉取所需的信息。当然，如果任务涉及使用检索到的信息构成同一会话内的新交互，那么您将需要使用顺序稍微不同的开发步骤。

因此，首先让我们查看我的 fetch() 函数：

清单 1. 获取页面内容
import　sys,　time,　os　 from　mechanize　import　Browser　　 LOGIN_URL　=　'http://www.example.com/login'　 USERNAME　=　'DavidMertz'　 PASSWORD　=　'TheSpanishInquisition'　 SEARCH_URL　=　'http://www.example.com/search?'　 FIXED_QUERY　=　'food=spam&'　'utensil=spork&'　'date=the_future&'　 VARIABLE_QUERY　=　['actor=%s'　%　actor　for　actor　in　　　　　('Graham　Chapman',　　　　　　'John　Cleese',　　　　　　'Terry　Gilliam',　　　　　　'Eric　Idle',　　　　　　'Terry　Jones',　　　　　　'Michael　Palin')]　　 def　fetch():　　　result_no　=　0　　　　　　　　　#　Number　the　output　files　　　br　=　Browser()　　　　　　　　#　Create　a　browser　　　br.open(LOGIN_URL)　　　　　　#　Open　the　login　page　　　br.select_form(name="login")　#　Find　the　login　form　　　br['username']　=　USERNAME　　　#　Set　the　form　values　　　br['password']　=　PASSWORD　　　resp　=　br.submit()　　　　　　#　Submit　the　form　　　　#　Automatic　redirect　sometimes　fails,　follow　manually　when　needed　　　if　'Redirecting'　in　br.title():　　　　　resp　=　br.follow_link(text_regex='click　here')　　　　#　Loop　through　the　searches,　keeping　fixed　query　parameters　　　for　actor　in　in　VARIABLE_QUERY:　　　　　#　I　like　to　watch　what's　happening　in　the　console　　　　　print　>>　sys.stderr,　'***',　actor　　　　　#　Lets　do　the　actual　query　now　　　　　br.open(SEARCH_URL　+　FIXED_QUERY　+　actor)　　　　　#　The　query　actually　gives　us　links　to　the　content　pages　we　like,　　　　　#　but　there　are　some　other　links　on　the　page　that　we　ignore　　　　　nice_links　=　[l　for　l　in　br.links()　　　　　　　　　　　　　if　'good_path'　in　l.url　　　　　　　　　　　　　and　'credential'　in　l.url]　　　　　if　not　nice_links:　　　　#　Maybe　the　relevant　results　are　empty　　　　　　　break　　　　　for　link　in　nice_links:　　　　　　　try:　　　　　　　　　response　=　br.follow_link(link)　　　　　　　　　#　More　console　reporting　on　title　of　followed　link　page　　　　　　　　　print　>>　sys.stderr,　br.title()　　　　　　　　　#　Increment　output　filenames,　open　and　write　the　file　　　　　　　　　result_no　+=　1　　　　　　　　　out　=　open(result_%04d'　%　result_no,　'w')　　　　　　　　　print　>>　out,　response.read()　　　　　　　　　out.close()　　　　　　　#　Nothing　ever　goes　perfectly,　ignore　if　we　do　not　get　page　　　　　　　except　mechanize._response.httperror_seek_wrapper:　　　　　　　　　print　>>　sys.stderr,　"Response　error　(probably　404)"　　　　　　　#　Let's　not　hammer　the　site　too　much　between　fetches　　　　　　　time.sleep(1)