0%
Theme NexT works best with JavaScript enabled
身不饥寒,天未曾负我;学无长进,我何以对天
安装lxml 1 2 3 4 5 1. 首先安装pip install wheel2. 在这里下载对应的.whl文件,注意别改文件名! http://www.lfd.uci.edu/~gohlke/pythonlibs/ cp后面是Python的版本号,27 表示2.7 ,根据你的Python版本选择下载。 3. pip install 带后缀的完整文件名
例子(一) 1 2 3 4 5 6 7 from bs4 import BeautifulSoupfrom urllib.request import urlopenresp = urlopen('http://www.tooopen.com/img/87.aspx' ).read().decode('utf-8' ) Soup = BeautifulSoup(resp,'lxml' ) images = Soup.select('title' ) for image in images: print (image.get_text())
例子(二) 1 2 3 4 5 6 7 from bs4 import BeautifulSoupfrom urllib.request import urlopenresp = urlopen('http://www.tooopen.com/img/87.aspx' ).read().decode('utf-8' ) Soup = BeautifulSoup(resp,'lxml' ) images = Soup.select('img' ) for image in images: print (image.get('src' ))
例子(三) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 from bs4 import BeautifulSoupimport requestsurl = 'http://zhuanzhuan.58.com/detail/789842404801118212z.shtml' def getInfo (url ): wpData = requests.get(url) soup = BeautifulSoup(wpData.text,'lxml' ) title = soup.title.text price = soup.select('.price_now > i' ) tags = soup.select('.qual_label' ) return tags data = getInfo(url) for tag in data: print (tag.text)
传递http头部 1 2 3 4 5 6 7 8 9 10 11 12 from bs4 import BeautifulSoupimport requestsurl = 'http://www.tripadvisor.cn/Saves#525792' headers = { 'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWeb....36' , 'Cookie' :'TASSK=eNs%2BEc2TG.....0N9BrZpKwsTH%2BAwg%3D...; NPID=' } wpData = requests.get(url,headers = headers) soup = BeautifulSoup(wpData.text,'lxml' ) titles = soup.select('div.title' ) print (soup)
构建多页数延迟访问 1 2 3 4 5 6 7 from bs4 import BeautifulSoupimport requestsimport timeurls = ['http://www.tripadvisor.cn/...-oa{}-New_Y...rk.html#ATTRACTION_LIST' .format (str (i)) for i in range (30 ,930 ,30 )] for url in urls: time.sleep(2 ) print (url)
异步加载 1 在 chrome 浏览器中找到有规律的,xhr请求链接
技巧
一般PC端的数据如果不能直接拿到,可以尝试访问手机端从而直接达到目的