python爬虫第一周

发表于 2016-10-24 更新于 2018-04-25

身不饥寒，天未曾负我；学无长进，我何以对天

安装lxml

1. 首先安装pip install wheel
2. 在这里下载对应的.whl文件，注意别改文件名！
   http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
   cp后面是Python的版本号，27表示2.7，根据你的Python版本选择下载。
3. pip install 带后缀的完整文件名

例子(一)

from bs4 import BeautifulSoup
from urllib.request import  urlopen
resp = urlopen('http://www.tooopen.com/img/87.aspx').read().decode('utf-8')
Soup = BeautifulSoup(resp,'lxml')
images = Soup.select('title')
for image in images:
    print(image.get_text())

例子(二)

from bs4 import BeautifulSoup
from urllib.request import  urlopen
resp = urlopen('http://www.tooopen.com/img/87.aspx').read().decode('utf-8')
Soup = BeautifulSoup(resp,'lxml')
images = Soup.select('img')
for image in images:
    print(image.get('src'))

例子(三)

from bs4 import BeautifulSoup
import requests
url = 'http://zhuanzhuan.58.com/detail/789842404801118212z.shtml'
def getInfo(url):
    wpData = requests.get(url)
    soup = BeautifulSoup(wpData.text,'lxml')
    title = soup.title.text
    price = soup.select('.price_now > i')
    tags = soup.select('.qual_label')
    return tags

data = getInfo(url)
for tag in data:
    print(tag.text)

传递http头部

from bs4 import BeautifulSoup
import  requests
url = 'http://www.tripadvisor.cn/Saves#525792'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWeb....36',
    'Cookie':'TASSK=eNs%2BEc2TG.....0N9BrZpKwsTH%2BAwg%3D...; NPID='
}

wpData = requests.get(url,headers = headers)
soup = BeautifulSoup(wpData.text,'lxml')
titles = soup.select('div.title')
print(soup)

构建多页数延迟访问

from bs4 import BeautifulSoup
import  requests
import time
urls = ['http://www.tripadvisor.cn/...-oa{}-New_Y...rk.html#ATTRACTION_LIST'.format(str(i)) for i in range(30,930,30)]
for url in urls:
    time.sleep(2)
    print(url)

异步加载

1	在 chrome 浏览器中找到有规律的，xhr请求链接

技巧

一般PC端的数据如果不能直接拿到，可以尝试访问手机端从而直接达到目的