程序运行截图如下:
这里分析下页面:
凡是百度百科的都是在此url上
https://baike.baidu.com/item/
xxxxx,所以可以直接提取。
这里我们用个队列,将这个页面的所有有关的url入队,然后出队列,进行访问:
还有个要注意的,要伪造成浏览器,不然会回数据
- import requests
- import queue
- import time
- from bs4 import BeautifulSoup
-
- header = {
- 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
- 'Accept-Language' : 'zh-CN,zh;q=0.9',
- 'Cache-Control' : 'no-cache',
- 'Connection' : 'keep-alive',
- 'Cookie' : 'xxxxxxx',
- 'Host' : 'baike.baidu.com',
- 'Pragma' : 'no-cache',
- 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
- 'sec-ch-ua' : '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"'
- }
-
- baseUrl = "https://baike.baidu.com/item/"
- urlQueue = queue.Queue(10000)
-
- def getRequest(url):
- response = requests.get(url, headers = header)
- return response.text
- pass
-
- if __name__ == '__main__':
-
- urlQueue.put('%E7%BB%9F%E8%AE%A1%E5%AD%A6/1175')
- for i in range(100):
- url = urlQueue.get()
- content = getRequest(baseUrl + url)
- contentSoup = BeautifulSoup(content, "html.parser")
- urlAllList = contentSoup.select("a")
- for urlTmp in urlAllList:
- if urlTmp.attrs.__contains__('href'):
- urlString = urlTmp['href']
- if '/item' in urlString:
- testUrl = urlString.split('/item/')[1]
- urlQueue.put(testUrl)
- pass
- pass
- pass
- print('over')
-
- pass