Python笔记-获取某百科页面所有URL（提取某百科所有URL）- Python

程序运行截图如下：

这里分析下页面：

凡是百度百科的都是在此url上

https://baike.baidu.com/item/

xxxxx，所以可以直接提取。

这里我们用个队列，将这个页面的所有有关的url入队，然后出队列，进行访问：

还有个要注意的，要伪造成浏览器，不然会回数据


import requests
import queue
import time
from bs4 import BeautifulSoup
 
header = {
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language' : 'zh-CN,zh;q=0.9',
    'Cache-Control' : 'no-cache',
    'Connection' : 'keep-alive',
    'Cookie' : 'xxxxxxx',
    'Host' : 'baike.baidu.com',
    'Pragma' : 'no-cache',
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
    'sec-ch-ua' : '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"'
}
 
baseUrl = "https://baike.baidu.com/item/"
urlQueue = queue.Queue(10000)
 
def getRequest(url):
    response = requests.get(url, headers = header)
    return response.text
    pass
 
if __name__ == '__main__':
 
    urlQueue.put('%E7%BB%9F%E8%AE%A1%E5%AD%A6/1175')
    for i in range(100):
        url = urlQueue.get()
        content = getRequest(baseUrl + url)
        contentSoup = BeautifulSoup(content, "html.parser")
        urlAllList = contentSoup.select("a")
        for urlTmp in urlAllList:
            if urlTmp.attrs.__contains__('href'):
                urlString = urlTmp['href']
                if '/item' in urlString:
                    testUrl = urlString.split('/item/')[1]
                    urlQueue.put(testUrl)
                    pass
                pass
            pass
        print('over')
 
    pass

关键词搜索

Python笔记-获取某百科页面所有URL（提取某百科所有URL）

详情内容

相关技术文章

最新源码

下载排行榜

提示信息

选择支付方式