神码ai火车头采集器伪原创【php源码】( 四 ) _标签

五、爬虫实战操作
实战是需要自己去分析网页结构的，博主代码只是一种参考，还是要自己去分析。
5.1 百度热搜榜
获取标题、热度、简介、新闻链接、图片、文本保存为txt，并输出
注意：这里没有自动创建保存文件的文件夹，自行修改
#导入库文件import requestsfrom bs4 import BeautifulSoup#请求头headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.42'}#请求地址url='https://top.baidu.com/board?tab=realtime'def Hot_search():#发起请求html=requests.get(url=url,headers=headers).text;#解析为bs4res=BeautifulSoup(html,"html.parser")#获取类名获取指定的div（包含所有的新闻信息）news_div=res.find_all("div",class_="category-wrap_iQLoo horizontal_1eKyQ")#遍历div（提取每一个新闻）for new in news_div:try:#获取新闻的标题（标题在下标为0的地方，获取文本）title=new.find_all("div",class_="c-single-text-ellipsis")[0].text#获取新闻的热度（热度在下标为0的地方，获取文本）hot=new.find_all("div",class_="hot-index_1Bl1a")[0].text#获取新闻的简介（简介在下标为0的地方，获取文本）conent=new.find_all("div",class_="hot-desc_1m_jR small_Uvkd3 ellipsis_DupbZ")[0].text#获取新闻的链接（通过第一个a标签，的href属性获取）newUrl=new.find('a').attrs.get('href')#获取要下载的图片链接的img标签（标签在下标为0的地方）imgUrl=new.find_all("img")[0]#获取要下载的图片链接（通过img的src属性获取）imgUrl=imgUrl.attrs.get('src')print('\n'+'标题：' + str(title) + '\n' + '热度：' + str(hot) + '\n' + '简介：' + str(conent) + '\n' + '新闻链接：' + str(newUrl) + '\n\n')#请求图片img=requests.get(url=imgUrl,headers=headers).content#写出图片with open('./百度热搜/' + str(title.strip()) + '.png', 'wb') as f:f.write(img)#写出文本with open('./百度热搜/热搜文本.txt', 'a') as f:f.write('\n'+'标题：' + str(title) + '\n' + '热度：' + str(hot) + '\n' + '简介：' + str(conent) + '\n' + '新闻链接：' + str(newUrl) + '\n\n')except:continue;print('写出成功！')Hot_search()
5.2 爬取图片壁纸
注意：这里没有自动创建保存文件的文件夹，自行修改
import requestsfrom bs4 import BeautifulSoupheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.40"}def indexUrl():indexUrls = ['http://www.netbian.com/meinv/index.htm']for i in range(2,4):indexUrls.append('http://www.netbian.com/meinv/index_'+str(i)+'.htm')#保存图片链接imgUrls=[]#当作图片名字imgName=1;for i in indexUrls:html=requests.get(url=i,headers=headers).content.decode('gbk')soup=BeautifulSoup(html,'html.parser')Alable=soup.find('div',class_='list').find_all('a')# print(Alable)for i in Alable:#获取图片地址的后半段地址iUrlP=i.attrs.get('href')#去除无效地址if 'http' in iUrlP:continue;#将残缺的图片页面地址拼起来，形成完整图片页面地址imgUrlP= 'http://www.netbian.com' + iUrlP#向图片网页发起请求imgPage=requests.get(url=imgUrlP,headers=headers).content.decode('gbk')#将请求回来的源码转换为bs4soup=BeautifulSoup(imgPage,'html.parser')#获取图片链接地址imgUrl=soup.find_all('div',class_='pic')[0].find('img').attrs.get('src')#将图片地址保存到列表中（可以不存）imgUrls.append(imgUrl)#向图片地址发起请求imgget=requests.get(url=imgUrl,headers=headers).content;#下载图片with open('./壁纸/'+str(imgName)+'.jpg','wb') as f:f.write(imgget)imgName+=1;print('下载成功')indexUrl()


上一页
1
2
3
4
5
下一页
		  	









26个数据分析案例——第五站：基于Scrapy的架构的数据采集 

上半年发送旅客17.7亿人次，铁路“火车头”作用凸显 日均开行列车创历史之最 

「发现最美铁路」“毛泽东号”机车，火车头中的“火车头” 铁路机车中国之最 

宽凳科技公布最新进展：已完成百余座城市数据采集，即将发布首张全自动高精度地图 

索尼dv机怎么使用,索尼DV机如何连接电脑实时采集图像和声音 

贝壳采集器：二手车之家 数据采集 

物通博联·5G数据采集网关模块无缝对接第三方平台 

工业物联网系统下如何实现设备数据采集与设备维护 

ZigBee采集MPU6050数据 

车间设备数据采集与MES系统应用