今天来演示一个,获取IP海的代理IP列表
代理IP有什么用呢? 代理IP是做爬虫的是常常用到的东西,它可以让我们规避被爬虫,服务器上的反爬虫机制;还有一个方法也可以规避那就是随机改变UA,当然两种方式一起用那是效果最佳的;
❝
好了话不多说,直接上代码,代码上已经详细注释了;看代码即可!
''睡眠延迟函数
Declare PtrSafe Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Function 取得网页源码(Optional ByVal pages As Integer = 1) As String
On Error GoTo er:
Dim iurl As String: iurl = "https://www.kuaidaili.com/free/inha/" & pages
''读取网页源码
With CreateObject("WinHttp.WinHttpRequest.5.1") ''请求对象
.Open "GET", iurl, False ''请求参数
.send ''发送请求
''取得源码
strText = .responseText
取得网页源码 = strText
End With
Exit Function
er:
取得网页源码 = "查询出错啦:" & Err.Description
End Function
Sub 解析网页源码()
Dim sht As Worksheet: Set sht = Worksheets("IP地址池")
sht.Range("A1:AA65536").ClearContents
''测试取5页数据
For p = 1 To 5
''解析html
Dim xmldocstr As String: xmldocstr = 取得网页源码(p)
Dim HTMLDoc As Object, TDElements As Object
Set HTMLDoc = CreateObject("htmlfile")
''大致判断内容
If Len(xmldocstr) < 100 Then Exit Sub
HTMLDoc.body.innerhtml = xmldocstr
''定位html表格
Set TDElements = HTMLDoc.getElementById("list")
Dim infotb As Object
Set infotb = TDElements.Children(1)
''读取表头
Dim heads As Object: Set heads = infotb.Children(0).Children(0)
For j = 0 To heads.Cells.Length – 1
''数据表头写入表格
sht.Cells(1, j + 1) = heads.Children(j).innertext
DoEvents
Next
''读取内容
Dim Contents As Object: Set Contents = infotb.Children(1)
For i = 0 To Contents.Rows.Length – 1
Dim Content As Object: Set Content = Contents.Children(i)
''取得实际行数
Dim rw As Integer: rw = sht.Range("A65536").End(xlUp).Row
DoEvents
For k = 0 To Content.Cells.Length – 1
''数据内容写入表格
sht.Cells(rw + 1, k + 1) = Content.Children(k).innertext
DoEvents
Next
DoEvents
Next
Sleep 800 ''如果无法获取第二页内容,请把延迟秒数调大一点
DoEvents
Next
End Sub
注意爬虫千万不要涉嫌隐私问题,最好遵循Robots协议!
文章来源:https://mp.weixin.qq.com/s/ZMborUHj6p4hkNFt3LR10w