How does the crawler obtain the available IP? How does the crawler call the proxy IP?

  The problem often encountered when using web crawlers to crawl the information content of online platforms is that your web crawler behavior is recognized by the other party, and the other party blocks your IP and cannot return normal network data to you. Then you must use the proxy IP to disguise your request. But how does the proxy IP come from? How does the crawler get the available IP? This is a problem. Now the most common method is to grab the IP on the network and verify it. The available IP is saved and then called.

How does the crawler obtain the available IP? How does the crawler call the proxy IP?

  How does the crawler obtain the available IP

  1. Request web page data

  Requesting web page data is to use the requests library to make network requests, fill in the url, and the header information content, use the get request method to request, after obtaining the response, the response.text is the text content of the response, that is, the web page Text content.

  2. Analyze web page data

  After returning the correct web page data, you can start to analyze it. Here, use the BeautifulSoup library to analyze the web content. If it is a Chrome browser, press f12 to query the source code of the webpage.

  Looking for a tr row, the first and second td columns are ip and port information, so we can also use BeautifulSoup to find all tr ​​rows, and then search for the first and second td columns of the tr row. You can get all the ip and port information on this page.

  3. Verify that the IP and port are available

  After parsing all the ip and port information on the web page, it must be verified whether it is valid, and then filter them to obtain a valid ip list. The verification method also uses it as a proxy address to request the network to see if the request can be completed. If the request is successful, it indicates that it is valid. In this case, a timeout must be added here to prevent the waiting time from being too long. Set the timeout here to 5 seconds.

  4. Initiate a request and store the available IP and port information

  How does the crawler call the proxy IP

  1. Open Python3, import the request of urllib, call ProxyHandler, it can receive the parameters of the proxy IP.

  2. Put the IP address into it in the form of a dictionary, set the key to http, of course some are https, and then the IP address and port number will follow, depending on what type of your IP address is, different IP port numbers may be different Different.

  3. Use build_opener() to build an opener object.

  4. Call the open method in the constructed opener object to make a request. In fact, urlopen uses the internally defined opener.open() in a similar way, which is equivalent to our own rewriting. If we use install_opener(), we can set the previously customized opener to be global.

  5. After setting to global, if we use urlopen to send the request, then the IP address used to send the request is the proxy IP, not the IP address of the machine.

  The above introduced the method of how the crawler obtains the available IP, and introduced the method of how the crawler calls the proxy IP. After understanding these two points, it is recommended to conduct actual combat and master the method of breaking IP restrictions.