What is a dynamic page? How to crawl dynamic web pages

  Crawlers exist throughout the Internet. Many companies need to collect information. Using crawlers can extract the required information from massive amounts of information faster. However, other websites do not want to give up this information in vain, and the information collected by crawlers will also affect the website. Therefore, the website will adopt various methods, such as IP restrictions, verification code restrictions, and the use of dynamic web pages to increase difficulty.

  For IP restrictions and verification code restrictions, proxy IP and verification code identification tools can be used to break through the restrictions. For dynamic webpages, the problem will be more complicated. Today, Tomato Accelerator will introduce what is a dynamic webpage and how to crawl dynamic webpages.

  1. What is a dynamic webpage

  The so-called dynamic webpage refers to a webpage programming technology as opposed to static webpages.

  Static webpage, with the generation of html code, the content and display effect of the page will basically not change-unless you modify the page code. This is not the case for dynamic web pages. Although the page code has not changed, the displayed content can change with time, environment, or the result of database operations.

  It is worth emphasizing that you should not confuse dynamic web pages with whether the page content is dynamic or not. The dynamic webpages mentioned here are not directly related to the visual dynamic effects such as various animations and rolling subtitles on the webpage. Dynamic webpages can also be pure text content or content containing various animations. These are only webpage specifics. The presentation form of the content, regardless of whether the web page has dynamic effects or not, as long as it is a web page generated by the use of dynamic website technology, it can be called a dynamic web page.

  In short, dynamic web pages are the fusion of basic html grammar specifications, Java, VB, VC and other advanced programming languages, database programming and other technologies to achieve efficient, dynamic and interactive management of website content and style. Therefore, in this sense, all webpages generated by webpage programming techniques that combine high-level programming languages ​​other than HTML and database technology are dynamic webpages.

  2. How to crawl dynamic web pages

  The first solution is to use some third-party tools to simulate the behavior of the browser to load data.

  For example: Selenium, PhantomJs.

  Advantages: There is no need to consider the various changes of the dynamic page (no matter how the dynamic data changes, the final effect on the page is fixed), we only need to care about the final reality result; it can be processed in a unified manner.

  Disadvantages: low performance, such as using Selenium, we need to start a browser process every time; configuration is cumbersome, different browsers need to download different drivers and jar packages, and there is a strict version matching relationship between drivers and jar packages. It cannot be used if it does not match.

  The second solution is to analyze the page, find the corresponding request interface, and get the data directly.

  Advantages: high performance and easy to use. We directly obtain the original data interface (in other words, the API interface that directly obtains the dynamic data of the web page), it will definitely be convenient to use, and the possibility of change is relatively small.

  Disadvantages: The disadvantages are also obvious. How to obtain the interface API? Some websites may consider data security and make various restrictions and confusions. This needs to look at the basic skills of the developer and conduct various analyses.

  How to crawl dynamic web pages? Whether the website is static or dynamic can be distinguished by some simple methods, such as the words “see more” or the content that will be loaded when the website is pulled down when the website is opened. The pages are dynamic, or The corresponding content of the page can be seen in the browser, and when the content is not found when viewing the source code of the page, it can be determined that the page uses dynamic technology. If the webpage uses dynamic technology, you can use the method described above.