
当我们尝试使用Python进行智能爬虫开发时,跨域请求(CORS)和动态加载内容是两个常见且棘手的问题。本文将通过实战案例,深入探讨如何使用最新的技术手段解决这些问题。
环境准备与核心依赖
首先,确保你的Python环境满足以下要求:
依赖项 | 版本要求 |
---|---|
Python | 3.9 或更高版本 |
Requests | 2.31.0 |
BeautifulSoup | 4.12.3 |
Selenium | 4.16.0 |
ChromeDriver | 120.0.6167.140 |
安装依赖项:
pip install requests beautifulsoup4 selenium
下载对应版本的ChromeDriver,并确保其可执行路径已添加到系统环境变量中。
案例:抓取带有CORS策略的电商商品数据
假设我们需要抓取某电商平台API的商品数据,该API存在严格的CORS策略,直接请求会返回403 Forbidden错误。以下是解决方案:
方案一:使用代理服务器绕过CORS限制
通过反向代理服务器转发请求,可以隐藏真实客户端IP,绕过CORS限制。以下是一个基于Nginx的反向代理配置示例:
server {
listen 8080;
location / {
proxy_pass http://backend_api;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Python代码示例:
import requests
def fetch_data_with_proxy(url, proxy_url):
proxies = {
'http': proxy_url,
'https': proxy_url
}
try:
response = requests.get(url, proxies=proxies, timeout=10)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Error fetching data: {e}")
return None
api_url = "http://localhost:8080/api/products"
proxy_url = "http://127.0.0.1.8080"
product_data = fetch_data_with_proxy(api_url, proxy_url)
if product_data:
print(product_data)
方案二:修改请求头模拟浏览器行为
部分网站会通过检查特定的请求头(如User-Agent)来放宽CORS限制。以下示例模拟Chrome浏览器请求:
import requests
def fetch_data_with_headers(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) appleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6167.140 Safari/537.36",
"Accept": "application/json",
"Accept-Language": "zh-CN,zh;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin"
}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Error fetching data: {e}")
return None
api_url = "http://backend_api/api/products"
product_data = fetch_data_with_headers(api_url)
if product_data:
print(product_data)
案例:抓取动态加载的网页内容
许多现代网站使用JavaScript动态加载内容,直接解析静态无法获取完整数据。以下是一个使用Selenium抓取动态内容的示例:
配置Selenium与ChromeDriver
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
def setup_driver():
chrome_options = Options()
chrome_options.add_argument("--headless") 无头模式
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--window-size=1920,1080")
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)
return driver
driver = setup_driver()
抓取动态加载的数据
假设我们需要抓取一个分页的股票行情数据,页面通过JavaScript异步加载每页数据:
import time
def fetch_dynamic_data(driver, url):
driver.get(url)
等待页面动态加载
time.sleep(5)
解析页面内容
try:
elements = driver.find_elements(By.XPATH, "//table[@class='stock-table']//tr")
data = []
for element in elements[1:]: 跳过表头
cells = element.find_elements(By.TAG_NAME, "td")
if cells:
row_data = {
"name": cells[0].text,
"price": cells[1].text,
"change": cells[2].text
}
data.append(row_data)
return data
except Exception as e:
print(f"Error parsing page: {e}")
return []
假设这是第一页URL
page_url = "https://example.com/stocks?page=1"
stock_data = fetch_dynamic_data(driver, page_url)
if stock_data:
print(stock_data)
处理分页加载
以下示例演示如何自动翻页抓取所有分页数据:
def fetch_all_pages(driver, base_url):
all_data = []
page = 1
while True:
current_url = f"{base_url}?page={page}"
print(f"Fetching page {page}")
page_data = fetch_dynamic_data(driver, current_url)
if not page_data:
break
all_data.extend(page_data)
page += 1
time.sleep(2) 避免请求过快被拦截
return all_data
all_stock_data = fetch_all_pages(driver, "https://example.com/stocks")
print(f"Total stocks fetched: {len(all_stock_data)}")
总结
通过以上案例,我们展示了如何使用代理服务器和请求头绕过CORS限制,以及如何使用Selenium处理动态加载内容。这些技术是构建高效智能爬虫的重要基础。
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。