简介: 网络抓取的演变
在数据驱动的动态行业领域,从在线资源中提取有价值的见解至关重要。从市场分析到学术研究,对特定数据的需求助长了对强大网络抓取工具的需求。传统上,BeautifulSoup 和 Scrapy 等 Python 库一直是最常用的解决方案,需要用户利用编程专业知识来浏览复杂的Web结构。
# BeautifulSoup Example
from bs4 import BeautifulSoup
import requests
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title)
# Scrapy Example
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']
def parse(self, response):
title = response.css('title::text').get()
print(title)
ScrapeGraphAI 简介:简化数据提取
ScrapeGraphAI是一个开创性的Python库,它重塑了网络搜索的格局。这款创新工具由 Niharika Singh 开发,利用大型语言模型(LLM)和直接图逻辑的强大功能来简化数据收集。与前代产品不同,ScrapeGraphAI 让用户能够明确表达自己的数据需求,从而抽象出网络搜索的复杂性。
%%capture
!apt install chromium-chromedriver
!pip install nest_asyncio
!pip install scrapegraphai
!playwright install
# if you plan on using text_to_speech and GPT4-Vision models be sure to use the
# correct APIKEY
OPENAI_API_KEY = "YOUR API KEY"
GOOGLE_API_KEY = "YOUR API KEY"
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"api_key": OPENAI_API_KEY,
"model": "gpt-3.5-turbo",
},
}
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their descriptions.",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects/",
config=graph_config
)
result = smart_scraper_graph.run()
import json
output = json.dumps(result, indent=2)
line_list = output.split("\n") # Sort of line replacing "\n" with a new line
for line in line_list:
print(line)
语音图表
SpeechGraph 是一个表示默认抓取管道之一的类,它将答案与音频文件一起生成。与 SmartScraperGraph 类似,但增加了 TextToSpeechNode 节点。
from scrapegraphai.graphs import SpeechGraph
# Define the configuration for the graph
graph_config = {
"llm": {
"api_key": OPENAI_API_KEY,
"model": "gpt-3.5-turbo",
},
"tts_model": {
"api_key": OPENAI_API_KEY,
"model": "tts-1",
"voice": "alloy"
},
"output_path": "website_summary.mp3",
}
# Create the SpeechGraph instance
speech_graph = SpeechGraph(
prompt="Create a summary of the website",
source="https://perinim.github.io/projects/",
config=graph_config,
)
result = speech_graph.run()
answer = result.get("answer", "No answer found")
import json
output = json.dumps(answer, indent=2)
line_list = output.split("\n") # Sort of line replacing "\n" with a new line
for line in line_list:
print(line)
from IPython.display import Audio
wn = Audio("website_summary.mp3", autoplay=True)
display(wn)
GraphBuilder(实验)
GraphBuilder 会根据用户提示从头开始创建一个刮擦管道。它会返回一个包含节点和边的图形。
GraphBuilder 是一个实验类,可帮助你根据提示创建自定义图形。它会创建一个包含图形基本要素的 json,并允许你使用 graphviz 将其可视化。 它知道库默认提供的节点类型,并将它们连接起来,帮助你实现目标。
from scrapegraphai.builders import GraphBuilder
# Define the configuration for the graph
graph_config = {
"llm": {
"api_key": OPENAI_API_KEY,
"model": "gpt-3.5-turbo",
},
}
# Example usage of GraphBuilder
graph_builder = GraphBuilder(
user_prompt="Extract the news and generate a text summary with a voiceover.",
config=graph_config
)
graph_json = graph_builder.build_graph()
# Convert the resulting JSON to Graphviz format
graphviz_graph = graph_builder.convert_json_to_graphviz(graph_json)
# Save the graph to a file and open it in the default viewer
graphviz_graph.render('ScrapeGraphAI_generated_graph', view=True)
graph_json
graphviz_graph
ScrapeGraphAI 的工作原理:仔细观察
ScrapeGraphAI 通过解释用户查询和智能浏览网页内容来获取所需信息。利用 LLM,它可以自主构建搜索管道,最大限度地减少用户干预。这种方法不仅提高了效率,还降低了入门门槛,使用户能够专注于数据分析,而不是复杂的技术问题。
利用 ScrapeGraphAI 提高效率
ScrapeGraphAI 能够自动执行复杂的抓取任务,同时确保高准确性,这对于各行各业的专业人士来说无疑是一场变革。无论是监控竞争对手还是开展学术研究,该工具都能帮助用户高效利用网络数据。随着数字领域的不断发展,ScrapeGraphAI 已成为推动数据驱动决策向前发展的不可或缺的盟友。
结论
在以数据为中心的世界里,高效数据提取的重要性怎么强调都不为过。ScrapeGraphAI代表了网络抓取的范式转变,提供了一种由尖端技术驱动的用户友好型方法。随着企业和研究人员努力在竞争激烈的环境中保持领先地位,采用 ScrapeGraphAI 等工具对于获取可操作的见解和推动明智决策至关重要。