π¦ RobotsNode Module
The RobotsNode
module implements a node responsible for checking if a website is scrapeable or not based on the robots.txt file. It uses a language model to determine if the website allows scraping of the provided path.
Classesβ
RobotsNode
β
RobotsNode
is a node responsible for checking if a website is scrapeable or not based on the robots.txt file.
Attributesβ
- llm_model: An instance of the language model client used for checking scrapeability.
- force_scraping (bool): A flag indicating whether scraping should be enforced even if disallowed by robots.txt.
- verbose (bool): A flag indicating whether to show print statements during execution.
Methodsβ
-
__init__(self, input: str, output: List[str], node_config: Optional[dict] = None, node_name: str = "Robots")
- Initializes the
RobotsNode
with a node name and other optional configurations. - Args:
input (str)
: Boolean expression defining the input keys needed from the state.output (List[str])
: List of output keys to be updated in the state.node_config (dict, optional)
: Additional configuration for the node. Defaults to None.node_name (str, optional)
: The unique identifier name for the node. Defaults to "Robots".
- Initializes the
-
execute(self, state: dict) -> dict
- Checks if a website is scrapeable based on the robots.txt file and updates the state with the scrapeability status. The method constructs a prompt for the language model, submits it, and parses the output to determine if scraping is allowed.
- Args:
state (dict)
: The current state of the graph. The input keys will be used to fetch the correct data from the state.
- Returns:
dict
: The updated state with the output key containing the scrapeability status.
Example Usageβ
Here is an example of how to use the RobotsNode
class:
from robots_node import RobotsNode
# Define a RobotsNode
robots_node = RobotsNode(
input="source",
output=["is_scrapable"],
node_config={"llm_model": ..., "force_scraping": False, "verbose": True},
node_name="Robots"
)
# Define the state
state = {
"source": "https://example.com",
}
# Execute the RobotsNode
state = robots_node.execute(state)
# Retrieve the scrapeability status from the state
is_scrapable = state["is_scrapable"]
print(f"Is Scrapable: {is_scrapable}")