ChromiumLoader Module
The ChromiumLoader
module scrapes HTML pages from URLs using a (headless) instance of the Chromium web driver with proxy protection.
Classes
ChromiumLoader
ChromiumLoader
is designed to scrape HTML pages from a list of URLs using a specified web driver backend, such as Playwright.
Attributes
- backend (str): The web driver backend library; defaults to 'playwright'.
- browser_config (dict): A dictionary containing additional browser kwargs.
- headless (bool): Whether to run the browser in headless mode.
- proxy (Optional[Proxy]): A dictionary containing proxy settings; None disables protection.
- urls (List[str]): A list of URLs to scrape content from.
Methods
-
__init__(self, urls: List[str], *, backend: str = "playwright", headless: bool = True, proxy: Optional[Proxy] = None, **kwargs: Any)
- Initializes the loader with a list of URL paths.
- Args:
backend (str)
: The web driver backend library; defaults to 'playwright'.headless (bool)
: Whether to run the browser in headless mode.proxy (Optional[Proxy])
: A dictionary containing proxy information; None disables protection.urls (List[str])
: A list of URLs to scrape content from.kwargs (Any)
: A dictionary containing additional browser kwargs.
- Raises:
ImportError
- If the required backend package is not installed.
-
ascrape_playwright(self, url: str) -> str
- Asynchronously scrape the content of a given URL using Playwright's async API.
- Args:
url (str)
: The URL to scrape.
- Returns:
str
- The scraped HTML content or an error message if an exception occurs.
-
lazy_load(self) -> Iterator[Document]
- Lazily load text content from the provided URLs.
- Yields:
Document
- The scraped content encapsulated within aDocument
object.
-
alazy_load(self) -> AsyncIterator[Document]
- Asynchronously load text content from the provided URLs.
- Yields:
Document
- ADocument
object containing the scraped content, along with its source URL as metadata.
Example Usage
Here is an example of how to use the ChromiumLoader
class:
import asyncio
from chromium_loader import ChromiumLoader
# Define the list of URLs to scrape
urls = ["https://example.com", "https://example.org"]
# Initialize ChromiumLoader
loader = ChromiumLoader(urls, backend="playwright", headless=True, proxy=None)
# Lazy load documents (synchronous)
for document in loader.lazy_load():
print(document.page_content)
# Asynchronously load documents
async def load_documents():
async for document in loader.alazy_load():
print(document.page_content)
# Run the asynchronous loader
asyncio.run(load_documents())