When you are writing a data provider function, you need to retrieve data from a source website. RSSerpent ships with convenient utility functions for you to retrieve data in various scenarios.
The most basic, bare-bone way to retrieve data from the Internet is by using the HTTP protocol. HTTPX, a powerful HTTP library for Python with asynchronous API, is bundled with the RSSerpent core.
Sometimes the source website has a directly accessible JSON API, so you can simply:
from rsserpent.utils import HTTPClient async with HTTPClient() as client: response = await client.get("https://httpbin.org/json") data = response.json()
rsserpent.utils.HTTPClient is a subclass of
httpx.AsyncClient, despite that
HTTPClient will automatically set
User-Agent headers for you.
The source website is sometimes not kind enough to offer you convenient JSON API, so you have to parse the HTML in order to retrieve data. PyQuery is here to help -- it lets you retrieve data from HTML documents by using a jQuery-like API, and it is also bundled with RSSerpent!
from pyquery import PyQuery from rsserpent.utils import HTTPClient async with HTTPClient() as client: response = await client.get("https://httpbin.org/json") dom = PyQuery(response.text) title = dom("h1").text() description = dom("div > p").text()
Playwright is a powerful browser automation library and it's bundled with RSSerpent. RSSerpent also provides a convenient wrapper on top of playwright:
from pyquery import PyQuery from rsserpent.utils import Browser async with Browser() as browser: await browser.goto("https://httpbin.org/html") content = await browser.content() dom = PyQuery(content) title = dom("h1").text() description = dom("div > p").text()
Browser emulation's power comes with a cost, as it strains CPU & memory resources. Be cautious! Use browser emulation only when necessary.