activesoup package¶
- class activesoup.Driver(**kwargs)¶
Bases:
object
Driver
is the main entrypoint intoactivesoup
.The
Driver
provides navigation functions, and keeps track of the current page. Note that this class is re-exposed viaactivesoup.Driver
.>>> d = Driver() >>> page = d.get("https://github.com/jelford/activesoup") >>> assert d.url == "https://github.com/jelford/activesoup"
Navigation updates the current page
Any methods which are not defined directly on
Driver
are forwarded on to the most recentResponse
object
A single
requests.Session
is held open for the lifetime of theDriver
- theSession
will accumulate cookies and open connections.Driver
may be used as a context manager to automatically close all open connections when finished:with Driver() as d: d.get("https://github.com/jelford/activesoup")
See Getting Started for a full demo of usage.
- Parameters
kwargs –
optional keyword arguments may be passed, which will be set as attributes of the
requests.Session
which will be used for the lifetime of thisDriver
:>>> d = Driver(headers={"User-Agent": "activesoup script"}) >>> d.session.headers["User-Agent"] 'activesoup script'
- get(url, **kwargs) Driver ¶
Move the Driver to a new page.
This is the primary means of navigating the
Driver
to the page of interest.
- class activesoup.Response(raw_response: Response, content_type: Optional[str])¶
Bases:
object
The result of a page load by
activesoup.Driver
.- Parameters
raw_response (requests.Response) – The raw data returned from the server.
content_type (str) – The datatype used for interpretting this response object.
This top-level class contains attributes common to all responses. Child classes contain response-type-specific helpers. Check the
content_type
of this object to determine what data you have (and therefore which methods are available).Generally, fields of a
Response
can be accessed directly through theDriver
:>>> import activesoup >>> d = activesoup.Driver() >>> page = d.get("https://github.com/jelford/activesoup") >>> d.content_type 'text/html' >>> links = d.find_all("a") # ... etc
- property content_type¶
The type of content contained in this response
e.g. application/csv
- Return type
- property response¶
The raw
requests.Response
object returned by the server.You can use this object to inspect information not directly available through the
activesoup
API.- Return type
requests.Response
Submodules¶
activesoup.driver module¶
- class activesoup.driver.Driver(**kwargs)¶
Bases:
object
Driver
is the main entrypoint intoactivesoup
.The
Driver
provides navigation functions, and keeps track of the current page. Note that this class is re-exposed viaactivesoup.Driver
.>>> d = Driver() >>> page = d.get("https://github.com/jelford/activesoup") >>> assert d.url == "https://github.com/jelford/activesoup"
Navigation updates the current page
Any methods which are not defined directly on
Driver
are forwarded on to the most recentResponse
object
A single
requests.Session
is held open for the lifetime of theDriver
- theSession
will accumulate cookies and open connections.Driver
may be used as a context manager to automatically close all open connections when finished:with Driver() as d: d.get("https://github.com/jelford/activesoup")
See Getting Started for a full demo of usage.
- Parameters
kwargs –
optional keyword arguments may be passed, which will be set as attributes of the
requests.Session
which will be used for the lifetime of thisDriver
:>>> d = Driver(headers={"User-Agent": "activesoup script"}) >>> d.session.headers["User-Agent"] 'activesoup script'
- get(url, **kwargs) Driver ¶
Move the Driver to a new page.
This is the primary means of navigating the
Driver
to the page of interest.
- exception activesoup.driver.DriverError¶
Bases:
RuntimeError
Errors that occur as part of operating the driver
These errors reflect logic errors (such as accessing the
last_response
before navigating) or that theDriver
is unable to carry out the action that was requested (e.g. the server returned a bad redirect)
activesoup.html module¶
- class activesoup.html.BoundForm(driver: Driver, raw_response: Response, element: Element)¶
Bases:
BoundTag
A
BoundForm
is a specialisation of theBoundTag
class, returned when the tag is a<form>
element.BoundForm
adds the ability to submit forms to the server.>>> d = activesoup.Driver() >>> page = d.get("https://github.com/jelford/activesoup/issues/new") >>> f = page.form >>> page = f.submit({"title": "Misleading examples", "body": "Examples appear to show interactions with GitHub.com but don't reflect GitHub's real page structure"}) >>> page.url 'https://github.com/jelford/activesoup/issues/1'
- submit(data: Dict, suppress_unspecified: bool = False) Driver ¶
Submit the form to the server
- Parameters
data (Dict) – The values that should be provided for the various fields in the submitted form. Keys should correspond to the form inputs’
name
attribute, and may be simple string values, or lists (in the case where a form input can take several values)suppress_unspecified (bool) –
If False (the default), then
activesoup
will augment thedata
parameter to include the values of fields that are:not specified in the
data
parameterpresent with default values in the form as it was presented to us.
The most common use-cases for this is to pick up fields with
type="hidden"
(commonly used for CSRF protection) or fields withtype="checkbox"
(commonly some default values are ticked).
If the form has an
action
attribute specified, then the form will be submitted to that URL. If the form does not specify amethod
, thenPOST
will be used as a default.
- class activesoup.html.BoundTag(driver: Driver, raw_response: Response, element: Element)¶
Bases:
Response
A
BoundTag
represents a single node in an HTML document.When a new HTML page is opened by the
activesoup.Driver
, the page is parsed, and a newBoundTag
is created, which is a handle to the top-level<html>
element.BoundTag
provides convenient access to data in the page:Via field-style find operation (inspired by BeautifulSoup):
>>> page = html_page('<html><body><a id="link">link-text</a></body></html>') >>> page.a.text() 'link-text'
Via dictionary-stype attribute lookup:
>>> page.a["id"] 'link'
A
BoundTag
wraps anxml.etree.ElementTree.Element
, providing shortcuts for common operations. The underlyingElement
can be accessed viaetree
. When child elements are accessed via those helpers, they are also wrapped in aBoundTag
object.Note: a
BoundTag
object is created internally by theactivesoup.Driver
- you will generally not need to construct one directly.- etree() Element ¶
Access the wrapped
etree.Element
objectThe other methods on this class class are generally shortcuts to functionality provided by the underlying
Element
- with the difference that where applicable they wrap the results in a newBoundTag
.- Return type
Element
- find(xpath: str = None, **kwargs) Optional[BoundTag] ¶
Find a single element matching the provided xpath expression
- Parameters
xpath (str) – xpath expression that will be forwarded to
etree's find
kwargs – Optional dictionary of attribute values. If present,
activesoup
will append attribute filters to the XPath expression
- Return type
Optional[BoundTag]
Note that unlike
find_all()
, the path is not first made relative.>>> page = html_page('<html><body><input type="text" name="first" /><input type="checkbox" name="second" /></body></html>') >>> page.find(".//input", type="checkbox")["name"] 'second'
The simplest use-case, of returning the first matching item for a particular tag, can be done via the field-stype find shortcut:
>>> first_input = page.input >>> first_input["name"] 'first'
find
is a shortcut for.etree().find()
:# The following are equivalent except that the returned value is wrapped in a BoundTag page.find('input', type="checkbox") page.find('input[@type="checkbox"]') page.etree().find('input[@type="checkbox"]') # The following are equivalent except that the returned value is wrapped in a BoundTag page.find('.//input') page.input
- find_all(element_matcher: str) List[BoundTag] ¶
Find all matching elements on the current page
The match expression is made relative (by prefixing with
.//
) and then forwarded toetree's findall
on the parsedElement
.Note that the general power of
xml.etree
’s XPath support is available, so filter expressions work too:>>> page = html_page('<html><body><a class="uncool">first link</a><a class="cool">second link</a></body></html>') >>> links = page.find_all('a') >>> links[0].text() 'first link' >>> links[1].text() 'second link'
>>> cool_links = page.find_all('a[@class="cool"]') >>> len(cool_links) 1 >>> cool_links[0].text() 'second link'
find_all
is a shortcut for.etree().findall()
with a relative path:# The following are equivalent: tag.find_all("a") tag.etree().findall(".//a")
- html() bytes ¶
Render this element’s HTML as bytes
- Return type
The output is generated from the parsed HTML structure, as interpretted by
html5lib
.html5lib
is howactivesoup
interprets pages in the same way as the browser would, and that might mean making some changes to the structure of the document - for example, if the original HTML contained errors.
activesoup.response module¶
The module contains the various types of response object, used to access after
navigating to a page with activesoup.Driver.get()
. All responses
are instances of activesoup.response.Response
. When activesoup
recognises the type of data, the response is specialized for convenient access.
This detection is driven by the Content-Type
header in the server’s response
(so, if a web server labels a CSV file as HTML, activesoup
will just assume
it’s HTML
and try to parse it as such)
The following specialisations are applied:
text/html
activesoup.html.BoundTag
. The HTML page is parsed, and a handle to the top-level<html>
element is provided.text/csv
application/json
activesoup.response.JsonResponse
. The JSON data is parsed into python objects viajson.loads
, and made available via dictionary-like access.
- class activesoup.response.CsvResponse(raw_response)¶
Bases:
Response
A response object representing a
CSV
page- Parameters
raw_response (requests.Response) – The raw data returned from the server.
- class activesoup.response.JsonResponse(raw_response: Response)¶
Bases:
Response
A response object representing a
JSON
page- Parameters
raw_response (requests.Response) – The raw data returned from the server.
JSON
data returned by the page will be parsed into a Python object:>>> raw_content = '{"key": "value"}' >>> resp = json_page(raw_content) >>> resp["key"] 'value'
- class activesoup.response.Response(raw_response: Response, content_type: Optional[str])¶
Bases:
object
The result of a page load by
activesoup.Driver
.- Parameters
raw_response (requests.Response) – The raw data returned from the server.
content_type (str) – The datatype used for interpretting this response object.
This top-level class contains attributes common to all responses. Child classes contain response-type-specific helpers. Check the
content_type
of this object to determine what data you have (and therefore which methods are available).Generally, fields of a
Response
can be accessed directly through theDriver
:>>> import activesoup >>> d = activesoup.Driver() >>> page = d.get("https://github.com/jelford/activesoup") >>> d.content_type 'text/html' >>> links = d.find_all("a") # ... etc
- property content_type¶
The type of content contained in this response
e.g. application/csv
- Return type
- property response¶
The raw
requests.Response
object returned by the server.You can use this object to inspect information not directly available through the
activesoup
API.- Return type
requests.Response
- exception activesoup.response.UnknownResponseType¶
Bases:
RuntimeError