activesoup package¶
- class activesoup.Driver(**kwargs)¶
Bases:
objectDriveris the main entrypoint intoactivesoup.The
Driverprovides navigation functions, and keeps track of the current page. Note that this class is re-exposed viaactivesoup.Driver.>>> d = Driver() >>> page = d.get("https://github.com/jelford/activesoup") >>> assert d.url == "https://github.com/jelford/activesoup"
Navigation updates the current page
Any methods which are not defined directly on
Driverare forwarded on to the most recentResponseobject
A single
requests.Sessionis held open for the lifetime of theDriver- theSessionwill accumulate cookies and open connections.Drivermay be used as a context manager to automatically close all open connections when finished:with Driver() as d: d.get("https://github.com/jelford/activesoup")
See Getting Started for a full demo of usage.
- Parameters
kwargs –
optional keyword arguments may be passed, which will be set as attributes of the
requests.Sessionwhich will be used for the lifetime of thisDriver:>>> d = Driver(headers={"User-Agent": "activesoup script"}) >>> d.session.headers["User-Agent"] 'activesoup script'
- get(url, **kwargs) Driver¶
Move the Driver to a new page.
This is the primary means of navigating the
Driverto the page of interest.
- class activesoup.Response(raw_response: Response, content_type: Optional[str])¶
Bases:
objectThe result of a page load by
activesoup.Driver.- Parameters
raw_response (requests.Response) – The raw data returned from the server.
content_type (str) – The datatype used for interpretting this response object.
This top-level class contains attributes common to all responses. Child classes contain response-type-specific helpers. Check the
content_typeof this object to determine what data you have (and therefore which methods are available).Generally, fields of a
Responsecan be accessed directly through theDriver:>>> import activesoup >>> d = activesoup.Driver() >>> page = d.get("https://github.com/jelford/activesoup") >>> d.content_type 'text/html' >>> links = d.find_all("a") # ... etc
- property content_type¶
The type of content contained in this response
e.g. application/csv
- Return type
- property response¶
The raw
requests.Responseobject returned by the server.You can use this object to inspect information not directly available through the
activesoupAPI.- Return type
requests.Response
Submodules¶
activesoup.driver module¶
- class activesoup.driver.Driver(**kwargs)¶
Bases:
objectDriveris the main entrypoint intoactivesoup.The
Driverprovides navigation functions, and keeps track of the current page. Note that this class is re-exposed viaactivesoup.Driver.>>> d = Driver() >>> page = d.get("https://github.com/jelford/activesoup") >>> assert d.url == "https://github.com/jelford/activesoup"
Navigation updates the current page
Any methods which are not defined directly on
Driverare forwarded on to the most recentResponseobject
A single
requests.Sessionis held open for the lifetime of theDriver- theSessionwill accumulate cookies and open connections.Drivermay be used as a context manager to automatically close all open connections when finished:with Driver() as d: d.get("https://github.com/jelford/activesoup")
See Getting Started for a full demo of usage.
- Parameters
kwargs –
optional keyword arguments may be passed, which will be set as attributes of the
requests.Sessionwhich will be used for the lifetime of thisDriver:>>> d = Driver(headers={"User-Agent": "activesoup script"}) >>> d.session.headers["User-Agent"] 'activesoup script'
- get(url, **kwargs) Driver¶
Move the Driver to a new page.
This is the primary means of navigating the
Driverto the page of interest.
- exception activesoup.driver.DriverError¶
Bases:
RuntimeErrorErrors that occur as part of operating the driver
These errors reflect logic errors (such as accessing the
last_responsebefore navigating) or that theDriveris unable to carry out the action that was requested (e.g. the server returned a bad redirect)
activesoup.html module¶
- class activesoup.html.BoundForm(driver: Driver, raw_response: Response, element: Element)¶
Bases:
BoundTagA
BoundFormis a specialisation of theBoundTagclass, returned when the tag is a<form>element.BoundFormadds the ability to submit forms to the server.>>> d = activesoup.Driver() >>> page = d.get("https://github.com/jelford/activesoup/issues/new") >>> f = page.form >>> page = f.submit({"title": "Misleading examples", "body": "Examples appear to show interactions with GitHub.com but don't reflect GitHub's real page structure"}) >>> page.url 'https://github.com/jelford/activesoup/issues/1'
- submit(data: Dict, suppress_unspecified: bool = False) Driver¶
Submit the form to the server
- Parameters
data (Dict) – The values that should be provided for the various fields in the submitted form. Keys should correspond to the form inputs’
nameattribute, and may be simple string values, or lists (in the case where a form input can take several values)suppress_unspecified (bool) –
If False (the default), then
activesoupwill augment thedataparameter to include the values of fields that are:not specified in the
dataparameterpresent with default values in the form as it was presented to us.
The most common use-cases for this is to pick up fields with
type="hidden"(commonly used for CSRF protection) or fields withtype="checkbox"(commonly some default values are ticked).
If the form has an
actionattribute specified, then the form will be submitted to that URL. If the form does not specify amethod, thenPOSTwill be used as a default.
- class activesoup.html.BoundTag(driver: Driver, raw_response: Response, element: Element)¶
Bases:
ResponseA
BoundTagrepresents a single node in an HTML document.When a new HTML page is opened by the
activesoup.Driver, the page is parsed, and a newBoundTagis created, which is a handle to the top-level<html>element.BoundTagprovides convenient access to data in the page:Via field-style find operation (inspired by BeautifulSoup):
>>> page = html_page('<html><body><a id="link">link-text</a></body></html>') >>> page.a.text() 'link-text'
Via dictionary-stype attribute lookup:
>>> page.a["id"] 'link'
A
BoundTagwraps anxml.etree.ElementTree.Element, providing shortcuts for common operations. The underlyingElementcan be accessed viaetree. When child elements are accessed via those helpers, they are also wrapped in aBoundTagobject.Note: a
BoundTagobject is created internally by theactivesoup.Driver- you will generally not need to construct one directly.- etree() Element¶
Access the wrapped
etree.ElementobjectThe other methods on this class class are generally shortcuts to functionality provided by the underlying
Element- with the difference that where applicable they wrap the results in a newBoundTag.- Return type
Element
- find(xpath: str = None, **kwargs) Optional[BoundTag]¶
Find a single element matching the provided xpath expression
- Parameters
xpath (str) – xpath expression that will be forwarded to
etree's findkwargs – Optional dictionary of attribute values. If present,
activesoupwill append attribute filters to the XPath expression
- Return type
Optional[BoundTag]
Note that unlike
find_all(), the path is not first made relative.>>> page = html_page('<html><body><input type="text" name="first" /><input type="checkbox" name="second" /></body></html>') >>> page.find(".//input", type="checkbox")["name"] 'second'
The simplest use-case, of returning the first matching item for a particular tag, can be done via the field-stype find shortcut:
>>> first_input = page.input >>> first_input["name"] 'first'
findis a shortcut for.etree().find():# The following are equivalent except that the returned value is wrapped in a BoundTag page.find('input', type="checkbox") page.find('input[@type="checkbox"]') page.etree().find('input[@type="checkbox"]') # The following are equivalent except that the returned value is wrapped in a BoundTag page.find('.//input') page.input
- find_all(element_matcher: str) List[BoundTag]¶
Find all matching elements on the current page
The match expression is made relative (by prefixing with
.//) and then forwarded toetree's findallon the parsedElement.Note that the general power of
xml.etree’s XPath support is available, so filter expressions work too:>>> page = html_page('<html><body><a class="uncool">first link</a><a class="cool">second link</a></body></html>') >>> links = page.find_all('a') >>> links[0].text() 'first link' >>> links[1].text() 'second link'
>>> cool_links = page.find_all('a[@class="cool"]') >>> len(cool_links) 1 >>> cool_links[0].text() 'second link'
find_allis a shortcut for.etree().findall()with a relative path:# The following are equivalent: tag.find_all("a") tag.etree().findall(".//a")
- html() bytes¶
Render this element’s HTML as bytes
- Return type
The output is generated from the parsed HTML structure, as interpretted by
html5lib.html5libis howactivesoupinterprets pages in the same way as the browser would, and that might mean making some changes to the structure of the document - for example, if the original HTML contained errors.
activesoup.response module¶
The module contains the various types of response object, used to access after
navigating to a page with activesoup.Driver.get(). All responses
are instances of activesoup.response.Response. When activesoup
recognises the type of data, the response is specialized for convenient access.
This detection is driven by the Content-Type header in the server’s response
(so, if a web server labels a CSV file as HTML, activesoup will just assume
it’s HTML and try to parse it as such)
The following specialisations are applied:
text/htmlactivesoup.html.BoundTag. The HTML page is parsed, and a handle to the top-level<html>element is provided.text/csvapplication/jsonactivesoup.response.JsonResponse. The JSON data is parsed into python objects viajson.loads, and made available via dictionary-like access.
- class activesoup.response.CsvResponse(raw_response)¶
Bases:
ResponseA response object representing a
CSVpage- Parameters
raw_response (requests.Response) – The raw data returned from the server.
- class activesoup.response.JsonResponse(raw_response: Response)¶
Bases:
ResponseA response object representing a
JSONpage- Parameters
raw_response (requests.Response) – The raw data returned from the server.
JSONdata returned by the page will be parsed into a Python object:>>> raw_content = '{"key": "value"}' >>> resp = json_page(raw_content) >>> resp["key"] 'value'
- class activesoup.response.Response(raw_response: Response, content_type: Optional[str])¶
Bases:
objectThe result of a page load by
activesoup.Driver.- Parameters
raw_response (requests.Response) – The raw data returned from the server.
content_type (str) – The datatype used for interpretting this response object.
This top-level class contains attributes common to all responses. Child classes contain response-type-specific helpers. Check the
content_typeof this object to determine what data you have (and therefore which methods are available).Generally, fields of a
Responsecan be accessed directly through theDriver:>>> import activesoup >>> d = activesoup.Driver() >>> page = d.get("https://github.com/jelford/activesoup") >>> d.content_type 'text/html' >>> links = d.find_all("a") # ... etc
- property content_type¶
The type of content contained in this response
e.g. application/csv
- Return type
- property response¶
The raw
requests.Responseobject returned by the server.You can use this object to inspect information not directly available through the
activesoupAPI.- Return type
requests.Response
- exception activesoup.response.UnknownResponseType¶
Bases:
RuntimeError