activesoup package

class activesoup.Driver(**kwargs)

Bases: object

Driver is the main entrypoint into activesoup.

The Driver provides navigation functions, and keeps track of the current page. Note that this class is re-exposed via activesoup.Driver.

>>> d = Driver()
>>> page = d.get("https://github.com/jelford/activesoup")
>>> assert d.url == "https://github.com/jelford/activesoup"
  • Navigation updates the current page

  • Any methods which are not defined directly on Driver are forwarded on to the most recent Response object

A single requests.Session is held open for the lifetime of the Driver - the Session will accumulate cookies and open connections. Driver may be used as a context manager to automatically close all open connections when finished:

with Driver() as d:
    d.get("https://github.com/jelford/activesoup")

See Getting Started for a full demo of usage.

Parameters

kwargs

optional keyword arguments may be passed, which will be set as attributes of the requests.Session which will be used for the lifetime of this Driver:

>>> d = Driver(headers={"User-Agent": "activesoup script"})
>>> d.session.headers["User-Agent"]
'activesoup script'

get(url, **kwargs) Driver

Move the Driver to a new page.

This is the primary means of navigating the Driver to the page of interest.

Parameters
  • url (str) – the new URL for the Driver to navigate to (e.g. https://www.example.com)

  • kwargs – additional keyword arguments are passed in to the constructor of the requests.Request used to fetch the page.

Returns

the Driver object itself

Return type

Driver

property last_response: Optional[Response]

Get the response object that was the result of the most recent page load

Returns

None if no page has been loaded, otherwise the parsed result of the most recent page load

Return type

activesoup.Response

property url: Optional[str]

The URL of the current page

Returns

None if no page has been loaded, otherwise the URL of the most recently loaded page.

Return type

str

class activesoup.Response(raw_response: Response, content_type: Optional[str])

Bases: object

The result of a page load by activesoup.Driver.

Parameters
  • raw_response (requests.Response) – The raw data returned from the server.

  • content_type (str) – The datatype used for interpretting this response object.

This top-level class contains attributes common to all responses. Child classes contain response-type-specific helpers. Check the content_type of this object to determine what data you have (and therefore which methods are available).

Generally, fields of a Response can be accessed directly through the Driver:

>>> import activesoup
>>> d = activesoup.Driver()
>>> page = d.get("https://github.com/jelford/activesoup")
>>> d.content_type
'text/html'
>>> links = d.find_all("a") # ... etc
property content_type

The type of content contained in this response

e.g. application/csv

Return type

str

property response

The raw requests.Response object returned by the server.

You can use this object to inspect information not directly available through the activesoup API.

Return type

requests.Response

property status_code

Status code from the HTTP response

e.g. 200

Return type

int

property url: str

Which URL was requested that resulted in this response?

Return type

str

Submodules

activesoup.driver module

class activesoup.driver.Driver(**kwargs)

Bases: object

Driver is the main entrypoint into activesoup.

The Driver provides navigation functions, and keeps track of the current page. Note that this class is re-exposed via activesoup.Driver.

>>> d = Driver()
>>> page = d.get("https://github.com/jelford/activesoup")
>>> assert d.url == "https://github.com/jelford/activesoup"
  • Navigation updates the current page

  • Any methods which are not defined directly on Driver are forwarded on to the most recent Response object

A single requests.Session is held open for the lifetime of the Driver - the Session will accumulate cookies and open connections. Driver may be used as a context manager to automatically close all open connections when finished:

with Driver() as d:
    d.get("https://github.com/jelford/activesoup")

See Getting Started for a full demo of usage.

Parameters

kwargs

optional keyword arguments may be passed, which will be set as attributes of the requests.Session which will be used for the lifetime of this Driver:

>>> d = Driver(headers={"User-Agent": "activesoup script"})
>>> d.session.headers["User-Agent"]
'activesoup script'

get(url, **kwargs) Driver

Move the Driver to a new page.

This is the primary means of navigating the Driver to the page of interest.

Parameters
  • url (str) – the new URL for the Driver to navigate to (e.g. https://www.example.com)

  • kwargs – additional keyword arguments are passed in to the constructor of the requests.Request used to fetch the page.

Returns

the Driver object itself

Return type

Driver

property last_response: Optional[Response]

Get the response object that was the result of the most recent page load

Returns

None if no page has been loaded, otherwise the parsed result of the most recent page load

Return type

activesoup.Response

property url: Optional[str]

The URL of the current page

Returns

None if no page has been loaded, otherwise the URL of the most recently loaded page.

Return type

str

exception activesoup.driver.DriverError

Bases: RuntimeError

Errors that occur as part of operating the driver

These errors reflect logic errors (such as accessing the last_response before navigating) or that the Driver is unable to carry out the action that was requested (e.g. the server returned a bad redirect)

activesoup.html module

class activesoup.html.BoundForm(driver: Driver, raw_response: Response, element: Element)

Bases: BoundTag

A BoundForm is a specialisation of the BoundTag class, returned when the tag is a <form> element.

BoundForm adds the ability to submit forms to the server.

>>> d = activesoup.Driver()
>>> page = d.get("https://github.com/jelford/activesoup/issues/new")
>>> f = page.form
>>> page = f.submit({"title": "Misleading examples", "body": "Examples appear to show interactions with GitHub.com but don't reflect GitHub's real page structure"})
>>> page.url
'https://github.com/jelford/activesoup/issues/1'
submit(data: Dict, suppress_unspecified: bool = False) Driver

Submit the form to the server

Parameters
  • data (Dict) – The values that should be provided for the various fields in the submitted form. Keys should correspond to the form inputs’ name attribute, and may be simple string values, or lists (in the case where a form input can take several values)

  • suppress_unspecified (bool) –

    If False (the default), then activesoup will augment the data parameter to include the values of fields that are:

    • not specified in the data parameter

    • present with default values in the form as it was presented to us.

    The most common use-cases for this is to pick up fields with type="hidden" (commonly used for CSRF protection) or fields with type="checkbox" (commonly some default values are ticked).

If the form has an action attribute specified, then the form will be submitted to that URL. If the form does not specify a method, then POST will be used as a default.

class activesoup.html.BoundTag(driver: Driver, raw_response: Response, element: Element)

Bases: Response

A BoundTag represents a single node in an HTML document.

When a new HTML page is opened by the activesoup.Driver, the page is parsed, and a new BoundTag is created, which is a handle to the top-level <html> element.

BoundTag provides convenient access to data in the page:

Via field-style find operation (inspired by BeautifulSoup):

>>> page = html_page('<html><body><a id="link">link-text</a></body></html>')
>>> page.a.text()
'link-text'

Via dictionary-stype attribute lookup:

>>> page.a["id"]
'link'

A BoundTag wraps an xml.etree.ElementTree.Element, providing shortcuts for common operations. The underlying Element can be accessed via etree. When child elements are accessed via those helpers, they are also wrapped in a BoundTag object.

Note: a BoundTag object is created internally by the activesoup.Driver - you will generally not need to construct one directly.

etree() Element

Access the wrapped etree.Element object

The other methods on this class class are generally shortcuts to functionality provided by the underlying Element - with the difference that where applicable they wrap the results in a new BoundTag.

Return type

Element

find(xpath: str = None, **kwargs) Optional[BoundTag]

Find a single element matching the provided xpath expression

Parameters
  • xpath (str) – xpath expression that will be forwarded to etree's find

  • kwargs – Optional dictionary of attribute values. If present, activesoup will append attribute filters to the XPath expression

Return type

Optional[BoundTag]

Note that unlike find_all(), the path is not first made relative.

>>> page = html_page('<html><body><input type="text" name="first" /><input type="checkbox" name="second" /></body></html>')
>>> page.find(".//input", type="checkbox")["name"]
'second'

The simplest use-case, of returning the first matching item for a particular tag, can be done via the field-stype find shortcut:

>>> first_input = page.input
>>> first_input["name"]
'first'

find is a shortcut for .etree().find():

# The following are equivalent except that the returned value is wrapped in a BoundTag
page.find('input', type="checkbox")
page.find('input[@type="checkbox"]')
page.etree().find('input[@type="checkbox"]')

# The following are equivalent except that the returned value is wrapped in a BoundTag
page.find('.//input')
page.input
find_all(element_matcher: str) List[BoundTag]

Find all matching elements on the current page

Parameters

element_matcher (str) – match expression to be used.

Return type

List[BoundTag]

The match expression is made relative (by prefixing with .//) and then forwarded to etree's findall on the parsed Element.

Note that the general power of xml.etree’s XPath support is available, so filter expressions work too:

>>> page = html_page('<html><body><a class="uncool">first link</a><a class="cool">second link</a></body></html>')
>>> links = page.find_all('a')
>>> links[0].text()
'first link'
>>> links[1].text()
'second link'
>>> cool_links = page.find_all('a[@class="cool"]')
>>> len(cool_links)
1
>>> cool_links[0].text()
'second link'

find_all is a shortcut for .etree().findall() with a relative path:

# The following are equivalent:
tag.find_all("a")
tag.etree().findall(".//a")
html() bytes

Render this element’s HTML as bytes

Return type

bytes

The output is generated from the parsed HTML structure, as interpretted by html5lib. html5lib is how activesoup interprets pages in the same way as the browser would, and that might mean making some changes to the structure of the document - for example, if the original HTML contained errors.

text() Optional[str]

Access the text content of an HTML node

Return type

Optional[str]

>>> page = html_page('<html><body><p>Hello world</p></body></html>')
>>> p = page.p
>>> p.text()
'Hello world'

text is a shortcut fro .etree().text:

# The following are equivalent:
p.text()
p.etree().text

activesoup.response module

The module contains the various types of response object, used to access after navigating to a page with activesoup.Driver.get(). All responses are instances of activesoup.response.Response. When activesoup recognises the type of data, the response is specialized for convenient access. This detection is driven by the Content-Type header in the server’s response (so, if a web server labels a CSV file as HTML, activesoup will just assume it’s HTML and try to parse it as such)

The following specialisations are applied:

text/html

activesoup.html.BoundTag. The HTML page is parsed, and a handle to the top-level <html> element is provided.

text/csv

activesoup.response.CsvResponse

application/json

activesoup.response.JsonResponse. The JSON data is parsed into python objects via json.loads, and made available via dictionary-like access.

class activesoup.response.CsvResponse(raw_response)

Bases: Response

A response object representing a CSV page

Parameters

raw_response (requests.Response) – The raw data returned from the server.

save(to: Union[Path, str, IO])

Saves the current page to to

Parameters

to – Where to save the file. to may be a path (in which case that path will be opened in binary mode, and truncated if it already exists) or a file-like object (in which case that object will be written to directly)

class activesoup.response.JsonResponse(raw_response: Response)

Bases: Response

A response object representing a JSON page

Parameters

raw_response (requests.Response) – The raw data returned from the server.

JSON data returned by the page will be parsed into a Python object:

>>> raw_content = '{"key": "value"}'
>>> resp = json_page(raw_content)
>>> resp["key"]
'value'
class activesoup.response.Response(raw_response: Response, content_type: Optional[str])

Bases: object

The result of a page load by activesoup.Driver.

Parameters
  • raw_response (requests.Response) – The raw data returned from the server.

  • content_type (str) – The datatype used for interpretting this response object.

This top-level class contains attributes common to all responses. Child classes contain response-type-specific helpers. Check the content_type of this object to determine what data you have (and therefore which methods are available).

Generally, fields of a Response can be accessed directly through the Driver:

>>> import activesoup
>>> d = activesoup.Driver()
>>> page = d.get("https://github.com/jelford/activesoup")
>>> d.content_type
'text/html'
>>> links = d.find_all("a") # ... etc
property content_type

The type of content contained in this response

e.g. application/csv

Return type

str

property response

The raw requests.Response object returned by the server.

You can use this object to inspect information not directly available through the activesoup API.

Return type

requests.Response

property status_code

Status code from the HTTP response

e.g. 200

Return type

int

property url: str

Which URL was requested that resulted in this response?

Return type

str

exception activesoup.response.UnknownResponseType

Bases: RuntimeError