API reference

Utilities

process.util.wrap(string)[source]

Format a long string as a help message, and return it.

process.util.walk(paths)[source]
process.util.get_publisher()[source]
process.util.consume(*args, **kwargs)[source]
process.util.decorator(decode, callback, state, channel, method, properties, body)[source]

Close the database connections opened by the callback, before returning.

If the callback raises an exception, shut down the client in the main thread, without acknowledgment. For some exceptions, assume that the same message was delivered twice, log an error, and nack the message.

process.util.get_or_create(model, data)[source]

Get or create a Data or PackageData instance.

process.util.create_note(collection, code, note, **kwargs)[source]
process.util.create_step(name, collection_id, **kwargs)[source]
process.util.deleting_step(*args, **kwargs)[source]

Delete the named step and run any finish callback only if successful or if the error is expected.

process.util.delete_step(name, finish=None, finish_args=(), exception=None, **kwargs)[source]
process.util.create_logger_note(collection, name)[source]
process.util.get_extensions(package)[source]

Models

class process.models.Default[source]
class process.models.Collection(*args, **kwargs)[source]

A collection of data from a source.

There should be at most one collection of a given source (source_id) at a given time (data_version) of a given scope (sample or not). A unique constraint therefore covers these fields.

A collection can be a sample of a source. For example, an analyst can load a sample of a bulk download, run manual queries to check whether it serves their needs, and then load the full file. To avoid the overhead of deleting the sample, we instead make sample part of the unique constraint, along with source_id and data_version.

class Transform(*values)[source]
clean_fields(exclude=None)[source]

Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.

get_upgraded_collection()[source]

Return the upgraded collection or None.

Return type:

Self | None

get_compiled_collection()[source]

Return the compiled collection or None, traversing the upgraded collection if needed.

Return type:

Self | None

get_root_parent()[source]

Return the “root” ancestor of the collection.

Return type:

Self

class process.models.CollectionNote(*args, **kwargs)[source]

A note an analyst made about the collection.

class Level(*values)[source]
class process.models.CollectionFile(*args, **kwargs)[source]

A file within the collection.

class process.models.ProcessingStep(*args, **kwargs)[source]

A step in the lifecycle of collection file.

class Name(*values)[source]
class process.models.Data(*args, **kwargs)[source]

The contents of a release, record or compiled release.

class process.models.PackageData(*args, **kwargs)[source]

The contents of a package, excluding the releases or records.

class process.models.Release(*args, **kwargs)[source]

A release.

class process.models.Record(*args, **kwargs)[source]

A record.

class process.models.CompiledRelease(*args, **kwargs)[source]

A compiled release.

class process.models.ReleaseCheck(*args, **kwargs)[source]

The result of checking a release.

class process.models.RecordCheck(*args, **kwargs)[source]

The result of checking a record.

Loader

process.processors.loader.file_or_directory(path)[source]

Check whether the path exists. Raise an exception if not.

process.processors.loader.create_collection_file(collection, filename=None, url=None)[source]

Create file for a collection and steps for this file.

Parameters:
  • collection (Collection) – collection

  • filename (str) – path to file data

Returns:

created collection file

Raises:

InvalidFormError – if there is a validation error

Return type:

CollectionFile

process.processors.loader.create_collections(source_id, data_version, *, sample=False, upgrade=False, compile=False, check=False, scrapyd_job='', note='', force=False)[source]

Create the root collection, derived collections and notes.

Parameters:
  • source_id (str) – collection source

  • data_version (str) – data version in ISO format

  • sample (boolean) – is this sample only

  • upgrade (boolean) – whether to plan collection upgrade

  • compile (boolean) – whether to plan collection compile

  • check (boolean) – whether to plan schema-based checks

  • scrapyd_job (str) – Scrapyd job ID

  • note (str) – text description

  • force (boolean) – skip validation of the source_id against the Scrapyd project

Returns:

the root collection, upgraded collection and compiled_collection

Return type:

tuple[Collection, Collection, Collection]

Scrapyd

process.scrapyd.configured()[source]

Return whether the connection to Scrapyd is configured.

Return type:

bool

process.scrapyd.spiders()[source]

Return the names of the spiders in the Scrapyd project.

Return type:

list[str]

Command-line interface

class process.cli.CollectionCommand(stdout=None, stderr=None, no_color=False, force_color=False)[source]
add_arguments(parser)[source]

Add default arguments to the command.

add_collection_arguments(parser)[source]

Add arguments specific to this command.

handle(*args, **options)[source]

Get the collection.

handle_collection(collection, *args, **options)[source]

Run the command.