API reference

Utilities

process.util.wrap(string)[source]

Formats a long string as a help message, and returns it.

process.util.walk(paths)[source]
process.util.get_publisher()[source]
process.util.consume(*args, **kwargs)[source]
process.util.decorator(decode, callback, state, channel, method, properties, body)[source]

Close the database connections opened by the callback, before returning.

If the callback raises an exception, shut down the client in the main thread, without acknowledgment. For some exceptions, assume that the same message was delivered twice, log an error, and nack the message.

process.util.get_or_create(model, data)[source]
process.util.create_note(collection, code, note, **kwargs)[source]
process.util.create_step(name, collection_id, **kwargs)[source]
process.util.delete_step(*args, **kwargs)[source]

Delete the named step and run any finish callback only if successful or if the error is expected.

process.util.create_warnings_note(collection, category)[source]
process.util.create_logger_note(collection, name)[source]

Models

class process.models.Default[source]
class process.models.Collection(*args, **kwargs)[source]

A collection of data from a source.

There should be at most one collection of a given source (source_id) at a given time (data_version) of a given scope (sample or not). A unique constraint therefore covers these fields.

A collection can be a sample of a source. For example, an analyst can load a sample of a bulk download, run manual queries to check whether it serves their needs, and then load the full file. To avoid the overhead of deleting the sample, we instead make sample part of the unique constraint, along with source_id and data_version.

class Transform(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
clean_fields(exclude=None)[source]

Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.

get_upgraded_collection()[source]

Returns existing upgraded collection or None.

Returns:

upgraded collection

Return type:

Collection

get_compiled_collection()[source]

Returns existing compiled collection or None.

Returns:

compiled collection

Return type:

Collection

get_root_parent()[source]

Returns “root” parent of collection. Basically traverses the tree to the top.

Returns:

root collection

Return type:

Collection

class process.models.CollectionNote(*args, **kwargs)[source]

A note an analyst made about the collection.

class Level(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
class process.models.CollectionFile(*args, **kwargs)[source]

A file within the collection.

class process.models.ProcessingStep(*args, **kwargs)[source]

A step in the lifecycle of collection file.

class Name(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
class process.models.CollectionFileItem(*args, **kwargs)[source]

An item within a file in the collection.

class process.models.Data(*args, **kwargs)[source]

The contents of a release, record or compiled release.

class process.models.PackageData(*args, **kwargs)[source]

The contents of a package, excluding the releases or records.

class process.models.Release(*args, **kwargs)[source]

A release.

class process.models.Record(*args, **kwargs)[source]

A record.

class process.models.CompiledRelease(*args, **kwargs)[source]

A compiled release.

class process.models.ReleaseCheck(*args, **kwargs)[source]

The result of checking a release.

class process.models.RecordCheck(*args, **kwargs)[source]

The result of checking a record.

Loader

process.processors.loader.file_or_directory(string)[source]

Checks whether the path is existing file or directory. Raises an exception if not

process.processors.loader.create_collection_file(collection, filename=None, url=None, errors=None)[source]

Creates file for a collection and steps for this file.

Parameters:
  • collection (Collection) – collection

  • filename (str) – path to file data

  • errors (json) – errors to be stored

Returns:

created collection file

Return type:

CollectionFile

Raises:

InvalidFormError – if there is a validation error

process.processors.loader.create_collections(source_id, data_version, sample=False, upgrade=False, compile=False, check=False, scrapyd_job='', note='', force=False)[source]

Creates main collection, note, upgraded collection, compiled collection etc. based on provided data

Parameters:
  • source_id (str) – collection source

  • data_version (str) – data version in ISO format

  • sample (boolean) – is this sample only

  • upgrade (boolean) – whether to plan collection upgrade

  • compile (boolean) – whether to plan collection compile

  • check (boolean) – whether to plan schema-based checks

  • scrapyd_job (str) – Scrapyd job ID

  • note (str) – text description

  • force (boolean) – skip validation of the source_id against the Scrapyd project

Returns:

created main collection, upgraded collection, compiled_collection

Return type:

Collection, Collection, Collection

Scrapyd

process.scrapyd.configured()[source]
Returns:

whether the connection to Scrapyd is configured

Return type:

bool

process.scrapyd.spiders()[source]
Returns:

the names of the spiders in the Scrapyd project

Return type:

list

Command-line interface

class process.cli.CollectionCommand(stdout=None, stderr=None, no_color=False, force_color=False)[source]
add_arguments(parser)[source]

Adds default arguments to the command.

add_collection_arguments(parser)[source]

Adds arguments specific to this command.

handle(*args, **options)[source]

Gets the collection.

handle_collection(collection, *args, **options)[source]

Runs the command.