User Guide¶

Installation¶

install via source code

$ python setup.py install

or pip

$ pip install argschema

Your First Module¶

mymodule.py¶

import argschema

class MySchema(argschema.ArgSchema):
    a = argschema.fields.Int(default = 42, description= 'my first parameter')
                            
if __name__ == '__main__':
    mod = argschema.ArgSchemaParser(schema_type=MySchema)
    print(mod.args)
    

running this code produces

$ python mymodule.py
{'a': 42, 'log_level': u'ERROR'}
$ python mymodule.py --a 2
{'a': 2, 'log_level': u'ERROR'}
$ python mymodule.py --a 2 --log_level WARNING
{'a': 2, 'log_level': u'WARNING'}
WARNING:argschema.argschema_parser:this program does nothing useful
$ python mymodule.py -h
usage: mymodule.py [-h] [--a A] [--output_json OUTPUT_JSON]
                [--log_level LOG_LEVEL] [--input_json INPUT_JSON]

optional arguments:
-h, --help            show this help message and exit
--a A                 my first parameter
--output_json OUTPUT_JSON
                        file path to output json file
--log_level LOG_LEVEL
                        set the logging level of the module
--input_json INPUT_JSON
                        file path of input json file

Great you are thinking, that is basically argparse, congratulations!

But there is more.. you can also give your module a dictionary in an interactive session

>>> from argschema import ArgSchemaParser
>>> from mymodule import MySchema
>>> d = {'a':5}
>>> mod = ArgSchemaParser(input_data=d,schema_type=MySchema)
>>> print(mod.args)
{'a': 5, 'log_level': u'ERROR'}

or you write out a json file and pass it the path on the command line

myinput.json¶

{
    "a":99
}

$ python mymodule.py --input_json myinput.json
{'a': 99, 'log_level': u'ERROR', 'input_json': u'myinput.json'}

or override a parameter if you want

$ python mymodule.py --input_json myinput.json --a 100
{'a': 100, 'log_level': u'ERROR', 'input_json': u'myinput.json'}

plus, no matter how you give it parameters, they will always be validated, before any of your code runs.

Whether from the command line

$ python mymodule.py --input_json myinput.json --a 5!
usage: mymodule.py [-h] [--a A] [--output_json OUTPUT_JSON]
                [--log_level LOG_LEVEL] [--input_json INPUT_JSON]
mymodule.py: error: argument --a: invalid int value: '5!'

or from a dictionary

>>> from argschema import ArgSchemaParser
>>> from mymodule import MySchema
>>> d={'a':'hello'}
>>> mod = ArgSchemaParser(input_data=d,schema_type=MySchema)
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Users/forrestcollman/argschema/argschema/argschema_parser.py", line 159, in __init__
        raise mm.ValidationError(json.dumps(result.errors, indent=2))
    marshmallow.exceptions.ValidationError: {
    "a": [
        "Not a valid integer."
    ]
    }

Fields¶

argschema uses marshmallow (http://marshmallow.readthedocs.io/) under the hood to define the parameters schemas. It comes with a basic set of fields that you can use to define your schemas. One powerful feature of Marshmallow is that you can define custom fields that do arbitrary validation. argschema.fields contains all the built-in marshmallow fields, but also some useful custom ones, such as argschema.fields.InputFile, argschema.fields.OutputFile, argschema.fields.InputDir that validate that the paths exist and have the proper permissions to allow files to be read or written.

Other fields, such as argschema.fields.NumpyArray will deserialize ordered lists of lists directly into a numpy array of your choosing.

Finally, an important Field to know is argschema.fields.Nested, which allows you to define heirarchical nested structures. Note, that if you use Nested schemas, your Nested schemas should subclass argschema.schemas.DefaultSchema in order that they properly fill in default values, as marshmallow.Schema does not do that by itself.

The template_module example shows how you might combine these features to define a more complex parameter structure.

template_module.py¶

from argschema import ArgSchemaParser, ArgSchema
from argschema.fields import OutputFile, NumpyArray, Boolean, Int, Str, Nested
from argschema.schemas import DefaultSchema
import numpy as np
import json

# these are the core parameters for my module
class MyNestedParameters(DefaultSchema):
    name = Str(required=True, description='name of vector')
    increment = Int(required=True, description='value to increment')
    array = NumpyArray(dtype=np.float, required=True, description='array to increment')
    write_output = Boolean(required=False, default=True)

# but i'm going to nest them inside a subsection called inc
class MyParameters(ArgSchema):
    inc = Nested(MyNestedParameters)

#this is another schema we will use to validate and deserialize our output
class MyOutputParams(DefaultSchema):
    name = Str(required=True, description='name of vector')
    inc_array = NumpyArray(dtype=np.float, required=True, description='incremented array')

if __name__ == '__main__':
    
    # this defines a default dictionary that will be used if input_json is not specified
    example_input = {
        "inc": {
            "name": "from_dictionary",
            "increment": 5,
            "array": [0, 2, 5],

            "write_output": True
        },
        "output_json": "output_dictionary.json"
    }

    # here is my ArgSchemaParser that processes my inputs
    mod = ArgSchemaParser(input_data=example_input,
                          schema_type=MyParameters,
                          output_schema_type=MyOutputParams)
                          
    # pull out the inc section of the parameters
    inc_params = mod.args['inc']

    # do my simple addition of the parameters
    inc_array = inc_params['array'] + inc_params['increment']

    # define the output dictionary
    output = {
        'name': inc_params['name'],
        'inc_array': inc_array
    }

    # if the parameters are set as such write the output
    if inc_params['write_output']:
        mod.output(output)

so now if run the example commands found in run_template.sh

input.json¶

  {
      "inc": {
            "name": "from_json",
            "increment": 1,
            "array": [3, 2, 1],
            "write_output": true
       }
  }

$ python template_module.py \
    --output_json output_command.json \
    --inc.name from_command \
    --inc.increment 2
{u'name': u'from_command', u'inc_array': [2.0, 4.0, 7.0]}
$ python template_module.py \
    --input_json input.json \
    --output_json output_fromjson.json
{u'name': u'from_json', u'inc_array': [4.0, 3.0, 2.0]}
$ python template_module.py
{u'name': u'from_dictionary', u'inc_array': [5.0, 7.0, 10.0]}

Command-Line Specification¶

As mentioned in the section Your First Module, argschema supports setting arguments at the command line, along with providing arguments either in an input json or directly passing a dictionary as input_data. Values passed at the command line will take precedence over those passed to the parser or in the input json.

Arguments are specified with –argument_name <value>, where value is passed by the shell. If there are spaces in the value, it will need to be wrapped in quotes, and any special characters will need to be escaped with . Booleans are set with True or 1 for true and False or 0 for false.

An exception to this rule is list formatting. If a schema contains a List and does not set the cli_as_single_argument keyword argument to True, lists will be parsed as –list_name <value1> <value2> …. In argschema 2.0 lists will be parsed in the same way as other arguments, as it allows more flexibility in list types and more clearly represents the intended data structure.

An example script showing old and new list settings:

deprecated_example.py¶

from argschema import ArgSchema, ArgSchemaParser
from argschema.fields import List, Float


class MySchema(ArgSchema):
    list_old = List(Float, default=[1.1, 2.2, 3.3],
                    description="float list with deprecated cli")
    list_new = List(Float, default=[4.4, 5.5, 6.6],
                    cli_as_single_argument=True,
                    description="float list with supported cli")


if __name__ == '__main__':
    mod = ArgSchemaParser(schema_type=MySchema)
    print(mod.args)

Running this code can demonstrate the differences in command-line usage:

$ python deprecated_example.py --help
FutureWarning: '--list_old' is using old-style command-line syntax
with each element as a separate argument. This will not be supported
in argschema after 2.0. See http://argschema.readthedocs.io/en/master/user/intro.html#command-line-specification
for details.
warnings.warn(warn_msg, FutureWarning)
usage: deprecated_example.py [-h] [--input_json INPUT_JSON]
                             [--output_json OUTPUT_JSON]
                             [--log_level LOG_LEVEL]
                             [--list_old [LIST_OLD [LIST_OLD ...]]]
                             [--list_new LIST_NEW]

optional arguments:
  -h, --help            show this help message and exit

MySchema:
  --input_json INPUT_JSON
                        file path of input json file
  --output_json OUTPUT_JSON
                        file path to output json file
  --log_level LOG_LEVEL
                        set the logging level of the module (default=ERROR)
  --list_old [LIST_OLD [LIST_OLD ...]]
                        float list with deprecated cli (default=[1.1, 2.2,
                        3.3])
  --list_new LIST_NEW   float list with supported cli (default=[4.4, 5.5,
                        6.6])
$ python deprecated_example.py --list_old 9.1 8.2 7.3 --list_new [6.4,5.5,4.6]
FutureWarning: '--list_old' is using old-style command-line syntax
with each element as a separate argument. This will not be supported
in argschema after 2.0. See http://argschema.readthedocs.io/en/master/user/intro.html#command-line-specification
for details.
warnings.warn(warn_msg, FutureWarning)
{'log_level': 'ERROR', 'list_new': [6.4, 5.5, 4.6], 'list_old': [9.1, 8.2, 7.3]}

We can explore some typical examples of command line usage with the following script:

cli_example.py¶

from argschema import ArgSchema, ArgSchemaParser
from argschema.fields import List, NumpyArray, Bool, Int, Nested, Str
from argschema.schemas import DefaultSchema


class MyNestedSchema(DefaultSchema):
    a = Int(default=42, description= "my first parameter")
    b = Bool(default=True, description="my boolean")


class MySchema(ArgSchema):
    array = NumpyArray(default=[[1, 2, 3],[4, 5, 6]], dtype="uint8",
                       description="my example array")
    string_list = List(List(Str),
                       default=[["hello", "world"], ["lists!"]],
                       cli_as_single_argument=True,
                       description="list of lists of strings")
    int_list = List(Int, default=[1, 2, 3],
                    cli_as_single_argument=True,
                    description="list of ints")
    nested = Nested(MyNestedSchema, required=True)


if __name__ == '__main__':
    mod = ArgSchemaParser(schema_type=MySchema)
    print(mod.args)

$ python cli_example.py --help
usage: cli_example.py [-h] [--input_json INPUT_JSON]
                      [--output_json OUTPUT_JSON] [--log_level LOG_LEVEL]
                      [--array ARRAY] [--string_list STRING_LIST]
                      [--int_list INT_LIST] [--nested.a NESTED.A]
                      [--nested.b NESTED.B]

optional arguments:
  -h, --help            show this help message and exit

MySchema:
  --input_json INPUT_JSON
                        file path of input json file
  --output_json OUTPUT_JSON
                        file path to output json file
  --log_level LOG_LEVEL
                        set the logging level of the module (default=ERROR)
  --array ARRAY         my example array (default=[[1, 2, 3], [4, 5, 6]])
  --string_list STRING_LIST
                        list of lists of strings (default=[['hello', 'world'],
                        ['lists!']])
  --int_list INT_LIST   list of ints (default=[1, 2, 3])

nested:
  --nested.a NESTED.A   my first parameter (default=42)
  --nested.b NESTED.B   my boolean (default=True)

We can set some values and observe the output:

$ python cli_example.py --nested.b 0 --string_list "[['foo','bar'],['baz','buz']]"
{'int_list': [1, 2, 3], 'string_list': [['foo', 'bar'], ['baz', 'buz']], 'array': array([[1, 2, 3],
   [4, 5, 6]], dtype=uint8), 'log_level': 'ERROR', 'nested': {'a': 42, 'b': False}}

If we try to set a field in a way the parser can’t cast the variable (for example, having an invalid literal) we will see a casting validation error:

$ python cli_example.py --array [1,foo,3]
Traceback (most recent call last):
  File "cli_example.py", line 25, in <module>
    mod = ArgSchemaParser(schema_type=MySchema)
  ...
marshmallow.exceptions.ValidationError: {
  "array": [
    "Command-line argument can't cast to NumpyArray"
  ]
}

argschema does not support setting Dict at the command line.

Alternate Sources/Sinks¶

Json files are just one way that you might decide to serialize module parameters or outputs. Argschema by default provides json support because that is what we use most frequently at the Allen Institute, however we have generalized the concept to allow argschema.ArgSchemaParser to plugin alternative “sources” and “sinks” of dictionary inputs and outputs.

For example, yaml is another reasonable choice for storing nested key-value stores. argschema.argschema_parser.ArgSchemaYamlParser demonstrates just that functionality. So now input_yaml and output_yaml can be specified instead.

Furthermore, you can pass an ArgSchemaParser an argschema.sources.ArgSource object which implements a get_dict method, and any argschema.ArgSchemaParser will get its input parameters from that dictionary. Importantly, this is true even when the original module author didn’t explicitly support passing parameters from that mechanism, and the parameters will still be deserialized and validated in a uniform manner.

Similarly you can pass an argschema.sources.ArgSink object which implements a put_dict method, and argschema.ArgSchemaParser.output will output the dictionary however that argschema.sources.ArgSink specifies it should.

Finally, both argschema.sources.ArgSource and argschema.sources.ArgSink have a property called ConfigSchema, which is a marshmallow.Schema for how to deserialize the kwargs to it’s init class.

For example, the default argschema.sources.json_source.JsonSource has one string field of ‘input_json’. This is how argschema.ArgSchemaParser is told what keys and values should be read to initialize a argschema.sources.ArgSource or

argschema.sources.ArgSink instance.

So for example, if you wanted to define a argschema.sources.ArgSource which loaded a dictionary from a particular host, port and url, and a module which had a command line interface for setting that host port and url you could do so like this.

from argschema.sources import ArgSource, ArgSink
from argschema.schemas import DefaultSchema
from argschema.fields import Str,Int
from argschema import ArgSchemaParser
from test_classes import MySchema
import requests
try:
    from urllib.parse import urlunparse 
except:
    from urlparse import urlunparse

class UrlSourceConfig(DefaultSchema):
    input_host = Str(required=True, description="host of url")
    input_port = Int(required=False, default=None, description="port of url")
    input_url = Str(required=True, description="location on host of input")
    input_protocol = Str(required=False, default='http', description="url protocol to use")

class UrlSource(ArgSource):
    ConfigSchema = UrlSourceConfig

    def get_dict(self):
        if self.input_port is None:
            netloc = self.input_host
        else:
            netloc = "{}:{}".format(self.input_host,self.input_port)
        url = urlunparse((self.input_protocol,netloc,self.input_url,None,None,None))                             
        response = requests.get(url)
        return response.json()


class UrlArgSchemaParser(ArgSchemaParser):
    default_configurable_sources = [UrlSource]
    default_schema = MySchema

so now a UrlArgSchemaParser would expect command line flags of ‘–input_host’ and ‘–input_url’, and optionally ‘–input_port’,’–input_protocol’ (or look for them in input_data) and will look to download the json from that http location via requests. In addition, an existing argschema.ArgSchemaParser module could be simply passed a configured UrlSource via input_source, and it would get its parameters from there.

Sphinx Documentation¶

argschema comes with a autodocumentation feature for Sphnix which will help you automatically add documentation of your Schemas and argschema.ArgSchemaParser classes in your project. This is how the documentation of the test suite included here was generated.

To configure sphnix to use this function, you must be using the sphnix autodoc module and add the following to your conf.py file

from argschema.autodoc import process_schemas

def setup(app):
    app.connect('autodoc-process-docstring',process_schemas)