Quick Start¶
xp makes it easy to create and update computational workflows (called pipelines) that always retain a connection to the data that the pipeline produced.
Installation¶
The easiest way to install xp is using pypi:
pip install xp
Alternatively, install xp by downloading the source and running:
python setup.py install
or:
sudo python setup.py install
depending on permission settings for your python site-packages.
A pipeline is a sequence of steps (called tasks) that manipulates data. On a very practical level, each task takes some data as input and produces other data as output. In the example below, there are four tasks, each involved in a different part of the workflow.
# Pipeline: cluster_data
DATA_URL=http://foobar:8080/data.tsv
NAME_COLUMN=1
COUNT_COLUMN=2
ALPHA=2
download_data:
code.sh:
curl $DATA_URL > data.tsv
extract_columns: download_data
code.sh:
cut -f $NAME_COLUMN,$COUNT_COLUMN data.tsv | tail +2 > xdata.tsv
code.py:
from csv import reader
fout = open('xdata2.tsv','w')
for data in reader(open('xdata.tsv','r'),delimiter='\t'):
fout.write('%s\t%s\n' % (data[0],data[1][1:-1])
cluster_rows: extract_columns
code.sh:
./cluster.sh --alpha $ALPHA xdata2.tsv > clusters.tsv
plot_clusters: cluster_rows
code.gnuplot:
plot "clusters.tsv" using 1:2 title 'Cluster quality'
Tasks can depend on other tasks (e.g., extract_columns
depends on
download_data
) - either in the same pipeline or in other pipelines. By
making tasks depend on other pipelines, it’s possible to treat pipelines as
modular parts of much larger workflows.
Once a task completes without error, it is marked - which flags it as not needing to be run again. In order to re-run a task, one can simply unmark it and run it again.
If you choose to run a task which has unmarked dependencies, these will be run before the task itself is run - in this way, an entire workflow can be run using a single command.
A task contains blocks which describe the actual computational steps being taken. As is seen in the example above, blocks can contain code for various different languages - making it possible to stitch together workflows that involve different languages. A single task can even contain multiple blocks for the same or different languages.
Currently, xp supports four block types:
- export (
export
) - this allows environment variables to be set and- unset within the context of a specific task
- python (
code.py
)- shell (
code.sh
)
- gnuplot (
code.gnuplot
)
These, of course, require that the appropriate executables are present on the
system. To customize the executable used, environment variables can be set
(PYTHON_EXEC
and GNUPLOT_EXEC
, respectively).
Future releases will support additional languages natively and also provide a plugin mechanism for adding new block types.
Once a pipeline has been written, it can be run using the xp command-line tool:
xp run <pipeline_file>
The command-line tool also allows easy marking (mark
), unmarking (unmark
),
and querying task info (tasks
) for a pipeline.
Pipeline-specific Data¶
A common activity that creates a lot of data management issues is running effectively the same or similar pipelines using different parameter settings: files can get overwritten and, more generally, the user typically loses track of exactly which files came from what setting.
In xp, files produced by a pipeline can be easily bound to their pipeline, eliminating this confusion:
DATA_URL=http://foobar:8080/data.tsv
ALPHA=2
...
download_data:
code.sh:
curl $DATA_URL > $PLN(data.tsv)
...
In the excerpt above, the file data.tsv
is being bound to this pipeline using
the $PLN(.)
function. In effect, the file is prefixed (either by name or
placed in a pipeline-specific directory). Future references to this file via
$PLN(data.tsv)
will access only this pipeline’s version of the file - even if
many pipelines are downloading the files at various times.
Extending Pipelines¶
In some cases, one will want to run exactly the same pipeline over and over
with different parameter settings. To support this, xp allows extending
pipelines. Much like subclassing, extending a pipeline brings all the content
of one pipeline into another one. Assume we are clustering some data using the
process here (in pipeline cluster_pln
). The process is parameterized by the
alpha value.
# Pipeline: cluster_pln
cluster_rows: extract_columns
code.sh:
./cluster.sh --alpha $ALPHA xdata2.tsv > $PLN(clusters.tsv)
plot_clusters: cluster_rows
code.gnuplot:
plot "$PLN(clusters.tsv)" using 1:2 title 'Cluster quality at alpha=$ALPHA'
We can extend this pipeline to retain the same workflow, but use different values:
# Pipeline: cluster_a2
extend cluster_pln
ALPHA=2
and again for a different value:
# Pipeline: cluster_a3
extend cluster_pln
ALPHA=3
Note that in each case the cluster data will be stored to $PLN(clusters.tsv)
,
so that each pipeline will have its own separate stored data.
Connecting Pipelines Together¶
It’s quite reasonable to expect that one pipeline could feed into another pipeline. xp supports this - pipelines can depend on the tasks in other pipelines - and in doing so, create even larger workflows that retain their nice modular organization.
Consider that the earlier pipeline given above, cluster_a2
, could actually be
assembling the data for a classifier. Let’s break this classifier portion of
the project into its own workflow:
# Pipeline: lda_classifier
use cluster_a2 as cdata
NEWS_ARTICLES=articles/*.gz
build_lda: cdata.cluster_rows
export:
LDA_CLASSIFIER=lda_runner
code.sh:
${LDA_CLASSIFIER} -input $PLN(cdata,clusters.tsv) -output $PLN(lda_model.json)
label_articles: build_lda
export:
LDA_LABELER=/opt/bin/lda_labeler
code.sh:
${LDA_LABELER} -model $PLN(lda_model.json) -data "$NEWS_ARTICLES" > $PLN(news.labels)
In the example above, notice how the task build_lda
both depends on a task
from the cluster_a2
pipeline and also uses data from that pipeline’s
namespace, $PLN(cdata,clusters.tsv)
.
Of course, we might want to try multiple classifiers on the same source data,
so we can create other pipelines that use cluster_a2
, shown next:
# Pipeline: crf_classifier
use cluster_a2 as cdata
NEWS_ARTICLES=articles/*.gz
build_crf_model: cdata.cluster_rows
code.sh:
/opt/bin/build_crf -data $PLN(cdata,clusters.tsv) -output $PLN(crf_model.json)
label_articles: build_crf_model
code.py:
import crf_model
model = crf_model.load_model('$PLN(crf_model.json)')
model.label_documents(docs='$NEWS_ARTICLES',out_file='$PLN(news.labels)')
Examples¶
See the examples/
directory in the xp root directory to see some real
pipelines that demonstrate the core features of the tool.
Command-line usage¶
The xp
command provides several core capabilities:
xp tasks <pipeline>
will out info about one or more tasks in the pipeline including whether they are markedxp run <pipeline>
will run a pipeline (or a task within a pipeline)xp mark <pipeline>
will mark specific tasks or an entire pipelinexp unmark <pipeline>
will unmark specific tasks or an entire pipeline
All of these commands have help messages to help their correct use.