Annotator tutorial

Writing an Annotator

Method developers and researchers can make their results available by packaging them as an OpenCRAVAT annotator and then publishing them to the OpenCRAVAT Store. Annotators typically include a database of annotations for fast high-throughput analysis of large variant files. The preferred storage mechanism for annotator reference data is sqlite databases, but other formats can be used.

This page provides a tutorial for writing an annotator from start to finish. For detailed documentation on each component of an annotator, try Annotator Reference

The completed annotator from this tutorial can be seen at the OpenCRAVAT Modules repository, which also contains the code for most other annotators.

Locating The Modules Directory

To begin writing a new annotator, first locate the path to the modules directory using the command oc config md.

$ oc config md
/PythonRoot/lib/site-packages/cravat/modules

The modules directory contains all OpenCRAVAT modules, split into sub-directories by type. Annotator modules will be found in the sub-directory annotators. This is where a developer will create their own annotator.

Developers can change the modules directory by passing the new directory as an argument to oc config md. The command changes the directory that OpenCRAVAT searches for modules. The command does not move currently installed modules to the new directory, which must be done manually if desired.

$ oc config md /my/custom/modules/location
/my/custom/modules/location
$ oc config md
/my/custom/modules/location

Starting from the Template

In this tutorial, an annotator will be created to display SIFT scores for BRCA1. For each variant, the annotator will return the SIFT score, a prediction of Pathogenic or Tolerated, and the number of sequences SIFT analyzed at the position.

To start, use oc to create a new annotator.

$ oc new annotator example
annotator example created at /md/example

This will create a few files at /md/annotators/example

example/
    |───example.md
    |───example.yml
    |───example.py
    └───data/
        └───example.sqlite

Getting data

Most of the work of creating an annotator is usually spent creating a database from the original data source. To save time, we’ve premade example.sqlite from the data at SIFT’s website. Replace the existing example.sqlite with this one.

The database consists of a single table, with one row per single nucleotide variant in BRCA1.

chrom

pos

ref

alt

score

nseq

chr17

43045706

A

T

0.001

24

chr17

43045707

T

A

0.012

24

chr17

43045707

T

C

0.965

24

chr17

43045707

T

G

0.012

24

Next, we will edit example.py to query the database and return the data we need.

Querying annotations

In example.py there is a python class, CravatAnnotator with three methods: setup, annotate, and cleanup.

from cravat import BaseAnnotator

class CravatAnnotator(BaseAnnotator):

    def setup():
        # ... setup code will go here ...
        pass

    def annotate(input_data, secondary_data=None):
        # ... annotate code will go here ...
        return out

    def cleanup():
        # ... cleanup code will go here ...
        pass

In this tutorial, we will only edit annotate. Setup and cleanup are called once each to open and close connections to data sources. But OpenCRAVAT will automatically connect to example/data/example.sqlite, and create a database connection self.dbconn, and a cursor self.cursor. Annotate is called once for each variant.

More detailed descriptions of the uses of each of these methods can be found in the annotator.py detailed reference.

annotate will take three general steps for each variant 1) Accept input data from OpenCRAVAT describing the variant 2) Query the database for annotations 3) Format and return any annotations

Variants are passed to annotate in the input_data dictionary.

{
    #The internal id of this input line. Seldom used.
    'uid' : 1,
    # The chromosome name
    'chrom' : 'chr10',
    # The genomic position of the first affected nucleotide
    'pos' : 87864486,
    # The reference base(s)
    'ref_base' : 'A',
    # The alternate base(s)
    'alt_base' : 'C',
    # coding or non-coding variant
    'coding': 'Yes',
    # HUGO symbol of the gene of the representative transcript (MANE by default)
    'hugo': 'NOC2L',
    # representative mapped transcript
    'transcript': 'ENST00000327044.7',
    # sequence ontology of the variant consequence on the representative transcript
    'so': 'MIS',
    # cDNA change of the variant on the representative transcript
    'cchange': 'c.2104G>A',
    # protein change of the variant on the representative transcript
    'achange': 'p.Asp702Asn',
    # all genes and transcripts mapped to the variant
    'all_mappings': '{"NOC2L": [["Q9Y3T9", "p.Asp702Asn", "MIS", "ENST00000327044.7", "c.2104G>A"]}'
}

``pos`` is in the 1-based GRCh38 coordinate system. If the original input is in hg19, the position converted to hg38 before reaching this point.

Also, coding, hugo, transcript, so, cchange, achange, and all_mappings are available only when input_format: crx exists in .yml file.

Add code to annotate to extract the variables needed.

chrom = input_data['chrom']
pos = input_data['pos']
ref_base = input_data['ref_base']
alt_base = input_data['alt_base']

Next, create a query and select data from the database.

query = f'select score, nseq from sift where chrom="{chrom}" and pos={pos} and ref="{ref_base}" and alt="{alt_base}";'
self.cursor.execute(query)
result = self.cursor.fetchone()

Finally, format and return the data. You must return data as a dictionary with a key for each output column. If there is no data for a variant, return None. In this case, one of our columns was not stored in the database to save space, we we must compute it.

if result is not None:
    score = result[0]
    num_seq = result[1]
    if score <= 0.05:
        prediction = 'Damaging'
    else:
        prediction = 'Tolerated'
    return {
        'score': score,
        'seq_count': num_seq,
        'prediction': prediction,
    }
else:
    return None

At this point, annotate should look like this.

def annotate(self, input_data, secondary_data=None):
    chrom = input_data['chrom']
    pos = input_data['pos']
    ref_base = input_data['ref_base']
    alt_base = input_data['alt_base']
    query = f'select score, nseq from sift where chrom="{chrom}" and pos={pos} and ref="{ref_base}" and alt="{alt_base}";'
    self.cursor.execute(query)
    result = self.cursor.fetchone()
    if result is not None:
        score = result[0]
        num_seq = result[1]
        if score <= 0.05:
            prediction = 'Damaging'
        else:
            prediction = 'Tolerated'
        return {
            'score': score,
            'seq_count': num_seq,
            'prediction': prediction,
        }
    else:
        return None

Before we run the annotator, we need to tell OpenCRAVAT how to interpret and display the results. We do this in the config file example.yml.

Displaying results

The annotator config file tells OpenCRAVAT what columns to expect from the annotate method, and how to display them in the results. It also contains display hints and metadata for the annotator itself, and attribution to the original data source.

The annotator uses yaml format, which is more readable representation of JSON, and python dictionaries.

To start, make a few edits to the parts that describe the annotator itself. Be sure to edit the relevant lines in the yml, don’t add new lines.

title: Example (SIFT BRCA1)
version: 1.0.0
description: Example annotator. BRCA1 scores from SIFT, a variant effect predictor.

Next, replace the output_columns section with this.

output_columns:
- name: prediction
  title: Prediction
  type: string
- name: score
  title: Score
  type: float
- name: seq_count
  title: Seqs at Position
  type: int

Three keys are needed to describe each column - name is the internal identifier of the column, it must exactly match one of the keys in the dictionary returned from annotate. Column names should only include lowercase letters, numbers, and underscores. Names cannot have two underscores in a row, and cannot start with a number. - title is the display name of the column, it will be shown in place of the name in reports whenever possible. - type is the type of the column data. Choose from string, float, or int.

Many more keys can be added to output columns to change their behavior in reports. Three are worth including in this annotator. Edit the yml again so that it shows:

output_columns:
- name: prediction
  title: Prediction
  type: string
  desc: Tolerated if Score > 0.05. Damaging if Score <= 0.05
  width: 70
- name: score
  title: Score
  type: float
  desc: Ranges from 0 to 1
- name: seq_count
  title: Seqs at Position
  type: int
  desc: Number of sequences scored by SIFT at this position
  width: 60
  hidden: true

The desc key is a longer description of a column. It shows up when the mouse hovers over a column in the GUI. The width key controls the width of the column in the GUI. It is measured in CSS pixels. Finally, hidden: true will hide a column by default in the GUI. To conserve space, most annotators should only show 3 or fewer default columns.

A full list of accepted and required config properties can be found at the `annotator.yml <./Annotator-Reference#annotatoryml>`__ reference documentation.

Running the annotator

At this point, the annotator should have everything it needs to run. This vcf file contains a few pathogenic and tolerated BRCA1 variants, and one variant not on BRCA1. Run it with oc run input.vcf -a example and check out the output with oc gui input.vcf.sqlite. It should look something like

The sqlite database

After all annotatos are finished, OpenCRAVAT aggregates all annotations into a sqlite database. It can be helpful to know how to find your annotators output in the database.

Variant level annotations are written to a table called variant. The column names are made by combining the annotator name and the column name with a double undersore. So, for our annotator, the database columns are called example__score, example__prediction, and example__seq_count.

The config for each output column is written to the variant_header table, and the config data for the annotator is writted to the variant_annotator table.

DB Browser for SQLite is an excellent cross-platform GUI for reading sqlite files.

Debugging

Finding Errors

When oc runs, two logs files are created: input.vcf.log and input.vcf.err. Exceptions raised by example.py will show up in these two places. The traceback is put in .log, and the variant causing the exception is put in .err. If the same exception occurs again, .log is not written, but .err contains all variants that caused an exception.

Raw annotator output

Remove any output files from a previous run, and run oc again with the --temp-files flag. This will keep temporary files around after the job finishes.

rm input.vcf.*
oc run input.vcf -a example --temp-files

There should be a file called input.vcf.crv.example.var. This is the raw output of the example annotator. It includes some header lines with information from the module config, and tab separated data lines.

#name=example
#displayname=Example (SIFT BRCA1)
#version=1.0.0
#column={"index": 0, "name": "uid", "title": "UID", "type": "int",...
#column={"index": 1, "name": "prediction", "title": "Prediction", ...
#column={"index": 2, "name": "score", "title": "Score", "type": ...
#column={"index": 3, "name": "seq_count", "title": "Seqs at ...
#no_aggregate=
#UID    Prediction  Score   Seqs at Position
2   Damaging    0.004   26
3   Tolerated   1.0 18
4   Damaging    0.0 18
5   Damaging    0.0 18
6   Tolerated   0.128   17

Running directly

It’s possible to run an annotator without running all of OpenCRAVAT. Clean the working directory, then run oc but end at the mapper stage.

rm input.vcf.*
oc run input.vcf --endat mapper

At this point, there is a file, input.vcf.crv that contains all of the variants in your input file. You can pass this file to the annotator to create input.vcf.crv.example.var directly.

python3 md/annotators/example/example.py input.vcf.crv

When run this way, the .log and .err files will be input.vcf.crv.log and input.vcf.crv.err.

This method can be used to run annotators with debuggers in most IDEs like VSCode, Spyder, or Jupyter.