Annotator reference

This section provides detailed reference documentation for each component of an annotator. An annotator consists of a python file, a YAML file, a data directory, and a markdown file (optional). The file structure is as follows:

annotator/
    |───annotator.md
    |───annotator.yml
    |───annotator.py
    └───data/

In an actual annotator, the name ‘annotator’ would be substituted for the annotator’s name. Also note that the annotator’s data source (the directiory annotator/data) will be filled with some configuration of data files. More info on the data source can be found at data.

annotator.md

The markdown file describes the module to prospective users. It is not required to run an annotator; however, it is required when publishing an annotator to the OpenCRAVAT store.

annotator.yml

The YAML file defines the input and output interfaces between an annotator and the rest of OpenCRAVAT. The YAML file specifies what data will be fed to annotator.py from OpenCRAVAT, and what data OpenCRAVAT should expect annotator.py to return. Below are the valid keys a developer may use in their YAML file.

Properties

Property

Required

Required to Publish

Description

title

X

X

The name of the module that will be displayed to the user.

version

X

X

The version number of the annotator. It is primarily used when publishing a module, but is required for all modules.

type

X

X

The module type, in this case ‘annotator’.

level

X

X

Either ‘variant’ or ‘gene’.

input_format

The input file type. Accepted values are .crv, .crx, and .crg. When omitted, the default is .crv.

output_columns

X

X

A list of the output columns from the annotator. See Output Columns sub-section for more details.

description

X

A short description of the module’s purpose and use.

developer

X

Information about the developer. Subkeys: name, email, organization, website, citation.

Output Columns

The output_columns property is a YAML list that enumerates the expected keys of the dictionary returned by annotator.py. The preparation of this dictionary is explained in greater detail in annotator.py. Each entry in the output_columns list requires three properties: name, title, and type described in the table below.

Property

Required

Required to Publish

Description

name

X

X

The column’s internal name. Used to identify the column in the output dictionary from annotator.py .

title

X

X

The column’s display name. Used in the final report.

type

X

X

The data-type of that column. Either string, int, or float.

annotator.py

The python module receives input data describing a single variant/gene, and uses it to lookup additional information specific to that annotator. An annotator.py works by extending a provided base class, BaseAnnotator, and implementing three instance methods: setup, annotate, and cleanup.

setup

The setup method executes once before the main loop over the input file. It is normally used to open file-handlers or database connections. More information on accessing the data source can be found in data.

annotate

The annotate method executes once per iteration of the main loop. The annotate method will receive an input_data argument, and possibly an optional secondary_data argument. These arguments represent the input data for a single variant/gene. Both arguments will be python dictionaries, whose format (including presence altogether for secondary_data) is determined by the input_format property of annotate.yml. The following table enumerates the possible keys of input_data, and which keys will be present in relation to the value of input_format.

Key

.crv

.crx

.crg

Description

Example

uid

X

X

An id.

1, 2

chrom

X

X

The chromosome, with prefix ‘chr’. 1-based indexing.

‘chr1’, ‘chr23’, ‘chrX’

pos

X

X

An integer genomic position.

112501307, 104770363

ref_base

X

X

The reference base.

‘A’, ‘GCC’

alt_base

X

X

The alternate base.

‘G’, ‘AT’, ‘-’

hugo

X

X

The gene name

TP53

transcript

X

The predicted primary transcript

ENST00000617 185.4

so

X

X

Most severe sequence ontology

MIS

all_mappings

X

All affected transcripts. Details here

Examples here

num_variants

X

Number of variants on this gene

5

all_so

X

Sequence ontologies and counts for this gene

STL(1),MIS(3 )

OpenCRAVAT expects annotator.py to return a python dictionary. The keys present in this dictionary, and the data-types of their values are both determined by the output_columns property in annotator.yml.

cleanup

The cleanup method executes once after the main loop has finished. It is normally used to close any database connection or file-handlers opened in setup.

data

The sub-directory data contains the data source for the annotator. This can be a flat-data file, a sqlite database, or a combination of multiples data files. To access the data, the developer will open a file-handler or database connection depending on the file type. This should be done in the instance method setup in annotator.py. The developer should then store the opened data-accessor as a self instance property to be accessible during the annotate method.

Note that there is special support for a sqlite database which shares the name of the annotator module. In this case, a database connection and cursor are automatically opened in the BaseConverter of annotate.py. The connection and cursor are stored as self.dbconn and self.cursosr respectively. This functionality is intened to aid a primary use case where the data source is a single sqlite database. A developer can safely overwrite self.dbconn and self.cursor if they wish, albeit at the loss of the automatic functionality.

The developer should close any active database connections or file-handlers during the cleanup method of annotate.py. Automatically opened database connections will also be automatically closed.

Secondary Inputs

Annotators can be piped together so that the output of one annotator can be used in the input of another annotator. This can be useful to create annotators that summarize groups of other annotators, or to use the data from another annotator in a query.

For example, lets say we have data that is indexed on ClinVar IDs. We can make an annotator that depends on the clinvar annotator, then use the ID to lookup our values.

Edit annotator.yml and add a secondary data input.

secondary_inputs:
  clinvar: {}

Now, in the annotate method of annotator.py, the secondary_data argument will sometimes contain data from clinvar.

if secondary_data['clinvar']:
    clinvar_id = secondary_data['clinvar'][0]['id']
else:
    clinvar_id = None

We also want to make sure that users who install our annotator have clinvar installed. Do do this, we need to add an install requirement to our annotator’s config.

requires:
- clinvar

If you need to require certain version of the secondary annotator, you can do so with boolean expressions similar to those in pip install.

clinvar==2.0.0
clinvar>=2.0.0
clinvar<2.0.0

Specifying a version is discouraged unless absolutely needed. OpenCRAVAT has very limited ability to resolve dependency issues between modules.

Table-in-table output

Originally, an output field of an OpenCRAVAT annotator module was supposed to be one of string, integer, and float types. However, from OpenCRAVAT 2.2.1, an output field can contain a table of values. This way, table-in-table output is possible for annotation modules. This feature is useful for organizing complex data. For example, VEST4 annotation module’s “All transcripts” column used to have such a string as “ENST00000612895.4(0.884:0.04118), *ENST00000614428.4(0.928:0.02102), ENST00000617649.4(0.866:0.05418)”. This string contains the VEST score and p-value for three different transcripts for a variant. To get the score and p-value of a specific transcript, parsing the string and extracting the values was necessary. However, the new VEST annotation module which works with OpenCRAVAT 2.2.1 and later has the following data instead of the string: [[ENST00000612895.4, 0.884, 0.04118], [ENST00000614428.4, 0.928, 0.02102], [ENST00000617649.4, 0.866, 0.05418]], which shows the transcript-score-pvaule organization of data much more clearly. This type of data is still stored as string in result databases, but OpenCRAVAT automatically performs the conversion between string and JSON object as it communicates with annotator modules. Thus, in writing an annotation module, the return dictionary of an annotate method can have a dictionary as the value of an output field. No conversion to a JSON string is necessary.

To enable table-in-table output support for an output column, add table: true property to the definition of the column in the module’s configuration yml file. There is another property, table_headers, but this one is optional. With these two new properties, “All annotations” (previously “All transcripts”) column of VEST module is defined as below.

  • name: all

  • title: All annotations

  • type: string

  • table: true

  • table_headers:

    • name: transcript

    • title: Transcript

    • type: string

    • name: score

    • title: Score

    • type: float

  • name: pval

  • title: p-value

  • type: float

When an output column with table data is used by a reporter module, the reporter module will receive a JSON object instead of a string, as OpenCRAVAT does the conversion automatically. In the same way, widget modules will also receive JSON objects instead of strings for output columns with table data. (edited)