_images/gFACs_logo.jpg

Introduction to gFACs

gFACs is a filtering, analysis, and conversion tool to unify genome annotations across alignment and gene prediction frameworks. It was developed by Madison Caballero and Dr. Jill Wegrzyn of the Plant Computational Genomics Lab at the University of Connecticut.

Find gFACs on GitLab.

Version 1.1.1 - 11/12/2019

How to cite:

Caballero M. and Wegrzyn J. 2019. gFACs: Gene Filtering, Analysis, and Conversion to Unify Genome Annotations Across Alignment and Gene Prediction Frameworks. Genomics_ Proteomics & Bioinformatics 17: 305-10

Comments? Questions? Email me at Madison.Caballero@uconn.edu

_images/flow_chart_gFACs.jpg

Installation

Installing gFACs is quite easy since this program operates almost exclusively through perl scripts. So long as the files are copied over, basic gFACs should be operational so long as perl and bioperl libraries are available. gFACs is also very light so feel free to have multiple copies wherever you need it!

You can find gFACs on GitLab.

In order to download gFACs, follow these basic steps:

  1. Obtain the directory for gFACs from GitLab in whatever format you prefer:
_images/Download_1.JPG
  1. Place the entire folder onto your system and extract components. You should have a screen that looks like this:
_images/Download_2.JPG
  1. To test to see if it works, perform the simple command of:

    perl gFACs.pl

The command line manual should appear. If so, congratulations! If not, a common error that occurs is that bioperl libraries are not loaded. If you have issues, feel free to send an issue request on GitLab!

Supported Input Formats

There are many different programs that utilize a gff or gtf style format. Each has its own inclusion rules and formatting guidelines. Therefore, a true universal gff/gff3/gtf file converter that requires no user input may be an impossible task. gFACs requires the user to define the application source and version. If you do not find you input format, look into the support script format_diagnosis.pl. If your file is made up of many different formats merged together, I suggest breaking it apart. If you want a format added (especially if it is a well-known one) let me know and I will likely create one!

The format is specified in the gFACs command by a -f [code] flag. It is a mandatory flag and the code will fail without it. These codes are listed out below with notes and can also be seen in the command line manual. For an example of the command with a proper format flag, see any of the sample runs!

BRAKER:
braker_2.05_gtf
braker_2.05_gff
braker_2.05_gff3
braker_2.0_gff3
braker_2.0_gff
braker_2.0_gtf
braker_2.1.2_gtf

MAKER:
maker_2.31.9_gff

Maker may provide other information such as blastx and protein2genome information. Currently, only maker models of genes and exons will be considered.

PROKKA:
prokka_1.11_gff

GMAP:
gmap_2017_03_17_gff3

GENOMETHREADER:
genomethreader_1.6.6_gff3

STRINGTIE:
stringtie_1.3.4_gtf

GFFREAD:
gffread_0.9.12_gff3

EXONERATE:
exonerate_2.4.0_gff

EVIDENCE MODELER:
EVM_1.1.1_gff3

CoGe:
CoGe_1.0_gff

GFACS:
gFACs_gene_table
gFACs_gtf

You can input a gene table from gFACs, any version. However, the prefix on the input will NOT be retained.

NCBI:
refseq_gff - only CDS taken.

For those who are curious, each format has a special conversion script that transitions the input into the gene table. These are the scripts found in the format_scripts folder that comes along with gFACs. If you are feeling adventurous, you can make your own conversion script that creates the gene table and simply run gFACs with the gene table format code.

_images/Format.JPG

About

gFACs was created during the annotation of the megagenome of loblolly pine. In dealing with gene models created from different softwares and alignment tools, we needed a way to filter and merge these models. Unfortunately, no such system existed so gFACs was designed and developed to fill this niche. The applications that generate gene model evidence include aligners and ab initio gene prediction software. These programs report their predictions and alignments in a similar structured gene transfer format (gtf) or general file format (gff) however there is little consistency across these standards. You can read more about the general structures here. gFACs will filter and select final gene models based upon user provided filters regarding their structural attributes. In addition, gFACs can optionally consider functional annotation from the EnTAP application as an additional filter to define true models.

The flow of gFACs.pl is controlled by the master script gFACs.pl. Flags and input files are processed by the master script, and a series of task-specific scripts are called upon to edit and filter gene or alignment models.

_images/Main_flow.JPG

The primary input is an annotation file, either in gene transfer format (gtf) or general feature format (gff/gff3). Since these file types are variable across the applications that generate them, formats designed to fit a particular software’s output must be created. Specific scripts in the folder format_scripts/ are used to convert the input into a median file type called gene_table.txt.The user is able to input their specific file type but they must inform gFACs about the format of the file. For specific formats and how to provide this information, see the supported input section on it.

_images/Format.JPG

The gene table

gene_table.txt (referred to also as the gene table) is specfiic to gFACs and is created to hold the minimum amount of information needed to uniformly apply the filtering options.

The gene table is the most important file for this program as it is used and edited in every step. Each flag or task in evaluation needs the format of the gene table to work successfully. The gene table will always have the gene models or alignments that are retained. Here is an example of gene table format:

_images/gene_table.png
The columns go:
  1. Gene part
  2. Length
  3. Start
  4. Stop
  5. Strand
  6. ID (8th column from input file)
  7. Scaffold/chromosome (needed for fasta commands)

Further scripts expect the gene table to be in the output directory called gene_table.txt. If you are using a prefix, it will look for the file with the prefix. The task-scripts also create their own files, notably if things are being separated, such as potential splice variants. However, the retained genes will always be renamed or concatenated into the master gene table.

Intron prediction

The gene table provides intron information that is not always found in the input file. Even if the input format does provide introns, they will be recalculated based on the positions of predicted exons. In the format step, a temporary file is created that will terminated upon step completion. The purpose is to add a divider between gene families. The ### line in the gene table allows for a clear break between different genes. The script then revisits the temp file, calculates introns, and makes final formatting shifts.

Here is an example of the temporary file:

_images/temp.png

Introns are the sequence between exons. Lengths and start and stop coordinates are calculated based on exon information. To accomplish this, exon lengths are pushed into an array and called by position. This method is more universally reliable but prone to errors involving overlapping exons. This can be resolved with a flag –splice-rescue which I recommend.

_images/introns.png

Task scripts

Once the formatting has completed, and the gene table has been created, filtering as designated by flags is done cyclically on the gene table until the final set of genes is produced. The order of filtering flags is pre-determined although this does not change the final result. For example, whether or not monoexonic genes are removed first or last will not change how many monoexonics appear in the final iteration of the gene table (spoiler alert: none).

Once all filtering is done, other commands are activated that involve processing or analyzing the final sets. This includes statistics, analysis of splice types, or distributions.

_images/cycle.JPG

The log file

In addition to the gene table, the gFACs_log.txt file is created every time the script is run, no exceptions. If a prefix is included, the log will have this prefix. The log file is reported by the master script and is appended with information regarding filtering at each step. It will also report what flags are being activated and the very specific corresponding system commands.

The log may be helpful for the user to see what is happening and the results of a particular filter. It is also helpful for noticing bugs and verifying script efficacy. gFACs supports readable and understandable log files!

Filter flags

As the title suggests, these flags edit or remove content from the input file. Flag use depends on whether additional resources are provided such as fasta file or EnTAP annotation file.

Basic flags

These are flags that can be run on any input without the addition of a fasta or EnTAP file. Given that, these flags are not capable of performing sequence level analysis. You can include as many flags as you want in any order. However, the order in which the flags are run is predetermined by the gFACs.pl script. This section is designed to tell you what the flags do conceptually. This is not the true order. See the log file for the order as the log is printed in sequence.

-p [prefix]

All files created will have your designated prefix. For example, if you provide the prefix “test”, your gene table will be called test_gene_table.txt. Every file (even temporary files) will have this prefix. However, it is not a mandatory argument.

--false-run

This flag allows for the user to understand the effect of filters on the original input without an additive effect. Before committing to a gFACs run, it may be important to understand what gFACs is planning on removing from the original set of models. When gFACs filters, it removes sequentially from a shrinking pool of models. However, it may be important to know how many gene models would be removed from the original set. How many overall do not have a stop codon? How many overall are of a certain CDS length?

This flag allows gFACs to run normally but always resets to the original set of genes in the gene table. Therefore, the log will print what is removed and retained but never follows through with filtering. Resulting files will be the gene table completely unedited and the log file will reflect resetting numbers.

NOTE: This flag does NOT keep the results of splice rescue nor unique genes only. This is problematic with overlap when isoforms or multiple RNA evidence models are within the same called gene. Sequence pull filters like in-frame stop codon, fasta creation, or analysis can then be wrong. For example, when setting in-frame stops to 0, all genes with isoforms will print one after another in a long sequence string that will have as many stop codons as models presented. This may create the appearance of increased removal of models. To work around this problem, run classic gFACs on your set using --splice-rescue and --unique-genes-only. Then use --false-run on that parsed gene table output. Feel free to contact me if you have problems!

--no-processing

The following occurs by DEFAULT. Use –no-processing to opt out.

Processing involves two scripts: task_scripts/overlapping_exons.pl and task_scripts/splice_variants.pl. The first analyzes the gene table for overlapping exon space and then the second analyzes genes with overlapping exon space for evidence that these are separate transcripts or models.

Take this particular gene for example:

_images/overlap_1.png

Although three exons are called, they overlap as the third is supported by a different parent transcript. The first two claim an intron while the third spans over that particular intron like this:

_images/overlap_2.png

Intron prediction, when not taking into account that these are separate models of the gene space, would be wrong. To solve this, genes that show exon overlap are separated out into their own file. Then they are evaluated.

_images/overlap_3.png

The way splice variants are confirmed is due to the labeling on the 6th column in the gene table. If an exon has a different transcript ID (often labeled t1, t2, t3…) then the exon is separated into its own “batch”. What was originally “gene22” with three separate transcripts then becomes gene22.1, gene22.2, and gene22.3.

_images/overlap_4.png

Introns are then recalculated for each of the separate isoforms. If some of the models are incomplete, they can be filtered out with other flags since they are now treated as separate “genes”. (This is because they are separated by a ### partition).

Passing genes will remain in the gene table. Results of this filter are printed in the log.

To resolve transcripts back to unique genes (selecting largest, if available, or first transcript) can be done with the --unique-genes-only flag.

Following this step, another script called task_scripts/gene_table_fixer.pl is implemented. This is done to check that exons and introns are in positive strand ascending order. Without it, printing the fastas will be problematic. It is done regardless of whether or not you use splice rescue.

--no-gene-redefining

The following occurs by DEFAULT. Use –no-gene-redefining to opt out. Genes that are incomplete in the annotation such that it appears they start or end with an intron are labeled as:

COMP : Complete. Note this does NOT mean there is a start and in-frame stop codon. 3*_*INC : 3’ incomplete 5*_*INC : 5’ incomplete 3*_*INC+5*_*INC : Both gene ends are incomplete

Incompleteness checks do NOT require a fasta and therefore do not judge incompleteness based on codon content. For examples of this, see the filters directly below for removing incompletes.

Genes will be trimmed to reflect available CDS in the first step following format conversion but incomplete labels will remain to identify the modification made. This is opt-out using --no-gene-redefining. All filters will be applied normally regardless of choice to trim. Note that a gene labeled as a 3’ incomplete will always fail a start codon check even if the first three nucleotides of the trimmed gene is an ATG (or alternate). Same for 5’ incompletes and stop codon checks.

--rem-3prime-incompletes

This option is controlled by the script task_scripts/remove_starting_introns.pl. It is designed to pull out genes that start with “intronic” space. This is almost always because of missing evidence due to missing scaffold sequence. Genes are automatically trimmed unless –no-gene-redefining is used.

Take this for example:

_images/rem_start.JPG

The very first exon begins 86 nucleotides into the proposed gene space. You can also see that the gene “begins” at position 1 in super417. This particular model came from BRAKER 2.0.

This script finds genes where gene start and first exon start do not agree. Passing genes will remain in the gene table. Results of this filter are printed in the log.

NOTE: This script takes into account directionality. Meaning a positive strand gene without a start intron would occur at the start of the gene (all gene coordinates are positive stranded). In a negative strand gene, missing the start intron would occur at the end of the gene.

NOTE: This filter does not take into acount sequence. To remove all incompletes with codons in mind, combine with –rem-genes-without-start-codon.

--rem-5prime-incompletes

This option is controlled by the script task_scripts/remove_ending_introns.pl. It is nearly identical to removing start and also takes into account strandedness and directionality. For another example in the exact same BRAKER 2.0 gene table:

_images/rem_end.JPG

Here you can see that the last exon ends at 1887 but the gene claims to end at 4318.

Sometimes this occurs because of the scaffold ending as before, but further fasta involvement can find incomplete genes due to codon evidence.

Passing genes will remain in the gene table. Results of this filter are printed in the log.

NOTE: This filter does not take into acount sequence. To remove all incompletes with codons in mind, combine with –rem-genes-without-stop-codon.

--rem-5prime-3prime-incompletes

Removes genes that are BOTH 5’ and 3’ incomplete (3_INC+5_INC). Genes that are 3’ incomplete but 5’ complete and vice versa are kept in the gene table. Results of this filter are printed in the log.

--rem-all-incompletes

Performs the tasks of --rem-3prime-incompletes and --rem-5prime-incompletes. See above for what each task does individually. Passing genes will remain in the gene table. Results and commands of this filter are printed in the log as if each command was run separately.

--rem-monoexonics

Removes monoexonics based off the presence of introns. All multiexonic genes then remain in the gene table. The script that does this command is task_scripts/remove_monoexonics.pl.

--rem-multiexonics

Remove multiexonics based off the presence of introns. All monoexonics genes then remain in the gene table. The script that does this command is task_scripts/remove_multiexonics.pl.

--min-exon-size [number]

Default: 20

Creates a filter to remove genes with exons below a certain size. The script that performs this command is task_scripts/minimum_exon.pl. If you do not provide a following number, 20 is used as a benchmark for an exon that is suspiciously too small.

Passing genes will remain in the gene table. Results of this filter are printed in the log.

--min-intron-size [number]

Default: 20

Creates a filter to remove genes with introns below a certain size. The script that performs this command is task_scripts/minimum_intron.pl. If you do not provide a following number, 20 is used as a benchmark for an intron that is suspiciously too small. However, it might be technically possible to have introns that are less than 20 nucleotides.

Passing genes will remain in the gene table. Results of this filter are printed in the log.

--min-CDS-size [number]

Default: 74

Creates a filter to remove genes with a coding sequence (CDS) below a certain nucleotide length. Introns do not count, only exon sequence size. The default is based off the smallest known gene and will be used if no input is provided.

Passing genes will remain in the gene table. Results of this filter are printed in the log.

--unique-genes-only

This option will collapse directly overlapping genes and resolve transcripts created using -–splice-rescue.

When using --splice-rescue, multiple transcripts are created that represent the same gene. They may be isoforms of one gene or the exact same gene model repeated due to multiple pieces of evidence. Since separation treats each transcript as if it was its own gene for statistics and file-creation steps, this step will return only unique genes.

This is done by the script task_scripts/unique_genes.pl. It separates out transcripts denoted by their .1, .2, etc… modification. For those representing the same gene, the largest transcript is selected if available. Otherwise it will just take the first one.

When not dealing with transcripts, if two separate genes with different IDs share the exact same space, the first one numerically will be chosen. This only affects genes with 100% overlap where each gene is the same size and starts and ends at the same coordinates.

This step is done before any outputs are created such as statistics, fastas, output tables, or gtf files. Unique genes will remain in the gene table. Results of this filter including genes in, transcripts present, unique transcripts, non-transcript duplicates, lost, and final returned genes are printed in the log.

EnTAP required flags

--entap-annotation /path/to/your/final_annotation.tsv

Provide the path to the output of the protein annotation. The first column should be the name of a gene that matches the gene name in the gene table. It will run with other formats but I encourage using a gFACs format input (see FAQ for EnTAP run details). Use gFACs to filter an original annotation, functionally annotate with EnTAP, then use the gFACs output and EnTAP output to filter again.

All versions of EnTAP (including future versions) should be compatible.

If issues arise, contact me at the gFACs GitLab.

The annotation should look something like this:

_images/entap.png

--annotated-all-genes-only

Only genes that have an associated similarity search OR EggNOG annotation are kept. Done by the script task_scripts/annotated_all_genes_only.pl. Passing genes will remain in the gene table. Results of this filter are printed in the log.

--annotated-ss-genes-only

Only genes that have an associated similarity search annotation are kept. Done by the script task_scripts/annotated_ss_genes_only.pl. Passing genes will remain in the gene table. Results of this filter are printed in the log.

Fasta required flags

These scripts require that there is a fasta, because sequence is being evaluated. The gFACs.pl script will index your fasta, and then task scripts that require sequence will find and use that index. If there is already an index, the indexing step will be skipped.

To specify a fasta:

--fasta /path/to/your/nucleotide/fasta.fasta

Bioperl will create an index with the ending “.fasta.idx”. It is a fairly fast process. The file may end with .fa or .fasta, but no other naming formats can be recognized.

NOTE: This fasta MUST MUST MUST be the same fasta used when making your particular gff3/gtf/gff. Bioperl needs to recognize the name on the fasta info line to the sixth column in the gene table.

--canonical-only

Analyzes introns for a canonical splice sites (GT-AG on the positive strand). The script that performs this task is task_scripts/canonical_only.pl.

To pass, all introns in a gene must have canonical splice sites. Monoexonics will also pass this filter because they do not have the evidence to be pulled out. Of course, monoexonic genes can be removed by –rem-monoexonics filter.

Genes that pass this filter are kept in the gene table and results are printed in the log.

NOTE: Splice sites take into account directionality and reverse compliment.

--rem-genes-without-start-codon

The first three nucleotides of the sequence are analyzed to match ATG. With this flag alone, no alternate start codons are taken into account (use --allow-alternate-starts to include alternate starts). This task is performed by task_scripts/rem_genes_without_start.pl. Again, gene directionality is considered.

NOTE: Genes marked 5’ incomplete are assumed NOT to have a start codon and are removed regardless if the gene starts with a Met.

Genes that pass this filter are kept in the gene table and results are printed in the log.

--allow-alternate-starts

If --rem-genes-without-start-codon is used, the start codons of GTG and TTG will also be included alongside ATG. This may be useful in Prokaryotic annotations. This task is performed by task_scripts/rem_genes_without_start_alternate.pl.

Outputs are identical to --rem-genes-without-start-codon.

--rem-genes-without-stop-codon

The last three nucleotides of the sequence are analyzed to match TAA, TAG, and TGA. Currently, all end codons are assumed to be within the reported gene. This task is performed by task_scripts/rem_genes_without_stop.pl. Again, gene directionality is considered.

Following this step, a script called task_scripts/frame_detection.pl is run. It is designed to pick out any genes that technically have a stop codon as the last three nucleotides, but it is not real because the codon is actually out of frame. These are rare occurrences, often happening on negative strand genes that run into the beginning of a scaffold where the first three nucleotides of the scaffold are a reverse complement stop codon. To solve this, any gene whose CDS is not divisible by 3, is removed.

NOTE: Genes marked 3’ incomplete are assumed NOT to have a stop codon and are removed regardless if the gene has a terminating in-frame stop.

Genes that pass this filter are kept in the gene table and results are printed in the log.

--rem-genes-without-start-and-stop-codon

Removes genes that lack BOTH a start and stop codon. Genes that have a start codon but no stop and vice versa are kept in the gene table. Results of this filter are printed in the log.

--allowed-inframe-stop-codons [number]

Default: 0

Creates a filter that removes genes based on the presence of a stop codon that is not the last codon in the gene. For example, setting this parameter as 1 will allow one other stop codon between the methionine and the terminating stop codon.

If you are not filtering for start and stop codons, this will still work so long as there are stop codons within the amino acid sequence but not necessarily at the end.

Genes that pass this filter are kept in the gene table and results are printed in the log.

--splice-table

To understand splice usage, a splice-site table is printed to the log that tells the frequency of every type of splice site used. This command is performed by task_scripts/splice_table.pl.

The splice table will look something like this:

_images/splice_table.png

This splice table will show you everything present in the file adjusted to lower case letters including N-bases. If you specify canonical genes only, the table will only show you gt_ag counts.

--nt-content

The CDS (all exon sequences) is analyzes for GC, AT, and N content by percent composition. This information is printed to the log. Here is an example of the output:

_images/nt_content.png

Output flags

Basic output flags

These are output flags that do not require the input of any fasta or EnTAP files.

--statistics

Statistics will be run on the gene table and printed to statistics.txt. This command is performed by task_scripts/classic_stats.pl. If a prefix is used, the statistics file will be named accordingly.

These are all the potential statistics in the reported format:

Number of genes:
Number of monoexonic genes:
Number of multiexonic genes:

Number of positive strand genes:
Monoexonic:
Multiexonic:

Number of negative strand genes:
Monoexonic:
Multiexonic:

Average overall gene size:
Median overall gene size:
Average overall CDS size:
Median overall CDS size:
Average overall exon size:
Median overall exon size:

Average size of monoexonic genes:
Median size of monoexonic genes:
Largest monoexonic gene:
Smallest monoexonic gene:

Average size of multiexonic genes:
Median size of multiexonic genes:
Largest multiexonic gene:
Smallest multiexonic gene:

Average size of multiexonic CDS:
Median size of multiexonic CDS:
Largest multiexonic CDS:
Smallest multiexonic CDS:

Average size of multiexonic exons:
Median size of multiexonic exons:
Average size of multiexonic introns:
Median size of multiexonic introns:

Average number of exons per multiexonic gene:
Median number of exons per multiexonic gene:
Largest multiexonic exon:
Smallest multiexonic exon:
Most exons in one gene:

Average number of introns per multiexonic gene:
Median number of introns per multiexonic gene:
Largest intron:
Smallest intron:

The following columns do not involve codons:
Number of complete models:
Number of 5’ only incomplete models:
Number of 3’ only incomplete models:
Number of 5’ and 3’ incomplete models:

If your set is only monoexonics, a smaller version of the statistics will be printed that only contain the categories where monoexonic genes are evaluated.

--statistics-at-every-step

A statistical analysis of the gene table is run following every filtering step. This information is in the same format as regular --statistics but prints to the log following the information line for each flag. To ensure statistics.txt is created at the end, make sure to include -–statistics in your command.

--create-simple-gtf

Identical to --create-gtf, but lacks start and stop codon information. This option is significantly faster.

--create-gff3

An Ensembl v3 gff3 gff3 will be created that contains mRNA, exon, and intron information. ID, Name, and Parent information will be shown.

Fasta required output flags

In order to create fasta required outputs, you will need to provide a fasta input. See how to here. If a proper fasta is provided, you unlock all these flags:

--get-fasta-with-introns

The nucleotide fasta sequence is printed to genes_with_introns.fasta. The genes are always printed on the positive strand. This fasta will contain the intron sequences so number of genes printed will be the same for both with and without introns. The header for each sequence is the fifth column of the gene line in the gene table.

This command is performed by task_scripts/get_fasta_with_introns.pl. If a prefix is specified, the output fasta will be named accordingly.

--get-fasta-without-introns

The nucleotide fasta sequence is printed to genes_without_introns.fasta. The genes are always printed on the positive strand. This fasta will not contain the intron sequences. The header for each sequence is the fifth column of the gene line in the gene table.

This command is performed by task_scripts/get_fasta_with_introns.pl. If a prefix is specified, the output fasta will be named accordingly.

--get-protein-fasta

A protein fasta of the genes is created called genes_without_introns.fasta.faa. Genes never include the introns (because of course not). All genes are printed in the N-terminus to C-terminus orientation (so M would be first) but reverse complementation of the negative strand is considered to choose the correct amino acids. Stop codons are depicted as *. The header for each sequence is the fifth column of the gene line in the gene table.

This command is performed by task_scripts/get_protein_fasta.pl. If a prefix is specified, the output fasta will be named accordingly.

--create-gtf

A gtf file called out.gtf is created. If a prefix is specified, the gtf file will have it. This step is done with two scripts, task_scripts/add_start_stop_to_gene_table.pl and task_scripts/gtf_creator.pl.

Since GTF files (as a general rule) require start and stop codon information, the locations of the start and stop codon (if found) are added to the gene table and the final gtf. CDS scores that correspond to an exon are retrieved from the original input file if found and the “exon” attribute is returned to “CDS”. Introns currently remain.

The source line does say gFACs. Not to steal the credit, it just might be helpful to know where the information is coming from particularly after filtering and rearranging.

Distributions flags

gFACs is capable of reformatting annotations into formats for distributions. It can provide distribution summaries or raw data. To signify distributions, a single flag followed by options may be used:

--distributions [option] [option] …

Activates the ability to create distributions. This task is always done last on the final version of the gene table. If a prefix is specified, all output files will reflect that.

All outputs are printed in a .tsv file that can be opened for viewing on excel or R. The options available for distributions are as follow:

exon_lengths

Creates the file exon_lengths_distributions.tsv. In it, a range of exon lengths and the corresponding representation is printed. In this example, --min-exon size was set to 40, which is reflected in the numbers:

_images/exon_length.png

The above data, when rendered into a histogram using R, looks like this:

_images/exon_dist_1.png

Notice that the curve is bimodal, which is indicative of the mono and multiexonic genes. Utilizing two runs one with --rem-monoexonics (red) and one with --rem-multiexonics (yellow) you can see the curves are indeed the difference in gene type where smaller exon lengths are in multiexonic genes:

_images/exon_dist_2.png

Advanced: Zoom of exon lengths can be controlled with a trailing number. This changes the size of the step. In the example above, the range of values as the cluster for the distribution is 10, but it can be controlled like this:

exon_lengths 5

This would change the above table to:

The default, if no number is chosen, is decided by the maximum exon length of the provided data. For a maximum length that is less than 100 nucleotides, the step is 1. For a maximum value of exon length that is more than 100 but less than 1,000, the step is 10 and so on.

Changing this step number should not drastically change the time it takes to run. However, the file will be larger and have more lines when a smaller number is used!

intron_lengths

Creates the file intron_lengths_distributions.tsv. In it, a range of intron lengths and the corresponding representation is printed. The outputs and applications are identical to exon_lengths.

Advanced: Zoom of intron lengths can be controlled with a trailing number. This changes the size of the step. It can be controlled like this:

intron_lengths 20

The default, if no number is chosen, is decided by the maximum intron length of the provided data. For a maximum length that is less than 100 nucleotides, the step is 1. For a maximum value of intron length that is more than 100 but less than 1,000, the step is 10 and so on.

Changing this step number should not drastically change the time it takes to run. However, the file will be larger and have more lines when a smaller number is used!

CDS_lengths

Creates the file CDS_lengths_distributions.tsv. In it, a range of CDS lengths and the corresponding representation is printed. The outputs and applications are identical to exon_lengths.

Advanced: Zoom of CDS lengths can be controlled with a trailing number. This changes the size of the step. It can be controlled like this:

CDS_lengths 25

The default, if no number is chosen, is decided by the maximum CDS length of the provided data. For a maximum length that is less than 100 nucleotides, the step is 1. For a maximum value of CDS length that is more than 100 but less than 1,000, the step is 10 and so on.

Changing this step number should not drastically change the time it takes to run. However, the file will be larger and have more lines when a smaller number is used!

gene_lengths

Creates the file gene_lengths_distributions.tsv. In it, a range of gene lengths and the corresponding representation is printed. These sequence lengths do include all introns. The outputs and applications are identical to exon_lengths.

Advanced: Zoom of gene lengths can be controlled with a trailing number. This changes the size of the step. It can be controlled like this:

gene_lengths 1000

The default, if no number is chosen, is decided by the maximum gene length of the provided data. For a maximum length that is less than 100 nucleotides, the step is 1. For a maximum value of gene length that is more than 100 but less than 1,000, the step is 10 and so on.

Changing this step number should not drastically change the time it takes to run. However, the file will be larger and have more lines when a smaller number is used!

exon_position

Analyzes and creates an output that evaluates exon position in a gene to its size. Position meaning which exon comes first. In positive strand genes, these are in the order they appear in the gene table. For reverse strand genes, the first exon is the last one to appear in the gene table. Creates the output file exon_position_distributions.tsv. The output looks like this:

_images/exon_position.png

Exon position goes from 1 to whatever the maximum number of exons in one gene is. It will match what a statistics output would say. The second column is how many exons are representative of that position. The first exon support (70,923 above) will always be equal to the overall number of genes because even monoexonics have a first exon. (You can remove those, of course). You can also say there are 70,923 first exons, 42,161 second exons, etc…

The third column is how many genes have the first column number as their maximum number of exons. So, in the last row shown, there are 1,656 genes that have 6 total exons. There are 28,762 monoexonics then as well by this same logic.

The third and fourth columns are average and median size of an exon at that positon. The last two are minimum and maximum. If you use a minimum exon parameter (as I did above) it will be reflected!

exon_position_data

Provides the raw data in data_intron_position_distributions.tsv on exon positions alongside exon_position_distributions.tsv produced from the command above. This set of data can be used to make boxplots.

The data appears like this:

_images/exon_position_raw.png

The first column is the exon position and the following values in the row are the sizes of exons (non-sorted). The row will have as many columns as exon position data points. Notice how 32 (the last visible row) has only 5 numbers, showing there are only 5 genes that have a 32nd exon where the values are the sizes.

intron_position

Intron positioning works identically to exon positions. However, this will only include multiexonic genes! All header names have the same meaning as exon position.

Creates the output file intron_position_distributions.tsv.

intron_position_data

Intron position raw data works identically to exon positions and will also only include multiexonic genes. Creates the output file data_intron_position_distributions.tsv and the default intron_position_distributions.tsv.

A sample of a boxplot that can be created:

_images/position_boxplot.png

For how I created the boxplot, feel free to contact me!

Compatibility flags

Just as the gff3/gff/gtf file formats follow their owns rules, so do other software in needing specific input formats. Although the gFACs gtf is fairly standard, a few modifications must be made before it can safely be used within other programs. This includes transitions to other formats such as gff or gff3!

Several formats are alrady compatible by default. –create-gtf output is compatible with Jbrowse and protein/nucleotide FASTAs are compatible with EnTAP!

These options are still being developed and user input is more than welcome! Do you not see a format you would like added? Let me know!

To specify the compatibility arguments, use this flag:

--compatibility [option] [option] etc…

Allows the creation of software compatible files. Available format options are:

SnpEff

A gff file called snpeff_format.gff will be created that can be used for SnpEff build. This format can be used in the default.

EVM_1.1.1_gene_prediction

A gff file called EVM_1.1.1_gene_prediction_format.gff will be created that can be used as a gene prediction format for EVidence Modeler.

EVM_1.1.1_alignment

A gff file called EVM_1.1.1_alignment_format.gff will be created that can be used as an alignment format for EVidence Modeler.

FAQ

Under construction. See gitlab issues or contact through gitlab with issues you have.

Why are there negative intron/exon lengths?

Coming soon.

Why don’t incompletes match start and stop exon statistics?

Coming soon.

How do I include EnTAP results?

Coming soon.

All models were removed!

Coming soon.

For further issues, report to gitlab.

format_diagnosis.pl

To determine what format you have, if it is ambiguous, format_diagnosis.pl may be able to help. The script will output information that you can compare with the table below to see if another format may work for you. To use the script:

perl format_diagnosis.pl [input_file]

The output will look something like this:

_images/format_diag.png

The data tells you what information is present followed by the observed quality of the feature. In the above output, the line of the “gene” feature, comes up 278 times and matches that with mRNA. This is not always the case. It also has exon lines and CDS lines but NOT at the same frequency. So CDS will be the more important feature.

Given the comparison, you could choose several formats that might work. braker_2.05_gff3, braker_2.05_gtf, gFACs_gtf, and several more.

FORMAT properties:

NOTE: In this table, know that each format example is NOT the same file. The information inside the [brackets] is just an example number and judgement on format should be made from ratios. Your file will not fit the actual numbers in the brackets above, but the ratio between a format’s exon to CDS counts may be the same. These were derived from my own collection of sample files across different species and projects.

_images/Table.png