Output Formats

This page describes all the output formats supported by genbank_to in detail.

FASTA Formats

Nucleotide FASTA (Genome)

Option: -n, --nucleotide

Description: Outputs the complete nucleotide sequence(s) from the GenBank file.

Format: Standard FASTA format with sequence ID from the GenBank LOCUS or ACCESSION field.

Example Output:

>NC_001417.1
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT
GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA
...

Use Cases:

  • Reference genome for mapping

  • Input for genome assembly comparison

  • Sequence extraction for primers/probes

ORF Nucleotide FASTA

Option: -o, --orfs

Description: Outputs the DNA sequences of all coding sequences (CDS features).

Format: FASTA format with protein IDs as headers (or locus tags if protein IDs are not available).

Example Output:

>NP_040703.1
ATGGTTAGCAAAATCGAACGTGCAAAGATTGATGATATTAATATTTTTATTGAAAATCACCAGAAAGATA
TAGACTATCTTTGGCAACGTATACCGATGAAATCATTAAAGACTTAAAAGTTGAGCGCTTTGATACGAGT
...

Use Cases:

  • Gene cloning

  • Codon usage analysis

  • Primer design

  • Gene synthesis

Protein FASTA

Option: -a, --aminoacids

Description: Outputs amino acid sequences for all coding sequences.

Format: FASTA format with protein IDs. If --complex is specified, includes detailed annotation.

Example Output (Simple):

>NP_040703.1
MVSKIERCKILMIINFLIEIHQKDIDYLWQRIPEIIIKDLKVERFDDTVKVGGYKKGGLVQPGGSLRLYE
VDEKGHFPENVVYDGDTVVADDTLYLVAVLDERKMKGINTRELLESYFDRRGFRLPVGHIDNKPGFNVK
*

Example Output (Complex with --complex ):

>NP_040703.1 [NC_001417] [Enterobacteria phage phiX174] [NC_001417_51_1905]
gpA DNA replication protein [GeneID:1261050]

Use Cases:

  • Protein homology searches (BLAST)

  • Phylogenetic analysis

  • Functional annotation

  • Structure prediction

Structured Formats

GFF3 (Generic Feature Format)

Option: --gff3

Description: Outputs annotations in GFF3 format, a standardized format for genomic features.

Format: Tab-delimited with 9 columns: seqid, source, type, start, end, score, strand, phase, attributes.

Example Output:

##gff-version 3
##sequence-region NC_001417 1 5386
NC_001417    GenBank region  1       5386    .       +       .       ID=NC_001417:1..5386;...
NC_001417    GenBank gene    51      1905    .       +       .       ID=gene-phiX174p01;...
NC_001417    GenBank CDS     51      1905    .       +       0       ID=cds-NP_040703.1;...

Use Cases:

  • Genome browsers (IGV, JBrowse)

  • Comparative genomics tools

  • Annotation pipelines

  • Data interchange

NCBI PTT Format

Option: -p, --ptt

Description: Protein table format formerly used by NCBI for genome downloads.

Format: Tab-delimited table with columns: Location, Strand, Length, PID, Gene, Synonym, COG, Product.

Example Output:

51..1905     +       617     -       gpA     NP_040703.1     -       DNA replication protein
1906..2079   +       57      -       gpB     NP_040704.1     -       capsid morphogenesis protein
2092..2529   +       145     -       gpD     NP_040705.1     -       capsid morphogenesis protein

Use Cases:

  • Legacy pipeline compatibility

  • Quick protein overview

  • Tab-delimited data processing

Function Table

Option: -f, --functions

Description: Simple two-column table mapping protein IDs to their functional annotations.

Format: Tab-separated values with columns: Protein ID, Function.

Example Output:

NC_001417    NP_040703.1     DNA replication protein
NC_001417    NP_040704.1     capsid morphogenesis protein
NC_001417    NP_040705.1     capsid morphogenesis protein
NC_001417    NP_040706.1     DNA maturase protein B

Use Cases:

  • Functional enrichment analysis

  • Database imports

  • Quick annotation lookup

  • Spreadsheet analysis

Specialized Formats

Bakta JSON

Option: --bakta-json

Description: JSON format compatible with Bakta genome annotation output. Includes comprehensive metadata and feature annotations.

Additional Options:

  • --bakta-version: Version string

  • --db-version: Database version

  • --genus, --species, --strain: Organism information

  • --gram: Gram stain (+/-)

  • --translation-table: Genetic code

Example Output:

{
    "version": "1.0",
    "genome": {
        "genus": "Enterobacteria",
        "species": "phage phiX174",
        "strain": "Sangier",
        "gram": "-"
    },
    "sequences": [
        {
            "id": "NC_001417",
            "length": 5386,
            "gc": 0.447,
            "features": [
                {
                    "type": "cds",
                    "start": 51,
                    "stop": 1905,
                    "strand": "+",
                    "product": "DNA replication protein"
                }
            ]
        }
    ]
}

Use Cases:

  • Bakta pipeline integration

  • Structured data analysis

  • Web applications

  • Database storage

Python API:

For programmatic access, use the genbank_to_json function from the library:

from GenBankToLib import genbank_to_json

genome_info = {'gram': '-', 'translation_table': 11}
json_data = genbank_to_json('genome.gbk', genome_info)

See the API Reference documentation for detailed information on the genbank_to_json function.

AMRFinderPlus Format

Option: --amr

Description: Creates three files required by NCBI’s AMRFinderPlus tool for antimicrobial resistance gene annotation.

Output Files:

  1. BASENAME.gff - Modified GFF3 with Name attributes

  2. BASENAME.faa - Protein sequences

  3. BASENAME.fna - Nucleotide sequences

Special Features:

  • Validates format for AMRFinderPlus compatibility

  • Adds required Name fields to GFF

  • Excludes pseudogenes

Use Cases:

  • AMR gene detection

  • Resistance profile analysis

  • Public health surveillance

  • Clinical microbiology

Phage Finder Format

Option: --phage_finder

Description: Tab-delimited format required by the phage_finder tool for prophage identification.

Format: Tab-separated with columns: Contig ID, Contig Length, Gene ID, Start, End, Function.

Example Output:

NC_001417    5386    NP_040703.1     51      1905    DNA replication protein
NC_001417    5386    NP_040704.1     1906    2079    capsid morphogenesis protein
NC_001417    5386    NP_040705.1     2092    2529    capsid morphogenesis protein

Use Cases:

  • Prophage detection in bacterial genomes

  • Phage-host interaction studies

  • Comparative phage genomics

Output Modifiers

Separate Files

Option: --separate

Description: When working with multi-record GenBank files, creates separate output files for each sequence record.

Naming Convention: BASENAME.SEQID.EXTENSION

Example:

genbank_to -g multi.gbk --separate -n output
# Creates: output.NC_001417.fna, output.NC_001418.fna, etc.

Sequence ID Filtering

Option: -i, --seqid

Description: Filters output to include only specified sequence IDs. Can be used multiple times.

Example:

genbank_to -g multi.gbk -i NC_001417 -i NC_001418 -n output.fna

Complex Headers

Option: --complex

Description: Adds detailed information to FASTA headers including organism, location, product, and database cross-references.

Example:

>NP_040703.1 [NC_001417] [Enterobacteria phage phiX174] [NC_001417_51_1905]
DNA replication protein [GeneID:1261050]

Compression

Option: -z, --zip

Description: Compresses output files using gzip. Experimental feature.

Example:

genbank_to -g genome.gbk -f functions.tsv --zip
# Creates: functions.tsv.gz

Format Comparison

Output Format Comparison

Format

Type

Primary Use

Tools

Nucleotide FASTA

Sequence

Genome storage

BWA, Bowtie, BLAST

Protein FASTA

Sequence

Protein analysis

BLAST, DIAMOND, InterProScan

ORF FASTA

Sequence

Gene analysis

Gene synthesis, primers

GFF3

Annotation

Feature storage

IGV, JBrowse, bedtools

PTT

Table

Legacy compatibility

Custom scripts

Functions

Table

Annotation lookup

Spreadsheets, R/Python

Bakta JSON

Structured

Data interchange

Bakta, web apps

AMRFinder

Specialized

AMR detection

AMRFinderPlus

Phage Finder

Specialized

Prophage detection

phage_finder