Output Formats
This page describes all the output formats supported by genbank_to in detail.
FASTA Formats
Nucleotide FASTA (Genome)
Option: -n, --nucleotide
Description: Outputs the complete nucleotide sequence(s) from the GenBank file.
Format: Standard FASTA format with sequence ID from the GenBank LOCUS or ACCESSION field.
Example Output:
>NC_001417.1
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT
GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA
...
Use Cases:
Reference genome for mapping
Input for genome assembly comparison
Sequence extraction for primers/probes
ORF Nucleotide FASTA
Option: -o, --orfs
Description: Outputs the DNA sequences of all coding sequences (CDS features).
Format: FASTA format with protein IDs as headers (or locus tags if protein IDs are not available).
Example Output:
>NP_040703.1
ATGGTTAGCAAAATCGAACGTGCAAAGATTGATGATATTAATATTTTTATTGAAAATCACCAGAAAGATA
TAGACTATCTTTGGCAACGTATACCGATGAAATCATTAAAGACTTAAAAGTTGAGCGCTTTGATACGAGT
...
Use Cases:
Gene cloning
Codon usage analysis
Primer design
Gene synthesis
Protein FASTA
Option: -a, --aminoacids
Description: Outputs amino acid sequences for all coding sequences.
Format: FASTA format with protein IDs. If --complex is specified, includes detailed annotation.
Example Output (Simple):
>NP_040703.1
MVSKIERCKILMIINFLIEIHQKDIDYLWQRIPEIIIKDLKVERFDDTVKVGGYKKGGLVQPGGSLRLYE
VDEKGHFPENVVYDGDTVVADDTLYLVAVLDERKMKGINTRELLESYFDRRGFRLPVGHIDNKPGFNVK
*
Example Output (Complex with --complex ):
>NP_040703.1 [NC_001417] [Enterobacteria phage phiX174] [NC_001417_51_1905]
gpA DNA replication protein [GeneID:1261050]
Use Cases:
Protein homology searches (BLAST)
Phylogenetic analysis
Functional annotation
Structure prediction
Structured Formats
GFF3 (Generic Feature Format)
Option: --gff3
Description: Outputs annotations in GFF3 format, a standardized format for genomic features.
Format: Tab-delimited with 9 columns: seqid, source, type, start, end, score, strand, phase, attributes.
Example Output:
##gff-version 3
##sequence-region NC_001417 1 5386
NC_001417 GenBank region 1 5386 . + . ID=NC_001417:1..5386;...
NC_001417 GenBank gene 51 1905 . + . ID=gene-phiX174p01;...
NC_001417 GenBank CDS 51 1905 . + 0 ID=cds-NP_040703.1;...
Use Cases:
Genome browsers (IGV, JBrowse)
Comparative genomics tools
Annotation pipelines
Data interchange
NCBI PTT Format
Option: -p, --ptt
Description: Protein table format formerly used by NCBI for genome downloads.
Format: Tab-delimited table with columns: Location, Strand, Length, PID, Gene, Synonym, COG, Product.
Example Output:
51..1905 + 617 - gpA NP_040703.1 - DNA replication protein
1906..2079 + 57 - gpB NP_040704.1 - capsid morphogenesis protein
2092..2529 + 145 - gpD NP_040705.1 - capsid morphogenesis protein
Use Cases:
Legacy pipeline compatibility
Quick protein overview
Tab-delimited data processing
Function Table
Option: -f, --functions
Description: Simple two-column table mapping protein IDs to their functional annotations.
Format: Tab-separated values with columns: Protein ID, Function.
Example Output:
NC_001417 NP_040703.1 DNA replication protein
NC_001417 NP_040704.1 capsid morphogenesis protein
NC_001417 NP_040705.1 capsid morphogenesis protein
NC_001417 NP_040706.1 DNA maturase protein B
Use Cases:
Functional enrichment analysis
Database imports
Quick annotation lookup
Spreadsheet analysis
Specialized Formats
Bakta JSON
Option: --bakta-json
Description: JSON format compatible with Bakta genome annotation output. Includes comprehensive metadata and feature annotations.
Additional Options:
--bakta-version: Version string--db-version: Database version--genus,--species,--strain: Organism information--gram: Gram stain (+/-)--translation-table: Genetic code
Example Output:
{
"version": "1.0",
"genome": {
"genus": "Enterobacteria",
"species": "phage phiX174",
"strain": "Sangier",
"gram": "-"
},
"sequences": [
{
"id": "NC_001417",
"length": 5386,
"gc": 0.447,
"features": [
{
"type": "cds",
"start": 51,
"stop": 1905,
"strand": "+",
"product": "DNA replication protein"
}
]
}
]
}
Use Cases:
Bakta pipeline integration
Structured data analysis
Web applications
Database storage
Python API:
For programmatic access, use the genbank_to_json function from the library:
from GenBankToLib import genbank_to_json
genome_info = {'gram': '-', 'translation_table': 11}
json_data = genbank_to_json('genome.gbk', genome_info)
See the API Reference documentation for detailed information on the genbank_to_json function.
AMRFinderPlus Format
Option: --amr
Description: Creates three files required by NCBI’s AMRFinderPlus tool for antimicrobial resistance gene annotation.
Output Files:
BASENAME.gff- Modified GFF3 with Name attributesBASENAME.faa- Protein sequencesBASENAME.fna- Nucleotide sequences
Special Features:
Validates format for AMRFinderPlus compatibility
Adds required Name fields to GFF
Excludes pseudogenes
Use Cases:
AMR gene detection
Resistance profile analysis
Public health surveillance
Clinical microbiology
Phage Finder Format
Option: --phage_finder
Description: Tab-delimited format required by the phage_finder tool for prophage identification.
Format: Tab-separated with columns: Contig ID, Contig Length, Gene ID, Start, End, Function.
Example Output:
NC_001417 5386 NP_040703.1 51 1905 DNA replication protein
NC_001417 5386 NP_040704.1 1906 2079 capsid morphogenesis protein
NC_001417 5386 NP_040705.1 2092 2529 capsid morphogenesis protein
Use Cases:
Prophage detection in bacterial genomes
Phage-host interaction studies
Comparative phage genomics
Output Modifiers
Separate Files
Option: --separate
Description: When working with multi-record GenBank files, creates separate output files for each sequence record.
Naming Convention: BASENAME.SEQID.EXTENSION
Example:
genbank_to -g multi.gbk --separate -n output
# Creates: output.NC_001417.fna, output.NC_001418.fna, etc.
Sequence ID Filtering
Option: -i, --seqid
Description: Filters output to include only specified sequence IDs. Can be used multiple times.
Example:
genbank_to -g multi.gbk -i NC_001417 -i NC_001418 -n output.fna
Complex Headers
Option: --complex
Description: Adds detailed information to FASTA headers including organism, location, product, and database cross-references.
Example:
>NP_040703.1 [NC_001417] [Enterobacteria phage phiX174] [NC_001417_51_1905]
DNA replication protein [GeneID:1261050]
Compression
Option: -z, --zip
Description: Compresses output files using gzip. Experimental feature.
Example:
genbank_to -g genome.gbk -f functions.tsv --zip
# Creates: functions.tsv.gz
Format Comparison
Format |
Type |
Primary Use |
Tools |
|---|---|---|---|
Nucleotide FASTA |
Sequence |
Genome storage |
BWA, Bowtie, BLAST |
Protein FASTA |
Sequence |
Protein analysis |
BLAST, DIAMOND, InterProScan |
ORF FASTA |
Sequence |
Gene analysis |
Gene synthesis, primers |
GFF3 |
Annotation |
Feature storage |
IGV, JBrowse, bedtools |
PTT |
Table |
Legacy compatibility |
Custom scripts |
Functions |
Table |
Annotation lookup |
Spreadsheets, R/Python |
Bakta JSON |
Structured |
Data interchange |
Bakta, web apps |
AMRFinder |
Specialized |
AMR detection |
AMRFinderPlus |
Phage Finder |
Specialized |
Prophage detection |
phage_finder |