Quick Start

This guide will get you up and running with genbank_to quickly.

Basic Usage

The most common use case is converting a GenBank file to FASTA format:

genbank_to -g input.gbk -n output.fna

This reads input.gbk and writes the nucleotide sequence to output.fna.

Common Conversions

Extract Genome Sequence

genbank_to -g genome.gbk -n genome.fna

Extract Protein Sequences

genbank_to -g genome.gbk -a proteins.faa

Extract ORF Sequences (DNA)

genbank_to -g genome.gbk -o orfs.fna

Generate GFF3 File

genbank_to -g genome.gbk --gff3 genome.gff3

Multiple Outputs at Once

You can request multiple output formats in a single command:

genbank_to -g genome.gbk \
    -n genome.fna \
    -a proteins.faa \
    -o orfs.fna \
    --gff3 genome.gff3

Using as a Python Library

You can also use genbank_to as a library in your Python scripts:

Extract All Protein Sequences

from GenBankToLib import genbank_to_faa

for seqid, protid, sequence in genbank_to_faa('genome.gbk'):
    print(f">{protid}")
    print(sequence)

Extract Genome Sequences

from GenBankToLib import genbank_to_fna

for seqid, sequence in genbank_to_fna('genome.gbk'):
    print(f">{seqid}")
    print(sequence)

Extract Functions

from GenBankToLib import genbank_to_functions

for protid, function in genbank_to_functions('genome.gbk'):
    print(f"{protid}\t{function}")

Convert to JSON Format

from GenBankToLib import genbank_to_json
import json

genome_info = {'gram': None, 'translation_table': 11}
json_data = genbank_to_json('genome.gbk', genome_info)

# Save to file
with open('genome.json', 'w') as f:
    json.dump(json_data, f, indent=2)

Example Workflow

Here’s a complete example workflow for analyzing a phage genome:

  1. Download a phage genome from NCBI:

# Example: Enterobacteria phage phiX174
# (This is the test file in the repository)
wget https://raw.githubusercontent.com/linsalrob/genbank_to/main/test/NC_001417.gbk
  1. Extract multiple formats:

genbank_to -g NC_001417.gbk \
    -n NC_001417.fna \
    -a NC_001417.faa \
    -o NC_001417_orfs.fna \
    -f NC_001417_functions.tsv \
    --gff3 NC_001417.gff3
  1. View the results:

# View genome sequence
head NC_001417.fna

# Count proteins
grep -c ">" NC_001417.faa

# View functions
head NC_001417_functions.tsv

Working with Multi-GenBank Files

If your GenBank file contains multiple sequences (separated by //), you can split them:

genbank_to -g multi.gbk --separate -n output

This creates separate files: output.seqid1.fna, output.seqid2.fna, etc.

Filtering by Sequence ID

Extract only specific sequences:

genbank_to -g multi.gbk -i NC_001417 -n output.fna

Next Steps