Examples

This page provides comprehensive examples for various use cases of genbank_to.

Basic Examples

Convert GenBank to FASTA

The simplest use case - extract the genome sequence:

genbank_to -g genome.gbk -n genome.fna

Extract All Sequence Types

Get genome, ORFs, and proteins in one command:

genbank_to -g genome.gbk \
    -n genome.fna \
    -o orfs.fna \
    -a proteins.faa

Working with Phage Genomes

Complete Phage Analysis

Extract all relevant information from a phage genome:

# Download example phage genome (phiX174)
wget https://raw.githubusercontent.com/linsalrob/genbank_to/main/test/NC_001417.gbk

# Convert to multiple formats
genbank_to -g NC_001417.gbk \
    -n NC_001417.fna \
    -a NC_001417.faa \
    -f NC_001417_functions.tsv \
    --gff3 NC_001417.gff3 \
    --phage_finder NC_001417.pf

Phage Annotation Comparison

Compare annotations from different sources:

# Extract functions from GenBank
genbank_to -g phage1.gbk -f phage1_functions.tsv
genbank_to -g phage2.gbk -f phage2_functions.tsv

# Compare in Python
import pandas as pd

df1 = pd.read_csv('phage1_functions.tsv', sep='\t',
                  names=['seqid', 'protid', 'function'])
df2 = pd.read_csv('phage2_functions.tsv', sep='\t',
                  names=['seqid', 'protid', 'function'])

# Merge and compare
merged = df1.merge(df2, on='protid', suffixes=('_1', '_2'))
print(merged[merged['function_1'] != merged['function_2']])

Bacterial Genome Analysis

Complete Bacterial Genome Conversion

genbank_to -g ecoli.gbk \
    -n ecoli_genome.fna \
    -a ecoli_proteins.faa \
    -o ecoli_orfs.fna \
    -f ecoli_functions.tsv \
    --gff3 ecoli.gff3 \
    --bakta-json ecoli.json \
    --genus Escherichia \
    --species coli \
    --strain K-12 \
    --gram -

AMR Analysis Pipeline

Prepare files for antimicrobial resistance analysis:

# Convert to AMRFinder format
genbank_to -g bacteria.gbk --amr bacteria_amr

# Run AMRFinderPlus (if installed)
amrfinder \
    -n bacteria_amr.fna \
    -p bacteria_amr.faa \
    -g bacteria_amr.gff \
    -o amr_results.txt

Multi-Genome Processing

Batch Processing

Process multiple GenBank files:

#!/bin/bash

for gbk in genomes/*.gbk; do
    base=$(basename "$gbk" .gbk)
    echo "Processing $base..."

    genbank_to -g "$gbk" \
        -n "output/${base}.fna" \
        -a "output/${base}.faa" \
        --gff3 "output/${base}.gff3"
done

Processing Multi-Record Files

Handle GenBank files with multiple chromosomes or contigs:

# Separate into individual files
genbank_to -g multi_chromosome.gbk --separate -n chromosome

# This creates:
# chromosome.chr1.fna
# chromosome.chr2.fna
# etc.

Extract Specific Chromosomes

Extract only certain sequences from a multi-record file:

genbank_to -g multi_chromosome.gbk \
    -i NC_000001 \
    -i NC_000002 \
    -n selected_chromosomes.fna

Python Library Usage

Basic Library Import

from GenBankToLib import (
    genbank_to_faa,
    genbank_to_fna,
    genbank_to_orfs,
    genbank_to_functions,
    genbank_to_gff,
    genbank_to_json
)

Extract and Filter Proteins

from GenBankToLib import genbank_to_faa

# Extract proteins longer than 100 amino acids
with open('large_proteins.faa', 'w') as out:
    for seqid, protid, sequence in genbank_to_faa('genome.gbk'):
        if len(sequence) > 100:
            out.write(f">{protid}\n{sequence}\n")

Build Custom Function Database

from GenBankToLib import genbank_to_functions
import json

# Build a dictionary of functions
func_db = {}
for protid, function in genbank_to_functions('genome.gbk'):
    func_db[protid] = function

# Save as JSON
with open('function_db.json', 'w') as f:
    json.dump(func_db, f, indent=2)

Calculate Genome Statistics

from GenBankToLib import genbank_to_fna, genbank_to_faa
from Bio.SeqUtils import gc_fraction

# Get genome statistics
for seqid, sequence in genbank_to_fna('genome.gbk'):
    gc_content = gc_fraction(sequence) * 100
    length = len(sequence)
    print(f"{seqid}: {length} bp, {gc_content:.2f}% GC")

# Count proteins
protein_count = sum(1 for _ in genbank_to_faa('genome.gbk'))
print(f"Proteins: {protein_count}")

Export to JSON Format

from GenBankToLib import genbank_to_json
import json

# Convert GenBank to comprehensive JSON format
genome_info = {
    'gram': '-',
    'translation_table': 11
}

json_data = genbank_to_json('genome.gbk', genome_info)

# Access genome metadata
print(f"Organism: {json_data['genome']['genus']} {json_data['genome']['species']}")
print(f"Gram: {json_data['genome']['gram']}")

# Access statistics
stats = json_data['stats']
print(f"Genome size: {stats['size']:,} bp")
print(f"GC content: {stats['gc']:.2%}")
print(f"Coding ratio: {stats['coding_ratio']:.2%}")
print(f"N50: {stats['n50']:,}")

# Filter features by type
cds_count = sum(1 for f in json_data['features'] if f['type'] == 'CDS')
trna_count = sum(1 for f in json_data['features'] if f['type'] == 'tRNA')
rrna_count = sum(1 for f in json_data['features'] if f['type'] == 'rRNA')

print(f"CDS: {cds_count}, tRNA: {trna_count}, rRNA: {rrna_count}")

# Save to file
with open('genome_complete.json', 'w') as f:
    json.dump(json_data, f, indent=2)

Merge Multiple Genomes

from GenBankToLib import genbank_to_faa

genbank_files = ['genome1.gbk', 'genome2.gbk', 'genome3.gbk']

with open('all_proteins.faa', 'w') as out:
    for gbk in genbank_files:
        for seqid, protid, sequence in genbank_to_faa(gbk):
            out.write(f">{protid}\n{sequence}\n")

Advanced Examples

Complex Header Format

Generate detailed FASTA headers:

genbank_to -g genome.gbk \
    -a proteins.faa \
    --complex

Output includes organism, location, product, and database IDs:

>NP_040703.1 [NC_001417] [Enterobacteria phage phiX174] [NC_001417_51_1905]
DNA replication protein [GeneID:1261050]

Include Pseudogenes

By default, pseudogenes are skipped. To include them:

genbank_to -g genome.gbk -a proteins.faa --pseudo

Compressed Output

Generate compressed output (experimental):

genbank_to -g genome.gbk \
    -f functions.tsv \
    --zip

Custom Logging

Control logging output:

# Custom log file with debug information
genbank_to -g genome.gbk \
    -n genome.fna \
    --log my_conversion.log \
    --debug

# View the log
tail -f my_conversion.log

Integration Examples

BLAST Database Creation

# Extract proteins
genbank_to -g genome.gbk -a proteins.faa

# Create BLAST database
makeblastdb -in proteins.faa -dbtype prot -out genome_db

Genome Browser Setup

# Generate GFF3 and FASTA
genbank_to -g genome.gbk \
    -n genome.fna \
    --gff3 genome.gff3

# Index for IGV
samtools faidx genome.fna

# Or sort GFF for JBrowse
sort -k1,1 -k4,4n genome.gff3 > genome.sorted.gff3
bgzip genome.sorted.gff3
tabix -p gff genome.sorted.gff3.gz

Codon Usage Analysis

# Extract ORFs
genbank_to -g genome.gbk -o orfs.fna

# Analyze with CodonW or similar tool
codonw orfs.fna -all_indices

Phylogenetic Analysis

# Extract specific gene
genbank_to -g genome1.gbk -a proteins1.faa
genbank_to -g genome2.gbk -a proteins2.faa

# Extract rpoB sequences (example)
grep -A1 "RNA polymerase beta" proteins1.faa > rpoB.faa
grep -A1 "RNA polymerase beta" proteins2.faa >> rpoB.faa

# Align with MAFFT
mafft rpoB.faa > rpoB_aligned.faa

# Build tree with RAxML
raxmlHPC -s rpoB_aligned.faa -n tree -m PROTGAMMAAUTO

Troubleshooting Examples

Check File Format

# Verify it's a GenBank file
head -1 genome.gbk
# Should start with LOCUS

# Check for multiple records
grep -c "//" genome.gbk

Handle Gzipped Input

genbank_to automatically handles gzipped files:

genbank_to -g genome.gbk.gz -n genome.fna

Extract from Failed Conversion

If conversion fails partway through:

# Enable debug logging
genbank_to -g genome.gbk \
    -a proteins.faa \
    --debug \
    --log debug.log

# Check the log for specific errors
grep ERROR debug.log

Performance Optimization

Large Genome Processing

For very large genomes or many files:

# Process one output at a time to save memory
genbank_to -g large_genome.gbk -n genome.fna
genbank_to -g large_genome.gbk -a proteins.faa

# Or use parallel processing for multiple files
parallel -j 4 'genbank_to -g {} -n {.}.fna' ::: genomes/*.gbk

Streaming Processing

from GenBankToLib import genbank_to_faa

# Process proteins on-the-fly without loading all into memory
for seqid, protid, sequence in genbank_to_faa('large_genome.gbk'):
    # Process each protein immediately
    result = analyze_protein(sequence)
    print(f"{protid}: {result}")