API Reference
This page documents the Python library API for genbank_to. All functions can be imported from the GenBankToLib package.
from GenBankToLib import (
genbank_to_faa,
genbank_to_fna,
genbank_to_orfs,
genbank_to_functions,
genbank_to_ptt,
genbank_to_gff,
genbank_to_phage_finder,
genbank_to_amrfinder,
genbank_seqio,
genbank_to_json
)
Core Functions
genbank_to_fna
genbank_to_fna(gbkf, include_definition=False)
Extract nucleotide sequences from a GenBank file.
Parameters:
gbkf(str): Path to the GenBank file.include_definition(bool, optional): If True, includes the GenBank definition line with the sequence ID. Default is False.
Yields:
tuple: (sequence_id, sequence) where:sequence_id(str): The sequence identifiersequence(Bio.Seq.Seq): The nucleotide sequence
Example:
from GenBankToLib import genbank_to_fna
for seqid, sequence in genbank_to_fna('genome.gbk'):
print(f">{seqid}")
print(sequence)
genbank_to_faa
genbank_to_faa(gbkf, complexheader=False, skip_pseudo=True)
Extract amino acid (protein) sequences from a GenBank file.
Parameters:
gbkf(str): Path to the GenBank file.complexheader(bool, optional): If True, creates detailed headers with organism, location, product, and database references. Default is False.skip_pseudo(bool, optional): If True, skips pseudogenes. Default is True.
Yields:
tuple: (sequence_id, protein_id, sequence) where:sequence_id(str): The parent sequence identifierprotein_id(str): The protein identifier (protein_id, locus_tag, or db_xref)sequence(str): The amino acid sequence
Example:
from GenBankToLib import genbank_to_faa
# Simple headers
for seqid, protid, sequence in genbank_to_faa('genome.gbk'):
print(f">{protid}")
print(sequence)
# Complex headers
for seqid, protid, sequence in genbank_to_faa('genome.gbk', complexheader=True):
print(f">{protid}")
print(sequence)
genbank_to_orfs
genbank_to_orfs(gbkf, complexheader=False, skip_pseudo=True)
Extract DNA sequences of open reading frames (ORFs) from a GenBank file.
Parameters:
gbkf(str): Path to the GenBank file.complexheader(bool, optional): If True, creates detailed headers. Default is False.skip_pseudo(bool, optional): If True, skips pseudogenes. Default is True.
Yields:
tuple: (sequence_id, protein_id, sequence) where:sequence_id(str): The parent sequence identifierprotein_id(str): The protein/feature identifiersequence(str): The DNA sequence of the ORF
Example:
from GenBankToLib import genbank_to_orfs
for seqid, protid, sequence in genbank_to_orfs('genome.gbk'):
print(f">{protid}")
print(sequence)
genbank_to_functions
genbank_to_functions(gbkf, seqid=False, skip_pseudo=True)
Extract protein functions (products) from a GenBank file.
Parameters:
gbkf(str): Path to the GenBank file.seqid(bool, optional): If True, includes the sequence ID in the output. Default is False.skip_pseudo(bool, optional): If True, skips pseudogenes. Default is True.
Yields:
tuple: If seqid=False: (protein_id, function)tuple: If seqid=True: (sequence_id, protein_id, function)
Example:
from GenBankToLib import genbank_to_functions
# Without sequence ID
for protid, function in genbank_to_functions('genome.gbk'):
print(f"{protid}\t{function}")
# With sequence ID
for seqid, protid, function in genbank_to_functions('genome.gbk', seqid=True):
print(f"{seqid}\t{protid}\t{function}")
genbank_to_ptt
genbank_to_ptt(gbkf, printout=False)
Convert GenBank file to NCBI Protein Table (PTT) format.
Parameters:
gbkf(str): Path to the GenBank file.printout(bool, optional): If True, prints the table to stdout. Default is False.
Returns:
list: List of lists, where each inner list contains: [location, strand, length, gi, gene, synonym, cog, product]
Example:
from GenBankToLib import genbank_to_ptt
table = genbank_to_ptt('genome.gbk')
for row in table:
print('\t'.join(map(str, row)))
genbank_to_gff
genbank_to_gff(gbkf, out_gff)
Convert GenBank file to GFF3 format.
Parameters:
gbkf(str): Path to the GenBank file.out_gff(str): Path to the output GFF3 file.
Returns:
None: Writes output directly to file.
Example:
from GenBankToLib import genbank_to_gff
genbank_to_gff('genome.gbk', 'genome.gff3')
genbank_to_phage_finder
genbank_to_phage_finder(gbkf)
Convert GenBank file to phage_finder format.
Parameters:
gbkf(str): Path to the GenBank file.
Yields:
list: [contig_id, contig_length, gene_id, start, end, function]
Example:
from GenBankToLib import genbank_to_phage_finder
for row in genbank_to_phage_finder('genome.gbk'):
print('\t'.join(map(str, row)))
genbank_to_amrfinder
genbank_to_amrfinder(gbkf, amrout)
Convert GenBank file to AMRFinderPlus format.
Parameters:
gbkf(str): Path to the GenBank file.amrout(str): Base name for output files (creates .gff, .faa, and .fna files).
Returns:
None: Writes output directly to files.
Example:
from GenBankToLib import genbank_to_amrfinder
genbank_to_amrfinder('genome.gbk', 'output')
# Creates: output.gff, output.faa, output.fna
genbank_to_json
genbank_to_json(genbank_path, genome_info)
Convert GenBank file to JSON format with comprehensive metadata.
Parameters:
genbank_path(str): Path to the GenBank file.genome_info(dict): Dictionary containing optional genome information:gram(str or None): Gram stain result (‘+’ for Gram-positive, ‘-’ for Gram-negative, or None to infer)translation_table(int or None): NCBI translation table number (default: 11)
Returns:
dict: Complete JSON data structure containing:genome: Genome metadata (genus, species, strain, gram, translation_table, complete)stats: Genome statistics (no_sequences, size, gc, n_ratio, n50, coding_ratio)features: List of feature dictionaries (CDS, tRNA, rRNA, tmRNA, ncRNA)sequences: List of sequence objects with metadataversion: Version information
Example:
from GenBankToLib import genbank_to_json
import json
# Basic usage with defaults
genome_info = {
'gram': None,
'translation_table': 11
}
json_data = genbank_to_json('genome.gbk', genome_info)
# Save to file
with open('genome.json', 'w') as f:
json.dump(json_data, f, indent=2)
# Specify Gram stain and translation table
genome_info = {
'gram': '-',
'translation_table': 11
}
json_data = genbank_to_json('ecoli.gbk', genome_info)
# Access the data
print(f"Genome: {json_data['genome']['genus']} {json_data['genome']['species']}")
print(f"Size: {json_data['stats']['size']} bp")
print(f"GC content: {json_data['stats']['gc']:.2%}")
print(f"Features: {len(json_data['features'])}")
JSON Output Structure:
The output JSON contains the following structure:
{
"genome": {
"genus": "Escherichia",
"species": "coli",
"strain": "K-12",
"complete": true,
"gram": "-",
"translation_table": 11
},
"stats": {
"no_sequences": 1,
"size": 4641652,
"gc": 0.5079,
"n_ratio": 0.0,
"n50": 4641652,
"coding_ratio": 0.8756
},
"features": [
{
"type": "CDS",
"contig": "NC_000913",
"start": 190,
"stop": 255,
"strand": 1,
"gene": "thrL",
"locus": "b0001",
"product": "thr operon leader peptide",
"nt": "ATGAAACGC...",
"aa": "MKRISTT...",
"aa_hexdigest": "abc123...",
"frame": 0,
"start_type": "ATG",
"rbs_motif": "AGGAG"
}
],
"sequences": [
{
"id": "NC_000913",
"description": "Escherichia coli str. K-12 substr. MG1655, complete genome",
"sequence": "AGCTTTTCATTC...",
"length": 4641652,
"complete": true,
"type": "chromosome",
"topology": "circular"
}
],
"version": {
"genbank_to": "1.2.3"
}
}
Notes:
The JSON format is compatible with Bakta annotation output
Coordinates are 1-based inclusive (start, stop)
Frame is 0-based (0, 1, 2) where GenBank codon_start values 1, 2, 3 are converted to frame values 0, 1, 2 respectively
GC content is calculated as (G+C)/(A+C+G+T), ignoring ambiguous bases
MD5 hexdigest is provided for amino acid sequences in the
aa_hexdigestfieldGram stain can be inferred from genus if not provided
genbank_seqio
genbank_seqio(gbkf)
Get a BioPython SeqIO parser for the GenBank file.
Parameters:
gbkf(str): Path to the GenBank file (can be gzipped).
Returns:
tuple: (parser, file_handle) where:parser: BioPython SeqIO parser objectfile_handle: File handle (should be closed when done)
Example:
from GenBankToLib import genbank_seqio
parser, handle = genbank_seqio('genome.gbk')
for record in parser:
print(f"Processing {record.id}")
# Do something with record
handle.close()
Utility Functions
feature_id
feature_id(seq, feat)
Choose the appropriate identifier for a feature.
Parameters:
seq(Bio.SeqRecord.SeqRecord): The sequence record.feat(Bio.SeqFeature.SeqFeature): The feature.
Returns:
str: The feature ID (protein_id, locus_tag, db_xref, or location-based ID).
Example:
from Bio import SeqIO
from GenBankToLib.genbank import feature_id
for record in SeqIO.parse('genome.gbk', 'genbank'):
for feature in record.features:
if feature.type == 'CDS':
fid = feature_id(record, feature)
print(fid)
is_gzip
is_gzip(gbkf)
Check if a file is gzip compressed.
Parameters:
gbkf(str): Path to the file.
Returns:
bool: True if the file is gzip compressed, False otherwise.
Example:
from GenBankToLib.genbank import is_gzip
if is_gzip('genome.gbk.gz'):
print("File is compressed")
Data Structures
Bacteria Module
The bacteria module contains sets of Gram-positive and Gram-negative bacterial genera.
from GenBankToLib.bacteria import gram_positive, gram_negative
# Check if a genus is Gram-positive
if 'Escherichia' in gram_negative:
print("Escherichia is Gram-negative")
# Get all Gram-positive genera
print(f"Gram-positive genera: {len(gram_positive)}")
Version Information
from GenBankToLib import __version__
print(f"genbank_to version: {__version__}")
Advanced Usage
Processing Large Files
For large GenBank files, the generator-based functions are memory-efficient:
from GenBankToLib import genbank_to_faa
# Process one protein at a time without loading entire file into memory
protein_count = 0
total_length = 0
for seqid, protid, sequence in genbank_to_faa('large_genome.gbk'):
protein_count += 1
total_length += len(sequence)
avg_length = total_length / protein_count
print(f"Processed {protein_count} proteins, avg length: {avg_length:.1f}")
Custom Processing Pipeline
from GenBankToLib import genbank_to_json
import json
# Build a comprehensive genome database with genbank_to_json
genome_info = {
'gram': '-',
'translation_table': 11
}
# Convert to JSON format
genome_data = genbank_to_json('genome.gbk', genome_info)
# Process the data
print(f"Genome: {genome_data['genome']['genus']} {genome_data['genome']['species']}")
print(f"Total size: {genome_data['stats']['size']} bp")
print(f"GC content: {genome_data['stats']['gc']:.2%}")
print(f"Coding ratio: {genome_data['stats']['coding_ratio']:.2%}")
# Extract specific features
cds_features = [f for f in genome_data['features'] if f['type'] == 'CDS']
print(f"CDS features: {len(cds_features)}")
# Save to JSON file
with open('genome_analysis.json', 'w') as f:
json.dump(genome_data, f, indent=2)
Working with BioPython
genbank_to functions integrate seamlessly with BioPython:
from GenBankToLib import genbank_seqio
from Bio import SeqIO
from Bio.SeqUtils import molecular_weight
# Get molecular weights of all proteins
parser, handle = genbank_seqio('genome.gbk')
for record in parser:
for feature in record.features:
if feature.type == 'CDS' and 'translation' in feature.qualifiers:
protein_seq = feature.qualifiers['translation'][0]
mw = molecular_weight(protein_seq, seq_type='protein')
print(f"{feature.qualifiers.get('protein_id', ['Unknown'])[0]}: {mw:.2f} Da")
handle.close()
Error Handling
Example of proper error handling:
from GenBankToLib import genbank_to_faa
import logging
logging.basicConfig(level=logging.INFO)
try:
for seqid, protid, sequence in genbank_to_faa('genome.gbk'):
# Process sequence
pass
except FileNotFoundError:
logging.error("GenBank file not found")
except Exception as e:
logging.error(f"Error processing GenBank file: {e}")
Type Hints
The library uses type hints for better IDE support:
from typing import Iterator, Tuple
from Bio.Seq import Seq
def process_proteins(gbkf: str) -> Iterator[Tuple[str, str, str]]:
"""
Process proteins from a GenBank file.
Args:
gbkf: Path to GenBank file
Yields:
Tuple of (sequence_id, protein_id, sequence)
"""
from GenBankToLib import genbank_to_faa
return genbank_to_faa(gbkf)