API Reference ============= This page documents the Python library API for genbank_to. All functions can be imported from the ``GenBankToLib`` package. .. code-block:: python from GenBankToLib import ( genbank_to_faa, genbank_to_fna, genbank_to_orfs, genbank_to_functions, genbank_to_ptt, genbank_to_gff, genbank_to_phage_finder, genbank_to_amrfinder, genbank_seqio, genbank_to_json ) Core Functions -------------- genbank_to_fna ~~~~~~~~~~~~~~ .. code-block:: python genbank_to_fna(gbkf, include_definition=False) Extract nucleotide sequences from a GenBank file. **Parameters:** - ``gbkf`` (str): Path to the GenBank file. - ``include_definition`` (bool, optional): If True, includes the GenBank definition line with the sequence ID. Default is False. **Yields:** - ``tuple``: (sequence_id, sequence) where: - ``sequence_id`` (str): The sequence identifier - ``sequence`` (Bio.Seq.Seq): The nucleotide sequence **Example:** .. code-block:: python from GenBankToLib import genbank_to_fna for seqid, sequence in genbank_to_fna('genome.gbk'): print(f">{seqid}") print(sequence) genbank_to_faa ~~~~~~~~~~~~~~ .. code-block:: python genbank_to_faa(gbkf, complexheader=False, skip_pseudo=True) Extract amino acid (protein) sequences from a GenBank file. **Parameters:** - ``gbkf`` (str): Path to the GenBank file. - ``complexheader`` (bool, optional): If True, creates detailed headers with organism, location, product, and database references. Default is False. - ``skip_pseudo`` (bool, optional): If True, skips pseudogenes. Default is True. **Yields:** - ``tuple``: (sequence_id, protein_id, sequence) where: - ``sequence_id`` (str): The parent sequence identifier - ``protein_id`` (str): The protein identifier (protein_id, locus_tag, or db_xref) - ``sequence`` (str): The amino acid sequence **Example:** .. code-block:: python from GenBankToLib import genbank_to_faa # Simple headers for seqid, protid, sequence in genbank_to_faa('genome.gbk'): print(f">{protid}") print(sequence) # Complex headers for seqid, protid, sequence in genbank_to_faa('genome.gbk', complexheader=True): print(f">{protid}") print(sequence) genbank_to_orfs ~~~~~~~~~~~~~~~ .. code-block:: python genbank_to_orfs(gbkf, complexheader=False, skip_pseudo=True) Extract DNA sequences of open reading frames (ORFs) from a GenBank file. **Parameters:** - ``gbkf`` (str): Path to the GenBank file. - ``complexheader`` (bool, optional): If True, creates detailed headers. Default is False. - ``skip_pseudo`` (bool, optional): If True, skips pseudogenes. Default is True. **Yields:** - ``tuple``: (sequence_id, protein_id, sequence) where: - ``sequence_id`` (str): The parent sequence identifier - ``protein_id`` (str): The protein/feature identifier - ``sequence`` (str): The DNA sequence of the ORF **Example:** .. code-block:: python from GenBankToLib import genbank_to_orfs for seqid, protid, sequence in genbank_to_orfs('genome.gbk'): print(f">{protid}") print(sequence) genbank_to_functions ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python genbank_to_functions(gbkf, seqid=False, skip_pseudo=True) Extract protein functions (products) from a GenBank file. **Parameters:** - ``gbkf`` (str): Path to the GenBank file. - ``seqid`` (bool, optional): If True, includes the sequence ID in the output. Default is False. - ``skip_pseudo`` (bool, optional): If True, skips pseudogenes. Default is True. **Yields:** - ``tuple``: If seqid=False: (protein_id, function) - ``tuple``: If seqid=True: (sequence_id, protein_id, function) **Example:** .. code-block:: python from GenBankToLib import genbank_to_functions # Without sequence ID for protid, function in genbank_to_functions('genome.gbk'): print(f"{protid}\t{function}") # With sequence ID for seqid, protid, function in genbank_to_functions('genome.gbk', seqid=True): print(f"{seqid}\t{protid}\t{function}") genbank_to_ptt ~~~~~~~~~~~~~~ .. code-block:: python genbank_to_ptt(gbkf, printout=False) Convert GenBank file to NCBI Protein Table (PTT) format. **Parameters:** - ``gbkf`` (str): Path to the GenBank file. - ``printout`` (bool, optional): If True, prints the table to stdout. Default is False. **Returns:** - ``list``: List of lists, where each inner list contains: [location, strand, length, gi, gene, synonym, cog, product] **Example:** .. code-block:: python from GenBankToLib import genbank_to_ptt table = genbank_to_ptt('genome.gbk') for row in table: print('\t'.join(map(str, row))) genbank_to_gff ~~~~~~~~~~~~~~ .. code-block:: python genbank_to_gff(gbkf, out_gff) Convert GenBank file to GFF3 format. **Parameters:** - ``gbkf`` (str): Path to the GenBank file. - ``out_gff`` (str): Path to the output GFF3 file. **Returns:** - ``None``: Writes output directly to file. **Example:** .. code-block:: python from GenBankToLib import genbank_to_gff genbank_to_gff('genome.gbk', 'genome.gff3') genbank_to_phage_finder ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python genbank_to_phage_finder(gbkf) Convert GenBank file to phage_finder format. **Parameters:** - ``gbkf`` (str): Path to the GenBank file. **Yields:** - ``list``: [contig_id, contig_length, gene_id, start, end, function] **Example:** .. code-block:: python from GenBankToLib import genbank_to_phage_finder for row in genbank_to_phage_finder('genome.gbk'): print('\t'.join(map(str, row))) genbank_to_amrfinder ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python genbank_to_amrfinder(gbkf, amrout) Convert GenBank file to AMRFinderPlus format. **Parameters:** - ``gbkf`` (str): Path to the GenBank file. - ``amrout`` (str): Base name for output files (creates .gff, .faa, and .fna files). **Returns:** - ``None``: Writes output directly to files. **Example:** .. code-block:: python from GenBankToLib import genbank_to_amrfinder genbank_to_amrfinder('genome.gbk', 'output') # Creates: output.gff, output.faa, output.fna genbank_to_json ~~~~~~~~~~~~~~~ .. code-block:: python genbank_to_json(genbank_path, genome_info) Convert GenBank file to JSON format with comprehensive metadata. **Parameters:** - ``genbank_path`` (str): Path to the GenBank file. - ``genome_info`` (dict): Dictionary containing optional genome information: - ``gram`` (str or None): Gram stain result ('+' for Gram-positive, '-' for Gram-negative, or None to infer) - ``translation_table`` (int or None): NCBI translation table number (default: 11) **Returns:** - ``dict``: Complete JSON data structure containing: - ``genome``: Genome metadata (genus, species, strain, gram, translation_table, complete) - ``stats``: Genome statistics (no_sequences, size, gc, n_ratio, n50, coding_ratio) - ``features``: List of feature dictionaries (CDS, tRNA, rRNA, tmRNA, ncRNA) - ``sequences``: List of sequence objects with metadata - ``version``: Version information **Example:** .. code-block:: python from GenBankToLib import genbank_to_json import json # Basic usage with defaults genome_info = { 'gram': None, 'translation_table': 11 } json_data = genbank_to_json('genome.gbk', genome_info) # Save to file with open('genome.json', 'w') as f: json.dump(json_data, f, indent=2) # Specify Gram stain and translation table genome_info = { 'gram': '-', 'translation_table': 11 } json_data = genbank_to_json('ecoli.gbk', genome_info) # Access the data print(f"Genome: {json_data['genome']['genus']} {json_data['genome']['species']}") print(f"Size: {json_data['stats']['size']} bp") print(f"GC content: {json_data['stats']['gc']:.2%}") print(f"Features: {len(json_data['features'])}") **JSON Output Structure:** The output JSON contains the following structure: .. code-block:: json { "genome": { "genus": "Escherichia", "species": "coli", "strain": "K-12", "complete": true, "gram": "-", "translation_table": 11 }, "stats": { "no_sequences": 1, "size": 4641652, "gc": 0.5079, "n_ratio": 0.0, "n50": 4641652, "coding_ratio": 0.8756 }, "features": [ { "type": "CDS", "contig": "NC_000913", "start": 190, "stop": 255, "strand": 1, "gene": "thrL", "locus": "b0001", "product": "thr operon leader peptide", "nt": "ATGAAACGC...", "aa": "MKRISTT...", "aa_hexdigest": "abc123...", "frame": 0, "start_type": "ATG", "rbs_motif": "AGGAG" } ], "sequences": [ { "id": "NC_000913", "description": "Escherichia coli str. K-12 substr. MG1655, complete genome", "sequence": "AGCTTTTCATTC...", "length": 4641652, "complete": true, "type": "chromosome", "topology": "circular" } ], "version": { "genbank_to": "1.2.3" } } **Notes:** - The JSON format is compatible with Bakta annotation output - Coordinates are 1-based inclusive (start, stop) - Frame is 0-based (0, 1, 2) where GenBank codon_start values 1, 2, 3 are converted to frame values 0, 1, 2 respectively - GC content is calculated as (G+C)/(A+C+G+T), ignoring ambiguous bases - MD5 hexdigest is provided for amino acid sequences in the ``aa_hexdigest`` field - Gram stain can be inferred from genus if not provided genbank_seqio ~~~~~~~~~~~~~ .. code-block:: python genbank_seqio(gbkf) Get a BioPython SeqIO parser for the GenBank file. **Parameters:** - ``gbkf`` (str): Path to the GenBank file (can be gzipped). **Returns:** - ``tuple``: (parser, file_handle) where: - ``parser``: BioPython SeqIO parser object - ``file_handle``: File handle (should be closed when done) **Example:** .. code-block:: python from GenBankToLib import genbank_seqio parser, handle = genbank_seqio('genome.gbk') for record in parser: print(f"Processing {record.id}") # Do something with record handle.close() Utility Functions ----------------- feature_id ~~~~~~~~~~ .. code-block:: python feature_id(seq, feat) Choose the appropriate identifier for a feature. **Parameters:** - ``seq`` (Bio.SeqRecord.SeqRecord): The sequence record. - ``feat`` (Bio.SeqFeature.SeqFeature): The feature. **Returns:** - ``str``: The feature ID (protein_id, locus_tag, db_xref, or location-based ID). **Example:** .. code-block:: python from Bio import SeqIO from GenBankToLib.genbank import feature_id for record in SeqIO.parse('genome.gbk', 'genbank'): for feature in record.features: if feature.type == 'CDS': fid = feature_id(record, feature) print(fid) is_gzip ~~~~~~~ .. code-block:: python is_gzip(gbkf) Check if a file is gzip compressed. **Parameters:** - ``gbkf`` (str): Path to the file. **Returns:** - ``bool``: True if the file is gzip compressed, False otherwise. **Example:** .. code-block:: python from GenBankToLib.genbank import is_gzip if is_gzip('genome.gbk.gz'): print("File is compressed") Data Structures --------------- Bacteria Module ~~~~~~~~~~~~~~~ The ``bacteria`` module contains sets of Gram-positive and Gram-negative bacterial genera. .. code-block:: python from GenBankToLib.bacteria import gram_positive, gram_negative # Check if a genus is Gram-positive if 'Escherichia' in gram_negative: print("Escherichia is Gram-negative") # Get all Gram-positive genera print(f"Gram-positive genera: {len(gram_positive)}") Version Information ------------------- .. code-block:: python from GenBankToLib import __version__ print(f"genbank_to version: {__version__}") Advanced Usage -------------- Processing Large Files ~~~~~~~~~~~~~~~~~~~~~~ For large GenBank files, the generator-based functions are memory-efficient: .. code-block:: python from GenBankToLib import genbank_to_faa # Process one protein at a time without loading entire file into memory protein_count = 0 total_length = 0 for seqid, protid, sequence in genbank_to_faa('large_genome.gbk'): protein_count += 1 total_length += len(sequence) avg_length = total_length / protein_count print(f"Processed {protein_count} proteins, avg length: {avg_length:.1f}") Custom Processing Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from GenBankToLib import genbank_to_json import json # Build a comprehensive genome database with genbank_to_json genome_info = { 'gram': '-', 'translation_table': 11 } # Convert to JSON format genome_data = genbank_to_json('genome.gbk', genome_info) # Process the data print(f"Genome: {genome_data['genome']['genus']} {genome_data['genome']['species']}") print(f"Total size: {genome_data['stats']['size']} bp") print(f"GC content: {genome_data['stats']['gc']:.2%}") print(f"Coding ratio: {genome_data['stats']['coding_ratio']:.2%}") # Extract specific features cds_features = [f for f in genome_data['features'] if f['type'] == 'CDS'] print(f"CDS features: {len(cds_features)}") # Save to JSON file with open('genome_analysis.json', 'w') as f: json.dump(genome_data, f, indent=2) Working with BioPython ~~~~~~~~~~~~~~~~~~~~~~ genbank_to functions integrate seamlessly with BioPython: .. code-block:: python from GenBankToLib import genbank_seqio from Bio import SeqIO from Bio.SeqUtils import molecular_weight # Get molecular weights of all proteins parser, handle = genbank_seqio('genome.gbk') for record in parser: for feature in record.features: if feature.type == 'CDS' and 'translation' in feature.qualifiers: protein_seq = feature.qualifiers['translation'][0] mw = molecular_weight(protein_seq, seq_type='protein') print(f"{feature.qualifiers.get('protein_id', ['Unknown'])[0]}: {mw:.2f} Da") handle.close() Error Handling -------------- Example of proper error handling: .. code-block:: python from GenBankToLib import genbank_to_faa import logging logging.basicConfig(level=logging.INFO) try: for seqid, protid, sequence in genbank_to_faa('genome.gbk'): # Process sequence pass except FileNotFoundError: logging.error("GenBank file not found") except Exception as e: logging.error(f"Error processing GenBank file: {e}") Type Hints ---------- The library uses type hints for better IDE support: .. code-block:: python from typing import Iterator, Tuple from Bio.Seq import Seq def process_proteins(gbkf: str) -> Iterator[Tuple[str, str, str]]: """ Process proteins from a GenBank file. Args: gbkf: Path to GenBank file Yields: Tuple of (sequence_id, protein_id, sequence) """ from GenBankToLib import genbank_to_faa return genbank_to_faa(gbkf)