Output Formats ============== This page describes all the output formats supported by genbank_to in detail. FASTA Formats ------------- Nucleotide FASTA (Genome) ~~~~~~~~~~~~~~~~~~~~~~~~~~ **Option**: ``-n``, ``--nucleotide`` **Description**: Outputs the complete nucleotide sequence(s) from the GenBank file. **Format**: Standard FASTA format with sequence ID from the GenBank LOCUS or ACCESSION field. **Example Output**: .. code-block:: text >NC_001417.1 GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA ... **Use Cases**: - Reference genome for mapping - Input for genome assembly comparison - Sequence extraction for primers/probes ORF Nucleotide FASTA ~~~~~~~~~~~~~~~~~~~~ **Option**: ``-o``, ``--orfs`` **Description**: Outputs the DNA sequences of all coding sequences (CDS features). **Format**: FASTA format with protein IDs as headers (or locus tags if protein IDs are not available). **Example Output**: .. code-block:: text >NP_040703.1 ATGGTTAGCAAAATCGAACGTGCAAAGATTGATGATATTAATATTTTTATTGAAAATCACCAGAAAGATA TAGACTATCTTTGGCAACGTATACCGATGAAATCATTAAAGACTTAAAAGTTGAGCGCTTTGATACGAGT ... **Use Cases**: - Gene cloning - Codon usage analysis - Primer design - Gene synthesis Protein FASTA ~~~~~~~~~~~~~ **Option**: ``-a``, ``--aminoacids`` **Description**: Outputs amino acid sequences for all coding sequences. **Format**: FASTA format with protein IDs. If ``--complex`` is specified, includes detailed annotation. **Example Output (Simple)**: .. code-block:: text >NP_040703.1 MVSKIERCKILMIINFLIEIHQKDIDYLWQRIPEIIIKDLKVERFDDTVKVGGYKKGGLVQPGGSLRLYE VDEKGHFPENVVYDGDTVVADDTLYLVAVLDERKMKGINTRELLESYFDRRGFRLPVGHIDNKPGFNVK * **Example Output (Complex with** ``--complex`` **)**: .. code-block:: text >NP_040703.1 [NC_001417] [Enterobacteria phage phiX174] [NC_001417_51_1905] gpA DNA replication protein [GeneID:1261050] **Use Cases**: - Protein homology searches (BLAST) - Phylogenetic analysis - Functional annotation - Structure prediction Structured Formats ------------------ GFF3 (Generic Feature Format) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Option**: ``--gff3`` **Description**: Outputs annotations in GFF3 format, a standardized format for genomic features. **Format**: Tab-delimited with 9 columns: seqid, source, type, start, end, score, strand, phase, attributes. **Example Output**: .. code-block:: text ##gff-version 3 ##sequence-region NC_001417 1 5386 NC_001417 GenBank region 1 5386 . + . ID=NC_001417:1..5386;... NC_001417 GenBank gene 51 1905 . + . ID=gene-phiX174p01;... NC_001417 GenBank CDS 51 1905 . + 0 ID=cds-NP_040703.1;... **Use Cases**: - Genome browsers (IGV, JBrowse) - Comparative genomics tools - Annotation pipelines - Data interchange NCBI PTT Format ~~~~~~~~~~~~~~~ **Option**: ``-p``, ``--ptt`` **Description**: Protein table format formerly used by NCBI for genome downloads. **Format**: Tab-delimited table with columns: Location, Strand, Length, PID, Gene, Synonym, COG, Product. **Example Output**: .. code-block:: text 51..1905 + 617 - gpA NP_040703.1 - DNA replication protein 1906..2079 + 57 - gpB NP_040704.1 - capsid morphogenesis protein 2092..2529 + 145 - gpD NP_040705.1 - capsid morphogenesis protein **Use Cases**: - Legacy pipeline compatibility - Quick protein overview - Tab-delimited data processing Function Table ~~~~~~~~~~~~~~ **Option**: ``-f``, ``--functions`` **Description**: Simple two-column table mapping protein IDs to their functional annotations. **Format**: Tab-separated values with columns: Protein ID, Function. **Example Output**: .. code-block:: text NC_001417 NP_040703.1 DNA replication protein NC_001417 NP_040704.1 capsid morphogenesis protein NC_001417 NP_040705.1 capsid morphogenesis protein NC_001417 NP_040706.1 DNA maturase protein B **Use Cases**: - Functional enrichment analysis - Database imports - Quick annotation lookup - Spreadsheet analysis Specialized Formats ------------------- Bakta JSON ~~~~~~~~~~ **Option**: ``--bakta-json`` **Description**: JSON format compatible with Bakta genome annotation output. Includes comprehensive metadata and feature annotations. **Additional Options**: - ``--bakta-version``: Version string - ``--db-version``: Database version - ``--genus``, ``--species``, ``--strain``: Organism information - ``--gram``: Gram stain (+/-) - ``--translation-table``: Genetic code **Example Output**: .. code-block:: json { "version": "1.0", "genome": { "genus": "Enterobacteria", "species": "phage phiX174", "strain": "Sangier", "gram": "-" }, "sequences": [ { "id": "NC_001417", "length": 5386, "gc": 0.447, "features": [ { "type": "cds", "start": 51, "stop": 1905, "strand": "+", "product": "DNA replication protein" } ] } ] } **Use Cases**: - Bakta pipeline integration - Structured data analysis - Web applications - Database storage **Python API**: For programmatic access, use the ``genbank_to_json`` function from the library: .. code-block:: python from GenBankToLib import genbank_to_json genome_info = {'gram': '-', 'translation_table': 11} json_data = genbank_to_json('genome.gbk', genome_info) See the :doc:`api` documentation for detailed information on the ``genbank_to_json`` function. AMRFinderPlus Format ~~~~~~~~~~~~~~~~~~~~ **Option**: ``--amr`` **Description**: Creates three files required by NCBI's AMRFinderPlus tool for antimicrobial resistance gene annotation. **Output Files**: 1. ``BASENAME.gff`` - Modified GFF3 with Name attributes 2. ``BASENAME.faa`` - Protein sequences 3. ``BASENAME.fna`` - Nucleotide sequences **Special Features**: - Validates format for AMRFinderPlus compatibility - Adds required Name fields to GFF - Excludes pseudogenes **Use Cases**: - AMR gene detection - Resistance profile analysis - Public health surveillance - Clinical microbiology Phage Finder Format ~~~~~~~~~~~~~~~~~~~ **Option**: ``--phage_finder`` **Description**: Tab-delimited format required by the phage_finder tool for prophage identification. **Format**: Tab-separated with columns: Contig ID, Contig Length, Gene ID, Start, End, Function. **Example Output**: .. code-block:: text NC_001417 5386 NP_040703.1 51 1905 DNA replication protein NC_001417 5386 NP_040704.1 1906 2079 capsid morphogenesis protein NC_001417 5386 NP_040705.1 2092 2529 capsid morphogenesis protein **Use Cases**: - Prophage detection in bacterial genomes - Phage-host interaction studies - Comparative phage genomics Output Modifiers ---------------- Separate Files ~~~~~~~~~~~~~~ **Option**: ``--separate`` **Description**: When working with multi-record GenBank files, creates separate output files for each sequence record. **Naming Convention**: ``BASENAME.SEQID.EXTENSION`` **Example**: .. code-block:: bash genbank_to -g multi.gbk --separate -n output # Creates: output.NC_001417.fna, output.NC_001418.fna, etc. Sequence ID Filtering ~~~~~~~~~~~~~~~~~~~~~ **Option**: ``-i``, ``--seqid`` **Description**: Filters output to include only specified sequence IDs. Can be used multiple times. **Example**: .. code-block:: bash genbank_to -g multi.gbk -i NC_001417 -i NC_001418 -n output.fna Complex Headers ~~~~~~~~~~~~~~~ **Option**: ``--complex`` **Description**: Adds detailed information to FASTA headers including organism, location, product, and database cross-references. **Example**: .. code-block:: text >NP_040703.1 [NC_001417] [Enterobacteria phage phiX174] [NC_001417_51_1905] DNA replication protein [GeneID:1261050] Compression ~~~~~~~~~~~ **Option**: ``-z``, ``--zip`` **Description**: Compresses output files using gzip. Experimental feature. **Example**: .. code-block:: bash genbank_to -g genome.gbk -f functions.tsv --zip # Creates: functions.tsv.gz Format Comparison ----------------- .. list-table:: Output Format Comparison :header-rows: 1 :widths: 20 20 20 40 * - Format - Type - Primary Use - Tools * - Nucleotide FASTA - Sequence - Genome storage - BWA, Bowtie, BLAST * - Protein FASTA - Sequence - Protein analysis - BLAST, DIAMOND, InterProScan * - ORF FASTA - Sequence - Gene analysis - Gene synthesis, primers * - GFF3 - Annotation - Feature storage - IGV, JBrowse, bedtools * - PTT - Table - Legacy compatibility - Custom scripts * - Functions - Table - Annotation lookup - Spreadsheets, R/Python * - Bakta JSON - Structured - Data interchange - Bakta, web apps * - AMRFinder - Specialized - AMR detection - AMRFinderPlus * - Phage Finder - Specialized - Prophage detection - phage_finder