RiboDB : a prokaryotic ribosomal components DataBase

RiboDB v.3.1, release 16.1, (April 25 2023)
This is the latest built database with more genomes, an extended biodiversity a simplier scheme and an improved final quality-control of the ribosomal proteins and sequences of rDNA 16S/23S/5S.
The aim of this work is primarily to facilitate the use of the ribosomal components in phylogeny
You can list the genomes you want, count them and extract all the corresponding ribosomal proteins or rDNA.

Since release 16.1 16SrDNA, 23SrDNA and 5SrDNA are included (proteins from rel.16.0 April 25 2023). RiboDB enables exploring the protein and RNA composition of the ribosomes
There is no table of the ribosomal proteins or rDNA content of the genomes here.
RIBODB V.3 CONTENT RiboDB currently contains nucleic and protein sequences of ribosomal proteins from 212,630 genomes of Bacteria (206,597) and Archaea (5,879). The aim of this work is facilitate the use of ribosomal proteins in phylogeny:
RiboDB currently contains the rDNA if available in the genomes. As concern 16SrDNA, 141,537 genomes bacteria representing 17637 names at a species level are available (Archaea 2,422 genomes and 802 names at a species level).
Due to the multiples operons the whole content in 16SrDNA is of 324,883 sequences

     • Worth noticing, "highly populated species" (i.e species with more that 1,000 genome sequences in RefSeq) are represented only by ~1000 genomes selected on the basis of representativity and quality
QUERY CONSTRUCTION RiboDB allows two types of queries:
     • The retrieval of information on strains and genomes for which ribosomal proteins and/or rDNA are available in RiboDB. These queries must begin by the tag "@". For instance "@Cyanobacteria" will return the list and information on all Cyanobacteria (phylum) strains/genomes contained in RiboDB.
           • Adding "%" at the end of a query line, will return the number of strains / genomes for the corresponding taxon. For instance "@Escherichia_coli %" will return the number of Escherichia coli strains / genomes contained in RiboDB.
     • The retrieval of ribosomal protein nucleic and protein sequences and/or rDNA sequences for given sets of taxa or genomes. These queries must begin by the tag "#". For instance "#Bacillus" will return r-prots sequences of all Bacillus (genus) genomes contained in RiboDB.
The two types of queries are mutually exclusive.
RiboDB allows multiple queries of the same type at once, by listing queries on separated lines:
     • @Bacteria ( @=list of the genomes)
     • @Archaea
Or
     • #Streptococcus_pneumoniae (#=extration of the ribo-proteins)
     • #Bacillus_subtilis
For additional details see (below) "Query construction" and "Structure of the FASTA commentary line"

Options Checkboxes can be used to reduce this set of strains / genomes by targeting:
     • type strain material
     • representative/reference strains / genomes
     • genomes from GenbBank/RefSeq included in Ensembl! Bacteria
YOUR QUERY
Targeted genomes / taxa (use # to extract r-prots or @ to extract information). If empty, launchs a random test
Targeted r-prots and rDNA (delete the unwanted r-prots and the associated semicolons) R-prots are named according to BAN, Nenad, BECKMANN, Roland, CATE, Jamie HD, et al. A new system for naming ribosomal proteins. Current opinion in structural biology, 2014, vol. 24, p. 165-169. (see also the Ban Lab website)
Options Selection of the subsets




Additional information Retrieving Ribosomal proteins: Queries allow scanning "FASTA commentary lines" of ribosomal proteins contained in the database using keywords. The structure of "FASTA commentary lines" is described below.

Most relevant searches target fields corresponding to:
     • Genus, Species, or lineage_report (e.g. #Sodalis_praecaptivus, @Bacillaceae-Bacillus)
     • NCBI_Species_TaxID (e.g. #~1463164)
     • Genome_assembly_number (e.g. #GCF_900890425.1)

To avoid any confusion among taxonomic ranks use "-" at the end of the taxon name when querying RiboDB on lineage report information. Using #Listeria will retrieve both Listeria (genus) and Listeriaceae (family). To retrieve ribosomal proteins from the Listeria genus, use "#Listeria-".
Similarly, use a "~" when querying on TaxID (e.g. "#~1312852")

More generally, any information contained in "FASTA commentary lines" may be queried, but may be risky or poorly relevant.
For instance, querying the database with "#Myco" will return information on Mycobacterium, Mycolicibacterium, Mycobacteroides, Mycolicibacter, and other Mycobacteriaceae (Actinobacteria), Mycoplasma (Mycoplasmatales), Mycoplana_dimorpha (an alphaproteobacterium), and Mycoavidus_cysteinexigens (a betaproteobacterium) strains contained in RiboDB.
Similarly, "#myco" will return proteins from Corynebacterium_amycolatum, Amycolatopsis, Streptomyces_antimycoticus, and Actinoplanes_awajinensis_subsp._mycoplanecinus (Actinobacteria), Bacillus_mycoides, Bacillus_paramycoides, Bacillus_pseudomycoides, and Mycoplasma_mycoides (Firmicutes).

Retrieving information/statistics only: Queries may concern the species name (ex: @Acinetobacter_colistiniresistens) the strain Id (ex:@NR1165) the genome Id (ex:@GCF_003227755), the NCBI taxId (ex:@TaxId 280145; <-mind the ";") and any part of the nomenclature hierarchy (ex: @-Gammaproteobacteria- note that the "-" may be mandatory in some cases)

Structure of the FASTA commentary line FASTA commentary lines are built as follow:
>Genus_species|strain_ID#genome_type~genome_assembly_number~contig_number~[position_on_the_genome]~NCBI_Species_TaxID~Genetic_code~Genome_source~Protein_evidence=lineage_report
with:
     • Genus_species: e.g. Pseudomonas_aeruginosa
     • strain_ID: e.g. PAO1
     • genome_type [#T, #R, or #E] with #T = genome tagged as type strain material in RefSeq or GenBank, #R = genome tagged as reference / representative genomes in RefSeq, #E = genome listed in Ensembl! Bacteria
     • genome_quality [#C, #S, #U and #d] with #C for complete genomes, #S for scaffolds #U for unassembled and note that #d indicate the origin from metagenomes and other potential loss of quality in the assembly (#S#d is a genome in the scaffold state from a metagenome).
     • genome_assembly_number: e.g. GCF_000006765.1
     • contig_number: e.g. NZ_002516.2
     • position: indicates the position of CDS on the contig, with "C" indicating that the CDS is encoded on the reverse strand, e.g. C[4781985..4782680]
     • NCBI_species_TaxID: corresponds to the species TaxID of the strain
     • Genetic_code: indicates the genetic code for the genome
     • Genome_source [~A. or ~B.> with #A = genome from RefSeq, #B = genome from Genbank not present in RefSeq
     • Protein_evidence (the indication is following the genome sourec ex: A.V) [V or H] with #V = match between RiboDB and CDS annotations as ribosomal protein, #H = if the protein identified by RiboDB is annotated as ribosomal protein.
     • Lineage_report = Domain-Phylum-Class-Order-Family-Genus-Species taxonomic ranks separated by "-": e.g. Bacteria-Proteobacteria-Gammaproteobacteria-Pseudomonadales-Pseudomonadaceae-Pseudomonas-Pseudomonas_aeruginosa.
See for example:
>Methanocaldococcus_bathoardescens|JH146#R#T#E#C~GCF_000739065.1~NZ_CP009149.1~C[1571584..1571994]~1301915~11~A.V=Archaea-Euryarchaeota-Methanococci-Methanococcales-Methanocaldococcaceae-Methanocaldococcus-Methanocaldococcus_bathoardescens

Questions: jpdotflandroisatuniv-lyon1dotfr

init done