RiboDB : a prokaryotic ribosomal proteins DataBase

RiboDB v.3.1, release 12.0, (May 30 2022)
This is the latest built database with more genomes, an extended biodiversity a simplier scheme and an improved final quality-control of the ribosomal proteins.
You can list the genomes you want, count them and extract all the corresponding ribosomal proteins.

This is also a new version of the web-interface
There is no table of the ribosomal proteins content of the genomes here.
RIBODB V.3 CONTENT RiboDB release 12 contains nucleic and protein sequences of ribosomal proteins from 177,216 genomes of Bacteria and 4,863 of Archaea (thus the whole DB contains 182,079 genomes). The aim of this work is facilitate the use of ribosomal proteins in phylogeny:
     • The DB contains all bacterial and archaeal genomes from RefSeq (140,447 Bacteria and 1,242 Archaea ).
     • Worth noticing, "highly populated species" (i.e species with more that 1,000 genome sequences in RefSeq) are represented only by 1000 genomes selected on the basis of representativity and quality
QUERY CONSTRUCTION RiboDB allows two types of queries:
     • The retrieval of information on strains and genomes for which ribosomal proteins are available in RiboDB. These queries must begin by the tag "@". For instance "@Cyanobacteria" will return the list and information on all Cyanobacteria (phylum) strains/genomes contained in RiboDB.
           • Adding "%" at the end of a query line, will return the number of strains / genomes for the corresponding taxon. For instance "@Escherichia_coli %" will return the number of Escherichia coli strains / genomes contained in RiboDB.
     • The retrieval of ribosomal protein nucleic and protein sequences for given sets of taxa or genomes. These queries must begin by the tag "#". For instance "#Bacillus" will return r-prots sequences of all Bacillus (genus) genomes contained in RiboDB.
The two types of queries are mutually exclusive.
RiboDB allows multiple queries of the same type at once, by listing queries on separated lines:
     • @Bacteria ( @=list of the genomes)
     • @Archaea
Or
     • #Streptococcus_pneumoniae (@=extration of the ribo-proteins)
     • #Bacillus_subtilis
For additional details see (below) "Query construction" and "Structure of the FASTA commentary line"

Options Checkboxes can be used to reduce this set of strains / genomes by targeting:
     • type strain material
     • representative/reference strains / genomes
     • genomes from GenbBank/RefSeq included in Ensembl! Bacteria
YOUR QUERY
Targeted genomes / taxa (use # to extract r-prots or @ to extract information). If empty, launchs a random test
Targeted r-prots (delete the unwanted r-prots and the associated semicolons) R-prots are named according to BAN, Nenad, BECKMANN, Roland, CATE, Jamie HD, et al. A new system for naming ribosomal proteins. Current opinion in structural biology, 2014, vol. 24, p. 165-169. (see also the Ban Lab website)
Options Selection of the subsets




Additional information Retrieving Ribosomal proteins: Queries allow scanning "FASTA commentary lines" of ribosomal proteins contained in the database using keywords. The structure of "FASTA commentary lines" is described below.

Most relevant searches target fields corresponding to:
     • Genus, Species, or lineage_report (e.g. #Sodalis_praecaptivus, @Bacillaceae-Bacillus)
     • NCBI_Species_TaxID (e.g. #~1463164)
     • Genome_assembly_number (e.g. #GCF_900890425.1)

To avoid any confusion among taxonomic ranks use "-" at the end of the taxon name when querying RiboDB on lineage report information. Using #Listeria will retrieve both Listeria (genus) and Listeriaceae (family). To retrieve ribosomal proteins from the Listeria genus, use "#Listeria-".
Similarly, use a "~" when querying on TaxID (e.g. "#~1312852")

More generally, any information contained in "FASTA commentary lines" may be queried, but may be risky or poorly relevant.
For instance, querying the database with "#Myco" will return information on Mycobacterium, Mycolicibacterium, Mycobacteroides, Mycolicibacter, and other Mycobacteriaceae (Actinobacteria), Mycoplasma (Mycoplasmatales), Mycoplana_dimorpha (an alphaproteobacterium), and Mycoavidus_cysteinexigens (a betaproteobacterium) strains contained in RiboDB.
Similarly, "#myco" will return proteins from Corynebacterium_amycolatum, Amycolatopsis, Streptomyces_antimycoticus, and Actinoplanes_awajinensis_subsp._mycoplanecinus (Actinobacteria), Bacillus_mycoides, Bacillus_paramycoides, Bacillus_pseudomycoides, and Mycoplasma_mycoides (Firmicutes).

Retrieving information/statistics only: Queries may concern the species name (ex: @Acinetobacter_colistiniresistens) the strain Id (ex:@NR1165) the genome Id (ex:@GCF_003227755), the NCBI taxId (ex:@TaxId 280145; <-mind the ";") and any part of the nomenclature hierarchy (ex: @-Gammaproteobacteria- note that the "-" may be mandatory in some cases)

Structure of the FASTA commentary line FASTA commentary lines are built as follow:
>Genus_species|strain_ID#genome_type~genome_assembly_number~contig_number~[position_on_the_genome]~NCBI_Species_TaxID~Genetic_code~Genome_source~Protein_evidence=lineage_report
with:
     • Genus_species: e.g. Pseudomonas_aeruginosa
     • strain_ID: e.g. PAO1
     • genome_type [#T, #R, or #E] with #T = genome tagged as type strain material in RefSeq or GenBank, #R = genome tagged as reference / representative genomes in RefSeq, #E = genome listed in Ensembl! Bacteria
     • genome_quality [#C, #S, #U and #d] with #C for complete genomes, #S for scaffolds #U for unassembled and note that #d indicate the origin from metagenomes and other potential loss of quality in the assembly (#S#d is a genome in the scaffold state from a metagenome).      • genome_assembly_number: e.g. GCF_000006765.1
     • contig_number: e.g. NZ_002516.2
     • position: indicates the position of CDS on the contig, with "C" indicating that the CDS is encoded on the reverse strand, e.g. C[4781985..4782680]
     • NCBI_species_TaxID: corresponds to the species TaxID of the strain
     • Genetic_code: indicates the genetic code for the genome
     • Genome_source [~A. or ~B.> with #A = genome from RefSeq, #B = genome from Genbank not present in RefSeq
     • Protein_evidence (the indication is following the genome sourec ex: A.V) [V or H] with #V = match between RiboDB and CDS annotations as ribosomal protein, #H = if the protein identified by RiboDB is annotated as ribosomal protein.
     • Lineage_report = Domain-Phylum-Class-Order-Family-Genus-Species taxonomic ranks separated by "-": e.g. Bacteria-Proteobacteria-Gammaproteobacteria-Pseudomonadales-Pseudomonadaceae-Pseudomonas-Pseudomonas_aeruginosa.
See for example:
>Methanocaldococcus_bathoardescens|JH146#R#T#E#C~GCF_000739065.1~NZ_CP009149.1~C[1571584..1571994]~1301915~11~A.V=Archaea-Euryarchaeota-Methanococci-Methanococcales-Methanocaldococcaceae-Methanocaldococcus-Methanocaldococcus_bathoardescens

Questions: jpdotflandroisatuniv-lyon1dotfr

init done