RBPDP - downloads README Last updated: 2011-11-23 by Kate Contents 1) Description of files 2) Description of tables 3) PFMs and PWMs 4) In vivo-bound sequences 5) Domain abbreviations ======================================================================== 1) Description of files: Database contents SQL dump MySQL dump of the tables in the database. All tables - TDT format Zipped directory of all tables in the database in tab-delimited text format. All tables - CSV format Zipped directory of all tables in the database in comma-separated values format. Individual tables All experiments - TDT format All experiment data in tab-delimited text format. All proteins - TDT format All protein data in tab-delimited text format. Experiments-to-proteins mapping - TDT format Table linking experiments to proteins in tab-delimited text format. All experiments - CSV format All experiment data in comma-separated values format. All proteins - CSV format All protein data in comma-separated values format. Experiments-to-proteins mapping - CSV format Table linking experiments to proteins in comma-separated values format. Motifs PWMs Zipped directory of Position Weight Matrices PFMs Zipped directory of Position Frequency Matrices Sequences from in vivo coimmunoprecipitation experiments Zipped directory of sequences and READMEs Species-specific files are as above. TDT and CSV files can be read by text editors or loaded into Excel. SQL files require an SQL server (such as MySQL) to access. ======================================================================== 2) Description of tables a. proteins id - unique id number annotID - annotation ID (Ensembl/Flybase/Wormbase/etc) createDate - date the protein was added to the database updateDate - last curation date for the protein geneName - official (HGNC/MGI/Flybase) gene symbol geneDesc - gene description species - species name in binomial nomenclature, eg "Homo sapiens" taxID - NCBI taxonomy ID for the species domains - string describing the domains (see below for abbreviations) flag - if 1, protein has been flagged and will not appear on the site. Don't use these proteins in your analyses! flagNote - reason for flag (eg: pseudogene, non RNA-binding, etc) aliases - alternative gene names PDBIDs - Protein Data Bank ID(s) of structures containing RNA-bound protein UniProtIDs - UniProt ID(s) for the gene b. experiments id - unique id number pmID - PubMed ID for the experiment exptype - type of experiment notes - notes for this experiment sequence_motif - motifs reported (possibly multiple, separated by ; ) SELEX_file - raw sequences extracted from publications aligned_SELEX_file - SELEX sequences aligned as in publication, with the reported motif enclosed in brackets aligned_motif_file - motif sequences as reported in the publication, used to construct PFMs logo_file - filename for logo PWM_file - filename for PWM PFM_file - filename for PFM invivo_notes - for in vivo (RIP-chip/seq/etc), where to find the data (may be an NCBI GEO accession number) invivo_file - file name with the sequences bound in vivo secondary_structure - 1 if the paper reported a secondary structure for the RNA sequence flag - if 1, experiment has been flagged and will not appear on the site. Don't use these experiments in your analyses! flagNote - reason for flag c. protExp id - unique id number protID - id number for the protein expID - id number for the experiment homolog - 1 if the experiment was performed on a non-human/mouse/fly/worm species d. inparanoid id - unique id number geneID1 - id for the first protein geneID2 - id for the second protein geneAnnotID1 - annotation id for the first protein geneAnnotID2 - annotation id for the second protein ======================================================================== 3) Using PWMs/PFMs: For the purposes of RBPDB, PFM refers to a position frequency matrix, in which each position in the motif has a number for each base, corresponding to the frequency with which that base appears at that position. PWMs (position weigh matrices) are log2-transformed PFMs calculated using equiprobable prior base frequencies. Each protein and experiment in the database is associated with a unique id number (The 'id' field in all of the tables). Each matrix is based off a single experiment from the 'experiments' table. The filename convention for the PFM and PWM files is "_.pfm" for PFMs and "_.pwm" for PWMs. To link these back to protein IDs, you need a copy of the experiments-to-proteins mapping (table ProtExp), which is a table of experiment and protein ids. Note: due to a technical issue, the experiment IDs on the website do not correspond to the experiment IDs in the database. The matrix files contain 4 lines of whitespace-delimited text, corresponding to the frequencies/weights for A, C, G and U respectively. The PWM and PFM sets are in a "flat file" format database that can be accessed easily using the TFBS perl package ( http://tfbs.genereg.net/ ). ======================================================================== 4) in vivo bound sequences from immunoprecipitation experiments These files contain RNA sequences that were bound by the protein. The exact data included depends on the type of experiment (RIP-chip, etc) performed. README files included with every data set, as well as a BATCH_README.txt file for all the experiments, explain the data formats and details of how the data was retrieved. ======================================================================== 5) Domain abbreviations RRM - RNA recognition motif KH - K homology Lsm - Like Sm Znf_CCCH - CCCH zinc finger Znf_C2H2 - C2H2 zinc finger CSD - Cold-shock domain PUA - Pseudouridine synthase and archaeosine transglycosylase S1 - Ribosomal protein S1-like Surp - Surp module/SWAP La - Lupus La RNA-binding PWI - PWI domain YTH - YTH domain Pum - Pumilio-like repeat THUMP - THUMP domain SAM - Sterile alpha motif TROVE - TROVE module