FILES

For each experiment, tab delimited file(s) is produced to represent the results of the study.

This file has a header line: <localID>	\t	<sequence>	<geneID(or transcriptID)>	<Description of sample1>

The columns are formatted as follows: localID	\t	sequence\t	geneID( or transcriptID) \t	affinity_in_sample1 \t	affinity_in_sample 2 etc.

The file name is chosen as follows: Exp<ExpID>_RBP<RBPID>.txt

Each experiment has an associated README file that contains more information about its data file(s).

The README file contains the following information:

- the source of the data (table 1, supplementary figure 1, GEO XXX etc.) is provided.

- the description of columns and sometimes some background information about the study

- how sequences corresponding to given IDs are retrieved --which database and which ID is used etc.


SEQUENCE RETRIEVAL

GEO datasets

For experiments with Gene Expression Omnibus (GEO) accession numbers, the corresponding data set is downloaded 
from GEO website in series matrix format (http://www.ncbi.nlm.nih.gov/geo/). Given gene (or transcript) IDs are
used to retrieve sequences corresponding to those genes. Signal intensities (or log ratios, p-values etc.) and 
sequences are then combined in a tab delimited text file, which compactly represents the experimental results. 

ArrayExpress datasets

Similar to the procedure above, array design and intensities are downloaded from the associated website. Gene IDs 
(GenBank or RefSeq) are used to retrieve the  corresponding sequences. If there is any quantitative data on 
these genes, it's included in the tab delimited text file. If these genes are known to be enriched in a condition
with relation to the control, but no other quantitative information is provided, then a default value of 1 
is reported as affinity. If there are control genes, those are used as the background and an intensity value of 0 
is reported for its affinity. 

Papers that give a set of genes

Gene IDs are used to retrieve corresponding sequences. If the IDs are GenBank accession numbers or RefSeq IDs, 
or GI numbers, BioPerl is used to download sequences from NCBI.


Final Step

For human and mouse experiments, all the transcripts for human and mouse are downloaded from NCBI. Previously retrieved sequences for all the datasets are blasted against all the transcripts (reference set of transcripts from NCBI) and the transcript with the largest match is retrieved. 


For more information about a specific experiment, please refer to its README file.