UMR 7238 CNRS
Universite Pierre et Marie Curie

NuST@Genomique des Microorganismes

Getting Started with NuST / Learn by example

1 Introduction
2 How to use and create data sets
        2.1 Upload a data set
        2.2 Explore the database
            2.2.1 Common and personal data sets
            2.2.2 Add a data set of genes from file
            2.2.3 Create a new data set from intersection or union of data sets
3 Tools
        3.1 Starting a data analysis
        3.2 Linear aggregation analysis
            3.3.1 Tool documentation
            3.3.2 Output format
        3.3 Multiple sliding window histograms
            3.3.1 Tool documentation
            3.3.2 Output format
        3.4 Compare histograms using local Pearson correlation
            3.4.1 Tool documentation
            3.4.1 Output format
4 Downloads

5 Credits

1 Introduction

NuST (Nucleoid Survey Tools) is a set of tools that can be used for the analysis of the aggregation of specific gene sets along the chromosome, at different observation scales. The main engine analyzes the spatial distribution of a gene list against a shuffling null model and produces a plot with the significant linear-aggregation clusters at different scales of analysis. It can also produce a sliding-window histogram of the data and a sketch of the cluster arrangements of the circular genome.

Different sliding window histograms can be overlayed, and compared using the local Pearson correlation coefficient.

The user can add data sets in the form of gene lists coming from his own experiments or bioinformatic analyses or make use of our database of data sets from published studies. This collection of data is not only a benchmark to test the various tools implemented in the web server, it also represents a valuable database for comparisons with user-defined lists of genes

The web server is currently based on Escherichia coli, since it is the best studied model organism in the bacteria kingdom, for which detailed information on both gene expression and chromosomal organization is available. We are planning to extend the web server to other bacterial species in the near future.

This help page guides the user step-by-step through the analyses that can be performed using the NuST web server.

TOP

2 How to use and create data sets

2.1 Upload a data set

A direct link in the NuST Home page allows the user to directly Upload a data set to be analyzed.

The NuST web server requires as input data set a single column text file with one gene ID for each row.
Three sample data sets are given as an example of valid inputs. The sample zip archive contains the three gene lists and a README text file with information about the gene lists. These data sets can be uploaded and used to start the analysis and to test all the available analysis tools.
We considered as standard gene ID the gene name given by the Regulon DB database (http://regulondb.ccg.unam.mx ). If a different gene ID (as a Blattner ID) is present in the user list or if a gene ID appears multiple times, the web server will propose the possible synonyms for each non-standard ID and give the user the possibility to eliminate possible redundancies. Here is an example of the output for a list containing eight Blattner IDs and a repeated gene name:

The complete list of gene IDs with their chromosomal coordinates and the list of possible synonyms that the web server can recognize can be downloaded in the Downloads section.

After an input file has been correctly uploaded, the analysis can be directly started by clicking the links that automatically appear.

The loaded data sets are stored as personal data sets in the internal database described in section 2.2 and can be accessed during the session in the Explore page (menu on top of each page) for further analysis . The personal data sets are deleted at the end of each anonymous session. If the user needs to keep the data for multiple session a login is needed. This can be obtained sending an email to the administrator (see the "Credits" section).

2.2 Explore the database

3 Tools

3.1 Starting a data analysis

The different tools that perform data analysis can be accessed in two ways.

The user can select a data set in the database and choose one of the three main analysis tools proposed by clicking on it.
As an example, the figure below is the page corresponding to a data set in the common database

The three tools available for the analysis are connected to the links at the bottom of the page.
All the implemented tools can be used from the Tools page (top menu). The figure below is a snapshot of the Tools page

The following sections describe the different types of analysis and their output formats.

3.2 Linear aggregation analysis

3.2.1 Tool documentation

The linear aggregation analysis is a statistical method for identifying sets of genes belonging to a data set that show significant aggregation along the genomic coordinate. This method considers the density of genes at different scales on the genome using grids of different bin sizes, and compares empirical data with results from random null models. In order to avoid spurious effects of binning, for each gene list a density histogram is built by using a sliding window with a given bin-size. The resulting plot of the averaged density of genes for every point of the circular chromosome is considered at different observation scales of the genome, i.e. at different bin sizes of length b_s in {L/2,L/4,. . .,L/2^n} where L is the length of the chromosome. We chose n=10, as b_s < L/1024 is close to the scale of the typical gene length.

Density peaks with a significantly high number of genes are identified by comparing empirical data with 10.000 realizations of a null model. For every bin size, the null model considers the density histogram from a random list of the same length of the empirical one. The number of genes for every bin in the empirical histogram is compared to the distribution of global maxima of the null model, obtaining a P-value for the value of the empirical histogram for each bin. This procedure enables the extraction of a list of statistically significant (P <0.01) bin positions. The web server computes the null model realizations only the first time that a data set is seen, in order to avoid unnecessary computation. For each bin-size (or observation scale), clusters are defined as connected intervals containing a significantly high proportion of the genes in the list. The lowest P-value among the merged bins is assigned to each cluster. The algorithm was previously presented in ref. (Scolari et al. Molecular BioSystems 7, 878-888 2011), where additional information can be found.

The procedure for detecting one dimensional aggregation of genes can be summarized as follows.

1) A data set (gene list) loaded by the user is taken as input.
2) Using a sliding-window of fixed size, the gene density is evaluated along the chromosome.
3) A sliding-window density histogram associates to every coordinate on the circular genome the number of genes in the empirical list in an interval surrounding the point and spanning the fixed bin size. The density at each chromosomal coordinate is compared with the P-value thresholds from the null model in order to obtain the significant positions, which are in turn merged with a compatibility threshold of size b_s in order to define the clusters. The null model calculates the absolute peaks in the gene density of randomized gene sets of the same size as the empirical one.

The three steps of the procedure are repeated for the different bin sizes b_s in {L/2,L/4,. . .,L/2^n} in order to obtain the significant clusters at multiple observation scales. The results are reported in different formats as described in the following section (3.2.2).

3.2.2 Output format

The output of the linear aggregation analysis is directly visualized on the web site as two bitmap pictures and a table. The top panel shows the significant clusters for the different observation scales, with a color code representing their P-value. The plot can be saved in two alternative formats (pdf file or file in the grace plotting program format). The figure below is an example of cluster diagram resulting from the linear aggregation analysis, for a data set in the common database. The x axis identifies the given observation scale (identified by the bin size of the grid) and the y axis draws the clusters as boxes, as a function of the genomic coordinate, with color coded P-values. The larger bars indicate confidence intervals, and the right panel reports the positions of chromosomal macrodomains and segments, and of a few important genes.

The cluster positions can be compared with the location of nucleoid macrodomains. The macrodomain locations are reported in the first vertical bar, as defined in the follwong references; (i) Valens et al EMBO J 23,4330-4110 2004 ; (ii) Boccard et al. Mol Microbiol 57, 9-16 2005 ; (iii) Espeli et al. J Struct Biol 156, 304-10 2006.
The exact positions used here are:

The second vertical bar indicates the coordinates of the chromosome sectors defined by Mathelier and Carbone (Mol Syst Biol 6, 366 2010), with positions:

The position of well-studied genes (such as "crp" or "fis") is also shown for reference.

A second graphical representation of results is presented in the bottom panel of the page, together with a table summarizing the results:

The statistically significant clusters are represented on the circular chromosome as colored wedges whose trasparency increases with size. The outer colored circle represents macrodomains, while the inner colored circle contains the chromosome sectors defined by Mathelier and Carbone.
The user can change on-line the range of bin-sizes and update the image and the corresponding table.

In the table, the five columns reports respectively:

an ID associated to the cluster (first column);
the scale of observation (given by the number of bins) at which it is found (second column);
the cluster start (third column) and stop (fourth column) coordinates in megabases (module the genome length 4.63965 Mb);
the P-value associated to the cluster (fifth column).

The results can be downloaded as a pdf image, a svg (also in a black and white version), and as a text file. The text file is tab-separated with 5 columns with the results reported in the table visualized on-line.

Follow this link for help choosing the parameters and interpreting the results through an example.

TOP