Since this is a large-scale bias, it becomes less clearly visible
using a higher number of bins.
In other cases, increasing the number of bins can highlight additional
useful information. As discussed in (Scolari et al. Molecular
BioSystems 7, 878-888 2011), clusters of genes whose expression is
affected by deletion of either Fis or H-NS and upon changes in
negative supercoiling, identified in different experiments, appear to
overlap well with the Ter macrodomain at large scale, while the
analysis at smaller scales shows a preferential localization towards
the edges of the macrodomain. This is the type of results that can be
extracted making full use of the multi-scale analysis offered by the
linear aggregation tool. For example, performing the linear
aggregation analysis on the common dataset "H-NS knockout sensitive
genes under negative-supercoiling (from Blot et al 2006)" and
representing the significative clusters from a large scale (4 bins) to
a smaller scale (128 bins) shows both the statistical clustering in
the Ter macrodomain and the preference for the Ter macrodomain edges:
TOP
3 Learn by example: multiple sliding window histograms
Sliding window histograms allow a graphical visualization of gene
densities along the genome. The resulting representation depends on
the observation scale selected by the user by changing the number of
bins with the sliding bar in the output page (described in
detail here ). While one can be
guided by the linear aggregation clusters found for a gene set, there
is no general rule for choosing an appropriate observation scale for
studying and comparing gene density profiles from different input
lists. The selected observation scale should reflect the specific
biological question of the user and the results need to be interpreted
accordingly. However, as a first qualitative criteria, the number of
bins should be sufficiently low to avoid a mere representation of gene
and operon positions. In fact, if the plot appears as "spikes"
alternated with zero density regions, and the number of genes in the
density peaks is just few units or even one (provided the "Normalize?"
option is not selected), the plot is just showing single gene
positions and their operon organization, as in this example:
Decreasing the number of bins can give a more relevant
representation of possible gene aggregation at larger scales, for the
same data set:
A lower bin number makes the size of the sliding window used to
evaluate the gene density larger. Hence, the plot typically looks
smoother.
Similar considerations hold if multiple sliding window histograms are
used to compare gene density profiles of different gene lists. A
number of bins that is too large could show the mere co-occurrence of
the same gene/operon in the two lists. If the user is interested in
such small-scale information it could be more appropriate to use the
hypergeometric test implemented in NuST, which gives immediately the
statistical significance of the intersection between the two lists
(the hypergeometric test in NuST is
described here ).
However, two gene lists can have a null intersection while presenting
similar gene density profiles if observed at the right scale. In this
case, the multiple sliding window histograms can be a useful tool to
explore a range of possible observation scales (changing the number of
bins), visually inspecting the results for possible profile
similarities. In order to perform a more quantitative of the
correlation between two gene density profiles at a certain observation
scale, the Pearson correlation analysis described in the following
section can be used.
TOP
4 Learn by example: local Pearson correlation
The Pearson correlation analysis allows the user to calculate the
correlation between gene density profiles along the chromosome of two
input gene lists. In particular, the global Pearson correlation
(reported on the top-right of the output plot) and its local version
(the curve in the output plot) can be evaluated (a description of the
output format can be
found here).
This analysis is quite delicate, as both quantities are heavily
influenced by the observation scale as the gene density profiles are
dependent on the number of bins. For instance, two lists can show a
relatively high global correlation on a large scale, but almost no
correlation on smaller scales. This is the case for the two lists
"Genes regulated by the sigma factor 70 (from Regulon DB)" and "Genes
regulated by the sigma factor 38 (from Regulon DB)" present in the
internal database. As recently discussed (Sobetzko et al PNAS 109,
E42-50 2012), the target genes of this two sigma factors are
preferentially located in a large region proximal to the replication
origin (Ori) and the replication terminus (Ter)
respectively. Therefore, anticorrelation between their density
profiles is expected at large scales. However, since in these broad
regions their genes are not positioned in mutually exclusive clusters,
almost no correlation is observed on smaller scales. A a result, the
global Pearson (given by the top-right legend in the plots) shows
anticorrelation using 4 bins, but the signal is lost increasing the
bin number:
The local Pearson correlation represented in the plots shows the
relative contributions of the different chromosomal regions to the
value of the global Pearson correlation between the density
profiles. Note that its absolute value does not have a statistical
meaning on its own and it is not bounded between -1 and 1 (as the
global Pearson correlation). However, it is a quantitative way to
compare different chromosomal sectors and extrapolate the locally
highly (anti)correlated regions where the gene densities concurrently
deviate from their mean values, thus considerably contributing to the
global correlation or anticorrelation depending on whether the
deviations are on the same or opposite direction.
As a second example of application of the local Pearson correlation
analysis, we consider two sets of putative H-NS targets obtained in
different groups using different experimental techniques: "Putative
H-NS target genes in stationary phase from ChIP-seq
experiments (from Kahramanoglou et al 2010)" and "Putative H-NS target
genes from ChIP-chip experiments (from Oshima et al 2006)" in the
common database. As expected, the two lists present both a significant
intersection (tested using the hypergeometric test
described here) and globally
correlated gene density profiles:
The chromosomal regions with high local Pearson coefficient correspond
to genomic sectors where both experiments found coherently a
higher-than-average or a lower-than-average density of targets, thus
the regions predominantly determining the global positive
correlation. Note however that within a region with good correlation,
but where the values of the two densities that are compared are both
close to their global average, the local Pearson will be smaller than
in equally correlated regions that are far from the global averages.
TOP
4 Credits
Vittore F Scolari (vittore.scolari_at_upmc.fr)
Mina Zarei
Matteo Osella
Marco Cosentino Lagomarsino
Genophysique / Genomic Physics Group
UMR 7238 CNRS Universite Pierre et Marie Curie
Paris, France
Grant RGY-0069/2009-C
| | | | | |