PyModulon: Analyzing iModulons in Python
What is an iModulon?
Gene expression datasets are highly complex. Differential expression analysis is the most common analytical tool used to analyze these datasets, but often result in large sets of differentially genes, many of which have little to no functional annotation. Recent studies of large transcriptomic datasets have introduced Independent Component Analysis (ICA) as a scalable alternative that produces easily-interpretable results [SGS+19].
When applied to a gene expression dataset, ICA extracts independently modulated groups of genes, called iModulons. In bacteria, iModulons recapitulate known regulation with remarkable accuracy, and often encode complete metabolic pathways [cite].
In addition, ICA simultaneously computes iModulon activities, which represents the effect of the associated transcriptional regulator on gene expression.
For more information on iModulons, see the iModulonDB About Page.
What is PyModulon?
PyModulon contains modules for the analysis, visualization, and dissemination of iModulons. PyModulon is designed with Jupyter notebooks in mind. In addition, it enables users to easily create their own iModulonDB pages for any analyzed dataset.
Citation
The pymodulon manuscript is currently in preparation. For now, please cite [SGS+19]. If you use the custom iModulonDB pages, please cite [RDS+20].
Installation
With Docker
The easiest way to get started with PyModulon is using the Docker container.
Install Docker
Open terminal and navigate to your work folder
Run the following commands to start a Jupyter Notebook server:
docker run -p 8888:8888 -v "${PWD}":/home/jovyan/work sbrg/pymodulon
Copy the URL from terminal to connect to the Jupyter notebook
Navigate to the
work
folder, which has your current directory mounted.To close the notebook, press
Ctrl+C
in terminal. All changes made to files in your current directory are saved to your local machine.
With Conda
Alternatively, you can install using Conda:
conda install -c conda-forge pymodulon
We recommend installing through a conda environment:
conda create -n pymodulon -c conda-forge pymodulon
conda activate pymodulon
Optional Dependencies
Some features of PyModulon require additional dependencies. Follow the links below for installation instructions.
This step is not necessary if you use the Docker container.
Example Workflow
We have provided an example workflow here illustrating how to compile and process all publicly available RNA-seq data for Bacillus subtilis, and subsequently compute and characterize its iModulons. This tutorial was designed to be easily applied to any other bacteria.
Introduction to the IcaData
object
The pymodulon.core.IcaData
object is at the core of the PyModulon package. This object holds all of the data related to the expression dataset, the iModulons, and their annotations.
[1]:
from pymodulon.core import IcaData
from pymodulon import example_data
from pymodulon.io import save_to_json, load_json_model
Minimum requirements
The IcaData
object only requires two matrices, which are the results of performing Independent Component Analysis (ICA) on an expression dataset. For more information about ICA, see the iModulonDB about page
M
: The iModulon matrix contains the Independent Components (ICs) themselves. Each column represents an IC, and each row contains the gene weights for each gene across each IC.
[2]:
M = example_data.M
M.head()
[2]:
AllR/AraC/FucR | ArcA-1 | ArcA-2 | ArgR | AtoC | BW25113 | Cbl+CysB | CdaR | CecR | Copper | ... | thrA-KO | translation | uncharacterized-1 | uncharacterized-2 | uncharacterized-3 | uncharacterized-4 | uncharacterized-5 | uncharacterized-6 | ydcI-KO | yheO-KO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
b0002 | -0.010888 | -0.007717 | -0.008502 | -0.012186 | -0.061489 | -0.005599 | -0.007377 | -0.000795 | 0.004331 | 0.001845 | ... | 0.479209 | 0.035685 | 0.024778 | -0.010660 | -0.002123 | -0.004416 | -0.005428 | -0.009219 | -0.004345 | -0.007838 |
b0003 | -0.011467 | 0.003042 | 0.011448 | -0.003685 | -0.006106 | 0.006680 | -0.043512 | 0.005107 | 0.000474 | 0.007650 | ... | 0.011420 | 0.040811 | 0.003324 | -0.008424 | -0.004415 | -0.016126 | -0.016476 | -0.003497 | -0.003583 | 0.003381 |
b0004 | -0.008693 | 0.003944 | 0.012347 | -0.008104 | 0.000585 | 0.003245 | -0.041283 | 0.006390 | 0.004260 | 0.007109 | ... | 0.011339 | 0.036244 | 0.003710 | -0.005212 | 0.000700 | -0.011096 | -0.006140 | -0.003155 | -0.008418 | 0.000129 |
b0005 | 0.006565 | -0.001099 | 0.009415 | -0.008507 | 0.005399 | 0.014748 | -0.009249 | -0.003058 | -0.012649 | -0.002370 | ... | -0.015324 | 0.028972 | 0.023969 | 0.000150 | 0.018497 | 0.009428 | 0.001255 | -0.006890 | -0.028069 | 0.021534 |
b0006 | -0.006011 | 0.009889 | -0.005555 | -0.000152 | -0.002454 | 0.009678 | -0.003456 | 0.002160 | -0.001924 | -0.000628 | ... | -0.005661 | 0.000700 | -0.002538 | -0.006103 | -0.002506 | -0.005077 | -0.004616 | -0.003585 | 0.001607 | 0.001285 |
5 rows × 92 columns
A
: The Activity matrix contains the condition-specific activities. Each column represents a sample, and each row contains the activity of each iModulon across all samples.
[3]:
A = example_data.A
A.head()
[3]:
control__wt_glc__1 | control__wt_glc__2 | fur__wt_dpd__1 | fur__wt_dpd__2 | fur__wt_fe__1 | fur__wt_fe__2 | fur__delfur_dpd__1 | fur__delfur_dpd__2 | fur__delfur_fe2__1 | fur__delfur_fe2__2 | ... | efeU__menFentC_ale29__1 | efeU__menFentC_ale29__2 | efeU__menFentC_ale30__1 | efeU__menFentC_ale30__2 | efeU__menFentCubiC_ale36__1 | efeU__menFentCubiC_ale36__2 | efeU__menFentCubiC_ale37__1 | efeU__menFentCubiC_ale37__2 | efeU__menFentCubiC_ale38__1 | efeU__menFentCubiC_ale38__2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AllR/AraC/FucR | 0.378690 | -0.378690 | 2.457678 | 2.248678 | -0.327344 | -0.259164 | 1.777251 | 2.690655 | 0.656937 | 0.319583 | ... | 1.041336 | 2.203940 | 3.698292 | 0.856998 | 1.557323 | 0.337806 | 0.943742 | 1.736640 | 0.499461 | 1.581476 |
ArcA-1 | -0.440210 | 0.440210 | -5.367360 | -5.684301 | 0.131174 | 0.348843 | -4.436389 | -4.770469 | -1.799113 | -1.474222 | ... | -6.471714 | -6.549861 | -3.109145 | -2.716183 | -2.531192 | -1.461022 | -0.408849 | -0.210397 | -5.700321 | -6.237836 |
ArcA-2 | 0.762258 | -0.762258 | 2.619623 | 2.900696 | 3.120724 | 2.743634 | 1.989803 | 1.555835 | 1.782500 | 1.530811 | ... | 2.789653 | 3.959650 | 1.585147 | 0.811182 | 0.300414 | 2.537535 | 1.061408 | 2.634524 | 0.125513 | 1.178747 |
ArgR | -0.289630 | 0.289630 | -10.085719 | -13.187916 | 2.371129 | 1.861918 | -8.708701 | -7.881588 | -1.237027 | -1.235604 | ... | -11.263744 | -10.366813 | -0.289217 | 0.389228 | -5.142768 | -5.014526 | -3.648777 | -4.125952 | -4.286326 | -5.475940 |
AtoC | 0.250770 | -0.250770 | 1.844767 | 2.055052 | 0.299345 | 0.425502 | 1.801217 | 1.790987 | 0.921254 | 1.410026 | ... | 3.821909 | 3.306573 | 2.652394 | 1.910173 | 0.927772 | 1.327549 | 1.846321 | 0.909667 | 2.064662 | 2.371405 |
5 rows × 278 columns
To create the IcaData
object, the M
and A
datasets can be entered as either filenames or as a Pandas DataFrame
[4]:
ica_data = IcaData(M,A)
ica_data
[4]:
<pymodulon.core.IcaData at 0x7fe93df4d990>
Once loaded, the M
and A
matrices can be accessed directly from the object
[5]:
ica_data.M.head()
[5]:
AllR/AraC/FucR | ArcA-1 | ArcA-2 | ArgR | AtoC | BW25113 | Cbl+CysB | CdaR | CecR | Copper | ... | thrA-KO | translation | uncharacterized-1 | uncharacterized-2 | uncharacterized-3 | uncharacterized-4 | uncharacterized-5 | uncharacterized-6 | ydcI-KO | yheO-KO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
b0002 | -0.010888 | -0.007717 | -0.008502 | -0.012186 | -0.061489 | -0.005599 | -0.007377 | -0.000795 | 0.004331 | 0.001845 | ... | 0.479209 | 0.035685 | 0.024778 | -0.010660 | -0.002123 | -0.004416 | -0.005428 | -0.009219 | -0.004345 | -0.007838 |
b0003 | -0.011467 | 0.003042 | 0.011448 | -0.003685 | -0.006106 | 0.006680 | -0.043512 | 0.005107 | 0.000474 | 0.007650 | ... | 0.011420 | 0.040811 | 0.003324 | -0.008424 | -0.004415 | -0.016126 | -0.016476 | -0.003497 | -0.003583 | 0.003381 |
b0004 | -0.008693 | 0.003944 | 0.012347 | -0.008104 | 0.000585 | 0.003245 | -0.041283 | 0.006390 | 0.004260 | 0.007109 | ... | 0.011339 | 0.036244 | 0.003710 | -0.005212 | 0.000700 | -0.011096 | -0.006140 | -0.003155 | -0.008418 | 0.000129 |
b0005 | 0.006565 | -0.001099 | 0.009415 | -0.008507 | 0.005399 | 0.014748 | -0.009249 | -0.003058 | -0.012649 | -0.002370 | ... | -0.015324 | 0.028972 | 0.023969 | 0.000150 | 0.018497 | 0.009428 | 0.001255 | -0.006890 | -0.028069 | 0.021534 |
b0006 | -0.006011 | 0.009889 | -0.005555 | -0.000152 | -0.002454 | 0.009678 | -0.003456 | 0.002160 | -0.001924 | -0.000628 | ... | -0.005661 | 0.000700 | -0.002538 | -0.006103 | -0.002506 | -0.005077 | -0.004616 | -0.003585 | 0.001607 | 0.001285 |
5 rows × 92 columns
[6]:
ica_data.A.head()
[6]:
control__wt_glc__1 | control__wt_glc__2 | fur__wt_dpd__1 | fur__wt_dpd__2 | fur__wt_fe__1 | fur__wt_fe__2 | fur__delfur_dpd__1 | fur__delfur_dpd__2 | fur__delfur_fe2__1 | fur__delfur_fe2__2 | ... | efeU__menFentC_ale29__1 | efeU__menFentC_ale29__2 | efeU__menFentC_ale30__1 | efeU__menFentC_ale30__2 | efeU__menFentCubiC_ale36__1 | efeU__menFentCubiC_ale36__2 | efeU__menFentCubiC_ale37__1 | efeU__menFentCubiC_ale37__2 | efeU__menFentCubiC_ale38__1 | efeU__menFentCubiC_ale38__2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AllR/AraC/FucR | 0.378690 | -0.378690 | 2.457678 | 2.248678 | -0.327344 | -0.259164 | 1.777251 | 2.690655 | 0.656937 | 0.319583 | ... | 1.041336 | 2.203940 | 3.698292 | 0.856998 | 1.557323 | 0.337806 | 0.943742 | 1.736640 | 0.499461 | 1.581476 |
ArcA-1 | -0.440210 | 0.440210 | -5.367360 | -5.684301 | 0.131174 | 0.348843 | -4.436389 | -4.770469 | -1.799113 | -1.474222 | ... | -6.471714 | -6.549861 | -3.109145 | -2.716183 | -2.531192 | -1.461022 | -0.408849 | -0.210397 | -5.700321 | -6.237836 |
ArcA-2 | 0.762258 | -0.762258 | 2.619623 | 2.900696 | 3.120724 | 2.743634 | 1.989803 | 1.555835 | 1.782500 | 1.530811 | ... | 2.789653 | 3.959650 | 1.585147 | 0.811182 | 0.300414 | 2.537535 | 1.061408 | 2.634524 | 0.125513 | 1.178747 |
ArgR | -0.289630 | 0.289630 | -10.085719 | -13.187916 | 2.371129 | 1.861918 | -8.708701 | -7.881588 | -1.237027 | -1.235604 | ... | -11.263744 | -10.366813 | -0.289217 | 0.389228 | -5.142768 | -5.014526 | -3.648777 | -4.125952 | -4.286326 | -5.475940 |
AtoC | 0.250770 | -0.250770 | 1.844767 | 2.055052 | 0.299345 | 0.425502 | 1.801217 | 1.790987 | 0.921254 | 1.410026 | ... | 3.821909 | 3.306573 | 2.652394 | 1.910173 | 0.927772 | 1.327549 | 1.846321 | 0.909667 | 2.064662 | 2.371405 |
5 rows × 278 columns
If the M
and A
datasets have row or column names, these will be saved as the sample/gene/iModulon names. Since genes are often re-named when characterized, the locus tag is the preferred identifier.
[7]:
print('Gene names:',ica_data.gene_names[:5])
print('Sample names:',ica_data.sample_names[:5])
print('iModulon names:',ica_data.imodulon_names[:5])
Gene names: ['b0002', 'b0003', 'b0004', 'b0005', 'b0006']
Sample names: ['control__wt_glc__1', 'control__wt_glc__2', 'fur__wt_dpd__1', 'fur__wt_dpd__2', 'fur__wt_fe__1']
iModulon names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
Adding the Expression Matrix
The X
matrix contains eXpression data and is primarily used for plotting functions. The column names of the X
matrix are the sample names, and the row names are the gene identifiers.
[8]:
X = example_data.X
X.head()
[8]:
control__wt_glc__1 | control__wt_glc__2 | fur__wt_dpd__1 | fur__wt_dpd__2 | fur__wt_fe__1 | fur__wt_fe__2 | fur__delfur_dpd__1 | fur__delfur_dpd__2 | fur__delfur_fe2__1 | fur__delfur_fe2__2 | ... | efeU__menFentC_ale29__1 | efeU__menFentC_ale29__2 | efeU__menFentC_ale30__1 | efeU__menFentC_ale30__2 | efeU__menFentCubiC_ale36__1 | efeU__menFentCubiC_ale36__2 | efeU__menFentCubiC_ale37__1 | efeU__menFentCubiC_ale37__2 | efeU__menFentCubiC_ale38__1 | efeU__menFentCubiC_ale38__2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
b0002 | -0.061772 | 0.061772 | 0.636527 | 0.819793 | -0.003615 | -0.289353 | -1.092023 | -0.777289 | 0.161343 | 0.145641 | ... | -0.797097 | -0.791859 | 0.080114 | 0.102154 | 0.608180 | 0.657673 | 0.813105 | 0.854813 | 0.427986 | 0.484338 |
b0003 | -0.053742 | 0.053742 | 0.954439 | 1.334385 | 0.307588 | 0.128414 | -0.872563 | -0.277893 | 0.428542 | 0.391761 | ... | -0.309105 | -0.352535 | -0.155074 | -0.077145 | 0.447030 | 0.439881 | 0.554528 | 0.569030 | 0.154905 | 0.294799 |
b0004 | -0.065095 | 0.065095 | -0.202697 | 0.119195 | -0.264995 | -0.546017 | -1.918349 | -1.577736 | -0.474815 | -0.495312 | ... | -0.184898 | -0.225615 | 0.019575 | 0.063986 | 0.483343 | 0.452754 | 0.524828 | 0.581878 | 0.293239 | 0.341040 |
b0005 | 0.028802 | -0.028802 | -0.865171 | -0.951179 | 0.428769 | 0.123564 | -1.660351 | -1.531147 | 0.240353 | -0.151132 | ... | -0.308221 | -0.581714 | 0.018820 | 0.004040 | -1.228763 | -1.451750 | -0.839203 | -0.529349 | -0.413336 | -0.478682 |
b0006 | 0.009087 | -0.009087 | -0.131039 | -0.124079 | -0.144870 | -0.090152 | -0.219917 | -0.046648 | -0.044537 | -0.089204 | ... | 1.464603 | 1.415706 | 1.230831 | 1.165153 | 0.447447 | 0.458852 | 0.421417 | 0.408077 | 1.151066 | 1.198529 |
5 rows × 278 columns
[9]:
ica_data.X = X
Adding annotation tables
You may load in additional data tables with information about your samples, genes, or iModulons.
These tables are originally empty, but can be altered like any Pandas DataFrame
.
[10]:
ica_data.gene_table.head()
[10]:
b0002 |
---|
b0003 |
b0004 |
b0005 |
b0006 |
Annotation tables contain one sample/gene/iModulon per row, and information about the respective item in columns. For example, a gene_table
may include the gene function, genomic position, or Cluster of Orthologous Groups (COG) Category. See the Creating the Gene Table tutorial for a step-by-step example on how to contruct this table. Gene names must match the gene names in the M
matrix.
[11]:
gene_table = example_data.gene_table
gene_table.head()
[11]:
start | end | strand | gene_name | length | operon | COG | accession | |
---|---|---|---|---|---|---|---|---|
b0001 | 189 | 255 | + | thrL | 66 | thrLABC | No COG Annotation | NC_000913.3 |
b0002 | 336 | 2799 | + | thrA | 2463 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0003 | 2800 | 3733 | + | thrB | 933 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0004 | 3733 | 5020 | + | thrC | 1287 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0005 | 5233 | 5530 | + | yaaX | 297 | yaaX | Function unknown | NC_000913.3 |
The sample_table
contains detailed experimental metadata about each sample. This must be manually created, and can contain information related to the strains or experimental conditions used in the study.
[12]:
sample_table = example_data.sample_table
sample_table.head()
[12]:
Study | project | condition | Replicate # | Strain Description | Strain | Base Media | Carbon Source (g/L) | Nitrogen Source (g/L) | Electron Acceptor | ... | Growth Rate (1/hr) | Evolved Sample | Isolate Type | Sequencing Machine | ALEdb sample | Additional Details | Biological Replicates | Alignment | DOI | GEO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample ID | |||||||||||||||||||||
control__wt_glc__1 | Control | control | wt_glc | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | NaN | No | NaN | MiSeq | NaN | NaN | 2 | 94.33 | doi.org/10.1101/080929 | GSE65643 |
control__wt_glc__2 | Control | control | wt_glc | 2 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | NaN | No | NaN | MiSeq | NaN | NaN | 2 | 94.24 | doi.org/10.1101/080929 | GSE65643 |
fur__wt_dpd__1 | Fur | fur | wt_dpd | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 0.00 | No | NaN | MiSeq | NaN | NaN | 2 | 98.04 | doi.org/10.1038/ncomms5910 | GSE54900 |
fur__wt_dpd__2 | Fur | fur | wt_dpd | 2 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 0.00 | No | NaN | MiSeq | NaN | NaN | 2 | 98.30 | doi.org/10.1038/ncomms5910 | GSE54900 |
fur__wt_fe__1 | Fur | fur | wt_fe | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 1.06 | No | NaN | MiSeq | NaN | NaN | 2 | 93.35 | doi.org/10.1038/ncomms5910 | GSE54900 |
5 rows × 26 columns
The project
and condition
columns in the sample_table
will be useful for the plotting functions described in the Plotting Functions tutorial.
The imodulon_table
contains information about each iModulon, such as regulator enrichments or iModulon size.
[13]:
imodulon_table = example_data.imodulon_table
imodulon_table.head()
[13]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | |
---|---|---|---|---|---|---|---|---|---|---|
name | ||||||||||
AllR/AraC/FucR | allR/araC/fucR | 0.750000 | 1.190000e-41 | 1.000000 | 0.600000 | 18.0 | 18 | 3 | Carbon Source Utilization | 0.086996 |
ArcA-1 | arcA | 0.130952 | 6.420000e-20 | 0.660000 | 0.072687 | 33.0 | 50 | 1 | Energy Metabolism | 0.058051 |
ArcA-2 | arcA | 0.087683 | 1.150000e-16 | 0.840000 | 0.046256 | 21.0 | 25 | 1 | Energy Metabolism | 0.081113 |
ArgR | argR | 0.177778 | 6.030000e-18 | 0.923077 | 0.098361 | 12.0 | 13 | 1 | Amino Acid and Nucleotide Biosynthesis | 0.080441 |
AtoC | atoC | 0.800000 | 1.520000e-12 | 0.666667 | 1.000000 | 4.0 | 6 | 1 | Miscellaneous Metabolism | 0.105756 |
The tables can be loaded into the IcaData
object as either filenames or as a Pandas DataFrame
[14]:
ica_data.gene_table = gene_table
ica_data.sample_table = sample_table
ica_data.imodulon_table = imodulon_table
[15]:
ica_data.sample_table.head()
[15]:
Study | project | condition | Replicate # | Strain Description | Strain | Base Media | Carbon Source (g/L) | Nitrogen Source (g/L) | Electron Acceptor | ... | Growth Rate (1/hr) | Evolved Sample | Isolate Type | Sequencing Machine | ALEdb sample | Additional Details | Biological Replicates | Alignment | DOI | GEO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
control__wt_glc__1 | Control | control | wt_glc | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | NaN | No | NaN | MiSeq | NaN | NaN | 2 | 94.33 | doi.org/10.1101/080929 | GSE65643 |
control__wt_glc__2 | Control | control | wt_glc | 2 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | NaN | No | NaN | MiSeq | NaN | NaN | 2 | 94.24 | doi.org/10.1101/080929 | GSE65643 |
fur__wt_dpd__1 | Fur | fur | wt_dpd | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 0.00 | No | NaN | MiSeq | NaN | NaN | 2 | 98.04 | doi.org/10.1038/ncomms5910 | GSE54900 |
fur__wt_dpd__2 | Fur | fur | wt_dpd | 2 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 0.00 | No | NaN | MiSeq | NaN | NaN | 2 | 98.30 | doi.org/10.1038/ncomms5910 | GSE54900 |
fur__wt_fe__1 | Fur | fur | wt_fe | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 1.06 | No | NaN | MiSeq | NaN | NaN | 2 | 93.35 | doi.org/10.1038/ncomms5910 | GSE54900 |
5 rows × 26 columns
[16]:
ica_data.gene_table.head()
[16]:
start | end | strand | gene_name | length | operon | COG | accession | |
---|---|---|---|---|---|---|---|---|
b0002 | 336 | 2799 | + | thrA | 2463 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0003 | 2800 | 3733 | + | thrB | 933 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0004 | 3733 | 5020 | + | thrC | 1287 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0005 | 5233 | 5530 | + | yaaX | 297 | yaaX | Function unknown | NC_000913.3 |
b0006 | 5682 | 6459 | - | yaaA | 777 | yaaA | Function unknown | NC_000913.3 |
[17]:
ica_data.imodulon_table.head()
[17]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | |
---|---|---|---|---|---|---|---|---|---|---|
AllR/AraC/FucR | allR/araC/fucR | 0.750000 | 1.190000e-41 | 1.000000 | 0.600000 | 18.0 | 18 | 3 | Carbon Source Utilization | 0.086996 |
ArcA-1 | arcA | 0.130952 | 6.420000e-20 | 0.660000 | 0.072687 | 33.0 | 50 | 1 | Energy Metabolism | 0.058051 |
ArcA-2 | arcA | 0.087683 | 1.150000e-16 | 0.840000 | 0.046256 | 21.0 | 25 | 1 | Energy Metabolism | 0.081113 |
ArgR | argR | 0.177778 | 6.030000e-18 | 0.923077 | 0.098361 | 12.0 | 13 | 1 | Amino Acid and Nucleotide Biosynthesis | 0.080441 |
AtoC | atoC | 0.800000 | 1.520000e-12 | 0.666667 | 1.000000 | 4.0 | 6 | 1 | Miscellaneous Metabolism | 0.105756 |
Converting between gene names and locus tags
If the gene_table
contains a gene_name
columns, the name2num
and num2name
methods can convert between locus tags and gene names.
[18]:
ica_data.num2name('b0002')
[18]:
'thrA'
[19]:
ica_data.name2num('thrA')
[19]:
'b0002'
Adding the TRN
Adding the transcriptional regulatory network (TRN) to the IcaData
object enables automated calculation of regulon enrichments. Each row of the TRN file represents a regulatory interaction. The TRN must contain the following columns:
regulator
: Name of the regulator (/
or+
characters will be converted to;
)gene_id
: Locus tag of the target gene (must be inica_data.gene_names
)
The following columns are optional, but are helpful to have:
regulator_id
- Locus tag of regulatorgene_name
- Name of gene (can automatically update this usingname2num
)direction
- Direction of regulation (+
for activation,-
for repression,?
orNaN
for unknown)evidence
- Evidence of regulation (e.g. ChIP-exo, qRT-PCR, SELEX, Motif search)PMID
- Reference for regulatory interaction
[20]:
trn = example_data.trn
trn.head()
[20]:
regulator | gene_id | effect | |
---|---|---|---|
0 | FMN | b3041 | - |
1 | L-tryptophan | b3708 | + |
2 | L-tryptophan | b3709 | + |
3 | TPP | b0066 | - |
4 | TPP | b0067 | - |
Again, this table can be passed in as either a filename or a Pandas DataFrame
.
[21]:
ica_data.trn = trn
ica_data.trn.head()
[21]:
regulator | gene_id | effect | |
---|---|---|---|
0 | FMN | b3041 | - |
1 | L-tryptophan | b3708 | + |
2 | L-tryptophan | b3709 | + |
3 | TPP | b0066 | - |
4 | TPP | b0067 | - |
Inspecting iModulons
view_imodulon
shows the information about each gene in the iModulon. Most information is retrieved from the gene_table
, but the regulator
column comes from the trn
.
[22]:
ica_data.view_imodulon('GlpR')
[22]:
gene_weight | start | end | strand | gene_name | length | operon | COG | accession | regulator | |
---|---|---|---|---|---|---|---|---|---|---|
b2239 | 0.211384 | 2349934 | 2351011 | - | glpQ | 1077 | glpTQ | Energy production and conversion | NC_000913.3 | crp,fis,fnr,glpR,ihf,nac,rpoD |
b2240 | 0.306134 | 2351015 | 2352374 | - | glpT | 1359 | glpTQ | Carbohydrate transport and metabolism | NC_000913.3 | crp,fis,fnr,glpR,ihf,nac,rpoD |
b2241 | 0.375662 | 2352646 | 2354275 | + | glpA | 1629 | glpABC | Energy production and conversion | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b2242 | 0.328961 | 2354264 | 2355524 | + | glpB | 1260 | glpABC | Amino acid transport and metabolism | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b2243 | 0.315752 | 2355520 | 2356711 | + | glpC | 1191 | glpABC | Energy production and conversion | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b3426 | 0.350034 | 3562012 | 3563518 | + | glpD | 1506 | glpD | Energy production and conversion | NC_000913.3 | arcA,crp,glpR,rpoD,yieP |
b3926 | 0.290235 | 4115713 | 4117222 | - | glpK | 1509 | glpFKX | Energy production and conversion | NC_000913.3 | crp,glpR,rpoD |
b3927 | 0.312307 | 4117244 | 4118090 | - | glpF | 846 | glpFKX | Carbohydrate transport and metabolism | NC_000913.3 | crp,glpR,rpoD |
Searching for genes in iModulons
To find which iModulons contain a specific gene, use the imodulons_with
method.
[23]:
ica_data.imodulons_with('b2239')
[23]:
['GlpR']
If the gene_table
contains a gene_name
columns, this function will work with either the locus tag or the gene name.
[24]:
ica_data.imodulons_with('carA')
[24]:
['PurR-2']
Renaming iModulons
Individual iModulons can be renamed using the rename_imodulons
method
[25]:
print('Original iModulon Names:', ica_data.imodulon_names[:5])
ica_data.rename_imodulons({'AllR/AraC/FucR':'AllR'})
print('Renamed iModulon Names:', ica_data.imodulon_names[:5])
Original iModulon Names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
Renamed iModulon Names: ['AllR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
These changes are reflected throughout the IcaData
object.
[26]:
print('M matrix columns:', ica_data.M.columns[:5])
M matrix columns: Index(['AllR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC'], dtype='object')
iModulon names can be updated all at once as well.
[27]:
print('Original iModulon Names:', ica_data.imodulon_names[:5])
new_names = ['AllR/AraC/FucR']+ica_data.imodulon_names[1:]
print('New iModulon names:', new_names[:5])
ica_data.imodulon_names = new_names
print('Renamed iModulon Names:', ica_data.imodulon_names[:5])
Original iModulon Names: ['AllR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
New iModulon names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
Renamed iModulon Names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
Copying IcaData
objects
The copy
method creates a new IcaData
object identical to the old one.
[28]:
ica_data.copy()
[28]:
<pymodulon.core.IcaData at 0x7fe93ded5d10>
Saving and Loading IcaData
Objects
To facilitate data sharing, you can save IcaData
objects as json files that can be easily re-loaded
[29]:
from pymodulon.io import *
from os import path
[30]:
filepath = path.join('tmp','ecoli_data.json')
save_to_json(ica_data,filepath)
[31]:
ica_data = load_json_model(filepath)
[32]:
ica_data.imodulon_table.head()
[32]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | |
---|---|---|---|---|---|---|---|---|---|---|
AllR/AraC/FucR | allR/araC/fucR | 0.750000 | 1.190000e-41 | 1.000000 | 0.600000 | 18.0 | 18 | 3 | Carbon Source Utilization | 0.086996 |
ArcA-1 | arcA | 0.130952 | 6.420000e-20 | 0.660000 | 0.072687 | 33.0 | 50 | 1 | Energy Metabolism | 0.058051 |
ArcA-2 | arcA | 0.087683 | 1.150000e-16 | 0.840000 | 0.046256 | 21.0 | 25 | 1 | Energy Metabolism | 0.081113 |
ArgR | argR | 0.177778 | 6.030000e-18 | 0.923077 | 0.098361 | 12.0 | 13 | 1 | Amino Acid and Nucleotide Biosynthesis | 0.080441 |
AtoC | atoC | 0.800000 | 1.520000e-12 | 0.666667 | 1.000000 | 4.0 | 6 | 1 | Miscellaneous Metabolism | 0.105756 |
[ ]:
Plotting Functions
PyModulon contains a suite of plotting functions for data visualization and iModulon characterization.
[1]:
from pymodulon.plotting import *
from pymodulon.example_data import load_ecoli_data
[2]:
ica_data = load_ecoli_data()
Bar plots
Gene expression and iModulon activities are easily viewed as bar plots. Use the plot_expression
and plot_activities
functions, respectively. Any numeric metadata for your experiments can be plotted using the plot_metadata
function.
Optional arguments:
projects
: Only show specific project(s)highlight
: Show individiual conditions for specific project(s)ax
: Use a pre-existing matplotlib axis (helpful if you want to manually determine the plot size).legend_args
: Dictionary of arguments to pass to the matplotlib legend (e.g.{'fontsize':12, 'loc':0, 'ncol':2}
)
Plot Gene Expression
You can plot the compendium-wide expression of a gene using either the locus tag or gene name. Each bar represents an experimental condition, and points are overlaid on the bars to show the gene expression of individual replicates. The plot is subdivided into projects, as stored in the sample_table
.
[3]:
plot_expression(ica_data,'b0002');
The plot can be limited to specific projects using the projects
argument. The highlight
argument shows individual conditions in the legend.
[4]:
plot_expression(ica_data,'thrA',projects=['ica','fps'],highlight='fps');
Plot iModulon Activities
The plot_activities
function mirrors the plot_expression
function, but shows iModulon activities instead of gene expression.
[5]:
plot_activities(ica_data,'GlpR',highlight='crp');
[6]:
plot_activities(ica_data,'GlpR',projects='crp');
Plot sample metadata
If the sample_table
contains numerical values in a column, this column can be graphed using the plot_metadata
function.
[7]:
plot_metadata(ica_data,'Growth Rate (1/hr)');
Scatterplots
Gene expression and iModulon activities can be compared with a scatter plot. Use the compare_expression
and compare_activities
functions, respectively. In addition, compare_values
can be used to compare any compendium-wide value against another, including gene expression, iModulon activity, and sample metadata.
Optional arguments:
groups
: Mapping of samples to specific groupscolors
: Color of points, list of colors to use for different groups, or dictionary mapping groups to colorsshow_labels
: Show labels for points. (default:False
)adjust_labels
: Automatically avoid label overlapfit_metric
: Correlation metric of'pearson'
,'spearman'
, or'r2adj'
(default:'pearson'
)ax
: Use a pre-existing axis (helpful if you want to manually determine the plot size)
Formatting arguments:
ax_font_args
: Arguments for x-axis labels and y-axis labels (e.g.{'fontsize':16'}
)scatter_args
: Arguments for matplotlib scatterplot (e.g.{'s'=10}
)label_font_args
: Arguments for matplotlib text labels (e.g.{'fontsize':8}
)legend_args
: Arguments to pass to the matplotlib legend (e.g.{'fontsize':12, 'loc':0, 'ncol':2}
)
Plot gene weights
plot_gene_weights
will plot an iModulon’s gene weights against its genomic position. If the number of genes in the iModulon is fewer than 20, it will also show the gene names (or locus tags, if gene name is unavailable). If the gene_table
contains a COG
column, genes will be colored by their Cluster of Orthologous Genes category.
[8]:
plot_gene_weights(ica_data,'GlpR')
[8]:
<AxesSubplot:xlabel='Gene Start', ylabel='GlpR Gene Weight'>
If there are more than 20 genes, gene names will not be shown by default.
[9]:
plot_gene_weights(ica_data,'Fnr')
[9]:
<AxesSubplot:xlabel='Gene Start', ylabel='Fnr Gene Weight'>
Use show_labels=True
show gene labels. It is advisable to turn of auto-adjustment of gene labels (adjust_labels=False
), as this may take a while with many genes.
[10]:
plot_gene_weights(ica_data,'Fnr',show_labels=True,adjust_labels=False)
[10]:
<AxesSubplot:xlabel='Gene Start', ylabel='Fnr Gene Weight'>
Compare two gene expression profiles
The compare_expression
function plots two iModulon activities against each other. Groups of samples can be highlighted to visualize the effects of experimental conditions.
[11]:
groups = {'minspan__wt_glc_anaero__1':'Anaerobic',
'minspan__wt_glc_anaero__2':'Anaerobic'}
[12]:
compare_expression(ica_data,'arcA','fnr',groups=groups)
[12]:
<AxesSubplot:xlabel='$arcA$ Expression', ylabel='$fnr$ Expression'>
Compare two iModulon activities
The compare_activities
function mirrors the compare_expression
function.
[13]:
compare_activities(ica_data,'Fnr','ArcA-1',groups=groups)
[13]:
<AxesSubplot:xlabel='Fnr iModulon Activity', ylabel='ArcA-1 iModulon Activity'>
[14]:
compare_activities(ica_data,'Fnr','ArcA-1',groups=groups,colors=['green','purple'])
[14]:
<AxesSubplot:xlabel='Fnr iModulon Activity', ylabel='ArcA-1 iModulon Activity'>
Compare iModulon gene weights
compare_gene_weights
plots the gene weights of two iModulons against each other. Dashed lines indicate iModulon thresholds. Genes outside both thresholds are highlighted in red and labelled.
[15]:
compare_gene_weights(ica_data,'CysB','Cbl+CysB')
[15]:
<AxesSubplot:xlabel='CysB Gene Weight', ylabel='Cbl+CysB Gene Weight'>
Differential iModulon activity
A Differential iModulon Activity plot, or DiMA plot shows iModulons that have significantly different iModulon activities between two experimental conditions. The plot_dima
function can either compare two conditions using project:condition
keys, or using a list of sample names.
[16]:
plot_dima(ica_data,'fur:wt_fe','fur:delfur_fe2')
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:174: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
_diff[":".join(name)] = abs(A_to_use[i1] - A_to_use[i2])
[16]:
<AxesSubplot:xlabel='fur:wt_fe', ylabel='fur:delfur_fe2'>
table=True
adjust=False
[17]:
ax, table = plot_dima(ica_data,
['fur__wt_fe__1','fur__wt_fe__2'],
['fur__delfur_fe2__1','fur__delfur_fe2__2'],
table=True,
adjust=False)
table
[17]:
difference | pvalue | qvalue | 0 | 1 | |
---|---|---|---|---|---|
Fur-1 | 27.598965 | 7.320594e-04 | 0.011225 | -6.701019 | 20.897946 |
fur-KO | 14.918007 | 4.338664e-08 | 0.000004 | 3.516092 | 18.434098 |
RpoH | 5.145175 | 3.091958e-03 | 0.031607 | 0.008680 | 5.153855 |
flu-yeeRS | -5.020887 | 2.246638e-04 | 0.006612 | 5.299051 | 0.278164 |
membrane | -6.005727 | 5.390713e-05 | 0.002480 | -0.163181 | -6.168908 |
PuuR | -6.026142 | 1.397577e-03 | 0.017874 | -3.040374 | -9.066517 |
NarL | -6.873706 | 4.898118e-03 | 0.040966 | 6.520503 | -0.353203 |
iron-related | -14.209552 | 2.874857e-04 | 0.006612 | 1.004311 | -13.205241 |
Zinc | -15.497242 | 5.238034e-04 | 0.009638 | 21.181402 | 5.684160 |
Compare iModulon gene weights across organisms
In order to compare gene weights across organisms, you will need the IcaData
objects for both organisms, as well as a file mapping the genes between the organisms. To generate this file, see the Comparing iModulons tutorial.
[18]:
from pymodulon.example_data import load_staph_data, load_example_bbh
staph_data = load_staph_data()
bbh = load_example_bbh()
[19]:
compare_gene_weights(ica_data,'PurR-2',
ica_data2 = staph_data,
imodulon2 = 'PyrR',
ortho_file = bbh)
[19]:
<AxesSubplot:xlabel='PurR-2 Gene Weight', ylabel='PyrR Gene Weight'>
Use use_org1_names
to switch which organism’s gene names are shown.
[20]:
compare_gene_weights(ica_data,'PurR-2',
ica_data2 = staph_data,
imodulon2 = 'PyrR',
ortho_file = bbh,
use_org1_names = False)
[20]:
<AxesSubplot:xlabel='PurR-2 Gene Weight', ylabel='PyrR Gene Weight'>
Automated metadata classification
The function metadata_boxplot
automatically classify iModulon activities given metadata information. Optional arguments include:
show_points
: Overlay individual points on top of the boxplot. By default, this is True, and uses a stripplot.show_points='swarm'
will use a swarmplot.n_boxes
: Number of boxes to createsample
: Subset of samples to analyzestrip_conc
: Remove concentrations from metadata (e.g. “glucose(2g/L)” would be interpreted as just “glucose”). Default isTrue
.ignore_cols
: List of columns to ignore. If empty, onlyproject
andcondition
are ignored.use_cols
: List of columns to use. This supercedesignore_cols
.return_results
: Return a DataFrame describing the classifications.
[21]:
metadata_boxplot(ica_data,"EvgA",ignore_cols=['GEO','study','project','DOI']);
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/sklearn/tree/_classes.py:370: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent.
FutureWarning,
The use_cols
argument selects which columns to use from the sample_table
.
To turn off individual points, set show_points=False
. Alternatively, show_points='swarm'
will use a swarmplot.
[22]:
metadata_boxplot(ica_data,"RpoS",n_boxes=5,
use_cols=['Base Media','Carbon Source (g/L)',
'Nitrogen Source (g/L)','Supplement',
'Evolved Sample','pH','Growth Rate (1/hr)'],
show_points=False);
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/sklearn/tree/_classes.py:370: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent.
FutureWarning,
The samples
argument selects specific samples to plot
The strip_kwargs
argument can be used to customize the stripplot
[23]:
metadata_boxplot(ica_data,"RpoS",
n_boxes=3,
samples=ica_data.sample_names[:26],
ignore_cols = ['DOI','GEO'],
strip_kwargs={'size':10, 'edgecolor':'white', 'linewidth':1});
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/sklearn/tree/_classes.py:370: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent.
FutureWarning,
iModulon histograms
iModulon gene weights can be visualized in a histogram. If you wish to highlight genes in a regulon, it can be visualized either as overlapping bars, or side-by-side bars.
[24]:
plot_regulon_histogram(ica_data,'Fur-1','fur')
[24]:
<AxesSubplot:xlabel='Fur-1 Gene Weight', ylabel='Number of Genes'>
[25]:
plot_regulon_histogram(ica_data,'Fur-1','fur',kind='side')
[25]:
<AxesSubplot:xlabel='Fur-1 Gene Weight', ylabel='Number of Genes'>
Overview plots
PyModulon contains two plots to help gain a general overview of your IcaData object.
Compare iModulons vs Regulons
compare_imodulon_vs_regulon
creates a scatter plot that compares the overlap between iModulons and their linked regulators. The dashed lines creates four quadrants:
Top-right: iModulons that are nearly identical to known regulons (“Well-matched”)
Top-left: iModulons contain a subset of the total genes thought to be regulated by the linked regulator (“Subset”)
Bottom-right: iModulons contain most of the genes thought to be regulated by the linked regulator. However, these iModulons contain a high fraction of genes that are not yet confirmed to be regulated by the regulator (“Unknown-containing”).
Bottom-left: iModulons have a small, but statistically significant, overlap with the associated regulator (“Closest match”)
Additional options include:
imodulons
: List of iModulons to plotcat_column
: Column in the `imodulon_table
that stores the category of each iModulonsize_column
: Column in theimodulon_table
that stores the size of each iModulonscale
: Value used to scale the size of each pointreg_only
: Only plot iModulons with an entry in theregulator
column of theimodulon_table
xlabel
: Custom x-axis label (default: “# shared genes/Regulon size”)ylabel
: Custom y-axis label (default: “# shared genes/iModulon size”)vline
: Draw a dashed vertical linehline
: Draw a dashed horizontal line
[26]:
compare_imodulons_vs_regulons(ica_data,
size_column='n_genes',
cat_column='Category',
scale=3);
Plot explained variance
plot_explained_variance
gives an idea of how much of the expression variation is captured by iModulons
[27]:
plot_explained_variance(ica_data);
Activity Clustering
The iModulon activites in the A
matrix can be clustered based on correlation between their activities across conditions in the compendium.
Use the cluster_activities
function to prepare a clustermap; the minimal input is simply your IcaData
object.
[28]:
cluster_activities(ica_data);
Using different correlation metrics
You can use multiple correlation metrics, including "pearson"
,"spearman"
, and "mutual_info"
. Mutual information is most likely to identify biologically similar iModulons, but can be more difficult to interpret as it finds both linear and non-linear correlations.
[29]:
cluster_activities(ica_data,correlation_method='mutual_info');
Automatic Distance Thresholding
Agglomerative (hierarchical) clustering is used under the hood. Thus, a distance threshold for defining “flat” clusters from the hierarchical structure must be determined. By default, this distance threshold is automagically calculated using a sensitivity analysis.
Different distance thresholds (this value is between 0 and 1) are tried, and the resulting clustering is assessed using a silhouette score (a measure of how separate the clusters are). The distance threshold yielding the maximum silhouette score is automatically chosen.
To see the result of this sensitivity analysis, use the show_thresholding
option.
[30]:
cluster_activities(ica_data, show_thresholding=True);
Manual Distance Threshold
You may also determine that you’re interested in manually varying the distance threshold to see what happens to the iModulon clusters. Setting this threshold manually (with the distance_threshold
option) will override the automatic thresholding shown above.
Note: distance_threshold
must be set to a value between 0 and 1; larger values will generally yield smaller numbers of larger clusters, as shown in the thresholding plot above.
[31]:
cluster_activities(ica_data, distance_threshold=0.95);
Displaying Best Clusters
The above clustermaps do not allow you to see which iModulons are actually being clustered together; use the show_best_clusters
option to call out an additional plot that shows such clusters.
By default, the clusters whose individual silhouette scores are greater than the mean silhouette score (indicating their separation from the other clusters is above-average) will be shown.
The cluster numbers come from the scikit-learn AgglomerativeClustering
estimator that actually performs the clustering; these labels don’t have any special significance and are just unique identifiers for each cluster. You can also access an iModulon-index-matched list of these labels by accessing the labels_
attribute of the AgglomerativeClustering
object (which is returned by cluster_activities
).
[32]:
cluster_activities(ica_data, show_best_clusters=True);
So we can see here that the clustering method does seem to capture some biologically-relevant groups of iModulons: Cluster 15 contains 2 flagella regulators, Cluster 8 is membrane-related, Cluster 13 is stress-related, Cluster 25 is iron-related, Cluster 12 is carbon metabolism related, etc.
NOTE: the parenthesized numbers next to the cluster names are the clusters’ silhouette scores (a 0 to 1 measure of a cluster’s separation from the pack).
Perhaps you’re only interested in seeing a specific number of the top clusters; use the n_best_clusters
argument to specify this preference:
[33]:
cluster_activities(ica_data, show_best_clusters=True, n_best_clusters=5);
Naming Clusters
After performing an initial clustering and manually mapping knowledge onto your best clusters, you may decide on a new, more descriptive name for some of your clusters. To generate pretty figures that use these names instead of the soulless integer IDs, use the cluster_names
option to map the integer IDs to names. You don’t have to rename all clusters.
[34]:
cluster_activities(
ica_data, show_best_clusters=True, n_best_clusters=5,
cluster_names={15: 'Flagella', 8: 'Membrane', 13: 'Stress'}
);
DIMCA
Differential iModulon Cluster Activity (DIMCA) analysis, a sister method to the differential iModulon activity (DIMA) analysis, allows you to compare all iModulon activities between 2 or more conditions.
cluster_activities
itself has the capability to perform DIMCA analyses, exposing a series of dimca_
-prefixed arguments that correspond with the arguments to plot_dima
(which is in fact used under the hood).
For DIMCA, a “cluster activity” will be calculated for each of your best clusters (however many you ask for) and then plotted between your 2 conditions, instead of that cluster’s constituent iModulons. In this way, a DIMA plot can be rendered even simpler.
Cluster activities are simply the averages of the activities of the constituent iModulons, EXCEPT that for iModulons that are generally anti-correlated with the others in a cluster (see purR-KO from the above plot, for example), the sign of the activities is first switched.
[35]:
plot_dima(ica_data, 'fur:wt_dpd', 'fur:wt_fe');
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:174: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
_diff[":".join(name)] = abs(A_to_use[i1] - A_to_use[i2])
This DIMA plot is fairly busy; we can use DIMCA to reduce the number of points even further:
[36]:
cluster_activities(ica_data, show_best_clusters=True, dimca_sample1='fur:wt_dpd', dimca_sample2='fur:wt_fe');
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:174: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
_diff[":".join(name)] = abs(A_to_use[i1] - A_to_use[i2])
Thus, we have a somewhat simpler picture of this comparison. We can increase the number of best clusters we ask for to yield only cluster points; be careful with this though, as some of the worse-scoring clusters are actually “singleton” clusters with just a single iModulon in them (for these iModulons’ activities are generally uncorrelated with the other iModulons in the dataset).
[37]:
cluster_activities(ica_data, n_best_clusters=50, dimca_sample1='fur:wt_dpd', dimca_sample2='fur:wt_fe');
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:174: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
_diff[":".join(name)] = abs(A_to_use[i1] - A_to_use[i2])
And as before, if we name clusters, those names will propagate to the DIMCA plot (and DIMCA table if requested):
[38]:
cluster_obj, dimca_ax, table = cluster_activities(
ica_data, show_best_clusters=True,
cluster_names={25: 'Iron', 13: 'Stress', 8: 'Membrane'},
dimca_sample1='fur:wt_dpd', dimca_sample2='fur:wt_fe', dimca_table=True
);
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:174: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
_diff[":".join(name)] = abs(A_to_use[i1] - A_to_use[i2])
[39]:
table
[39]:
difference | pvalue | qvalue | 0 | 1 | |
---|---|---|---|---|---|
iron-related | 16.153595 | 0.000168 | 0.001971 | -15.149284 | 1.004311 |
ArgR | 13.753341 | 0.002044 | 0.005877 | -11.636817 | 2.116524 |
NarL | 8.227232 | 0.002884 | 0.007045 | -1.706729 | 6.520503 |
Membrane [Clst] | 7.906679 | 0.001441 | 0.004514 | -6.433855 | 1.472824 |
GcvA | 7.412801 | 0.000035 | 0.000823 | -6.265427 | 1.147374 |
Cluster 28 | 7.156066 | 0.000250 | 0.002347 | -6.943347 | 0.212719 |
Cluster 9 | -5.126362 | 0.000456 | 0.002925 | 4.132141 | -0.994221 |
SoxS | -5.226992 | 0.000665 | 0.003127 | 5.043639 | -0.183353 |
Thiamine | -5.558917 | 0.003923 | 0.007682 | 5.193298 | -0.365619 |
FecI | -5.760403 | 0.001402 | 0.004514 | 5.871660 | 0.111257 |
uncharacterized-1 | -7.566460 | 0.000498 | 0.002925 | 8.833006 | 1.266546 |
Stress [Clst] | -7.639391 | 0.000619 | 0.003127 | 7.081778 | -0.557613 |
Cluster 2 | -8.318706 | 0.000114 | 0.001783 | -1.638389 | -9.957095 |
Iron [Clst] | -8.611146 | 0.003289 | 0.007045 | 6.395635 | -2.215511 |
Leu/Ile | -8.702352 | 0.000850 | 0.003633 | 8.712151 | 0.009800 |
MetJ | -9.857605 | 0.000403 | 0.002925 | 9.533284 | -0.324320 |
nitrate-related | -11.292903 | 0.000002 | 0.000097 | 13.926154 | 2.633251 |
Note that [Clst]
is added to cluster names in the DIMCA plot to avoid confusion with unclustered iModulons.
Inferring iModulon activities for new data
Re-computing the complete set of iModulons can be computationally intensive for every new dataset. However, once a dataset reaches a critical size, you can use a pre-computed IcaData
object to infer the iModulon activities of a new dataset. iModulon activities are relative measures; every dataset must have a reference condition to which all other samples are compared against.
To compute the new iModulon activities, first load the pre-computed IcaData
object.
[1]:
from pymodulon.example_data import load_ecoli_data
ica_data = load_ecoli_data()
Next, load your expression profiles. This should be normalized using whichever read mapping pipeline you use, as Transcripts per Million (TPM) or log-TPM.
[2]:
from pymodulon.example_data import load_example_log_tpm
log_tpm = load_example_log_tpm()
log_tpm.head()
[2]:
Reference_1 | Reference_2 | Test_1 | Test_2 | |
---|---|---|---|---|
Geneid | ||||
b0001 | 10.473721 | 10.271944 | 10.315476 | 10.808135 |
b0002 | 10.260569 | 10.368555 | 10.735874 | 10.726916 |
b0003 | 9.920277 | 10.044224 | 10.528432 | 10.503092 |
b0004 | 9.936694 | 10.010638 | 9.739519 | 9.722997 |
b0005 | 7.027515 | 7.237449 | 6.745798 | 6.497823 |
Next, make sure your dataset uses similar gene names as the target IcaData
object.
[3]:
from matplotlib_venn import venn2
venn2((set(ica_data.gene_names),set(log_tpm.index)), set_labels=['IcaData genes','Dataset genes'])
[3]:
<matplotlib_venn._common.VennDiagram at 0x7f674e99c410>
Only genes shared between your IcaData
object and the new expression profiles will be used to project your data. All other genes will be ignored.
Then, center your dataset on a reference condition, taking the average of replicates.
[4]:
centered_log_tpm = log_tpm.sub(log_tpm[['Reference_1','Reference_2']].mean(axis=1),axis=0)
centered_log_tpm.head()
[4]:
Reference_1 | Reference_2 | Test_1 | Test_2 | |
---|---|---|---|---|
Geneid | ||||
b0001 | 0.100889 | -0.100889 | -0.057356 | 0.435303 |
b0002 | -0.053993 | 0.053993 | 0.421312 | 0.412354 |
b0003 | -0.061973 | 0.061973 | 0.546181 | 0.520841 |
b0004 | -0.036972 | 0.036972 | -0.234147 | -0.250669 |
b0005 | -0.104967 | 0.104967 | -0.386684 | -0.634659 |
Finally, use the pymodulon.util.infer_activities
function to infer the relative iModulon activities of your dataset.
[5]:
from pymodulon.util import infer_activities
[6]:
activities = infer_activities(ica_data,centered_log_tpm)
activities.head()
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:327: FutureWarning: Index.__and__ operating as a set operation is deprecated, in the future this will be a logical operation matching Series.__and__. Use index.intersection(other) instead
shared_genes = ica_data.M.index & data.index
[6]:
Reference_1 | Reference_2 | Test_1 | Test_2 | |
---|---|---|---|---|
AllR/AraC/FucR | 0.243143 | -0.243143 | 1.028044 | 0.848571 |
ArcA-1 | -0.157687 | 0.157687 | -2.644027 | -2.418106 |
ArcA-2 | 0.038248 | -0.038248 | 0.182260 | 0.039267 |
ArgR | -0.150147 | 0.150147 | -1.456806 | -1.293399 |
AtoC | 0.344893 | -0.344893 | 0.632130 | 1.075412 |
All of the plotting functions in pymodulon.plotting
can be used on your inferred activities once you add it to a new IcaData
object. It is advisable to create a new sample_table
with project
and condition
columns.
[7]:
from pymodulon.core import IcaData
import pandas as pd
[8]:
new_sample_table = pd.DataFrame([['new_data','reference']]*2+[['new_data','test']]*2,columns=['project','condition'],index=log_tpm.columns)
new_sample_table
[8]:
project | condition | |
---|---|---|
Reference_1 | new_data | reference |
Reference_2 | new_data | reference |
Test_1 | new_data | test |
Test_2 | new_data | test |
[9]:
new_data = IcaData(ica_data.M,
activities,
gene_table = ica_data.gene_table,
sample_table = new_sample_table,
imodulon_table = ica_data.imodulon_table)
[10]:
from pymodulon.plotting import *
[11]:
plot_activities(new_data,'Fur-1',highlight='new_data')
[11]:
<AxesSubplot:ylabel='Fur-1 iModulon\nActivity'>
[ ]:
[ ]:
[ ]:
Gene enrichment analysis
[1]:
from pymodulon.enrichment import *
from pymodulon.example_data import load_ecoli_data, trn
ica_data = load_ecoli_data()
General functions
To perform a basic enrichment test between two gene sets, use the compute_enrichment
function. Optional arguments:
label
: Label for your target gene set (e.g. regulator name or GO term)
[2]:
gene_set = ['b0002','b0003','b0004'] # e.g. an iModulon or other gene set
target_genes = ['b0002','b0005','b0007'] # e.g. a regulon or genes in a COG category
all_genes = ica_data.gene_names # List of all genes in your expression dataset
[3]:
compute_enrichment(gene_set,target_genes,all_genes)
[3]:
pvalue 0.002293
precision 0.333333
recall 0.333333
f1score 0.333333
TP 1.000000
target_set_size 3.000000
gene_set_size 3.000000
dtype: float64
To perform an enrichment test against a regulon in a TRN table, use the compute_regulon
function.
[4]:
compute_regulon_enrichment(gene_set,'lrp',all_genes,trn)
[4]:
pvalue 0.000131
precision 1.000000
recall 0.015000
f1score 0.029557
TP 3.000000
regulon_size 200.000000
gene_set_size 3.000000
n_regs 1.000000
Name: lrp, dtype: float64
Each row of the TRN table represents a regulatory interaction. The table requires the columns regulator
and gene_id
.
[5]:
trn.head()
[5]:
regulator | gene_id | effect | |
---|---|---|---|
0 | FMN | b3041 | - |
1 | L-tryptophan | b3708 | + |
2 | L-tryptophan | b3709 | + |
3 | TPP | b0066 | - |
4 | TPP | b0067 | - |
To perform an enrichment against all regulons in a TRN table, use the compute_trn_enrichment
function. This function includes false detection correction using the Benjamini-Hochberg procedure to compute Q-values. By default, the false detection rate is 1e-5
, and can be updated using the fdr
argument.
[6]:
compute_trn_enrichment(gene_set,all_genes,trn)
[6]:
pvalue | precision | recall | f1score | TP | regulon_size | gene_set_size | n_regs | qvalue | |
---|---|---|---|---|---|---|---|---|---|
thr-tRNA | 0.000000e+00 | 1.0 | 1.000000 | 1.000000 | 3.0 | 3.0 | 3.0 | 1.0 | 0.000000e+00 |
ile-tRNA | 8.354256e-09 | 1.0 | 0.333333 | 0.500000 | 3.0 | 9.0 | 3.0 | 1.0 | 2.923990e-08 |
lrp | 1.306248e-04 | 1.0 | 0.015000 | 0.029557 | 3.0 | 200.0 | 3.0 | 1.0 | 3.047911e-04 |
arcA | 1.540883e-03 | 1.0 | 0.006608 | 0.013129 | 3.0 | 454.0 | 3.0 | 1.0 | 2.157237e-03 |
fnr | 1.411995e-03 | 1.0 | 0.006803 | 0.013514 | 3.0 | 441.0 | 3.0 | 1.0 | 2.157237e-03 |
crp | 3.335561e-03 | 1.0 | 0.005111 | 0.010169 | 3.0 | 587.0 | 3.0 | 1.0 | 3.891487e-03 |
compute_trn_enrichment
can perform enrichment against complex regulons in boolean combinations. The max_regs
argument determines the number of regulators to include in the complex regulon, and the method
argument determines how to combine complex regulons. method='or'
computes enrichments against the union of regulons, 'and'
computes enrichment against intersection of regulons, and 'both'
performs both tests (default).
[7]:
compute_trn_enrichment(gene_set,all_genes,trn,max_regs=2).head()
[7]:
pvalue | precision | recall | f1score | TP | regulon_size | gene_set_size | n_regs | qvalue | |
---|---|---|---|---|---|---|---|---|---|
fnr+ile-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 2.0 | 0.0 |
thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 1.0 | 0.0 |
arcA+ile-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 2.0 | 0.0 |
lrp+thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 2.0 | 0.0 |
arcA+thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 2.0 | 0.0 |
Increasing the number of regulators for complex regulons beyond three can lead to a combinatorial explosion. If you are sure you want to search for larger complex regulons, use force=True
[8]:
compute_trn_enrichment(gene_set,all_genes,trn,max_regs=3,force=True).head()
[8]:
pvalue | precision | recall | f1score | TP | regulon_size | gene_set_size | n_regs | qvalue | |
---|---|---|---|---|---|---|---|---|---|
arcA+ile-tRNA+lrp | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 |
arcA+ile-tRNA+thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 |
arcA+lrp+thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 |
arcA+rpoD+thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 |
crp+fnr+ile-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 |
Using the IcaData
object
[9]:
# Compare a single iModulon to a simple regulon
ica_data.compute_regulon_enrichment('GlpR',regulator='glpR')
[9]:
pvalue 0
precision 1
recall 1
f1score 1
TP 9
regulon_size 9
imodulon_size 9
n_regs 1
Name: glpR, dtype: int64
[10]:
# Compare a single iModulon to a complex regulon
ica_data.compute_regulon_enrichment('Copper','cusR/cueR')
[10]:
pvalue 5.471346e-20
precision 1.000000e+00
recall 4.210526e-01
f1score 5.925926e-01
TP 8.000000e+00
regulon_size 1.900000e+01
imodulon_size 8.000000e+00
n_regs 2.000000e+00
Name: cusR/cueR, dtype: float64
[11]:
# Get all significant enrichments for all iModulons.
# Optional parameters are same as the global compute_trn_enrichment function
ica_data.compute_trn_enrichment()
[11]:
imodulon | regulator | pvalue | qvalue | precision | recall | f1score | TP | regulon_size | imodulon_size | n_regs | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AllR/AraC/FucR | araC | 3.804040e-14 | 3.423636e-13 | 0.388889 | 0.500000 | 0.437500 | 7.0 | 14.0 | 18.0 | 1.0 |
1 | AllR/AraC/FucR | allR | 3.067669e-13 | 1.380451e-12 | 0.333333 | 0.666667 | 0.444444 | 6.0 | 9.0 | 18.0 | 1.0 |
2 | AllR/AraC/FucR | fucR | 2.316812e-11 | 6.950436e-11 | 0.277778 | 0.714286 | 0.400000 | 5.0 | 7.0 | 18.0 | 1.0 |
3 | AllR/AraC/FucR | rpoD | 5.383749e-07 | 1.211343e-06 | 0.944444 | 0.011668 | 0.023051 | 17.0 | 1457.0 | 18.0 | 1.0 |
4 | AllR/AraC/FucR | crp | 8.866740e-07 | 1.596013e-06 | 0.666667 | 0.020443 | 0.039669 | 12.0 | 587.0 | 18.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
181 | nitrate-related | narP | 9.268494e-13 | 6.024521e-12 | 0.571429 | 0.163265 | 0.253968 | 8.0 | 49.0 | 14.0 | 1.0 |
182 | nitrate-related | nsrR | 3.762939e-09 | 1.630607e-08 | 0.500000 | 0.086420 | 0.147368 | 7.0 | 81.0 | 14.0 | 1.0 |
183 | nitrate-related | fnr | 8.576894e-09 | 2.787490e-08 | 0.785714 | 0.024943 | 0.048352 | 11.0 | 441.0 | 14.0 | 1.0 |
184 | translation | fnr | 2.865424e-08 | 4.584679e-07 | 0.583333 | 0.031746 | 0.060215 | 14.0 | 441.0 | 24.0 | 1.0 |
185 | translation | arcA | 4.244932e-07 | 3.395946e-06 | 0.541667 | 0.028634 | 0.054393 | 13.0 | 454.0 | 24.0 | 1.0 |
186 rows × 11 columns
Save to sample table
You can directly save iModulon enrichments to the sample table using either the ica_data.compute_regulon_enrichment
function or ica_data.compute_trn_enrichment
function. Columns unrelated to regulon enrichments (e.g. Function
and Category
) are left alone.
[12]:
ica_data.imodulon_table.loc[['GlpR']]
[12]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | |
---|---|---|---|---|---|---|---|---|---|---|
GlpR | glpR | 1.0 | 0.0 | 1.0 | 1.0 | 9.0 | 9 | 1 | Carbon Source Utilization | 0.066801 |
[13]:
ica_data.compute_regulon_enrichment('GlpR',regulator='crp',save=True)
ica_data.imodulon_table.loc[['GlpR']]
[13]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | regulon_size | imodulon_size | n_regs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GlpR | crp | 0.030201 | 3.567813e-08 | 1.0 | 0.015332 | 9.0 | 9 | 1 | Carbon Source Utilization | 0.066801 | 587.0 | 9.0 | 1.0 |
[14]:
# The enrichment with the lowest qvalue is saved as the official enrichment
ica_data.compute_trn_enrichment(save=True)
ica_data.imodulon_table
[14]:
regulator | pvalue | qvalue | precision | recall | f1score | TP | regulon_size | imodulon_size | n_regs | threshold | Category | n_tf | n_genes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AllR/AraC/FucR | araC | 3.804040e-14 | 3.423636e-13 | 0.388889 | 0.500000 | 0.437500 | 7.0 | 14.0 | 18.0 | 1.0 | 0.086996 | Carbon Source Utilization | 3 | 18 |
ArcA-1 | arcA | 6.418412e-20 | 2.374813e-18 | 0.660000 | 0.072687 | 0.130952 | 33.0 | 454.0 | 50.0 | 1.0 | 0.058051 | Energy Metabolism | 1 | 50 |
ArcA-2 | appY | 7.019326e-18 | 1.895218e-16 | 0.320000 | 0.888889 | 0.470588 | 8.0 | 9.0 | 25.0 | 1.0 | 0.081113 | Energy Metabolism | 1 | 25 |
ArgR | argR | 6.026074e-18 | 7.833897e-17 | 0.923077 | 0.098361 | 0.177778 | 12.0 | 122.0 | 13.0 | 1.0 | 0.080441 | Amino Acid and Nucleotide Biosynthesis | 1 | 13 |
AtoC | atoC | 1.522277e-12 | 1.217822e-11 | 0.666667 | 1.000000 | 0.800000 | 4.0 | 4.0 | 6.0 | 1.0 | 0.105756 | Miscellaneous Metabolism | 1 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
uncharacterized-4 | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.074349 | Uncharacterized | 0 | 10 |
uncharacterized-5 | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.079179 | Uncharacterized | 0 | 4 |
uncharacterized-6 | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.052578 | Uncharacterized | 0 | 54 |
ydcI-KO | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.099229 | Genomic Alterations | 0 | 3 |
yheO-KO | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.078511 | Genomic Alterations | 0 | 16 |
92 rows × 14 columns
[ ]:
Searching for motifs
Using MEME to find motifs
MEME must be installed to use this functionality. find_motifs
searches for motifs upstream of all genes in an iModulon. The gene_table
must contain the columns accession
and operon
for this function to work (see notebooks/gene_annotation.ipynb
).
find_motifs
supports many of the command-line options for MEME:
outdir
: Directory for output filespalindrome
: If True, limit search to palindromic motifs (default: False)nmotifs
: Number of motifs to search for (default: 5)upstream
: Number of basepairs upstream from first gene in operon to include in motif search (default: 500)downstream
: Number of basepairs upstream from first gene in operon to include in motif search (default: 100)verbose
: Show steps in verbose output (default: True)force
: Force execution of MEME even if output already exists (default: False)evt
: E-value threshold (default: 0.001)cores
Number of cores to use (default: 8)minw
: Minimum motif width in basepairs (default: 6)maxw
: Maximum motif width in basepairs (default: 40)minsites
: Minimum number of sites required for a motif. Default is the number of operons divided by 3.
[1]:
from pymodulon.motif import *
from pymodulon.example_data import load_ecoli_data, ecoli_fasta
ica_data = load_ecoli_data()
[4]:
motifs = find_motifs(ica_data, 'ArgR', ecoli_fasta, outdir='tmp')
Finding motifs for 11 sequences
Running command: meme tmp/ArgR.fasta -oc tmp/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 3
Found 2 motifs across 15 sites
find_motifs
creates a MotifInfo
object
[5]:
motifs
[5]:
<MotifInfo with 2 motifs across 15 sites>
The MotifInfo
object contains the following properties:
motifs
: Information about the motifs (e.g. E-value, width, consensus sequence)
[6]:
motifs.motifs
[6]:
e_value | sites | width | consensus | motif_frac | |
---|---|---|---|---|---|
motif | |||||
MEME-1 | 1.7e-30 | 10 | 39 | TGAATWWAHATKCAVTWWWTWTGMATAAAWATWCAHTRR | 0.909091 |
MEME-2 | 0.00015 | 5 | 39 | GMAYASGCWSVTYGYGGVTGHCHGMDGCNACGCTGGCGY | 0.454545 |
sites
: Information about predicted binding sites (e.g. location, p-value)
[7]:
motifs.sites
[7]:
rel_position | pvalue | site_seq | genes | locus_tags | start_pos | strand | ||
---|---|---|---|---|---|---|---|---|
motif | operon | |||||||
MEME-1 | argA | 453 | 2.78e-12 | AGAATAAAAATACACTAATTTCGAATAATCATGCAAAGA | argA | b2818 | 2948741 | + |
argCBH | 375 | 1.89e-16 | TCAATATTCATGCAGTATTTATGAATAAAAATACACTAA | argC,argB,argH | b3958,b3959,b3960 | 4154500 | + | |
argD | 132 | 6.68e-14 | TGAAATTATAACCACAAAATATGCATAAAAAATCACTAA | argD | b3359 | 3490080 | - | |
argE | 128 | 1.89e-16 | TCAATATTCATGCAGTATTTATGAATAAAAATACACTAA | argE | b3957 | 4154747 | - | |
argF | 129 | 6.79e-16 | TGAATTAAAATTCACTTTATATGTGTAATTATTCATTTG | argF | b0273 | 290205 | - | |
argG | 412 | 2.07e-13 | TAAATGAAAACTCATTTATTTTGCATAAAAATTCAGTGA | argG | b3172 | 3318136 | + | |
argI | 127 | 9.75e-16 | TGAATTAAAATTCAATTTATATGGATGATTATTCATTTG | argI | b4254 | 4478211 | - | |
artJ | 171 | 1.63e-12 | TGAATTTATATGCAATAAACATGATTAAATAATTTAAAT | artJ | b0860 | 900475 | - | |
carAB | 453 | 3.02e-14 | TGAATTAATATGCAAATAAAGTGAGTGAATATTCTCTGG | carA | b0032 | 29150 | + | |
dtpD | 192 | 1.11e-11 | TTATGCAATATGCTGTCTGATTGCATAAATATACATTAG | dtpD | b0709 | 742456 | - | |
xynR | NaN | NaN | NaN | yagI | b0272 | 289062 | - | |
MEME-2 | argA | 151 | 1.64e-13 | GCACCGCCACGACCTGGCTGACCGCAGCCAGACTGACCT | argA | b2818 | 2948741 | + |
argCBH | 505 | 4.24e-22 | GAATACGCTGATTGTGGGTGCCAGCGGCTACGCTGGCGC | argC,argB,argH | b3958,b3959,b3960 | 4154500 | + | |
argD | NaN | NaN | NaN | argD | b3359 | 3490080 | - | |
argE | 258 | 4.24e-22 | GAATACGCTGATTGTGGGTGCCAGCGGCTACGCTGGCGC | argE | b3957 | 4154747 | - | |
argF | NaN | NaN | NaN | argF | b0273 | 290205 | - | |
argG | NaN | NaN | NaN | argG | b3172 | 3318136 | + | |
argI | NaN | NaN | NaN | argI | b4254 | 4478211 | - | |
artJ | 305 | 1.00e-15 | GCACAGGCACCGTGCTGATGTCTGATGCGACGCTGGCGC | artJ | b0860 | 900475 | - | |
carAB | NaN | NaN | NaN | carA | b0032 | 29150 | + | |
dtpD | NaN | NaN | NaN | dtpD | b0709 | 742456 | - | |
xynR | 301 | 2.76e-13 | GAACACGATGCTCGCCGCCGACTCAAACACCTCGTCCGT | yagI | b0272 | 289062 | - |
cmd
: The original command used to run MEME
[8]:
motifs.cmd
[8]:
'meme tmp/ArgR.fasta -oc tmp/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 3'
file
: Path to MEME output. This is the relative path of the file from the notebook that ranfind_motifs
[9]:
motifs.file
[9]:
'tmp/ArgR/meme.txt'
This MotifInfo
object is automatically stored as a dictionary in the IcaData object. It will persist after saving and re-loading the IcaData object.
[10]:
ica_data.motif_info
[10]:
{'ArgR': <MotifInfo with 2 motifs across 15 sites>}
Using TOMTOM to compare motifs against external databases
Once you have a motif from MEME, you can use TOMTOM to compare your motif against external databases. The compare_motifs
function makes this process simple.
The MotifInfo
object generated in the find_motifs
function contains the MEME file location, which is the primary input for compare_motifs
.
[17]:
compare_motifs(ica_data.motif_info['ArgR'], verbose=False)
[17]:
motif | target | target_id | pvalue | Evalue | qvalue | overlap | optimal_offset | orientation | database | |
---|---|---|---|---|---|---|---|---|---|---|
0 | MEME-1 | argR2 | argR2 | 2.648790e-28 | 1.565440e-25 | 3.020170e-25 | 37 | 2 | - | dpinteract |
1 | MEME-1 | ArgR_26-5 | ArgR_26-5 | 4.221120e-16 | 2.494680e-13 | 2.406470e-13 | 26 | 0 | - | SwissRegulon_e_coli |
2 | MEME-1 | argR | argR | 2.456950e-13 | 1.452060e-10 | 9.338110e-11 | 18 | -19 | - | dpinteract |
3 | MEME-1 | ArgR | MX000116 | 3.377890e-07 | 1.996330e-04 | 6.419150e-05 | 14 | -21 | + | prodoric |
4 | MEME-1 | AhrC | MX000043 | 4.871210e-07 | 2.878890e-04 | 6.942730e-05 | 16 | -20 | - | prodoric |
This information is saved in the matches
attribute of the ArgR MotifInfo
. Again, this information will be maintained if you save the file to json.
[13]:
ica_data.motif_info['ArgR'].matches
[13]:
motif | target | target_id | pvalue | Evalue | qvalue | overlap | optimal_offset | orientation | database | |
---|---|---|---|---|---|---|---|---|---|---|
0 | MEME-1 | argR2 | argR2 | 2.648790e-28 | 1.565440e-25 | 3.020170e-25 | 37 | 2 | - | dpinteract |
1 | MEME-1 | ArgR_26-5 | ArgR_26-5 | 4.221120e-16 | 2.494680e-13 | 2.406470e-13 | 26 | 0 | - | SwissRegulon_e_coli |
2 | MEME-1 | argR | argR | 2.456950e-13 | 1.452060e-10 | 9.338110e-11 | 18 | -19 | - | dpinteract |
3 | MEME-1 | ArgR | MX000116 | 3.377890e-07 | 1.996330e-04 | 6.419150e-05 | 14 | -21 | + | prodoric |
4 | MEME-1 | AhrC | MX000043 | 4.871210e-07 | 2.878890e-04 | 6.942730e-05 | 16 | -20 | - | prodoric |
By default, compare_motifs
compares your motifs to 5 prokaryotic databases. You can also use the path to a local motif file or limit the search to a single database, namely:
'prodoric'
'collectf'
'dpinteract'
'regtransbase'
'SwissRegulon_e_coli'
To re-run compare_motifs
with different arguments, set force=True
[18]:
compare_motifs(ica_data.motif_info['ArgR'], motif_db = 'prodoric', verbose=False, force=True)
[18]:
motif | target | target_id | pvalue | Evalue | qvalue | overlap | optimal_offset | orientation | database | |
---|---|---|---|---|---|---|---|---|---|---|
0 | MEME-1 | ArgR | MX000116 | 2.901120e-07 | 0.000058 | 0.000055 | 14 | -21 | + | prodoric |
1 | MEME-1 | AhrC | MX000043 | 4.308400e-07 | 0.000087 | 0.000055 | 16 | -20 | - | prodoric |
[ ]:
iModulon Thresholds
iModulon thresholds are computed using an outlier test (See Methods section “Determination of the gene coefficient threshold” in Sastry et al 2019). This outlier test relies on the D’Agostino test statistic.
However, if you are working with an understudied organism, we recommend setting the thresholds using the K-means method when creating the :class:~pymodulon.core.IcaData
object.
[1]:
from pymodulon.example_data import load_ecoli_data
from pymodulon.core import IcaData
[2]:
ica_data = load_ecoli_data()
Setting the threshold method
You can choose between the D’Agostino or K-means method of thresholding when creating the :class:~pymodulon.core.IcaData
object.
[3]:
test_data = IcaData(ica_data.M,
ica_data.A,
trn=ica_data.trn,
threshold_method='dagostino')
test_data.thresholds['ArgR']
WARNING:root:Using the default dagostino_cutoff of 550. This may not be optimal for your dataset. Use ica_data.reoptimize_thresholds() to find the optimal threshold.
[3]:
0.0804412723
A higher D’Agostino cutoff leads to higher thresholds. The default is 550.
[4]:
test_data = IcaData(ica_data.M,
ica_data.A,
trn=ica_data.trn,
threshold_method='dagostino',
dagostino_cutoff=5000)
test_data.thresholds['ArgR']
[4]:
0.26879279035000003
To search for the optimal D’Agostino cutoff, set optimize_cutoff=True
. To demonstrate this, we created a mini IcaData object since optimizing the cutoff can take 2-3 minutes.
[5]:
test_data = IcaData(ica_data.M.iloc[:,:10],
ica_data.A.iloc[:10,:],
trn=ica_data.trn,
threshold_method='dagostino',
optimize_cutoff=True)
test_data.thresholds['ArgR']
WARNING:root:Optimizing iModulon thresholds, may take 2-3 minutes...
[5]:
0.0945705206
The K-means method is a non-parametric threshold estimation method and is the default if no TRN is supplied. Although the thresholds are fairly close to the D’Agostino method, the D’Agostino method produces the highest accuracy iModulons.
[6]:
test_data = IcaData(ica_data.M,
ica_data.A,
threshold_method='kmeans')
test_data.thresholds['ArgR']
[6]:
0.1375202039
Changing the iModulon thresholds
If the TRN file is updated, the D’Agostino test statistic can be re-optimized using the reoptimize_thresholds
function. To manually adjust this statistic, use the recompute_thresholds
function. To change a single iModulon threshold, use the change_threshold
function.
Re-optimizing the threshold
The reoptimize_threshold
function may take a few minutes to complete. Fortunately, this only needs to be performed once. You can turn off the progress bar or plot using the progress
and plot
arguments, respectively.
[7]:
# ica_data.reoptimize_thresholds()
Changing the D’agostino statistic
To test out different D’agostino test statistics, use the recompute_thresholds
function
[8]:
ica_data.view_imodulon('GlpR')
[8]:
gene_weight | start | end | strand | gene_name | length | operon | COG | accession | regulator | |
---|---|---|---|---|---|---|---|---|---|---|
b2239 | 0.211384 | 2349934 | 2351011 | - | glpQ | 1077 | glpTQ | Energy production and conversion | NC_000913.3 | crp,fis,fnr,glpR,ihf,nac,rpoD |
b2240 | 0.306134 | 2351015 | 2352374 | - | glpT | 1359 | glpTQ | Carbohydrate transport and metabolism | NC_000913.3 | crp,fis,fnr,glpR,ihf,nac,rpoD |
b2241 | 0.375662 | 2352646 | 2354275 | + | glpA | 1629 | glpABC | Energy production and conversion | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b2242 | 0.328961 | 2354264 | 2355524 | + | glpB | 1260 | glpABC | Amino acid transport and metabolism | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b2243 | 0.315752 | 2355520 | 2356711 | + | glpC | 1191 | glpABC | Energy production and conversion | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b3426 | 0.350034 | 3562012 | 3563518 | + | glpD | 1506 | glpD | Energy production and conversion | NC_000913.3 | arcA,crp,glpR,rpoD,yieP |
b3925 | 0.084278 | 4114568 | 4115579 | - | glpX | 1011 | glpFKX | Carbohydrate transport and metabolism | NC_000913.3 | crp,glpR,rpoD |
b3926 | 0.290235 | 4115713 | 4117222 | - | glpK | 1509 | glpFKX | Energy production and conversion | NC_000913.3 | crp,glpR,rpoD |
b3927 | 0.312307 | 4117244 | 4118090 | - | glpF | 846 | glpFKX | Carbohydrate transport and metabolism | NC_000913.3 | crp,glpR,rpoD |
[9]:
ica_data.recompute_thresholds(300)
[10]:
ica_data.view_imodulon('GlpR')
[10]:
gene_weight | start | end | strand | gene_name | length | operon | COG | accession | regulator | |
---|---|---|---|---|---|---|---|---|---|---|
b1199 | 0.045686 | 1249124 | 1249757 | - | dhaL | 633 | dhaKLM | Carbohydrate transport and metabolism | NC_000913.3 | arcA,dhaR,fnr,lrp,rpoS |
b1200 | 0.045491 | 1249767 | 1250838 | - | dhaK | 1071 | dhaKLM | Carbohydrate transport and metabolism | NC_000913.3 | arcA,dhaR,fnr,lrp,rpoS |
b2239 | 0.211384 | 2349934 | 2351011 | - | glpQ | 1077 | glpTQ | Energy production and conversion | NC_000913.3 | crp,fis,fnr,glpR,ihf,nac,rpoD |
b2240 | 0.306134 | 2351015 | 2352374 | - | glpT | 1359 | glpTQ | Carbohydrate transport and metabolism | NC_000913.3 | crp,fis,fnr,glpR,ihf,nac,rpoD |
b2241 | 0.375662 | 2352646 | 2354275 | + | glpA | 1629 | glpABC | Energy production and conversion | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b2242 | 0.328961 | 2354264 | 2355524 | + | glpB | 1260 | glpABC | Amino acid transport and metabolism | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b2243 | 0.315752 | 2355520 | 2356711 | + | glpC | 1191 | glpABC | Energy production and conversion | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b3011 | 0.049324 | 3155354 | 3156518 | + | yqhD | 1164 | yqhD-dkgA | Energy production and conversion | NC_000913.3 | rpoD,yqhC |
b3426 | 0.350034 | 3562012 | 3563518 | + | glpD | 1506 | glpD | Energy production and conversion | NC_000913.3 | arcA,crp,glpR,rpoD,yieP |
b3925 | 0.084278 | 4114568 | 4115579 | - | glpX | 1011 | glpFKX | Carbohydrate transport and metabolism | NC_000913.3 | crp,glpR,rpoD |
b3926 | 0.290235 | 4115713 | 4117222 | - | glpK | 1509 | glpFKX | Energy production and conversion | NC_000913.3 | crp,glpR,rpoD |
b3927 | 0.312307 | 4117244 | 4118090 | - | glpF | 846 | glpFKX | Carbohydrate transport and metabolism | NC_000913.3 | crp,glpR,rpoD |
Manually changing thresholds
To manually change a threshold, use the change_threshold
function.
[11]:
ica_data.change_threshold('GlpR',0.2)
[12]:
ica_data.view_imodulon('GlpR')
[12]:
gene_weight | start | end | strand | gene_name | length | operon | COG | accession | regulator | |
---|---|---|---|---|---|---|---|---|---|---|
b2239 | 0.211384 | 2349934 | 2351011 | - | glpQ | 1077 | glpTQ | Energy production and conversion | NC_000913.3 | crp,fis,fnr,glpR,ihf,nac,rpoD |
b2240 | 0.306134 | 2351015 | 2352374 | - | glpT | 1359 | glpTQ | Carbohydrate transport and metabolism | NC_000913.3 | crp,fis,fnr,glpR,ihf,nac,rpoD |
b2241 | 0.375662 | 2352646 | 2354275 | + | glpA | 1629 | glpABC | Energy production and conversion | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b2242 | 0.328961 | 2354264 | 2355524 | + | glpB | 1260 | glpABC | Amino acid transport and metabolism | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b2243 | 0.315752 | 2355520 | 2356711 | + | glpC | 1191 | glpABC | Energy production and conversion | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b3426 | 0.350034 | 3562012 | 3563518 | + | glpD | 1506 | glpD | Energy production and conversion | NC_000913.3 | arcA,crp,glpR,rpoD,yieP |
b3926 | 0.290235 | 4115713 | 4117222 | - | glpK | 1509 | glpFKX | Energy production and conversion | NC_000913.3 | crp,glpR,rpoD |
b3927 | 0.312307 | 4117244 | 4118090 | - | glpF | 846 | glpFKX | Carbohydrate transport and metabolism | NC_000913.3 | crp,glpR,rpoD |
[ ]:
Comparing iModulons
compare_ica
is a function specifically built to compare the components between different ICA runs. The function will return a Digraph plot which shows connections between ICA components of different runs, as well as Dictionary List of the one-to-one connections.
The function requires a minimum of two inputs if running a comparison beween two ICA runs for the same organism:
S1: Pandas Dataframe of M matrix 1
S2: Pandas Dataframe of M Matrix 2
compare_ica
can be found in pymodulon.compare
[1]:
from pymodulon import example_data
from pymodulon.compare import *
Comparsion within the same organism
[2]:
ecoli_data = example_data.load_ecoli_data()
We will create a copy of the E. coli M matrix that has slightly different iModulon names.
[3]:
new_M = ecoli_data.M.copy()
new_M.columns = ['run2_'+str(i) for i in new_M.columns]
For this example, we will only compare the first 5 iModulons.
[4]:
links,dots = compare_ica(ecoli_data.M.iloc[:,:5],new_M.iloc[:,:5])
The comparison data are stored in links
. The first two values in each element of links
are the iModulon name from the first and second datasets respectively. The final value is the Pearson R correlation between the Independent Component gene weights.
[5]:
links
[5]:
[('AllR/AraC/FucR', 'run2_AllR/AraC/FucR', 1.0),
('ArcA-1', 'run2_ArcA-1', 1.0),
('ArcA-2', 'run2_ArcA-2', 1.0),
('ArgR', 'run2_ArgR', 1.0),
('AtoC', 'run2_AtoC', 1.0)]
The graphical comparison is saved as dots
. The thickness of the arrows represents how similar the iModulons in the first dataset are to the dataset in the second dataset. Since we are comparing two copies of the same dataset, all arrows are at maximum thickness.
[6]:
dots
[6]:
Comparison across organisms
compare_ica
can also be used to compare ICA components between different organisms. To do so, you will need to provide an addition input, which is a string path to a Bidirectional Best Hit (BBH) orthology between your two organisms of interest.
[7]:
example_bbh = example_data.load_example_bbh()
example_bbh.head()
[7]:
gene | subject | PID | alnLength | mismatchCount | gapOpenCount | queryStart | queryEnd | subjectStart | subjectEnd | eVal | bitScore | gene_length | COV | BBH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | b0003 | USA300HOU_RS06760 | 26.897 | 290 | 183 | 8 | 1 | 279 | 4 | 275 | 2.060000e-25 | 99.8 | 310 | 0.935484 | <=> |
6 | b0007 | USA300HOU_RS06905 | 40.638 | 470 | 257 | 6 | 1 | 455 | 1 | 463 | 3.650000e-114 | 342.0 | 476 | 0.987395 | <=> |
9 | b0014 | USA300HOU_RS08425 | 55.818 | 636 | 251 | 4 | 1 | 636 | 1 | 606 | 0.000000e+00 | 678.0 | 638 | 0.996865 | <=> |
10 | b0015 | USA300HOU_RS08420 | 46.982 | 381 | 187 | 5 | 1 | 371 | 1 | 376 | 8.040000e-110 | 323.0 | 376 | 1.013298 | <=> |
15 | b0023 | USA300HOU_RS08455 | 42.353 | 85 | 36 | 3 | 1 | 79 | 1 | 78 | 1.530000e-09 | 46.6 | 87 | 0.977011 | <=> |
You can also provide a path to your own orthology file, as long as it contains the columns gene
and target
It is important that the order of the organism matches the order of the orthology file. Specifically, the organism for the M1
file must be the organism that fills the gene
column in the orthology file.
[8]:
staph_data = example_data.load_staph_data()
[9]:
links,dots = compare_ica(ecoli_data.M,staph_data.M,
ortho_file = example_bbh)
dots
[9]:
Here it is much more clear that some iModulons are more similar between species, whereas others are loosely connected. Arrows can also indicate if an iModulon is composed of multiple iModulons from another dataset.
[10]:
links[:5]
[10]:
[('AllR/AraC/FucR', 'PyrR', 0.261519782054789),
('Cbl+CysB', 'CymR', 0.2694541420808301),
('Crp-1', 'CcpA-1', 0.2576532196155267),
('CsqR', 'LacR', 0.4172511239115126),
('CysB', 'CymR', 0.31057785046755)]
Additionally, you can adjust the metric type and cutoff values for correlations by utilizing the method
and cutoff
parameters. The default for method
is "pearson"
, but can be changed to `"spearman"
.
cutoff
takes a float value between 0 and 1 and will only show correlations greater than the cutoff
value (default: 0.3).
[11]:
links,dots = compare_ica(ecoli_data.M,staph_data.M,
ortho_file = example_bbh,
cutoff = 0.6)
dots
[11]:
[12]:
links,dots = compare_ica(ecoli_data.M,staph_data.M,
ortho_file = example_bbh,
cutoff = 0.15, method = 'spearman')
dots
[12]:
Create a BBH orthology file
To create your own BBH orthology file, pymodulon.compare contains a set of functions that will allow you to create the desired bbh.csv file for you specific comparison. This only needs to be done once per comparison.
In order to make the BBH orthology file, you need to obtain the Genbank Full Genome files for your organisms. These can be found at https://www.ncbi.nlm.nih.gov/genome/. Search for both of your organisms/strains and download the “Genbank (full)” file. Once saved into a known location, you can use the make_prots
function to convert the Genbank files into protein FASTA files.
make_prots
recieves two parameters:
gbk
: Path to the genbank file (one file per function call)out_path
: Path to the protein FASTA files
[13]:
# ecoli_gb= "../pymodulon/test/data/ecoli.gb"
# mtb_gb = "../pymodulon/test/data/mtb.gb"
# ecoli_out = "../pymodulon/test/data/ecoli_prot.fasta"
# mtb_out = "../pymodulon/test/data/mtb_prot.fasta"
# make_prots(ecoli_gb,ecoli_out)
# make_prots(mtb_gb,mtb_out)
Create Genbank database files
In order to run Bidirectional Best Hits between two organisms, it is required to build the necessary Genbank Database files from the newly created protein FASTA files. In order to do so, use the make_prots_db
function. You will do this for both organims.
make_prots_db
recieves one parameter:
fasta_file
: String path to protein FASTA file
If there is an error, the function will print out error message. The function will also not execut if the necessary Genbank DB files already exist.
[14]:
# make_prot_db(ecoli_out)
# make_prot_db(mtb_out)
Create BBH CSV
Once the previous steps have been completed, you can now create the BBH CSV file. In order to do so, you can use the get_bbh
function.
get_bbh
has the following parameters:
db1
: String path to protein FASTA file (output of make_prots function) for organism 1db2
: String path to protein FASTA file (output of make_prots function) for organism 2outdir
: String path to output directory, default is “bbh” and will create the directory if it does not existoutname
: Default db1_vs_db2_parsed.csv where db[1-2] are the passed arguments name of the csv file where that will save the resultsmincov
: Minimum coverage to call hits in BLAST, must be between 0 and 1evalue
: evalue thershold for BLAST hits, Default .001threads
: Number of threads to run BLAST, Default 1force
: Whether to overwrite existing files or notsavefiles
: Whether to save files to outdir
The function will return a Pandas DataFrame containing all the BLAST hits between the two genes. The function will also save the DataFrame to a csv file, which you can then use in compare_ica
between the two organisms.
[15]:
# get_bbh(ecoli_out,mtb_out, outdir = "../pymodulon/test/data/")
[ ]:
Creating the Gene Table
A copy of this notebook is available on the PyModulon GitHub page: Creating the Gene Table Notebook
First, download the FASTA and GFF files for your organism and its plasmids from NCBI. We will use the E. coli K-12 MG1655 strain as an example, which does not have plasmids.
[2]:
from pymodulon import example_data
Get information from GFF files
[3]:
from pymodulon.gene_util import *
Enter the location of all your GFF files here:
[4]:
gff_files = [example_data.ecoli_gff]
The following cell will convert all the GFF files into a single Pandas DataFrame for easy manipulation. Pseudogenes have multiple rows in a GFF file (one for each fragment), but only the first fragment will be kept.
[5]:
keep_cols = ['accession','start','end','strand','gene_name','old_locus_tag','gene_product','ncbi_protein']
DF_annot = gff2pandas(gff_files,index='locus_tag')
DF_annot = DF_annot[keep_cols]
DF_annot.head()
WARNING:root:Duplicate locus_tag detected. Dropping duplicates.
[5]:
accession | start | end | strand | gene_name | old_locus_tag | gene_product | ncbi_protein | |
---|---|---|---|---|---|---|---|---|
locus_tag | ||||||||
b0001 | NC_000913.3 | 190 | 255 | + | thrL | None | thr operon leader peptide | NP_414542.1 |
b0002 | NC_000913.3 | 337 | 2799 | + | thrA | None | fused aspartate kinase/homoserine dehydrogenase 1 | NP_414543.1 |
b0003 | NC_000913.3 | 2801 | 3733 | + | thrB | None | homoserine kinase | NP_414544.1 |
b0004 | NC_000913.3 | 3734 | 5020 | + | thrC | None | threonine synthase | NP_414545.1 |
b0005 | NC_000913.3 | 5234 | 5530 | + | yaaX | None | DUF2502 domain-containing protein YaaX | NP_414546.1 |
To ensure that the gene index used is identical to the expression matrix, load in your data.
[6]:
DF_log_tpm = example_data.load_example_log_tpm()
DF_log_tpm.head()
[6]:
ecoli_00001 | ecoli_00002 | ecoli_00009 | ecoli_00010 | |
---|---|---|---|---|
Geneid | ||||
b0001 | 10.473721 | 10.271944 | 10.315476 | 10.808135 |
b0002 | 10.260569 | 10.368555 | 10.735874 | 10.726916 |
b0003 | 9.920277 | 10.044224 | 10.528432 | 10.503092 |
b0004 | 9.936694 | 10.010638 | 9.739519 | 9.722997 |
b0005 | 7.027515 | 7.237449 | 6.745798 | 6.497823 |
Check that the genes are the same in the expression dataset as in the annotation dataframe. Mismatched genes are listed below.
[7]:
test = DF_annot.sort_index().index == DF_log_tpm.sort_index().index
DF_annot[~test]
[7]:
accession | start | end | strand | gene_name | old_locus_tag | gene_product | ncbi_protein | |
---|---|---|---|---|---|---|---|---|
locus_tag |
(Optional) KEGG and COGs
Generate nucleotide fasta files for CDS
Enter the location of all your fasta files here:
[8]:
fasta_files = [example_data.ecoli_fasta]
The following code generates CDS files using your FASTA and GFF3 files
[9]:
from Bio import SeqIO
cds_list = []
for fasta in fasta_files:
seq = SeqIO.read(fasta,'fasta')
# Get gene information for genes in this fasta file
df_genes = DF_annot[DF_annot.accession == seq.id]
for i,row in df_genes.iterrows():
cds = seq[row.start-1:row.end]
if row.strand == '-':
cds = seq[row.start-1:row.end].reverse_complement()
cds.id = row.name
cds.description = row.gene_name if pd.notnull(row.gene_name) else row.name
cds_list.append(cds)
[10]:
cds_list[:5]
[10]:
[SeqRecord(seq=Seq('ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAAC...TGA', SingleLetterAlphabet()), id='b0001', name='NC_000913.3', description='thrL', dbxrefs=[]),
SeqRecord(seq=Seq('ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTG...TGA', SingleLetterAlphabet()), id='b0002', name='NC_000913.3', description='thrA', dbxrefs=[]),
SeqRecord(seq=Seq('ATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGAT...TAA', SingleLetterAlphabet()), id='b0003', name='NC_000913.3', description='thrB', dbxrefs=[]),
SeqRecord(seq=Seq('ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCC...TAA', SingleLetterAlphabet()), id='b0004', name='NC_000913.3', description='thrC', dbxrefs=[]),
SeqRecord(seq=Seq('GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCC...TAA', SingleLetterAlphabet()), id='b0005', name='NC_000913.3', description='yaaX', dbxrefs=[])]
Uncomment the line below to write the CDS file
[11]:
# SeqIO.write(cds_list,'CDS.fna','fasta')
Run EggNOG Mapper
Upload the CDS.fna file from your organism directory (within the sequence_files folder)
Make sure to limit the taxonomy to the correct level
After the job is submitted, you must follow the link in your email to run the job.
Once the job completes (after ~4 hrs), download the annotations file.
Save the annotation file
Get KEGG IDs
Once you have the EggNOG annotations, load the annotation file
[12]:
eggnog_file = example_data.ecoli_eggnog
[13]:
DF_eggnog = pd.read_csv(eggnog_file,sep='\t',skiprows=4,header=None)
eggnog_cols = ['query_name','seed eggNOG ortholog','seed ortholog evalue','seed ortholog score',
'Predicted taxonomic group','Predicted protein name','Gene Ontology terms',
'EC number','KEGG_orth','KEGG_pathway','KEGG_module','KEGG_reaction',
'KEGG_rclass','BRITE','KEGG_TC','CAZy','BiGG Reaction','tax_scope',
'eggNOG OGs','bestOG_deprecated','COG','eggNOG free text description']
DF_eggnog.columns = eggnog_cols
# Strip last three rows as they are comments
DF_eggnog = DF_eggnog.iloc[:-3]
# Set locus tag as index
DF_eggnog = DF_eggnog.set_index('query_name')
DF_eggnog.index.name = 'locus_tag'
DF_eggnog.head()
[13]:
seed eggNOG ortholog | seed ortholog evalue | seed ortholog score | Predicted taxonomic group | Predicted protein name | Gene Ontology terms | EC number | KEGG_orth | KEGG_pathway | KEGG_module | ... | KEGG_rclass | BRITE | KEGG_TC | CAZy | BiGG Reaction | tax_scope | eggNOG OGs | bestOG_deprecated | COG | eggNOG free text description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
locus_tag | |||||||||||||||||||||
b0002 | 316407.85674276 | 0.000000e+00 | 1600.5 | Escherichia | thrA | GO:0003674,GO:0003824,GO:0004072,GO:0004412,GO... | 1.1.1.3,2.7.2.4 | ko:K12524 | ko00260,ko00261,ko00270,ko00300,ko01100,ko0111... | M00016,M00017,M00018,M00526,M00527 | ... | RC00002,RC00043,RC00087 | ko00000,ko00001,ko00002,ko01000 | NaN | NaN | iECDH1ME8569_1439.ECDH1ME8569_0002,iEcDH1_1363... | Escherichia | 1MW3H@1224,1RN1G@1236,3XN2A@561,COG0460@1,COG0... | NA|NA|NA | E | homoserine dehydrogenase I |
b0003 | 316407.85674277 | 3.000000e-178 | 630.9 | Escherichia | thrB | GO:0000096,GO:0000097,GO:0003674,GO:0003824,GO... | 2.7.1.39 | ko:K00872 | ko00260,ko01100,ko01110,ko01120,ko01230,map002... | M00018 | ... | RC00002,RC00017 | ko00000,ko00001,ko00002,ko01000 | NaN | NaN | iECSE_1348.ECSE_0003 | Escherichia | 1MW8I@1224,1RMYR@1236,3XN01@561,COG0083@1,COG0... | NA|NA|NA | F | Catalyzes the ATP-dependent phosphorylation of... |
b0004 | 316407.21321894 | 1.200000e-246 | 858.6 | Escherichia | thrC | GO:0003674,GO:0003824,GO:0004795,GO:0005575,GO... | 4.2.3.1 | ko:K01733 | ko00260,ko00750,ko01100,ko01110,ko01120,ko0123... | M00018 | ... | RC00017,RC00526 | ko00000,ko00001,ko00002,ko01000 | NaN | NaN | iLF82_1304.LF82_2261,iNRG857_1313.NRG857_00025 | Escherichia | 1MUWQ@1224,1RQ0H@1236,3XP31@561,COG0498@1,COG0... | NA|NA|NA | E | Catalyzes the gamma-elimination of phosphate f... |
b0005 | 316407.21321895 | 1.400000e-23 | 115.5 | Escherichia | yaaX | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | Escherichia | 1N4SV@1224,1S9KR@1236,2B58A@1,32XJS@2,3XRGB@561 | NA|NA|NA | S | Protein of unknown function (DUF2502) |
b0006 | 198214.SF0006 | 1.600000e-143 | 515.4 | Gammaproteobacteria | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | Escherichia | 1MUAF@1224,1RMTD@1236,COG3022@1,COG3022@2 | NA|NA|NA | S | Belongs to the UPF0246 family |
5 rows × 21 columns
Now we will pull the KEGG information from the eggNOG file, including orthology, pathway, module, and reactions for each gene.
[14]:
DF_kegg = DF_eggnog[['KEGG_orth','KEGG_pathway','KEGG_module','KEGG_reaction']]
# Melt dataframe
DF_kegg = DF_kegg.reset_index().melt(id_vars='locus_tag')
# Remove null values
DF_kegg = DF_kegg[DF_kegg.value.notnull()]
# Split comma-separated values into their own rows
list2struct = []
for name,row in DF_kegg.iterrows():
for val in row.value.split(','):
list2struct.append([row.locus_tag,row.variable,val])
DF_kegg = pd.DataFrame(list2struct,columns=['gene_id','database','kegg_id'])
# Remove ko entries, as only map entries are searchable in KEGG pathway
DF_kegg = DF_kegg[~DF_kegg.kegg_id.str.startswith('ko')]
DF_kegg.head()
[14]:
gene_id | database | kegg_id | |
---|---|---|---|
2750 | b0002 | KEGG_pathway | map00260 |
2751 | b0002 | KEGG_pathway | map00261 |
2752 | b0002 | KEGG_pathway | map00270 |
2753 | b0002 | KEGG_pathway | map00300 |
2754 | b0002 | KEGG_pathway | map01100 |
Save KEGG information
Uncomment the line below to save the KEGG file
[15]:
# DF_kegg.to_csv('kegg_mapping.csv')
Save COGs to annotation dataframe
[16]:
DF_annot['COG'] = DF_eggnog.COG
# Make sure COG only has one entry per gene
DF_annot['COG'] = [item[0] if isinstance(item,str) else item for item in DF_annot['COG']]
Uniprot ID mapping
The uniprot_id_mapping
function is a python wrapper for the Uniprot ID mapping tool. Use input_id=P_REFSEQ_AC
if the FASTA/GFFf files are from RefSeq, and input_id=EMBL
if the files are from Genbank.
[17]:
mapping_uniprot = uniprot_id_mapping(DF_annot.ncbi_protein.fillna(''),input_id='P_REFSEQ_AC',output_id='ACC',
input_name='ncbi_protein',output_name='uniprot')
mapping_uniprot.head()
[17]:
ncbi_protein | uniprot | |
---|---|---|
2090 | NP_416515.1 | A0A0G3HGF9 |
4381 | NP_418733.1 | A0A0N9YI43 |
3651 | NP_418014.1 | A0A223DQZ5 |
26 | NP_414563.1 | A0A385XJ53 |
1978 | NP_416408.1 | A0A385XJ53 |
[18]:
# Merge with current annotation
DF_annot = pd.merge(DF_annot.reset_index(),mapping_uniprot,how='left',on='ncbi_protein')
DF_annot.set_index('locus_tag',inplace=True)
DF_annot.head()
[18]:
accession | start | end | strand | gene_name | old_locus_tag | gene_product | ncbi_protein | COG | uniprot | |
---|---|---|---|---|---|---|---|---|---|---|
locus_tag | ||||||||||
b0001 | NC_000913.3 | 190 | 255 | + | thrL | None | thr operon leader peptide | NP_414542.1 | NaN | P0AD86 |
b0002 | NC_000913.3 | 337 | 2799 | + | thrA | None | fused aspartate kinase/homoserine dehydrogenase 1 | NP_414543.1 | E | P00561 |
b0003 | NC_000913.3 | 2801 | 3733 | + | thrB | None | homoserine kinase | NP_414544.1 | F | P00547 |
b0004 | NC_000913.3 | 3734 | 5020 | + | thrC | None | threonine synthase | NP_414545.1 | E | P00934 |
b0005 | NC_000913.3 | 5234 | 5530 | + | yaaX | None | DUF2502 domain-containing protein YaaX | NP_414546.1 | S | P75616 |
Add Biocyc Operon information
To obtain operon information from Biocyc, follow the steps below
Go to Biocyc.org (you may need to create an account and/or login)
Change the organism database to your organism/strain
Select SmartTables -> Special SmartTables
Select “All genes of <organism>”
Select the “Gene Name” column
Under “ADD TRANSFORM COLUMN” select “Genes in same transcription unit”
Select the “Genes in same transcription unit” column
Under “ADD PROPERTY COLUMN” select “Accession-1”
Under OPERATIONS, select “Export” -> “to Spreadsheet File…”
Select “common names” and click “Export smarttable”
Add file location below and run the code cell
[19]:
biocyc_file = example_data.ecoli_biocyc
DF_biocyc = pd.read_csv(biocyc_file,sep='\t')
# Remove genes with no accession
DF_biocyc = DF_biocyc[DF_biocyc['Accession-1'].notnull()]
# Set the accession (i.e. locus tag) as index
DF_biocyc = DF_biocyc.set_index('Accession-1').sort_values('Left-End-Position')
# Only keep genes in the final annotation file
DF_biocyc = DF_biocyc.reindex(DF_annot.index)
# Reformat transcription units
DF_biocyc['operon_list'] = DF_biocyc['Accession-1.1'].apply(reformat_biocyc_tu)
DF_biocyc.head()
[19]:
Gene Name | Left-End-Position | Right-End-Position | Product | Genes in same transcription unit | Accession-1.1 | operon_list | |
---|---|---|---|---|---|---|---|
locus_tag | |||||||
b0001 | thrL | 190.0 | 255.0 | <i>thr</i> operon leader peptide | thrL // thrA // thrB // thrC | b0001 // b0002 // b0003 // b0004 | b0001;b0002;b0003;b0004 |
b0002 | thrA | 337.0 | 2799.0 | fused aspartate kinase/homoserine dehydrogenase 1 | thrA // thrB // thrC // thrL | b0002 // b0003 // b0004 // b0001 | b0001;b0002;b0003;b0004 |
b0003 | thrB | 2801.0 | 3733.0 | homoserine kinase | thrA // thrB // thrC // thrL | b0002 // b0003 // b0004 // b0001 | b0001;b0002;b0003;b0004 |
b0004 | thrC | 3734.0 | 5020.0 | threonine synthase | thrA // thrB // thrC // thrL | b0002 // b0003 // b0004 // b0001 | b0001;b0002;b0003;b0004 |
b0005 | yaaX | 5234.0 | 5530.0 | DUF2502 domain-containing protein YaaX | yaaX | b0005 | b0005 |
Assign unique IDs to operons
The following code assigns unique names to each operon
[20]:
# Get all operons
operons = DF_biocyc['operon_list'].unique()
# Map each operon to a unique string
operon_dict = {operon: "Op"+str(i) for i, operon in enumerate(operons)}
# Add names to dataframe
DF_biocyc['operon'] = [operon_dict[op] for op in DF_biocyc["operon_list"]]
DF_biocyc.head()
[20]:
Gene Name | Left-End-Position | Right-End-Position | Product | Genes in same transcription unit | Accession-1.1 | operon_list | operon | |
---|---|---|---|---|---|---|---|---|
locus_tag | ||||||||
b0001 | thrL | 190.0 | 255.0 | <i>thr</i> operon leader peptide | thrL // thrA // thrB // thrC | b0001 // b0002 // b0003 // b0004 | b0001;b0002;b0003;b0004 | Op0 |
b0002 | thrA | 337.0 | 2799.0 | fused aspartate kinase/homoserine dehydrogenase 1 | thrA // thrB // thrC // thrL | b0002 // b0003 // b0004 // b0001 | b0001;b0002;b0003;b0004 | Op0 |
b0003 | thrB | 2801.0 | 3733.0 | homoserine kinase | thrA // thrB // thrC // thrL | b0002 // b0003 // b0004 // b0001 | b0001;b0002;b0003;b0004 | Op0 |
b0004 | thrC | 3734.0 | 5020.0 | threonine synthase | thrA // thrB // thrC // thrL | b0002 // b0003 // b0004 // b0001 | b0001;b0002;b0003;b0004 | Op0 |
b0005 | yaaX | 5234.0 | 5530.0 | DUF2502 domain-containing protein YaaX | yaaX | b0005 | b0005 | Op1 |
Finally, merge the Biocyc information with the main annotation DataFrame
[21]:
DF_annot['operon'] = DF_biocyc['operon']
Clean up and save annotation
First, we will re-order the annotation columns
[22]:
if 'old_locus_tag' in DF_annot.columns:
order = ['gene_name','accession','old_locus_tag','start','end','strand','gene_product','COG','uniprot','operon']
else:
order = ['gene_name','accession','start','end','strand','gene_product','COG','uniprot','operon']
DF_annot = DF_annot[order]
[23]:
DF_annot.head()
[23]:
gene_name | accession | old_locus_tag | start | end | strand | gene_product | COG | uniprot | operon | |
---|---|---|---|---|---|---|---|---|---|---|
locus_tag | ||||||||||
b0001 | thrL | NC_000913.3 | None | 190 | 255 | + | thr operon leader peptide | NaN | P0AD86 | Op0 |
b0002 | thrA | NC_000913.3 | None | 337 | 2799 | + | fused aspartate kinase/homoserine dehydrogenase 1 | E | P00561 | Op0 |
b0003 | thrB | NC_000913.3 | None | 2801 | 3733 | + | homoserine kinase | F | P00547 | Op0 |
b0004 | thrC | NC_000913.3 | None | 3734 | 5020 | + | threonine synthase | E | P00934 | Op0 |
b0005 | yaaX | NC_000913.3 | None | 5234 | 5530 | + | DUF2502 domain-containing protein YaaX | S | P75616 | Op1 |
Final statistics
The following graphs show how much information is available for the organism.
[24]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('ticks')
[25]:
fig,ax = plt.subplots()
DF_annot.count().plot(kind='bar',ax=ax)
ax.set_ylabel('# of Values',fontsize=18)
ax.tick_params(labelsize=16)
Fill missing values
Some organisms are missing gene names, so these will be filled with locus tag gene names.
[26]:
# Fill in missing gene names with locus tag names
DF_annot['tmp_name'] = DF_annot.copy().index.tolist()
DF_annot.gene_name.fillna(DF_annot.tmp_name,inplace=True)
DF_annot.drop('tmp_name',axis=1,inplace=True)
COG letters will also be converted to the full name.
[27]:
# Fill missing COGs with X
DF_annot['COG'].fillna('X',inplace=True)
# Change single letter COG annotation to full description
DF_annot['COG'] = DF_annot.COG.apply(cog2str)
counts = DF_annot.COG.value_counts()
plt.pie(counts.values,labels=counts.index);
Uncomment the following line to save the gene annotation dataset
[28]:
# DF_annot.to_csv('gene_info.csv')
GO Annotations
To start, download the GO Annotations for your organism from AmiGO 2
Go to AmiGO 2
Filter for your organism
Click
CustomDL
Drag
GO class (direct)
to the end of your Selected FieldsEnter the location of your GO annotation file below and run the following code block
[29]:
go_file = example_data.ecoli_go_example
[30]:
DF_GO = pd.read_csv(go_file,sep='\t',header=None,usecols=[2,10,17])
DF_GO.columns = ['gene_name','gene_id','gene_ontology']
DF_GO.gene_id.fillna(DF_GO.gene_name,inplace=True)
DF_GO = DF_GO[['gene_id','gene_ontology']]
DF_GO.head()
[30]:
gene_id | gene_ontology | |
---|---|---|
0 | b0312|ECK0310 | betaine-aldehyde dehydrogenase activity |
1 | b0312|ECK0310 | betaine-aldehyde dehydrogenase activity |
2 | b0312|ECK0310 | betaine-aldehyde dehydrogenase activity |
3 | b0312|ECK0310 | betaine-aldehyde dehydrogenase activity |
4 | b0312|ECK0310 | response to osmotic stress |
Take a look at the gene_id
column: 1. Make sure there are no null entries 2. Check which naming convention is used (locus tag, new locus tag, or gene name)
If it looks like it uses the old locus tag, set old_locus_tag to True
.
In this example, the gene_id
column needs to be altered slightly to match with the gene name. This may not be necessary for your organism.
[31]:
def format_go_gene_id(gene_id):
locus_tags = [item for item in gene_id.split('|') if re.match('b\d{4}',item)]
if len(locus_tags) > 0:
return locus_tags[0]
else:
return None
DF_GO.gene_id = DF_GO.gene_id.apply(format_go_gene_id)
Now we remove null entries
[32]:
DF_GO = DF_GO[DF_GO.gene_id.notnull()]
[33]:
naming = 'locus_tag' # Can be "gene_name" or "old_locus_tag"
if naming != 'locus_tag':
convert_tags = {value:key for key,value in DF_annot[naming].items()}
DF_GO.gene_id = DF_GO.gene_id.apply(lambda x: convert_tags[x])
[34]:
DF_GO.head()
[34]:
gene_id | gene_ontology | |
---|---|---|
0 | b0312 | betaine-aldehyde dehydrogenase activity |
1 | b0312 | betaine-aldehyde dehydrogenase activity |
2 | b0312 | betaine-aldehyde dehydrogenase activity |
3 | b0312 | betaine-aldehyde dehydrogenase activity |
4 | b0312 | response to osmotic stress |
Uncomment the line below to save the annotations
[35]:
# DF_GO[['gene_id','gene_ontology']].to_csv('GO_annotations.csv')
Additional Functions
Identifying “single-gene” iModulons
Some iModulons are dominated by a single, high-coefficient gene. These iModulons may result from:
Overdecomposition of the dataset to identify noisy genes
Artificial knock-out of single genes
Regulons with only one target gene
No matter what causes these iModulons, it is important to be aware of them. The find_single_gene_imodulons
function identifies iModulons that are likely dominated by a single gene.
The iModulons identified by find_single_gene_imodulons
may contain more than one gene, since a threshold-agnostic method is used to identify these iModulons.
[1]:
from pymodulon.example_data import load_ecoli_data
ica_data = load_ecoli_data()
[2]:
from pymodulon.plotting import *
[3]:
fig,ax= plt.subplots(figsize=(3,3))
plot_regulon_histogram(ica_data,'Cra',ax=ax);
[4]:
ica_data.find_single_gene_imodulons()
[4]:
['Pyruvate', 'fur-KO', 'sgrT', 'thrA-KO', 'ydcI-KO']
Single-gene iModulons can be automatically saved in the iModulon table.
[5]:
ica_data.find_single_gene_imodulons()
ica_data.imodulon_table.loc[['fur-KO']]
[5]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | |
---|---|---|---|---|---|---|---|---|---|---|
fur-KO | None | NaN | NaN | NaN | NaN | NaN | 13 | 0 | Genomic Alterations | 0.078193 |
Explained variance
To compute the total amount of expression variation explained by iModulons, use the explained_variance
function.
[6]:
from pymodulon.util import explained_variance
[7]:
explained_variance(ica_data)
[7]:
0.6946207831724602
Explained variance can be computed for individual genes, samples, or iModulons
[8]:
print('iModulons explain {:.0f}% of the expression variation in thrA'.format(explained_variance(ica_data,genes='thrA')*100))
print('The RpoH iModulon explains {:.0f}% of the expression variation in the dnaKJ transcription unit'.format(explained_variance(ica_data,imodulons=['RpoH'],genes=['dnaK','dnaJ'])*100))
print('iModulons explain {:.0f}% of the expression variation induced by iron starvation'.format(explained_variance(ica_data,samples=['fur__wt_dpd__1','fur__wt_dpd__2'])*100))
print('The RpoS iModulon explains {:.1f}% of the global expression variation'.format(explained_variance(ica_data,imodulons=['RpoS'])*100))
iModulons explain 93% of the expression variation in thrA
The RpoH iModulon explains 69% of the expression variation in the dnaKJ transcription unit
iModulons explain 70% of the expression variation induced by iron starvation
The RpoS iModulon explains 4.3% of the global expression variation
The following code shows how to identify iModulons that explain the largest fraction of variance in the dataset. These iModulons tend to represent the effects of global regulators.
[9]:
import pandas as pd
rec_var = {}
for k in ica_data.imodulon_names:
rec_var[k] = explained_variance(ica_data,imodulons=k)
df_rec_var = pd.Series(rec_var)
df_rec_var = df_rec_var.sort_values(ascending=False)
df_rec_var.head(10)
[9]:
RpoS 0.043242
FlhDC 0.042180
Fnr 0.034855
FliA 0.034259
uncharacterized-5 0.032859
Fur-1 0.020544
Crp-1 0.018618
ArcA-1 0.017906
PurR-2 0.017560
NtrC+RpoN 0.016785
dtype: float64
[10]:
import matplotlib.pyplot as plt
[11]:
plt.barh(range(10,0,-1),df_rec_var.head(10),tick_label = df_rec_var.head(10).index)
plt.xlabel('Fraction of Explained Variance',fontsize=16)
# Rename uncharacterized-5 to ppGpp
new_labels = [x if x != 'uncharacterized-5' else 'ppGpp' for x in df_rec_var.head(10).index]
plt.yticks(range(10,0,-1),labels=new_labels,fontsize=14);
Plotting Explained Variance
[12]:
from pymodulon.plotting import plot_explained_variance
[13]:
plot_explained_variance(ica_data, pc=True);
[ ]:
Creating an iModulonDB Dashboard
After an iModulon analysis is completed and published, dashboards for every iModulon and gene are made available on the iModulonDB website. The PyModulon package enables generation of all necessary files for any IcaData
object that meets a few simple requirements.
iModulonDB Site Overview
For general information about iModulonDB, visit the about page. Familiarize yourself with the main page types:
Splash page: The main entry point for the site
Dataset page: After choosing a dataset on the splash page, you are directed to the dataset page, which contains gray dataset metadata box and a table of iModulons.
iModulon page: The iModulon dashboards, which have a gray iModulon metadata box, a gene table, and several plots.
Gene page: Gene dashboards, which have an iModulon table connecting them to iModulons and an expression plot.
It’s helpful to understand the two parts to most websites:
Front End: The front end runs in a web user’s browser. It requests and receives files from the back end, and contains JavaScript functions which can create interactive plots and other features. iModulonDB’s front end code is available at this GitHub repository.
Back End: The back end runs in a server, which is usually operated by the site owner. It sends the necessary files to the front end when requested. With iModulonDB, all back end data is precomputed by the PyModulon package using the instructions provided here. Output files are uploaded to GitHub Pages, which serves the files to users. If you host your own version of the site, the local server will be the back end.
To generate and view iModulonDB dashboards for a custom project, follow these steps:
Follow the instructions in this document to generate your own files for the new
IcaData
object of interest. Inimodulondb_export
, specify the path to the cloned repository so that PyModulon can place new files in the appropriate locations.Host a local server from the repository folder. This will act as the back end in place of GitHub Pages.
In your browser, visit
http://localhost:8000/dataset.html?organism=new_organism&dataset=new_dataset
. The organism and dataset names in the URL are customizable.
[1]:
from pymodulon.core import IcaData
from pymodulon.imodulondb import *
from pymodulon.example_data import load_ecoli_data
[2]:
# Increase the maximum column width
pd.set_option('display.max_colwidth', None)
[3]:
ica_data = load_ecoli_data()
The imodulondb_compatibility
Function
imodulondb_compatibility
is essentially a convenient checklist for iModulonDB-relevant metadata. It has four outputs, which we will go through one-by-one.
Optional arguments:
inplace
: Whether to modify the object. It is not recommended to change this from its default (False
).tfcomplex_to_gene
: Dictionary mapping transcription factor complexes to gene names. See the section ontf_issues.has_gene
for more information.
Returns:
table_issues
: Dataframe describing missing columns from the major annotation tables and the site’s behavior if they are not updated. Three elements in this table are “CRITICAL”, meaning the site cannot be generated without them. If those elements are missing, they will be entries in this table as well as warnings.tf_issues
: Dataframe describing each regulator from theimodulon_table
that can’t be mapped to thetrn
,tf_links
, and/orgene_table
.missing_gene_links
: Series listing each gene that doesn’t have agene_link
to an external database.missing_dois
: Index listing each sample that doesn’t have an associated DOI in thesample_table
. Clicking on these samples in the activity plots on the table will not bring you to a relevant paper.
Call this function and browse its output. Not all issues it lists will be important to fix. In some cases, the default values will be fine for your purposes, or the match it is seeking might not exist.
[4]:
table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)
For demonstration purposes, we will also create a minimal IcaData
object called blank_data
containing only the M and A matrices. If we call imodulondb_compatibility
with this object, the table_issues
output will contain all possible issues.
Note the three warnings which appear when this happens; seeing any of these indicates that iModulonDB export cannot be completed. The possible CRITICAL issues are:
No
X
matrix: Each gene page contains a plot of gene expression, so you must provide the X matrix.No
project
column insample_table
: See the section ontable_issues.sample_table
.No
condition
column insample_table
: See the section ontable_issues.sample_table
.
[5]:
blank_data = IcaData(ica_data.M, ica_data.A)
all_table_issues, b_tf_issues, b_missing_gene_links, b_missing_dois = imodulondb_compatibility(blank_data)
WARNING:root:Critical issue: No X matrix
WARNING:root:Critical issue: No project column in sample_table.
WARNING:root:Critical issue: No condition column in sample_table.
The table_issues
Output
The first output of imodulondb_compatibility
is the table_issues
. Each row corresponds to an issue with one of the main class elements. The columns are:
Table
: which table or other variable the issue is inMissing Column
: the column of the Table with the issue (not case sensitive; capitalization is ignored).Solution
: Unless “CRITICAL” is in this cell, the site behavior if the issue remained is described here.
If the X
matrix is missing, it will be listed in the first row as a CRITICAL issue. After that, the imodulondb_table
issues represent missing information from a dictionary specific to iModulonDB. The rest of the rows may relate to the gene_table
, sample_table
, or imodulon_table
. These issues arise from parsing the column names of the annotation tables.
You may not be able to provide the information to fill in the missing columns (for example, if no database of gene functions exist for your organism). If you’d like, you can partially fill in any column and leave np.nan
values for the information you do not know. In most cases, you can continue to omit the column - the Solution
column of table_issues
will describe what happens in those cases.
Column names are case insensitive. Be sure not to have multiple columns with matching names, as that will cause errors later.
[6]:
table_issues
[6]:
Table | Missing Column | Solution | |
---|---|---|---|
0 | iModulonDB | organism | The default, "New Organism", will be used. |
1 | iModulonDB | dataset | The default, "New Dataset", will be used. |
2 | iModulonDB | strain | The default, "Unspecified", will be used. |
3 | iModulonDB | publication_name | The default, "Unpublished Study", will be used. |
4 | iModulonDB | publication_link | The publication name will not be a hyperlink. |
5 | iModulonDB | gene_link_db | The default, "External Database", will be used. |
6 | iModulonDB | organism_folder | The default, "new_organism", will be used. |
7 | iModulonDB | dataset_folder | The default, "new_dataset", will be used. |
8 | Gene | gene_product | Locus tags (gene_table.index) will be used. |
9 | Sample | sample | The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#." |
10 | Sample | n_replicates | This column will be generated for you. |
11 | iModulon | name | imodulon_table.index will be used. |
12 | iModulon | function | The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard |
13 | iModulon | exp_var | This column will be left blank. |
The imodulondb_table
Variable
Unless you are missing the X matrix, the first set of table_issues will be relating to the iModulonDB table (imodulondb_table
), which is a dictionary of details about the dataset in general. As you can see by reading the Solution
column of table_issues
for these entries, most entries have a default value (and the default behavior for a missing link is to not create a link).
Each entry affects the following parts of the site:
organism
: Appears in the dataset page metadata box, and gets programmatically changed into a short form (“Eshcerichia coli” –> “E. coli”) to appear in the dataset title on iModulon and Gene pages. Be sure to use the form “Genus species”.dataset
: The specific dataset title, which could distinguish it from other datasets in this organism. Appears in the dataset page metadata box, and gets appended to the shortened organism name in the dataset title on iModulon and Gene pages.strain
: Appears on the dataset pages in the metadata box.publication_name
: Appears on the dataset pages in the metadata box. Typically we use the form “Smith, et al., year”.publication_link
: The publication_name will be turned into a hyperlink if this entry is not blank. DOIs preferred. PyModulon does not test the validity of any links, so be sure this is a valid link.gene_link_db
: Appears on Gene pages in the metadata box if agene_link
is also available for the gene. PyModulon expects all gene links to go to specific gene pages from the same database, and for the database name to be the one shown here. If gene links are not included, then “Not Available” will appear in place of thegene_link_db
name. Examples of gene link databases include EcoCyc, SubtiWiki, and AureoWiki.organism_folder
: Determines output file location names. Also used in the URLs. Cannot contain spaces or slashes.dataset_folder
: Determines output file location names. Also used in URLs. Cannot contain spaces or slashes.
Any files with matching organism_folder
and dataset_folder
will be overwritten.
If you would like to italicize any part of these entries, you can use HTML tags: <i>italic</i>. These are automatically applied to organism names in the site.
[7]:
# Here are ALL possible issues with the imodulondb_table
all_table_issues.loc[all_table_issues.Table == 'iModulonDB']
[7]:
Table | Missing Column | Solution | |
---|---|---|---|
1 | iModulonDB | organism | The default, "New Organism", will be used. |
2 | iModulonDB | dataset | The default, "New Dataset", will be used. |
3 | iModulonDB | strain | The default, "Unspecified", will be used. |
4 | iModulonDB | publication_name | The default, "Unpublished Study", will be used. |
5 | iModulonDB | publication_link | The publication name will not be a hyperlink. |
6 | iModulonDB | gene_link_db | The default, "External Database", will be used. |
7 | iModulonDB | organism_folder | The default, "new_organism", will be used. |
8 | iModulonDB | dataset_folder | The default, "new_dataset", will be used. |
[8]:
# here are the issues relevant to ica_data.imodulondb_table
# in this case, all issues appear
table_issues.loc[table_issues.Table == 'iModulonDB']
[8]:
Table | Missing Column | Solution | |
---|---|---|---|
0 | iModulonDB | organism | The default, "New Organism", will be used. |
1 | iModulonDB | dataset | The default, "New Dataset", will be used. |
2 | iModulonDB | strain | The default, "Unspecified", will be used. |
3 | iModulonDB | publication_name | The default, "Unpublished Study", will be used. |
4 | iModulonDB | publication_link | The publication name will not be a hyperlink. |
5 | iModulonDB | gene_link_db | The default, "External Database", will be used. |
6 | iModulonDB | organism_folder | The default, "new_organism", will be used. |
7 | iModulonDB | dataset_folder | The default, "new_dataset", will be used. |
[9]:
# issues 0-7
# complete the imodulondb_table
ica_data.imodulondb_table = {
'organism': 'Escherichia coli',
'dataset': 'PRECISE 1',
'strain': 'K-12 MG1655 and BW25113',
'publication_name': 'Sastry, et al., 2019',
'publication_link': 'https://doi.org/10.1038/s41467-019-13483-w',
'gene_link_db': 'EcoCyc',
'organism_folder': 'e_coli',
'dataset_folder': 'precise1'
}
[10]:
# if we now re-call the function, the issues we fixed are removed.
table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)
table_issues.loc[table_issues.Table == 'iModulonDB']
[10]:
Table | Missing Column | Solution |
---|
The gene_table
Variable
The gene_table
contains details about each gene and shares an index with the M
and X
matrices. Its information is included in the following parts of the site:
Gene Tables on the iModulon Pages
Metadata box on the Gene Pages
Search results
name
,cog
, andgene_start
are shown when a gene is hovered over in the gene scatter plot (middle right of iModulon Pages).cog
is also used to color this scatter plot, andgene_start
defines the x axis.
Note that column names are case-insensitive.
Some additional considerations for each column are described below:
gene_name
: Try to use the most up to date names. Also, if a gene encodes a transcription factor (regulator) in the TRN, make sure that the names match (e.g. the transcription factor listed in the trn as ‘RpoS’ is encoded by the gene withgene_name
‘rpoS’). See the section ontf_issues.has_gene
for more information.gene_product
: This field is meant to be a concise but specific description of the function/product of the gene, such as ‘homoserine kinase’. You can use html tags in this field if superscripts or subscripts are desired.cog
: This stands for ‘cluster of orthologous groups’, and should be a larger category of genes such as ‘Carbohydrate transport and metabolism’. In addition to being displayed everywhere listed above, each COG is assigned a random color in the gene scatter plot, so it is especially useful to have them.gene_start
: This is not displayed anywhere, but it is used as the x axis value for each gene in the scatter plots on the iModulon Pages. If a few gene_starts are missing, those genes will have a value of 0 on the x axis. If allgene_start
s are not provided, then the order of thegene_table
will be used to assign integer values to each gene for the x axis of this plot.operon
: Preferably, this field will be the shortest string listing all genes, such as ‘artPIQM’, ‘argT-hisJQMP’, or ‘pdxB-usg-truA-dedA’. For very long operons with names that don’t share letters, you can list the first and last gene as in ‘yitB->yisZ’. iModulonDB is not picky about this column, however, so operons as provided by another database are acceptable.regulator
: Adding a TRN will automatically generate this column for you. It should contain a comma-separated list of all regulators that regulate this gene, e.g. ‘ppGpp,Lrp,Nac’.
If any of these columns are listed in your table_issues
, be sure to check the gene_table for the appropriate information under other column names. Rename those columns so that they match the names above.
[11]:
# here are all possible issues with the gene_table
all_table_issues.loc[all_table_issues.Table == 'Gene']
[11]:
Table | Missing Column | Solution | |
---|---|---|---|
9 | Gene | gene_name | Locus tags (gene_table.index) will be used. |
10 | Gene | gene_product | Locus tags (gene_table.index) will be used. |
11 | Gene | cog | COG info will not display & the gene scatter plot will not have color. |
12 | Gene | start | The x axis of the scatter plot will be a numerical value instead of a genome location. |
13 | Gene | operon | Operon info will not display. |
14 | Gene | regulator | Regulator info will not display. If you have a TRN, add it to the model to auto-generate this column. |
[12]:
# here are the issues relevant to ica_data.gene_table
table_issues.loc[table_issues.Table == 'Gene']
[12]:
Table | Missing Column | Solution | |
---|---|---|---|
0 | Gene | gene_product | Locus tags (gene_table.index) will be used. |
[13]:
# issue 0
# if gene_product information is available, add it to the gene_table.
The sample_table
Variable
The sample_table
describes each sample and shares its index with the columns of X
and A
. Its information is very important for generating the activity and expression bar graphs on the iModulon and gene pages. Familiarize yourself with the activity plot.
Three columns in the sample_table
define progressively smaller groupings of samples. It is CRITICAL that these columns exist so that the activity plots can be generated. Short, human readable, and specific names are preferred. Formatting should be consistent within your dataset, but is not necessarily constant throughout iModulonDB.
project
: This defines the largest grouping of samples, which share a common theme and/or were featured in the same paper. Examples include: ‘biofilm’, ‘acid’, and ‘crp_KO’. In the activity bar graph on the iModulon Pages, the vertical lines separate each project and the names are displayed across the bottom.condition
: This is the smallest grouping of samples, used for experimental conditions. Samples have matchingproject
andcondition
names if and only if they are biological replicates. Examples include: ‘biofilm_t0’, ‘biofilm_t10’, ‘wt_ctrl’, ‘del_crp’. In the activity bar graph on the iModulon Pages, each bar corresponds to a condition, and its height is the average activity value across all samples in the condition. Hovering over a bar provides additional details from the sample table, and zooming in switches the x labels to be conditions instead of projects.sample
: This must be unique for each sample. If it is not a column, it is assumed that it is the index of thesample_table
. Use a human readable sample name, preferably of the form ‘project__condition__#’, where ‘#’ indicates a replicate number. Examples: ‘carbon__wt_ctrl__1’, ‘carbon__wt_ctrl__2’, ‘carbon__del_crp__1’, ‘biofilm__biofilm_t0__1’. In the activity bar graph on the iModulon Pages, each dot floating near the bar is a sample. Also, the regulon scatter plots contain points corresponding to each sample; sample names will appear when points are hovered over in that plot as well.
Finally, two other columns are checked:
n_replicates
: For each condition, the samples all must have a matchingn_replicates
value that is used by the code. You can ignore this output, however, since a missingn_replicates
column will automatically be generated. If you would like to remove this issue from the table, you can run the commandgenerate_n_replicates_column(ica_data)
; otherwise, it will be automatically called on export.doi
: If you click on the bars of the activity plot and the first sample of the corresponding condition has a link in thedoi
column, then you will go to that link. This could be useful to learn more about a given sample. DOIs are preferred, but any link is allowed. PyModulon does not test the validity of any links, so be sure this is a valid link. See the section onmissing_dois
for more details on this.
Note that in addition to the columns that PyModulon searches for, all columns of the ``sample_table`` can be used to annotate or color the activity plots. The more informative the sample_table is, the more useful the activity plots will be. You can access all the columns in any iModulon or Gene Page by clicking the wrench symbol next to the word ‘Activity’. Checkboxes next to each column name indicate whether its value will be displayed on hovering over the sample, and the ink button allows
you to recolor the activity plot by the values of that feature. For example, ‘pH’ is not a required sample_table
column, but it could be useful to color or label each sample by that variable. Other useful but completely optional columns include:
Strain Description
Carbon Source (g/L)
Supplement
Temperature (C)
Time (min)
Growth Phase
[14]:
# here are all possible issues with the sample_table
all_table_issues.loc[all_table_issues.Table == 'Sample']
[14]:
Table | Missing Column | Solution | |
---|---|---|---|
15 | Sample | project | This is a CRITICAL column defining the largest grouping of samples. Vertical bars in the activity plot will separate projects. |
16 | Sample | condition | This is an CRITICAL column defining the smallest grouping of samples. Biological replicates must have matching projects and conditions, and they will appear as single bars with averaged activities. |
17 | Sample | sample | The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#." |
18 | Sample | n_replicates | This column will be generated for you. |
19 | Sample | doi | Clicking on activity plot bars will not link to relevant papers for the samples. |
[15]:
# here are the issues relevant to ica_data.sample_table
table_issues.loc[table_issues.Table == 'Sample']
[15]:
Table | Missing Column | Solution | |
---|---|---|---|
1 | Sample | sample | The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#." |
2 | Sample | n_replicates | This column will be generated for you. |
[16]:
# issue 1
# check to make sure that the sample_table.index contains good names
ica_data.sample_table.index
[16]:
Index(['control__wt_glc__1', 'control__wt_glc__2', 'fur__wt_dpd__1',
'fur__wt_dpd__2', 'fur__wt_fe__1', 'fur__wt_fe__2',
'fur__delfur_dpd__1', 'fur__delfur_dpd__2', 'fur__delfur_fe2__1',
'fur__delfur_fe2__2',
...
'efeU__menFentC_ale29__1', 'efeU__menFentC_ale29__2',
'efeU__menFentC_ale30__1', 'efeU__menFentC_ale30__2',
'efeU__menFentCubiC_ale36__1', 'efeU__menFentCubiC_ale36__2',
'efeU__menFentCubiC_ale37__1', 'efeU__menFentCubiC_ale37__2',
'efeU__menFentCubiC_ale38__1', 'efeU__menFentCubiC_ale38__2'],
dtype='object', length=278)
[17]:
# the sample names above are good; we can now ignore issue 1
# issue 2
# optionally add the n_replicates column
generate_n_replicates_column(ica_data)
The imodulon_table
Variable
The imodulon_table
describes each iModulon and shares its index with the columns of M
and the index of A
. Its data is featured on iModulonDB in the following places:
Dataset Page, main section
iModulon tables on the Gene Pages (
regulator
,function
, andcategory
columns)Metadata box on the iModulon Pages
Search results
Some additional considerations for each column are described below:
name
: Usually, this issue can be ignored because theimodulon_table.index
always matches theimodulon_names
. If the names are all integers, then the word “iModulon” will be added (“0” –> “iModulon 0”).regulator
: This column is very important. Use regulator names that match the TRN. Join regulators using either ‘+’ to represent the intersection of regulons or ‘/’ to represent the union of regulons. Regulator links will be added in the iModulon Page metadata boxes according to thetf_links
variable. The content of this column affects the behavior of the gene table, gene histogram, regulon venn diagram, and regulon scatter plots. See the section ontf_issues
.function
: This is meant to be a specific description of each iModulon’s function, such as “Histidine biosynthesis”.category
: This column provides larger groupings of iModulons, such as “Amino Acid Biosynthesis”.n_genes
: The number of genes in the iModulon. This can be ignored since it will be computed for you.precision
: The overlap between the iModulon and its regulon divided by the size of the iModulon. See the tutorial on gene enrichment analysis.recall
: The overlap between the iModulon and its regulon divided by the size of the regulon. See the tutorial on gene enrichment analysis.exp_var
: The explained variance when this iModulon alone is used to reconstruct the originalX
matrix. Use decimal values; they will be converted to percentages in iModulonDB. These can be computed using theexplained_variance
function; see the tutorial on additional functions.
[18]:
# here are all possible issues with the imodulon_table
all_table_issues.loc[all_table_issues.Table == 'iModulon']
[18]:
Table | Missing Column | Solution | |
---|---|---|---|
20 | iModulon | name | imodulon_table.index will be used. |
21 | iModulon | regulator | The regulator details will be left blank. |
22 | iModulon | function | The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard |
23 | iModulon | category | The categories will be filled in as "Uncharacterized". |
24 | iModulon | n_genes | This column will be computed for you. |
25 | iModulon | precision | This column will be left blank. |
26 | iModulon | recall | This column will be left blank. |
27 | iModulon | exp_var | This column will be left blank. |
[19]:
# here are the issues relevant to ica_data.imodulon_table
table_issues.loc[table_issues.Table == 'iModulon']
[19]:
Table | Missing Column | Solution | |
---|---|---|---|
3 | iModulon | name | imodulon_table.index will be used. |
4 | iModulon | function | The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard |
5 | iModulon | exp_var | This column will be left blank. |
[20]:
# issue 3
# check that the imodulon_table.index contains good names
ica_data.imodulon_table.index
[20]:
Index(['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC', 'BW25113',
'Cbl+CysB', 'CdaR', 'CecR', 'Copper', 'CpxR', 'Cra', 'Crp-1', 'Crp-2',
'CsqR', 'CysB', 'DhaR/Mlc', 'EvgA', 'ExuR/FucR', 'FadR', 'FecI',
'FlhDC', 'FliA', 'Fnr', 'Fur-1', 'Fur-2', 'GadEWX', 'GadWX', 'GcvA',
'GlcC', 'GlpR', 'GntR/TyrR', 'His-tRNA', 'Leu/Ile', 'Lrp', 'MalT',
'MetJ', 'Nac', 'NagC/TyrR', 'NarL', 'NikR', 'NtrC+RpoN', 'OxyR', 'PrpR',
'PurR-1', 'PurR-2', 'PuuR', 'Pyruvate', 'RbsR', 'RcsAB', 'RpoH', 'RpoS',
'SoxS', 'SrlR+GutM', 'Thiamine', 'Tryptophan', 'XylR', 'YgbI', 'YiaJ',
'YieP', 'YneJ', 'Zinc', 'crp-KO', 'curli', 'deletion-1', 'deletion-2',
'duplication-1', 'e14-deletion', 'efeU-repair', 'entC-menF-KO',
'fimbriae', 'flu-yeeRS', 'fur-KO', 'gadWX-KO', 'insertion',
'iron-related', 'lipopolysaccharide', 'membrane', 'nitrate-related',
'proVWX', 'purR-KO', 'sgrT', 'thrA-KO', 'translation',
'uncharacterized-1', 'uncharacterized-2', 'uncharacterized-3',
'uncharacterized-4', 'uncharacterized-5', 'uncharacterized-6',
'ydcI-KO', 'yheO-KO'],
dtype='object')
[21]:
# the names are good; we can ignore issue 3
# issue 4
# write a function column if desired
# (part of iModulon characterization)
# issue 5
# compute the explained variance for each iModulon
from pymodulon.util import explained_variance
for k in ica_data.imodulon_table.index:
ica_data.imodulon_table.loc[k, 'exp_var'] = explained_variance(
ica_data, imodulons=k)
The tf_issues
Output
The next output is the tf_issues
dataframe. Each row corresponds to a regulator that is used in the imodulon_table
. Regulators that satisfy all requirements are omitted. False
values in this table represent potentially missing information.
The columns refer to the following:
in_trn
: whether the regulator is in the model.trn. Regulators not in the TRN will be ignored in the site’s histograms and gene tables. It is highly recommended that all regulators be in the TRN. Any ``False`` values in this column should be fixed by adding rows to thetrn
and/or ensuring that names all match up betweenimodulon_table.regulator
andtrn.regulator
.has_link
: whether the regulator has a link inica_data.tf_links
. If it does, its name will be clickable in the iModulon Page metadata box. If not, the regulator name will not appear as a hyperlink. Any regulators without dedicated pages in other databases should be ignored.has_gene
: whether the regulator can be matched to a gene in the model. This is used to generate the regulon scatter plot which appears at the bottom of the iModulon Page, comparing iModulon activity to the expression of the regulators. If your regulator is a gene, try to ensure that its name matches thegene_table.gene_name
. If your regulator doesn’t correspond to a gene (e.g. the small molecule regulator ppGpp), then aFalse
value in this column is acceptable.
Note that in our minimal example, the tf_issues
will be empty because there is no regulator column in the imodulon_table
.
[22]:
tf_issues
[22]:
in_trn | has_link | has_gene | |
---|---|---|---|
allR | True | False | True |
fucR | True | False | True |
araC | True | False | True |
arcA | True | False | True |
argR | True | False | True |
... | ... | ... | ... |
trpR | True | False | True |
xylR | True | False | True |
yiaJ | True | False | True |
zur | True | False | True |
zntR | True | False | True |
70 rows × 3 columns
The in_trn
Column
As described above, any False
values in this column should definitely be fixed by ensuring name match-ups between imodulon_table.regulator
and trn.regulator
columns, or by adding rows to trn
. True
values and omitted regulators indicate matchable regulators.
[23]:
# here is a list of missing regulators from the TRN
tf_issues.index[~tf_issues.in_trn.astype(bool)]
[23]:
Index([], dtype='object')
[24]:
# it is empty; no issues
The has_link
Column
This column encourages you to find links for each transcription factor and put them in the dictionary ica_data.tf_links
. Find a database for your regulators (e.g. RegulonDB), and either make a file full of links or programmatically generate links by finding patterns in the databases URLs.
In the example below, a file of regulator links has alreay been made.
[25]:
# here is a list of missing tf_links
tf_issues.index[~tf_issues.has_link.astype(bool)]
[25]:
Index(['allR', 'fucR', 'araC', 'arcA', 'argR', 'atoC', 'cbl', 'cysB', 'cdaR',
'cecR', 'cusR', 'hprR', 'cueR', 'cpxR', 'cra', 'crp', 'csqR', 'dhaR',
'mlc', 'evgA', 'exuR', 'fadR', 'fecI', 'flhD;flhC', 'fliA', 'fnr',
'fur', 'gadW', 'gadE', 'gadX', 'gcvA', 'glcC', 'glpR', 'tyrR', 'gntR',
'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'ilvY', 'lrp', 'malT', 'metJ',
'nac', 'nagC', 'narL', 'nikR', 'ntrC', 'rpoN', 'oxyR', 'prpR', 'purR',
'puuR', 'btsR', 'ypdB', 'pdhR', 'rbsR', 'rcsA;rcsB', 'rpoH', 'rpoS',
'soxS', 'gutM', 'srlR', 'TPP', 'L-tryptophan', 'trp-tRNA', 'trpR',
'xylR', 'yiaJ', 'zur', 'zntR'],
dtype='object')
[26]:
# read in a file of links
import pandas as pd
file_location = '../../src/pymodulon/data/imodulondb/e_coli_tf_links.csv'
tf_links = pd.read_csv(file_location, header = None, index_col = 0)
# convert to dictionary
tf_links = tf_links.to_dict()[1]
# add to ica_data
ica_data.tf_links = tf_links
# display a few for demonstration purposes
{k:tf_links[k] for k in list(tf_links.keys())[0:5]}
[26]:
{'glpR': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120012730&organism=ECK12&format=jsp&type=regulon',
'dhaR': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120015690&organism=ECK12&format=jsp&type=regulon',
'mlc': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120011240&organism=ECK12&format=jsp&type=regulon',
'argR': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120011670&organism=ECK12&format=jsp&type=regulon',
'narL': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120011502&organism=ECK12&format=jsp&type=regulon'}
[27]:
# let's re-check the tf_issues now
table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)
tf_issues.index[~tf_issues.has_link.astype(bool)]
[27]:
Index(['his-tRNA', 'leu-tRNA', 'ile-tRNA', 'TPP', 'L-tryptophan', 'trp-tRNA'], dtype='object')
[28]:
# none of those regulators have pages on RegulonDB
# so they can be ignored
The has_gene
Column
As described above, any False
values in this column indicate an inability to match between imodulon_table.regulator
and gene_table.gene_name
. Some regulators will not have genes, so some False
values are acceptable.
In some cases, the regulator name will be a complex of several genes. PyModulon supports an additional input, tfcomplex_to_gene
, which is a dictionary mapping those regulators to their preferred gene. This variable will need to be passed to both imodulondb_compatibility
and imodulondb_export
.
[29]:
tf_issues.index[~tf_issues.has_gene.astype(bool)]
[29]:
Index(['cecR', 'flhD;flhC', 'glpR', 'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'btsR',
'rcsA;rcsB', 'gutM', 'TPP', 'L-tryptophan', 'trp-tRNA'],
dtype='object')
[30]:
# three of these are name mismatches
# find the old names and replace them
ica_data.gene_table = ica_data.gene_table.replace({
'ybiH':'cecR',
'yehT':'btsR',
'srlM':'gutM'
})
# two are complexes
tfcomplex_to_gene = {
'flhD;flhC':'flhD',
'rcsA;rcsB':'rcsB'
}
[31]:
# let's re-check the tf_issues now
# use the tfcomplex_to_gene input
table_issues, tf_issues, missing_gene_links, missing_dois = \
imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)
tf_issues.index[~tf_issues.has_gene.astype(bool)]
[31]:
Index(['glpR', 'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'TPP', 'L-tryptophan',
'trp-tRNA'],
dtype='object')
[32]:
# the rest are pseudogene or non-gene regulators; ignore.
The missing_gene_links
Output
The next output is missing_gene_links
. Similar to tf_issues.has_link
, this output shows all the genes that are missing links in the gene_links
dictionary. The gene_links
dictionary is indexed by locus tag (gene_table.index
).
Gene links are featured on the Gene Pages in the metadata box. The text of the hyperlink is determined by imodulondb_table.gene_link_db
. Note that this does not have to be the same database as the one used for tf_links
.
As with the tf_links
, gene_links
can be filled out programmatically or using an external file. In the example below, we use an external file.
[33]:
missing_gene_links
[33]:
0 b0002
1 b0003
2 b0004
3 b0005
4 b0006
...
3918 b4688
3919 b4693
3920 b4696_1
3921 b4696_2
3922 b4705
Name: missing_gene_links, Length: 3923, dtype: object
[34]:
# read in a file of links
import pandas as pd
file_location = '../../src/pymodulon/data/imodulondb/e_coli_gene_links.csv'
gene_links = pd.read_csv(file_location, header = None, index_col = 0)
# convert to dictionary
gene_links = gene_links.to_dict()[1]
# add to ica_data
ica_data.gene_links = gene_links
# display a few for demonstration purposes
{k:gene_links[k] for k in list(gene_links.keys())[0:5]}
[34]:
{'b0002': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10998',
'b0003': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10999',
'b0004': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG11000',
'b0005': 'https://ecocyc.org/gene?orgid=ECOLI&id=G6081',
'b0006': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10011'}
[35]:
# let's re-check the tf_issues now
table_issues, tf_issues, missing_gene_links, missing_dois = \
imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)
missing_gene_links.values
[35]:
array(['b0240_2', 'b0502_1', 'b0553_1', 'b0562_2', 'b1459_1', 'b2092_1',
'b2092_2', 'b2139_2', 'b2190_2', 'b2641_2', 'b2681_1', 'b2681_2',
'b2891_2', 'b3046_2', 'b3268_2', 'b3423_1', 'b3423_2', 'b3643_1',
'b3777_2', 'b4038_3', 'b4308_2', 'b4488_1', 'b4488_2', 'b4490_2',
'b4491_1', 'b4492_1', 'b4492_3', 'b4493_1', 'b4493_2', 'b4495_1',
'b4495_2', 'b4496_1', 'b4496_2', 'b4498_1', 'b4498_2', 'b4499_1',
'b4571_1', 'b4571_2', 'b4575_2', 'b4580_1', 'b4580_2', 'b4581_2',
'b4582_1', 'b4582_2', 'b4587_1', 'b4587_2', 'b4600_2', 'b4623_1',
'b4640_1', 'b4646_2', 'b4658_1', 'b4658_2', 'b4659_1', 'b4659_2',
'b4660_1', 'b4696_1', 'b4696_2'], dtype=object)
[36]:
# all the above are pseudogenes without available links
# they can be ignored
The missing_dois
Output
The final output is the missing_dois
output. This output searches for completed entries in sample_table.doi
, and lists the sample_table.index
associated with any missing entries.
It may be helpful to think of the DOIs as “sample_links”. Clicking on the bar of an activity plot on the iModulon Pages or an expression plot on the Gene Pages will take users to the DOI of the corresponding sample if it exists. If this project is part of the modulome, in which samples from many publications are pooled, it is especially useful to fill in all of the DOIs.
In the case where no doi
column exists in the sample_table
, all samples will be listed as missing their dois.
[37]:
missing_dois
[37]:
Index(['misc__wt_no_te__1', 'misc__wt_no_te__2', 'misc__bw_delcbl__1',
'misc__bw_delcbl__2', 'misc__bw_delfabr__1', 'misc__bw_delfabr__2',
'misc__bw_delfadr__1', 'misc__bw_delfadr__2', 'misc__nitr_031__1',
'omics__wt_glu__1', 'omics__wt_glu__2', 'minspan__wt_glc__4',
'minspan__bw_delcra_glc__2', 'ica__wt_glc__1', 'ica__wt_glc__2',
'ica__wt_glc__3', 'ica__wt_glc__4', 'ica__arg_sbt__1',
'ica__arg_sbt__2', 'ica__cytd_rib__1', 'ica__cytd_rib__2',
'ica__gth__1', 'ica__gth__2', 'ica__leu_glcr__1', 'ica__leu_glcr__2',
'ica__met_glc__1', 'ica__met_glc__2', 'ica__no3_anaero__1',
'ica__no3_anaero__2', 'ica__phe_acgam__1', 'ica__phe_acgam__2',
'ica__thm_gal__1', 'ica__thm_gal__2', 'ica__tyr_glcn__1',
'ica__tyr_glcn__2', 'ica__ura_pyr__1', 'ica__ura_pyr__2',
'ica__wt_glc__5', 'ica__wt_glc__6', 'ica__bw_delpurR_cytd__1',
'ica__bw_delpurR_cytd__2', 'ica__ade_glc__1', 'ica__ade_glc__2'],
dtype='object', name='missing_DOIs')
[38]:
# a couple of the missing dois do exist
dois = {
'minspan__wt_glc__4': 'doi.org/10.15252/msb.20145243',
'minspan__bw_delcra_glc__2': 'doi.org/10.15252/msb.20145243',
'omics__wt_glu__1': 'doi.org/10.1038/ncomms13091',
'omics__wt_glu__2': 'doi.org/10.1038/ncomms13091'
}
# add them to the sample_table
for sample, doi in dois.items():
ica_data.sample_table.loc[sample, 'DOI'] = doi
Double-check Compatibility and Save
Once all the outputs have been checked, it is encouraged to re-run the imodulondb_compatibility
check to make sure that you are comfortable ignoring all missing information. If so, save your ica_data
object so that you don’t need to repeat any of these steps!
[39]:
table_issues, tf_issues, missing_gene_links, missing_dois = \
imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)
print('--Table Issues--')
display(table_issues)
print('--TF Issues--')
display(tf_issues)
print('--Missing Gene Links--')
display(missing_gene_links.values)
print('--Missing DOIs--')
display(missing_dois.values)
--Table Issues--
Table | Missing Column | Solution | |
---|---|---|---|
0 | Gene | gene_product | Locus tags (gene_table.index) will be used. |
1 | Sample | sample | The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#." |
2 | iModulon | name | imodulon_table.index will be used. |
3 | iModulon | function | The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard |
--TF Issues--
in_trn | has_link | has_gene | |
---|---|---|---|
glpR | True | True | False |
his-tRNA | True | False | False |
leu-tRNA | True | False | False |
ile-tRNA | True | False | False |
TPP | True | False | False |
L-tryptophan | True | False | False |
trp-tRNA | True | False | False |
--Missing Gene Links--
array(['b0240_2', 'b0502_1', 'b0553_1', 'b0562_2', 'b1459_1', 'b2092_1',
'b2092_2', 'b2139_2', 'b2190_2', 'b2641_2', 'b2681_1', 'b2681_2',
'b2891_2', 'b3046_2', 'b3268_2', 'b3423_1', 'b3423_2', 'b3643_1',
'b3777_2', 'b4038_3', 'b4308_2', 'b4488_1', 'b4488_2', 'b4490_2',
'b4491_1', 'b4492_1', 'b4492_3', 'b4493_1', 'b4493_2', 'b4495_1',
'b4495_2', 'b4496_1', 'b4496_2', 'b4498_1', 'b4498_2', 'b4499_1',
'b4571_1', 'b4571_2', 'b4575_2', 'b4580_1', 'b4580_2', 'b4581_2',
'b4582_1', 'b4582_2', 'b4587_1', 'b4587_2', 'b4600_2', 'b4623_1',
'b4640_1', 'b4646_2', 'b4658_1', 'b4658_2', 'b4659_1', 'b4659_2',
'b4660_1', 'b4696_1', 'b4696_2'], dtype=object)
--Missing DOIs--
array(['misc__wt_no_te__1', 'misc__wt_no_te__2', 'misc__bw_delcbl__1',
'misc__bw_delcbl__2', 'misc__bw_delfabr__1', 'misc__bw_delfabr__2',
'misc__bw_delfadr__1', 'misc__bw_delfadr__2', 'misc__nitr_031__1',
'ica__wt_glc__1', 'ica__wt_glc__2', 'ica__wt_glc__3',
'ica__wt_glc__4', 'ica__arg_sbt__1', 'ica__arg_sbt__2',
'ica__cytd_rib__1', 'ica__cytd_rib__2', 'ica__gth__1',
'ica__gth__2', 'ica__leu_glcr__1', 'ica__leu_glcr__2',
'ica__met_glc__1', 'ica__met_glc__2', 'ica__no3_anaero__1',
'ica__no3_anaero__2', 'ica__phe_acgam__1', 'ica__phe_acgam__2',
'ica__thm_gal__1', 'ica__thm_gal__2', 'ica__tyr_glcn__1',
'ica__tyr_glcn__2', 'ica__ura_pyr__1', 'ica__ura_pyr__2',
'ica__wt_glc__5', 'ica__wt_glc__6', 'ica__bw_delpurR_cytd__1',
'ica__bw_delpurR_cytd__2', 'ica__ade_glc__1', 'ica__ade_glc__2'],
dtype=object)
[40]:
# the above output is all acceptable
# save your results
from pymodulon.io import *
save_to_json(ica_data, 'e_coli_imdb.json')
The imodulondb_export
Function
After iModulonDB compatibility has been assured, it is time to use the imodulondb_export
function.
Arguments:
model
: theica_data
objectpath
: the path to the main folder of iModulonDB. This is where you cloned the iModulonDB GitHub repository, and where you should host your local server from. This function will create new folders and files inside this repository. The default is the current working directory.cat_order
: a list of each of theimodulon_table.category
s in order. When sorting by category in the Dataset Pages, the categories will appear in this order. Otherwise, they will appear in alphabetical order. Note that the default sorting is now byimodulon_table.exp_var
(or by iModulon name if no explained variance is provided), so this input is not very important.tfcomplex_to_gene
: a dictionary relating regulatory complexes in theimodulon_table.regulator
s to gene names ingene_table.gene_name
s. See the secion ontf_issues.has_gene
.
[41]:
cat_order = ['Carbon Source Utilization',
'Amino Acid and Nucleotide Biosynthesis',
'Energy Metabolism',
'Miscellaneous Metabolism',
'Structural Components',
'Metal Homeostasis',
'Stress Response',
'Regulator Discovery',
'Biological Enrichment',
'Genomic Alterations',
'Uncharacterized']
imodulondb_export(ica_data,
'../iModulonDB',
cat_order = cat_order,
tfcomplex_to_gene = tfcomplex_to_gene)
Writing main site files...
Done writing main site files. Writing plot files...
Two progress bars will appear below. The second will take significantly longer than the first.
Writing iModulon page files (1/2)
Writing Gene page files (2/2)
Complete! (Organism = e_coli; Dataset = precise1)
Now that the files have been exported, you can access them in your browser using the instructions at the beginning of this tutorial.
Adding a Link to the Splash Page
If you would like to be able to access your data from the splash page of your locally hosted site, you will have to edit the HTML code in index.html
, which is in the main folder.
Copy the contents of organisms/new_organism/new_dataset/html_for_splash.html
Open index.html using a text editor or IDE.
Find “<!– INSERT NEW DATASETS BELOW THIS LINE –>”, which should be near line 297
Paste the contents according to the comments in the file
If desired, increase the indentation of the pasted lines.
Now, if you reload localhost:8000, the new dataset will appear in the left hand side of the splash page.
API Reference
This page contains auto-generated API reference documentation 1.
pymodulon
Submodules
pymodulon.compare
|
Given two M matrices, returns the dot graph and name links of the various |
|
Given a set of links between M matrices, generates a dot graph of the various |
|
Reorganizes and renames genes in a dataframe to be consistent with |
|
Compares two M matrices between a single organism or across organisms and |
|
Makes protein files for all the genes in the genbank file |
|
Creates GenBank Databases from Protein FASTA of an organism |
|
Runs Bidirectional Best Hit BLAST to find orthologs utilizing two protein |
|
Computes gene lengths |
|
Runs BLASTP between two organisms |
|
Checks inputs are acceptable |
|
Checks that inputs are the same |
-
pymodulon.compare.
_get_orthologous_imodulons
(M1, M2, method, cutoff)[source] Given two M matrices, returns the dot graph and name links of the various connected ICA components
- Parameters
- Returns
links – Links and distances of connected iModulons
- Return type
-
pymodulon.compare.
_make_dot_graph
(links, show_all, names1, names2)[source] Given a set of links between M matrices, generates a dot graph of the various connected iModulons
- Parameters
- Returns
dot – Dot graph of connected iModulons
- Return type
Digraph
-
pymodulon.compare.
convert_gene_index
(df1, df2, ortho_file=None, keep_locus=False)[source] Reorganizes and renames genes in a dataframe to be consistent with another object/organism
- Parameters
- Returns
df1_new (~pandas.DataFrame) – M matrix for organism 1 with indexes translated into orthologs
df2_new (~pandas.DataFrame) – M matrix for organism 2 with indexes translated into orthologs
-
pymodulon.compare.
compare_ica
(M1, M2, ortho_file=None, cutoff=0.25, method='pearson', plot=True, show_all=False)[source] Compares two M matrices between a single organism or across organisms and returns the connected iModulons
- Parameters
M1 (DataFrame) – M matrix from the first organism
M2 (DataFrame) – M matrix from the second organism
ortho_file (str, optional) – Path to orthology file between organisms (default: None)
cutoff (float) – Cut off value for correlation metric (default: .25)
method (str or Callable) – Correlation metric to use from {‘pearson’, ‘kendall’, ‘spearman’} or callable (see
corr()
)plot (bool) – Create dot plot of matches (default: True)
show_all (bool) – Show all iModulons regardless of their linkage (default: False)
- Returns
matches (list) – Links and distances of connected iModulons
dot (Digraph) – Dot graph of connected iModulons
-
pymodulon.compare.
make_prots
(gbk, out_path, lt_key='locus_tag')[source] Makes protein files for all the genes in the genbank file
-
pymodulon.compare.
make_prot_db
(fasta_file, outname=None, combined='combined.fa')[source] Creates GenBank Databases from Protein FASTA of an organism
- Parameters
- Returns
None
- Return type
-
pymodulon.compare.
get_bbh
(db1, db2, outdir='bbh', outname=None, mincov=0.8, evalue=0.001, threads=1, force=False, savefiles=True)[source] Runs Bidirectional Best Hit BLAST to find orthologs utilizing two protein FASTA files. Outputs a CSV file of all orthologous genes.
- Parameters
db1 (str) – Path to protein FASTA file for organism 1
db2 (str) – Path to protein FASTA file for organism 2
outdir (str) – Path to output directory (default: “bbh”)
outname (str) – Name of output CSV file (default: <db1>_vs_<db2>_parsed.csv)
mincov (float) – Minimum coverage to call hits in BLAST, must be between 0 and 1 (default: 0.8)
evalue (float) – E-value threshold for BlAST hist (default: .001)
threads (int) – Number of threads to use for BLAST (default: 1)
force (bool) – If True, overwrite existing files (default: False)
savefiles (bool) – If True, save files to ‘outdir’ (default: True)
- Returns
out – Table of bi-directional BLAST hits between the two organisms
- Return type
pymodulon.core
Core functions for the IcaData object
Class representation of all iModulon-related data |
|
-
class
pymodulon.core.
IcaData
(M, A, X=None, log_tpm=None, gene_table=None, sample_table=None, imodulon_table=None, trn=None, dagostino_cutoff=None, optimize_cutoff=False, thresholds=None, threshold_method='dagostino', motif_info=None, imodulondb_table=None, gene_links=None, tf_links=None)[source] Bases:
object
Class representation of all iModulon-related data
-
view_imodulon
(self, imodulon)[source] View genes in an iModulon and show relevant information about each gene.
-
find_single_gene_imodulons
(self, save=False)[source] A simple function that returns the names of all likely single-gene iModulons. Checks if the largest iModulon gene weight is more than twice the weight of the second highest iModulon gene weight.
-
_update_imodulon_table
(self, enrichment)[source] Update iModulon table given new iModulon enrichments
-
compute_regulon_enrichment
(self, imodulon, regulator, save=False, evidence=None)[source] Compare an iModulon against a regulon. (Note: q-values cannot be computed for single enrichments)
- Parameters
regulator (str) – TF name, or complex regulon, where “/” uses genes in any regulon and “+” uses genes in all regulons
save (bool) – If true, save enrichment score to the imodulon_table (default: True)
evidence (list or str) – ‘Evidence’ level of TRN interactions to include during TRN enrichment (default: None)
- Returns
enrich – Table containing enrichment statistics
- Return type
-
compute_trn_enrichment
(self, imodulons=None, fdr=1e-05, max_regs=1, save=False, method='both', force=False, evidence=None)[source] Compare iModulons against all regulons in the TRN
- Parameters
imodulons (int or str) – Name of iModulon(s). If none given, compute enrichments for all ‘iModulons’ (default: None)
fdr (float) – False detection rate (default: 1e-5)
max_regs (int) – Maximum number of regulators to include in complex regulon ( default: 1)
save (bool) – Save regulons with highest enrichment scores to the imodulon_table (default: False)
method (str) – How to combine multiple regulators (default: ‘both’). ‘or’ computes enrichment against union of regulons, ‘and’ computes enrichment against intersection of regulons, and ‘both’ performs both tests
force (bool) – If false, prevents computation of >2 regulators (default: False)
evidence (list or str) – Evidence level of TRN interactions to include during TRN enrichment
- Returns
df_enriched – Table of statistically significant enrichments
- Return type
-
compute_annotation_enrichment
(self, annotation, column, imodulons=None, fdr=0.1)[source] Compare iModulons against a gene annotation table
- Parameters
annotation (DataFrame) – Table containing two columns: the gene locus tag, and its appropriate annotation
column (str) – Name of the column containing the annotation
imodulons (list or str or int) – Name of iModulon(s). If none given, compute enrichments for all iModulons (default: None)
fdr (float) – False detection rate (default: 0.1)
- Returns
DF_enriched – Table of statistically significant enrichments
- Return type
-
recompute_thresholds
(self, dagostino_cutoff)[source] Re-computes iModulon thresholds using a new D’Agostino cutoff
-
reoptimize_thresholds
(self, progress=True, plot=True)[source] Re-optimizes the D’Agostino statistic cutoff for defining iModulon thresholds if the TRN has been updated
-
_optimize_dagostino_cutoff
(self, progress, plot)[source] Computes an abridged version of the TRN enrichments for the 20 highest-weighted genes in order to determine a global minimum for the D’Agostino cutoff ultimately used to threshold and define the genes “in” an iModulon
-
compute_kmeans_thresholds
(self)[source] Computes iModulon thresholds using K-means clustering
- Returns
None
- Return type
-
copy
(self)[source] Make a deep copy of an IcaData object
- Returns
IcaData – Copy of IcaData object
- Return type
-
pymodulon.enrichment
Contains functions for gene set enrichment analysis
|
Creates contingency table for gene enrichment |
|
Computes enrichment statistic for gene_set in target_genes. |
|
Runs false detection correction for a table of statistics |
|
Converts a complex regulon (regulon_str) into a list of genes |
|
Computes enrichment statistics for a gene_set in a regulon |
|
Compare a gene set against an entire TRN |
|
Compare a gene set against a dataframe of gene annotations |
-
pymodulon.enrichment.
contingency
(set1, set2, all_genes)[source] Creates contingency table for gene enrichment
-
pymodulon.enrichment.
compute_enrichment
(gene_set, target_genes, all_genes, label=None)[source] Computes enrichment statistic for gene_set in target_genes.
- Parameters
- Returns
Table containing statistically significant enrichments
- Return type
pd.Series
-
pymodulon.enrichment.
FDR
(p_values, fdr, total=None)[source] Runs false detection correction for a table of statistics
-
pymodulon.enrichment.
parse_regulon_str
(regulon_str, trn)[source] Converts a complex regulon (regulon_str) into a list of genes
-
pymodulon.enrichment.
compute_regulon_enrichment
(gene_set, regulon_str, all_genes, trn)[source] Computes enrichment statistics for a gene_set in a regulon
- Parameters
- Returns
result – Table containing statistically significant enrichments
- Return type
-
pymodulon.enrichment.
compute_trn_enrichment
(gene_set, all_genes, trn, max_regs=1, fdr=0.01, method='both', force=False)[source] Compare a gene set against an entire TRN
- Parameters
gene_set (set) – Gene set for enrichment (e.g. genes in iModulon)
all_genes (set) – Set of all genes
trn (DataFrame) – Table containing transcriptional regulatory network
max_regs (int) – Maximum number of regulators to include in complex regulon (default: 1)
fdr (float) – False detection rate (default = .01)
method (str) – How to combine complex regulons. (default: ‘both’) “or” computes enrichment against union of regulons “and” computes enrichment against intersection of regulons “both” performs both tests
force (bool) – Allows computation of >2 regulators (default = False)
- Returns
Table containing statistically significant enrichments
- Return type
pymodulon.example_data
Pre-loaded example dataset for PyModulon tutorials.
Load Escherichia coli |
|
Load Staphylococcus aureus |
|
Load Bacillus subtilis |
|
Load an example bi-directional blast best hit (BBH) file |
|
Load an example expression dataset in units log-TPM |
pymodulon.gene_util
Utility functions for gene annotation
|
Get the full description for a COG category letter |
|
Helper function for parsing GFF annotations |
|
Converts GFF file(s) to a Pandas DataFrame |
|
|
|
Python wrapper for the uniprot ID mapping tool (See |
-
pymodulon.gene_util.
_get_attr
(attributes, attr_id, ignore=False)[source] Helper function for parsing GFF annotations
-
pymodulon.gene_util.
gff2pandas
(gff_file, feature='CDS', index=None)[source] Converts GFF file(s) to a Pandas DataFrame :param gff_file: Path(s) to GFF file :type gff_file: str or list :param feature: Name(s) of features to keep (default = “CDS”) :type feature: str or list :param index: Column or attribute to use as index :type index: str, optional
- Returns
df_gff – GFF formatted as a DataFrame
- Return type
-
pymodulon.gene_util.
uniprot_id_mapping
(prot_list, input_id='ACC+ID', output_id='P_REFSEQ_AC', input_name='input_id', output_name='output_id')[source] Python wrapper for the uniprot ID mapping tool (See https://www.uniprot.org/uploadlists/)
- Parameters
- Returns
mapping – Table containing two columns, one listing the inputs, and one listing the mapped outputs. Column names are defined by input_name and output_name.
- Return type
pymodulon.imodulondb
Functions for writing a directory for iModulonDB webpages
|
Checks for all issues and missing information prior to exporting to iModulonDB. |
|
Generates the iModulonDB page for the data and exports to the path. |
|
Converts the model’s imodulondb_table into dataset metadata |
|
Reformats the iModulon table according |
|
Generates the two versions of the gene presence file, one as a binary |
|
Generates all parts of the site that do not require large iteration loops |
|
Generates all files for all iModulons in data |
|
Generates all files for all iModulons in IcaData object |
|
Returns a list of relevant tfs from a string. Will ignore TFs not in the |
|
Creates the gene table dataframe for iModulonDB |
|
Helper function for imdb_gene_hist_df |
|
Creates a formatted string for the histogram legends. Helper function for |
|
Sorts TF strings for the legend of the histogram. Helper function for |
|
Creates the gene histogram for an iModulon |
|
Helper function to match genes to colors based on COG. Used by |
|
Generates a dataframe for the gene scatter plot in iModulonDB |
|
Generates the “n_replicates” column of the sample_table for iModulonDB. |
|
Generates a dataframe for the activity bar graph of iModulon k |
|
The Bacillus microarray dataset uses [] to create unusually complicated |
|
Finds the set of genes regulated by the boolean combination of regulators |
|
Generates a dataframe for the regulon venn diagram of iModulon k. Returns |
|
|
|
|
|
Adds links to the regulator string |
|
Adds links to the regulator string |
|
Generates a dataframe for the metadata of iModulon k |
|
|
|
|
|
Generates a dataframe for the iModulon table of gene g |
|
|
|
Generates all data for gene g, stores it in a subfolder of path_prefix |
-
pymodulon.imodulondb.
imodulondb_compatibility
(model, inplace=False, tfcomplex_to_gene=None)[source] Checks for all issues and missing information prior to exporting to iModulonDB. If inplace = True, modifies the model (not recommended for main model variables).
- Parameters
model (
IcaData
) – IcaData object to checkinplace (bool, optional) – If true, modifies the model to prepare for export. Not recommended for use with your main model variable.
tfcomplex_to_gene (dict, optional) – dictionary pointing complex TRN entries to matching gene names in the gene table (ex: {“FlhDC”:”flhD”})
- Returns
table_issues (pd.DataFrame) – Each row corresponds to an issue with one of the main class elements. Columns: * Table: which table or other variable the issue is in * Missing Column: the column of the Table with the issue (not case sensitive; capitalization is ignored). * Solution: Unless “CRITICAL” is in this cell, the site behavior if the issue remained is described here.
tf_issues (pd.DataFrame) – Each row corresponds to a regulator that is used in the imodulon_table. Columns: * in_trn: whether the regulator is in the model.trn. Regulators not in the TRN will be ignored in the site’s histograms and gene tables. * has_link: whether the regulator has a link in tf_links. If not, no link to external regulator databases will be shown. * has_gene: whether the regulator can be matched to a gene in the model. If this is false, then there will be no regulator scatter plot on the site. You can link TF complexes to one of their genes using the tfcomplex_to_gene input.
missing_g_links (pd.Series) – The genes on this list don’t have links in the gene_links. Their gene pages for these genes will not display links.
missing_DOIs (pd.Series) – The samples listed here don’t have DOIs in the sample_table. Clicking on their associated bars in the activity plots will not link to relevant papers.
-
pymodulon.imodulondb.
imodulondb_export
(model, path='.', cat_order=None, tfcomplex_to_gene=None, skip_iMs=False, skip_genes=False)[source] Generates the iModulonDB page for the data and exports to the path. If certain columns are unavailable but can be filled in automatically, they will be.
- Parameters
model (
IcaData
) – IcaData object to exportpath (str, optional) – Path to iModulonDB main hosting folder (default = “.”)
cat_order (list, optional) – List of categories in the imodulon_table, ordered as you would like them to appear in the dataset table (default = None)
tfcomplex_to_gene (dict, optional) – dictionary pointing complex TRN entries to matching gene names in the gene table ex: {“FlhDC”:”flhD”}
skip_iMs (bool, optional) – If this is True, do not output iModulon files (to save time)
skip_genes (bool, optional) – If this is True, do not output gene files (to save time)
- Returns
None
- Return type
-
pymodulon.imodulondb.
imdb_dataset_table
(model)[source] Converts the model’s imodulondb_table into dataset metadata for the gray box on the left side of the dataset page
-
pymodulon.imodulondb.
imdb_iM_table
(imodulon_table, cat_order=None)[source] Reformats the iModulon table according
-
pymodulon.imodulondb.
imdb_gene_presence
(model)[source] Generates the two versions of the gene presence file, one as a binary matrix, and one as a DataFrame
- Parameters
model (
IcaData
) – An IcaData object- Returns
mbin (~pandas.DataFrame) – Binarized M matrix
mbin_list (~pandas.DataFrame) – Table mapping genes to iModulons
-
pymodulon.imodulondb.
imodulondb_main_site_files
(model, path_prefix='.', rewrite_annotations=True, cat_order=None)[source] Generates all parts of the site that do not require large iteration loops
- Parameters
model (
IcaData
) – IcaData objectpath_prefix (str, optional) – Main folder for iModulonDB files (default = “.”)
rewrite_annotations (bool, optional) – Set to False if the gene_table and trn are unchanged (default = True)
cat_order (list, optional) – list of categories in data.imodulon_table.category, ordered as you want them to appear on the dataset page (default = None)
- Returns
main_folder – Dataset folder, for use as the path_prefix in imdb_generate_im_files()
- Return type
-
pymodulon.imodulondb.
imdb_generate_im_files
(model, path_prefix='.', gene_scatter_x='start', tfcomplex_to_gene=None)[source] Generates all files for all iModulons in data
- Parameters
model (
IcaData
) – IcaData objectpath_prefix (str, optional) – Dataset folder in which to store the files (default = “.”)
gene_scatter_x (str) – Column from the gene table that specificies what to use on the X-axis of the gene scatter plot (default = “start”)
tfcomplex_to_gene (dict, optional) – dictionary pointing complex TRN entries to matching gene names in the gene table ex: {“FlhDC”:”flhD”}
-
pymodulon.imodulondb.
imdb_generate_gene_files
(model, path_prefix='.')[source] Generates all files for all iModulons in IcaData object
-
pymodulon.imodulondb.
parse_tf_string
(model, tf_str, verbose=False)[source] Returns a list of relevant tfs from a string. Will ignore TFs not in the trn file. iModulonDB helper function.
-
pymodulon.imodulondb.
imdb_gene_table_df
(model, k)[source] Creates the gene table dataframe for iModulonDB :param model: IcaData object :type model:
IcaData
:param k: iModulon name :type k: int or str- Returns
res – DataFrame of the gene table that is compatible with iModulonDB
- Return type
-
pymodulon.imodulondb.
_component_DF
(model, k, tfs=None)[source] Helper function for imdb_gene_hist_df
-
pymodulon.imodulondb.
_tf_combo_string
(row)[source] Creates a formatted string for the histogram legends. Helper function for imdb_gene_hist_df.
-
pymodulon.imodulondb.
_sort_tf_strings
(tfs, unique_elts)[source] Sorts TF strings for the legend of the histogram. Helper function for imdb_gene_hist_df.
-
pymodulon.imodulondb.
imdb_gene_hist_df
(model, k, bins=20, tol=0.001)[source] Creates the gene histogram for an iModulon
- Parameters
- Returns
gene_hist_table – A dataframe for producing the histogram that is compatible with iModulonDB
- Return type
-
pymodulon.imodulondb.
_gene_color_dict
(model)[source] Helper function to match genes to colors based on COG. Used by imdb_gene_scatter_df.
-
pymodulon.imodulondb.
imdb_gene_scatter_df
(model, k, gene_scatter_x='start')[source] Generates a dataframe for the gene scatter plot in iModulonDB
-
pymodulon.imodulondb.
generate_n_replicates_column
(model)[source] Generates the “n_replicates” column of the sample_table for iModulonDB.
-
pymodulon.imodulondb.
imdb_activity_bar_df
(model, k)[source] Generates a dataframe for the activity bar graph of iModulon k
-
pymodulon.imodulondb.
_parse_regulon_string
(model, s)[source] The Bacillus microarray dataset uses [] to create unusually complicated TF strings. This function parses those, as a helper to _get_reg_genes for imdb_regulon_venn_df.
-
pymodulon.imodulondb.
_get_reg_genes
(model, tf)[source] Finds the set of genes regulated by the boolean combination of regulators in a TF string
-
pymodulon.imodulondb.
imdb_regulon_venn_df
(model, k)[source] Generates a dataframe for the regulon venn diagram of iModulon k. Returns None if there is no diagram to draw
-
pymodulon.imodulondb.
get_tfs_to_scatter
(model, tf_string, tfcomplex_to_genename=None, verbose=False)[source] - Parameters
- Returns
res – List of gene loci
- Return type
-
pymodulon.imodulondb.
imdb_regulon_scatter_df
(model, k, tfcomplex_to_genename=None)[source] - Parameters
- Returns
res – A dataframe for producing the regulon scatter plots in iModulonDB
- Return type
-
pymodulon.imodulondb.
tf_with_links_brackets
(model, tf_str)[source] Adds links to the regulator string Used with the complicated bracket system in Bacillus Microarray
-
pymodulon.imodulondb.
imdb_imodulon_basics_df
(model, k, reg_venn, reg_scatter)[source] Generates a dataframe for the metadata of iModulon k
- Parameters
- Returns
res – A dataframe of metadata for iModulon k in iModulonDB
- Return type
-
pymodulon.imodulondb.
make_im_directory
(model, k, path_prefix='.', gene_scatter_x='start', tfcomplex_to_genename=None)[source] - Parameters
model (
IcaData
) – IcaData objectpath_prefix (str, optional) – Path to the dataset folder. This function creates an ‘iModulon_files/k/’ subdirectory there to store everything. (default = “.”)
gene_scatter_x (str) – Passed to imdb_gene_scatter_df() to indicate the x axis type of that plot (default = “start”)
tfcomplex_to_genename (dict, optional) – dictionary pointing complex TRN entries to matching gene names in the gene table ex: {“FlhDC”:”flhD”}
- Returns
None
- Return type
-
pymodulon.imodulondb.
imdb_gene_im_table_df
(model, g, im_table, m_bin)[source] Generates a dataframe for the iModulon table of gene g
- Parameters
- Returns
perGene_table – A dataframe for the iModulon table of gene g in iModulonDB
- Return type
pymodulon.io
Functions for reading and writing data into files.
|
Save |
|
Load |
-
pymodulon.io.
save_to_json
(data, filename, compress=False)[source] Save
IcaData
object to a json file
pymodulon.motif
|
Get upstream sequences for a table of operons |
|
Finds motifs upstream of genes in an iModulon |
|
|
|
|
|
Compare a MEME motif against external motif databases |
-
pymodulon.motif.
_get_upstream_seqs
(ica_data, imodulon, seq_dict, upstream, downstream)[source] Get upstream sequences for a table of operons
- Parameters
ica_data (IcaData) – IcaData object
seq_dict (Dict) – Dictionary mapping accession numbers to Biopython SeqRecords
upstream (int) – Number of basepairs upstream from first gene in operon to include in motif search
downstream (int) – Number of basepairs upstream from first gene in operon to include in motif search
- Returns
pd.DataFrame – DataFrame containing operon information
List[SeqRecord] – List of SeqRecords containing upstream sequences
-
pymodulon.motif.
find_motifs
(ica_data, imodulon, fasta_file, outdir=None, palindrome=False, nmotifs=5, upstream=500, downstream=100, verbose=True, force=False, evt=0.001, cores=8, minw=6, maxw=40, minsites=None)[source] Finds motifs upstream of genes in an iModulon
- Parameters
ica_data (IcaData) – IcaData object
fasta_file (Union[os.PathLike, List[os.PathLike]]) – Path or list of paths to fasta file(s) for organism
outdir (os.PathLike) – Path to output directory
palindrome (bool) – If True, limit search to palindromic motifs (default: False)
nmotifs (int) – Number of motifs to search for (default: 5)
upstream (int) – Number of basepairs upstream from first gene in operon to include in motif search (default: 500)
downstream (int) – Number of basepairs upstream from first gene in operon to include in motif search (default: 100)
verbose (bool) – Show steps in verbose output (default: True)
force (bool) – Force execution of MEME even if output already exists (default: False)
evt (float) – E-value threshold (default: 0.001)
cores (int) – Number of cores to use (default: 8)
minw (int) – Minimum motif width in basepairs (default: 6)
maxw (int) – Maximum motif width in basepairs (default: 40)
minsites (Optional[int]) – Minimum number of sites required for a motif. Default is the number of operons divided by 3.
- Returns
# TODO
- Return type
add documentation of return
-
pymodulon.motif.
compare_motifs
(motif_info=None, motif_file=None, motif_db=None, outdir=None, force=False, verbose=True, evt=0.001)[source] Compare a MEME motif against external motif databases
- Parameters
motif_info (Optional[MotifInfo]) – MotifInfo object. Either ‘motif_info’ or ‘motif_file’ must be provided. ‘motif_info’ takes precedence over ‘motif_file’.
motif_file (Optional[os.PathLike]) – Txt file generated by MEME from a motif search.
motif_db (Optional[Union[List,str]]) – Name or path to MEME formatted databases
outdir (Optional[os.PathLike]) – Output directory for TOMTOM comparisons
force (bool) – Force execution of TOMTOM even if output already exists (default: False)
verbose (bool) – Show steps in verbose output (default: True)
evt (float) – E-value threshold (default: 0.001)
pymodulon.plotting
Plotting functions for iModulons
|
Creates an overlaid scatter and barplot for a set of values (either gene |
|
Creates a barplot showing an gene’s expression across the compendium |
|
Creates a barplot showing an iModulon’s activity across the compendium |
|
Creates a barplot for values in the sample table |
|
Plots a histogram of regulon vs non-regulon genes by iModulon weighting. |
|
Generates a scatter-plot of the data given, with options for coloring by |
|
Plot gene weights on a scatter plot. |
|
Create a scatterplot comparing the gene weights between two iModulons |
|
Create a scatterplot comparing the compendium-wide expression profiles of two genes |
|
Create a scatterplot comparing the compendium-wide activities of two iModulons |
|
Plots a DiMA plot between two projects or two sets of samples |
|
Plots the cumulative explained variance for independent components and, |
|
Compare the overlaps between iModulons and their linked regulons |
|
Cluster all iModulon activity profiles using hierarchical clustering and display |
|
Uses a decision tree regressor to automatically cluster iModulon activities |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Generates bins using optimal bin width estimate using the Freedman-Diaconis rule |
|
Implements experimental xaxis param from plot_gene_weights. This |
-
pymodulon.plotting.
barplot
(values, sample_table, ylabel='', projects=None, highlight=None, ax=None, legend_kwargs=None)[source] Creates an overlaid scatter and barplot for a set of values (either gene expression levels or iModulon activities)
- Parameters
values (Series) – List of values to plot
ylabel (str, optional) – Y-axis label
projects (list or str, optional) – Name(s) of projects to show (default: show all)
highlight (list or str, optional) – Project(s) to highlight (default: None)
ax (Axes, optional) – Axes object to plot on, otherwise use current Axes
legend_kwargs (dict, optional) – Additional keyword arguments passed to
matplotlib.pyplot.legend()
- Returns
ax –
Axes
containing the barplot- Return type
-
pymodulon.plotting.
plot_expression
(ica_data, gene, projects=None, highlight=None, ax=None, legend_kwargs=None)[source] Creates a barplot showing an gene’s expression across the compendium
- Parameters
gene (str) – Gene locus tag or name
projects (str or list, optional) – Name(s) of projects to show (default: show all)
highlight (str or list, optional) – Name(s) of projects to highlight (default: None)
ax (Axes, optional) – Axes object to plot on, otherwise use current Axes
legend_kwargs (dict, optional) – Additional keyword arguments passed to
matplotlib.pyplot.legend()
- Returns
ax –
Axes
containing the barplot- Return type
-
pymodulon.plotting.
plot_activities
(ica_data, imodulon, projects=None, highlight=None, ax=None, legend_kwargs=None)[source] Creates a barplot showing an iModulon’s activity across the compendium
- Parameters
projects (list or str, optional) – Name(s) of projects to show (default: show all)
highlight (str or list, optional) – Name(s) of projects to highlight (default: None)
ax (Axes, optional) – Axes object to plot on, otherwise use current Axes
legend_kwargs (dict, optional) – Additional keyword arguments passed to
matplotlib.pyplot.legend()
- Returns
ax –
Axes
containing the barplot- Return type
-
pymodulon.plotting.
plot_metadata
(ica_data, column, projects=None, highlight=None, ax=None, legend_kwargs=None)[source] Creates a barplot for values in the sample table
- Parameters
column (str) – Column name to plot
projects (list or str, optional) – Name(s) of projects to show (default: show all)
highlight (str or list, optional) – Name(s) of projects to highlight (default: None)
ax (Axes, optional) – Axes object to plot on, otherwise use current Axes
legend_kwargs (dict, optional) – Additional keyword arguments passed to
matplotlib.pyplot.legend()
- Returns
ax –
Axes
containing the barplot- Return type
-
pymodulon.plotting.
plot_regulon_histogram
(ica_data, imodulon, regulator=None, bins=None, kind='overlap', ax=None, hist_label=('Not regulated', 'Regulon Genes'), color=('#aaaaaa', 'salmon'), alpha=0.7, ax_font_kwargs=None, legend_kwargs=None)[source] Plots a histogram of regulon vs non-regulon genes by iModulon weighting.
- Parameters
regulator (str, optional) – Name of regulator to compare enrichment against. If no regulator is given, choose top enrichment from
imodulon_table
bins (int, str or list, optional) – The bins to use when generating the histogram. Passed on to
matplotlib.pyplot.hist()
kind ('overlap' or 'side', optional) – Whether to plot an overlapping or side-by-side comparison histogram ( default: ‘overlap’)
ax (Axes, optional) – Axes object to plot on, otherwise use current Axes
hist_label (tuple, optional) – The label to use when plotting the regulon and non-regulon genes. Takes into a tuple of 2 values (first for non-regulon genes, second for regulon genes). Passed on to
matplotlib.pyplot.hist()
color (list or str, optional) – The colors to use for regulon and non-regulon genes. Takes a Sequence of 2 values (first for non-regulon genes, second for regulon genes). Passed on to
matplotlib.pyplot.hist()
alpha (float, optional) – Sets the opacity of the histogram (0 = transparent, 1 = opaque). Passed on to
matplotlib.pyplot.hist()
ax_font_kwargs (dict, optional) – Additional keyword arguments for axes labels
legend_kwargs (dict, optional) – Additional keyword arguments passed to
matplotlib.pyplot.legend()
- Returns
ax –
Axes
containing the histogram- Return type
-
pymodulon.plotting.
scatterplot
(x, y, groups=None, colors=None, show_labels='auto', adjust_labels=True, line45=False, line45_margin=0, fit_line=False, fit_metric='pearson', xlabel='', ylabel='', ax=None, legend=True, ax_font_kwargs=None, scatter_kwargs=None, label_font_kwargs=None, legend_kwargs=None)[source] Generates a scatter-plot of the data given, with options for coloring by group, adding labels, adding lines, and generating correlation or determination coefficients.
- Parameters
x (Series) – X-axis data
y (Series) – Y-axis data
groups (dict, optional) – A dictionary mapping samples to group names
colors (str, list or dict, optional) – Color of points, list of colors for different groups, or dictionary mapping groups to colors
show_labels (bool or 'auto') – Show labels for data points
adjust_labels (bool) – Auto-adjust labels for data points
line45 (bool) – Show 45-degree line of equal values
line45_margin (float) – Show 45-degreen lines offset by a margin
fit_line (bool) – Draw a line of best fit on the scatterplot
fit_metric ('pearson', 'spearman', 'r2', or None) – Metric to report in legend for line of best fit
xlabel (str) – X-axis label
ylabel (str) – Y-axis label
ax (Axes, optional) – Axes object to plot on, otherwise use current Axes
legend (bool) – Show legend
ax_font_kwargs (dict, optional) – Additional keyword arguments for axis labels
scatter_kwargs (dict, optional) – Additional keyword arguments passed to
matplotlib.pyplot.scatter()
label_font_kwargs (dict, optional) – Additional keyword arguments for labels passed to
matplotlib.pyplot.text()
legend_kwargs (dict, optional) – Additional keyword arguments passed to
matplotlib.pyplot.legend()
- Returns
ax –
Axes
containing the scatterplot- Return type
-
pymodulon.plotting.
plot_gene_weights
(ica_data, imodulon, by='start', xaxis=None, xname='', **kwargs)[source] Plot gene weights on a scatter plot.
- Parameters
by ('log-tpm-norm', 'length', or 'start') – Property to plot on x-axis. Superceded by xaxis
xaxis (list, dict or Series, optional) – Values on custom x-axis
xname (str, optional) – Name of x-axis if using custom x-axis
**kwargs – Additional keyword arguments passed to
pymodulon.plotting.scatterplot()
- Returns
ax –
Axes
containing the scatterplot- Return type
-
pymodulon.plotting.
compare_gene_weights
(ica_data, imodulon1, imodulon2, ica_data2=None, ortho_file=None, use_org1_names=True, **kwargs)[source] Create a scatterplot comparing the gene weights between two iModulons
- Parameters
ica_data2 (IcaData, optional) –
IcaData
object for second iModulon (if comparing iModulons across objects)ortho_file (str, optional) – Path to orthology file between organisms
use_org1_names (bool) – If true, use gene names from first organism. If false, use gene names from second organism (default: True)
**kwargs – Additional keyword arguments passed to
pymodulon.plotting.scatterplot()
- Returns
ax –
Axes
containing the scatterplot- Return type
-
pymodulon.plotting.
compare_expression
(ica_data, gene1, gene2, **kwargs)[source] Create a scatterplot comparing the compendium-wide expression profiles of two genes
-
pymodulon.plotting.
compare_activities
(ica_data, imodulon1, imodulon2, **kwargs)[source] Create a scatterplot comparing the compendium-wide activities of two iModulons
- Parameters
- Returns
ax –
Axes
containing the scatterplot- Return type
-
pymodulon.plotting.
plot_dima
(ica_data, sample1, sample2, threshold=5, fdr=0.1, label=True, adjust=True, table=False, **kwargs)[source] Plots a DiMA plot between two projects or two sets of samples
- Parameters
sample1 (list or str) – List of sample IDs or name of “project:condition” for x-axis
sample2 (list or str) – List of sample IDs or name of “project:condition” for y-axis
threshold (float) – Minimum activity difference to determine DiMAs (default: 5)
fdr (float) – False detection rate (default: 0.1)
label (bool) – Label differentially activated iModulons (default: True)
adjust (bool) – Automatically adjust labels (default: True)
table (bool) – Return differential iModulon activity table (default: False)
**kwargs – Additional keyword arguments passed to
pymodulon.plotting.scatterplot()
- Returns
ax (~matplotlib.axes.Axes) –
Axes
containing the scatterplotdf_diff (~pandas.DataFrame, optional) – Table reporting differentially activated iModulons
-
pymodulon.plotting.
plot_explained_variance
(ica_data, pc=True, ax=None)[source] Plots the cumulative explained variance for independent components and, optionally, principal components
-
pymodulon.plotting.
compare_imodulons_vs_regulons
(ica_data, imodulons=None, cat_column=None, size_column=None, scale=1, reg_only=True, xlabel=None, ylabel=None, vline=0.6, hline=0.6, ax=None, scatter_kwargs=None, ax_font_kwargs=None, legend_kwargs=None)[source] Compare the overlaps between iModulons and their linked regulons
- Parameters
imodulons (list, optional) – List of iModulons to plot
cat_column (str, optional) – Column in the imodulon_table that stores the category of each iModulon
size_column (str, optional) – Column in the imodulon_table that stores the size of each iModulon
scale (float, optional (default: 1)) – Value used to scale the size of each point
reg_only (bool (default: True)) – Only plot iModulons with an entry in the regulator column of the imodulon_table
xlabel (str, optional) – Custom x-axis label (default: “# shared genes/Regulon size”)
ylabel (str, optional) – Custom y-axis label (default: “# shared genes/iModulon size”)
vline (float, optional (default: 0.6)) – Draw a dashed vertical line
hline (float, optional (default: 0.6)) – Draw a dashed horizontal line
ax (Axes, optional) – Axes object to plot on, otherwise use current Axes
scatter_kwargs (dict, optional) – Additional keyword arguments passed to
matplotlib.pyplot.scatter()
ax_font_kwargs (dict, optional) – Additional keyword arguments for axes labels
legend_kwargs (dict, optional) – Additional keyword arguments passed to
matplotlib.pyplot.legend()
- Returns
ax –
Axes
containing the line plot- Return type
-
pymodulon.plotting.
cluster_activities
(ica_data, correlation_method='spearman', distance_threshold=None, show_thresholding=False, show_clustermap=True, show_best_clusters=False, n_best_clusters='auto', cluster_names=None, return_clustermap=False, dimca_sample1=None, dimca_sample2=None, dimca_threshold=5, dimca_fdr=0.1, dimca_label=True, dimca_adjust=True, dimca_table=False, **dima_kwargs)[source] Cluster all iModulon activity profiles using hierarchical clustering and display the results using
seaborn.clustermap()
- Parameters
correlation_method ('pearson', 'spearman', 'kendall', 'mutual_info' or callable) – Method for computing correlations between iModulon activities. See
pandas.DataFrame.corr()
Default is ‘spearman’.distance_threshold (float, optional) – A distance from 0 to 1 to define flat clusters from the hierarchical clustering. Larger values yield fewer clusters. If None, automatic selection of optimal threshold will occur by maximizing the silhouette score across iModulons. (default: None)
show_thresholding (bool) – Show the plot of distance thresholds vs. silhouette scores (default: False)
show_clustermap (bool) – Show the clustermap (default: True)
show_best_clusters (bool) – Show individual clusters below complete clustermap
n_best_clusters (int or str) – Number of best clusters to show. If ‘auto’, only clusters with above-average silhouette scores are shown
cluster_names (dict, optional) – A dictionary mapping cluster indices to names, usually used after clustering has been performed previously.
return_clustermap (bool) – Return the axis containing the clustermap
dimca_sample1 (list or str) – List of sample IDs or name of “project:condition” of reference samples for Differential iModulon Cluster Analysis (DiMCA)
dimca_sample2 (list or str) – List of sample IDs or name of “project:condition” of target samples for DiMCA
dimca_threshold (float) – Minimum activity difference to determine DiMCAs
dimca_fdr (float) – False detection rate for DiMCA
dimca_label (bool) – Label differentially activated imodulon clusters (default: True)
dimca_adjust (bool) – Auto-adjust DiMCA cluster labels (default: True)
dimca_table (bool) – Return DiMCA table (default: False)
**dima_kwargs (dict, optional) – Additional keyword arguments passed to
pymodulon.plotting.dima()
- Returns
clusters (~sklearn.cluster.AgglomerativeClustering) –
sklearn.cluster.AgglomerativeClustering
of activity matrixcg (~seaborn.matrix.ClusterGrid, optional) –
ClusterGrid
containing the clusterplotdimca_ax (~matplotlib.axes.Axes, optional) –
Axes
containing the DiMCA scatterplotdimca_table (~pandas.DataFrame, optional) – Table of differentially activated iModulon clusters
-
pymodulon.plotting.
metadata_boxplot
(ica_data, imodulon, show_points=True, n_boxes=3, samples=None, strip_conc=True, ignore_cols=None, use_cols=None, return_results=False, ax=None, box_kwargs=None, strip_kwargs=None, swarm_kwargs=None)[source] Uses a decision tree regressor to automatically cluster iModulon activities using metadata. Displays results as a box plot.
- Parameters
show_points (bool or str (default: True)) – Overlay individual points on top of the boxplot. By default, this uses
seaborn.stripplot()
. show_points=’swarm’ will use :func`seaborn.swarmplot`.n_boxes (int) – Number of boxes to create
samples (list, optional) – Subset of samples to analyze
strip_conc (bool) – Remove concentrations from metadata (e.g. “glucose(2g/L)” would be interpreted as just “glucose”)
ignore_cols (list, optional) – List of columns to ignore. If None, only “project” and “condition” are ignored
use_cols (list, optional) – List of columns to use. This supercedes ignore_cols.
return_results (bool) – Return a dataframe describing the classifications
ax (Axes, optional) – Axes object to plot on, otherwise use current Axes
box_kwargs (dict, optional) – Additional keyword arguments passed to
seaborn.boxplot()
strip_kwargs (dict, optional) – Additional keyword arguments passed to
seaborn.stripplot()
swarm_kwargs (dict, optional) – Additional keyword arguments passed to
seaborn.swarmplot()
- Returns
ax (~matplotlib.axes.Axes) –
Axes
containing the boxplotdf_classes (~pd.DataFrame) – Metadata classifications of the samples
-
pymodulon.plotting.
_encode_metadata
(ica_data, samples, use_cols=None, ignore_cols=None, strip_conc=True)[source]
-
pymodulon.plotting.
_mod_freedman_diaconis
(ica_data, imodulon)[source] Generates bins using optimal bin width estimate using the Freedman-Diaconis rule
- Parameters
- Returns
bins – Numpy array of bins
- Return type
See also
-
pymodulon.plotting.
_set_xaxis
(xaxis, y)[source] Implements experimental xaxis param from plot_gene_weights. This allows for users to generate a scatterplot comparing gene weight on the y-axis with any collection of numbers on the x-axis (as long as the lengths match).
- Parameters
- Returns
x – Returns a pd.Series to be used as the x-axis data-points for generating the plot_gene_weights scatter-plot.
- Return type
pd.Series
pymodulon.util
General utility functions for the pymodulon package
|
|
|
|
|
|
|
Computes D’agostino-test-based threshold for a component of an M matrix |
|
Creates DIMA table of differentially expressed iModulons |
|
Parses sample inputs into a list of sample IDs |
|
Computes the fraction of variance explained by iModulons (from 0 to 1) |
|
Infer iModulon activities for external data |
|
|
|
Mutual information of x and y (conditioned on z if z is not None) |
|
The classic K-L k-nearest neighbor continuous entropy estimator |
|
Small noise to break degeneracy, see doc. |
|
|
|
|
|
This part finds number of neighbors in some radius in the marginal space |
|
|
|
-
pymodulon.util.
compute_threshold
(ic, dagostino_cutoff)[source] Computes D’agostino-test-based threshold for a component of an M matrix
-
pymodulon.util.
dima
(ica_data, sample1, sample2, threshold=5, fdr=0.1, alternate_A=None)[source] Creates DIMA table of differentially expressed iModulons
- Parameters
sample1 (str or list) – List of sample IDs or name of “project:condition”
sample2 (str or list) – List of sample IDs or name of “project:condition”
threshold (float) – Minimum activity difference to determine DiMAs (default = 5)
fdr (float) – False Detection Rate (default = .1)
alternate_A (DataFrame) – Alternate A to use (default = None)
- Returns
results – Table of differentially expressed iModulons
- Return type
DataFrame
-
pymodulon.util.
_parse_sample
(ica_data, sample)[source] Parses sample inputs into a list of sample IDs
-
pymodulon.util.
explained_variance
(ica_data, genes=None, samples=None, imodulons=None, reference=None)[source] Computes the fraction of variance explained by iModulons (from 0 to 1)
- Parameters
genes (str or list, optional) – List of genes to use (default: all genes)
samples (str or list, optional) – List of samples to use (default: all samples)
imodulons (int or str or list, optional) – List of iModulons to use (default: all iModulons)
reference (list, optional) – List of samples that represent the reference condition for the set. If none are provided, uses the dataset-specific reference condition.
- Returns
Fraction of variance explained by selected iModulons for selected genes/samples
- Return type
-
pymodulon.util.
infer_activities
(ica_data, data)[source] Infer iModulon activities for external data
-
pymodulon.util.
mi
(x, y, z=None, k=3, base=2, alpha=0)[source] Mutual information of x and y (conditioned on z if z is not None) x, y should be a list of vectors, e.g. x = [[1.3], [3.7], [5.1], [2.4]] if x is a one-dimensional scalar and we have four samples
-
pymodulon.util.
entropy
(x, k=3, base=2)[source] The classic K-L k-nearest neighbor continuous entropy estimator x should be a list of vectors, e.g. x = [[1.3], [3.7], [5.1], [2.4]] if x is a one-dimensional scalar and we have four samples
- 1
Created with sphinx-autoapi
Works Cited
The following is a list of all references cited throughout the documentation.
- PTS+20
Saugat Poudel, Hannah Tsunemoto, Yara Seif, Anand V. Sastry, Richard Szubin, Sibei Xu, Henrique Machado, Connor A. Olson, Amitesh Anand, Joe Pogliano, Victor Nizet, and Bernhard O. Palsson. Revealing 29 sets of independently modulated genes inStaphylococcus aureus, their regulators, and role in key physiological response. Proceedings of the National Academy of Sciences, 117(29):17228–17239, July 2020. URL: https://doi.org/10.1073/pnas.2008413117, doi:10.1073/pnas.2008413117.
- RDS+20
Kevin Rychel, Katherine Decker, Anand V Sastry, Patrick V Phaneuf, Saugat Poudel, and Bernhard O Palsson. iModulonDB: a knowledgebase of microbial transcriptional regulation derived from machine learning. Nucleic Acids Research, 49(D1):D112–D120, October 2020. URL: https://doi.org/10.1093/nar/gkaa810, doi:10.1093/nar/gkaa810.
- RSP20
Kevin Rychel, Anand V. Sastry, and Bernhard O. Palsson. Machine learning uncovers independently regulated modules in the bacillus subtilis transcriptome. Nature Communications, December 2020. URL: https://doi.org/10.1038/s41467-020-20153-9, doi:10.1038/s41467-020-20153-9.
- SGS+19
Anand V. Sastry, Ye Gao, Richard Szubin, Ying Hefner, Sibei Xu, Donghyuk Kim, Kumari Sonal Choudhary, Laurence Yang, Zachary A. King, and Bernhard O. Palsson. The escherichia coli transcriptome mostly consists of independently regulated modules. Nature Communications, December 2019. URL: https://doi.org/10.1038/s41467-019-13483-w, doi:10.1038/s41467-019-13483-w.