PyModulon: Analyzing iModulons in Python

PyPI - Python Version PyPI Conda installation Docker container Documentation Status Black code style MIT License

What is an iModulon?

Gene expression datasets are highly complex. Differential expression analysis is the most common analytical tool used to analyze these datasets, but often result in large sets of differentially genes, many of which have little to no functional annotation. Recent studies of large transcriptomic datasets have introduced Independent Component Analysis (ICA) as a scalable alternative that produces easily-interpretable results [SGS+19].

When applied to a gene expression dataset, ICA extracts independently modulated groups of genes, called iModulons. In bacteria, iModulons recapitulate known regulation with remarkable accuracy, and often encode complete metabolic pathways [cite].

In addition, ICA simultaneously computes iModulon activities, which represents the effect of the associated transcriptional regulator on gene expression.

For more information on iModulons, see the iModulonDB About Page.

What is PyModulon?

PyModulon contains modules for the analysis, visualization, and dissemination of iModulons. PyModulon is designed with Jupyter notebooks in mind. In addition, it enables users to easily create their own iModulonDB pages for any analyzed dataset.

Citation

The pymodulon manuscript is currently in preparation. For now, please cite [SGS+19]. If you use the custom iModulonDB pages, please cite [RDS+20].

Installation

With Docker

The easiest way to get started with PyModulon is using the Docker container.

  1. Install Docker

  2. Open terminal and navigate to your work folder

  3. Run the following commands to start a Jupyter Notebook server:

    docker run -p 8888:8888 -v "${PWD}":/home/jovyan/work sbrg/pymodulon
    
  4. Copy the URL from terminal to connect to the Jupyter notebook

  5. Navigate to the work folder, which has your current directory mounted.

  6. To close the notebook, press Ctrl+C in terminal. All changes made to files in your current directory are saved to your local machine.

With Pip

You can install PyModulon from PyPI using pip as follows:

python -m pip install pymodulon

With Conda

Alternatively, you can install using Conda:

conda install -c conda-forge pymodulon

We recommend installing through a conda environment:

conda create -n pymodulon -c conda-forge pymodulon
conda activate pymodulon

Optional Dependencies

Some features of PyModulon require additional dependencies. Follow the links below for installation instructions.

  1. GraphViz

  2. NCBI BLAST

  3. MEME Suite

This step is not necessary if you use the Docker container.

Example Workflow

We have provided an example workflow here illustrating how to compile and process all publicly available RNA-seq data for Bacillus subtilis, and subsequently compute and characterize its iModulons. This tutorial was designed to be easily applied to any other bacteria.

Introduction to the IcaData object

The pymodulon.core.IcaData object is at the core of the PyModulon package. This object holds all of the data related to the expression dataset, the iModulons, and their annotations.

[1]:
from pymodulon.core import IcaData
from pymodulon import example_data
from pymodulon.io import save_to_json, load_json_model

Minimum requirements

The IcaData object only requires two matrices, which are the results of performing Independent Component Analysis (ICA) on an expression dataset. For more information about ICA, see the iModulonDB about page

  • M: The iModulon matrix contains the Independent Components (ICs) themselves. Each column represents an IC, and each row contains the gene weights for each gene across each IC.

[2]:
M = example_data.M
M.head()
[2]:
AllR/AraC/FucR ArcA-1 ArcA-2 ArgR AtoC BW25113 Cbl+CysB CdaR CecR Copper ... thrA-KO translation uncharacterized-1 uncharacterized-2 uncharacterized-3 uncharacterized-4 uncharacterized-5 uncharacterized-6 ydcI-KO yheO-KO
b0002 -0.010888 -0.007717 -0.008502 -0.012186 -0.061489 -0.005599 -0.007377 -0.000795 0.004331 0.001845 ... 0.479209 0.035685 0.024778 -0.010660 -0.002123 -0.004416 -0.005428 -0.009219 -0.004345 -0.007838
b0003 -0.011467 0.003042 0.011448 -0.003685 -0.006106 0.006680 -0.043512 0.005107 0.000474 0.007650 ... 0.011420 0.040811 0.003324 -0.008424 -0.004415 -0.016126 -0.016476 -0.003497 -0.003583 0.003381
b0004 -0.008693 0.003944 0.012347 -0.008104 0.000585 0.003245 -0.041283 0.006390 0.004260 0.007109 ... 0.011339 0.036244 0.003710 -0.005212 0.000700 -0.011096 -0.006140 -0.003155 -0.008418 0.000129
b0005 0.006565 -0.001099 0.009415 -0.008507 0.005399 0.014748 -0.009249 -0.003058 -0.012649 -0.002370 ... -0.015324 0.028972 0.023969 0.000150 0.018497 0.009428 0.001255 -0.006890 -0.028069 0.021534
b0006 -0.006011 0.009889 -0.005555 -0.000152 -0.002454 0.009678 -0.003456 0.002160 -0.001924 -0.000628 ... -0.005661 0.000700 -0.002538 -0.006103 -0.002506 -0.005077 -0.004616 -0.003585 0.001607 0.001285

5 rows × 92 columns

  • A: The Activity matrix contains the condition-specific activities. Each column represents a sample, and each row contains the activity of each iModulon across all samples.

[3]:
A = example_data.A
A.head()
[3]:
control__wt_glc__1 control__wt_glc__2 fur__wt_dpd__1 fur__wt_dpd__2 fur__wt_fe__1 fur__wt_fe__2 fur__delfur_dpd__1 fur__delfur_dpd__2 fur__delfur_fe2__1 fur__delfur_fe2__2 ... efeU__menFentC_ale29__1 efeU__menFentC_ale29__2 efeU__menFentC_ale30__1 efeU__menFentC_ale30__2 efeU__menFentCubiC_ale36__1 efeU__menFentCubiC_ale36__2 efeU__menFentCubiC_ale37__1 efeU__menFentCubiC_ale37__2 efeU__menFentCubiC_ale38__1 efeU__menFentCubiC_ale38__2
AllR/AraC/FucR 0.378690 -0.378690 2.457678 2.248678 -0.327344 -0.259164 1.777251 2.690655 0.656937 0.319583 ... 1.041336 2.203940 3.698292 0.856998 1.557323 0.337806 0.943742 1.736640 0.499461 1.581476
ArcA-1 -0.440210 0.440210 -5.367360 -5.684301 0.131174 0.348843 -4.436389 -4.770469 -1.799113 -1.474222 ... -6.471714 -6.549861 -3.109145 -2.716183 -2.531192 -1.461022 -0.408849 -0.210397 -5.700321 -6.237836
ArcA-2 0.762258 -0.762258 2.619623 2.900696 3.120724 2.743634 1.989803 1.555835 1.782500 1.530811 ... 2.789653 3.959650 1.585147 0.811182 0.300414 2.537535 1.061408 2.634524 0.125513 1.178747
ArgR -0.289630 0.289630 -10.085719 -13.187916 2.371129 1.861918 -8.708701 -7.881588 -1.237027 -1.235604 ... -11.263744 -10.366813 -0.289217 0.389228 -5.142768 -5.014526 -3.648777 -4.125952 -4.286326 -5.475940
AtoC 0.250770 -0.250770 1.844767 2.055052 0.299345 0.425502 1.801217 1.790987 0.921254 1.410026 ... 3.821909 3.306573 2.652394 1.910173 0.927772 1.327549 1.846321 0.909667 2.064662 2.371405

5 rows × 278 columns

To create the IcaData object, the M and A datasets can be entered as either filenames or as a Pandas DataFrame

[4]:
ica_data = IcaData(M,A)
ica_data
[4]:
<pymodulon.core.IcaData at 0x7fe93df4d990>

Once loaded, the M and A matrices can be accessed directly from the object

[5]:
ica_data.M.head()
[5]:
AllR/AraC/FucR ArcA-1 ArcA-2 ArgR AtoC BW25113 Cbl+CysB CdaR CecR Copper ... thrA-KO translation uncharacterized-1 uncharacterized-2 uncharacterized-3 uncharacterized-4 uncharacterized-5 uncharacterized-6 ydcI-KO yheO-KO
b0002 -0.010888 -0.007717 -0.008502 -0.012186 -0.061489 -0.005599 -0.007377 -0.000795 0.004331 0.001845 ... 0.479209 0.035685 0.024778 -0.010660 -0.002123 -0.004416 -0.005428 -0.009219 -0.004345 -0.007838
b0003 -0.011467 0.003042 0.011448 -0.003685 -0.006106 0.006680 -0.043512 0.005107 0.000474 0.007650 ... 0.011420 0.040811 0.003324 -0.008424 -0.004415 -0.016126 -0.016476 -0.003497 -0.003583 0.003381
b0004 -0.008693 0.003944 0.012347 -0.008104 0.000585 0.003245 -0.041283 0.006390 0.004260 0.007109 ... 0.011339 0.036244 0.003710 -0.005212 0.000700 -0.011096 -0.006140 -0.003155 -0.008418 0.000129
b0005 0.006565 -0.001099 0.009415 -0.008507 0.005399 0.014748 -0.009249 -0.003058 -0.012649 -0.002370 ... -0.015324 0.028972 0.023969 0.000150 0.018497 0.009428 0.001255 -0.006890 -0.028069 0.021534
b0006 -0.006011 0.009889 -0.005555 -0.000152 -0.002454 0.009678 -0.003456 0.002160 -0.001924 -0.000628 ... -0.005661 0.000700 -0.002538 -0.006103 -0.002506 -0.005077 -0.004616 -0.003585 0.001607 0.001285

5 rows × 92 columns

[6]:
ica_data.A.head()
[6]:
control__wt_glc__1 control__wt_glc__2 fur__wt_dpd__1 fur__wt_dpd__2 fur__wt_fe__1 fur__wt_fe__2 fur__delfur_dpd__1 fur__delfur_dpd__2 fur__delfur_fe2__1 fur__delfur_fe2__2 ... efeU__menFentC_ale29__1 efeU__menFentC_ale29__2 efeU__menFentC_ale30__1 efeU__menFentC_ale30__2 efeU__menFentCubiC_ale36__1 efeU__menFentCubiC_ale36__2 efeU__menFentCubiC_ale37__1 efeU__menFentCubiC_ale37__2 efeU__menFentCubiC_ale38__1 efeU__menFentCubiC_ale38__2
AllR/AraC/FucR 0.378690 -0.378690 2.457678 2.248678 -0.327344 -0.259164 1.777251 2.690655 0.656937 0.319583 ... 1.041336 2.203940 3.698292 0.856998 1.557323 0.337806 0.943742 1.736640 0.499461 1.581476
ArcA-1 -0.440210 0.440210 -5.367360 -5.684301 0.131174 0.348843 -4.436389 -4.770469 -1.799113 -1.474222 ... -6.471714 -6.549861 -3.109145 -2.716183 -2.531192 -1.461022 -0.408849 -0.210397 -5.700321 -6.237836
ArcA-2 0.762258 -0.762258 2.619623 2.900696 3.120724 2.743634 1.989803 1.555835 1.782500 1.530811 ... 2.789653 3.959650 1.585147 0.811182 0.300414 2.537535 1.061408 2.634524 0.125513 1.178747
ArgR -0.289630 0.289630 -10.085719 -13.187916 2.371129 1.861918 -8.708701 -7.881588 -1.237027 -1.235604 ... -11.263744 -10.366813 -0.289217 0.389228 -5.142768 -5.014526 -3.648777 -4.125952 -4.286326 -5.475940
AtoC 0.250770 -0.250770 1.844767 2.055052 0.299345 0.425502 1.801217 1.790987 0.921254 1.410026 ... 3.821909 3.306573 2.652394 1.910173 0.927772 1.327549 1.846321 0.909667 2.064662 2.371405

5 rows × 278 columns

If the M and A datasets have row or column names, these will be saved as the sample/gene/iModulon names. Since genes are often re-named when characterized, the locus tag is the preferred identifier.

[7]:
print('Gene names:',ica_data.gene_names[:5])
print('Sample names:',ica_data.sample_names[:5])
print('iModulon names:',ica_data.imodulon_names[:5])
Gene names: ['b0002', 'b0003', 'b0004', 'b0005', 'b0006']
Sample names: ['control__wt_glc__1', 'control__wt_glc__2', 'fur__wt_dpd__1', 'fur__wt_dpd__2', 'fur__wt_fe__1']
iModulon names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']

Adding the Expression Matrix

The X matrix contains eXpression data and is primarily used for plotting functions. The column names of the X matrix are the sample names, and the row names are the gene identifiers.

[8]:
X = example_data.X
X.head()
[8]:
control__wt_glc__1 control__wt_glc__2 fur__wt_dpd__1 fur__wt_dpd__2 fur__wt_fe__1 fur__wt_fe__2 fur__delfur_dpd__1 fur__delfur_dpd__2 fur__delfur_fe2__1 fur__delfur_fe2__2 ... efeU__menFentC_ale29__1 efeU__menFentC_ale29__2 efeU__menFentC_ale30__1 efeU__menFentC_ale30__2 efeU__menFentCubiC_ale36__1 efeU__menFentCubiC_ale36__2 efeU__menFentCubiC_ale37__1 efeU__menFentCubiC_ale37__2 efeU__menFentCubiC_ale38__1 efeU__menFentCubiC_ale38__2
b0002 -0.061772 0.061772 0.636527 0.819793 -0.003615 -0.289353 -1.092023 -0.777289 0.161343 0.145641 ... -0.797097 -0.791859 0.080114 0.102154 0.608180 0.657673 0.813105 0.854813 0.427986 0.484338
b0003 -0.053742 0.053742 0.954439 1.334385 0.307588 0.128414 -0.872563 -0.277893 0.428542 0.391761 ... -0.309105 -0.352535 -0.155074 -0.077145 0.447030 0.439881 0.554528 0.569030 0.154905 0.294799
b0004 -0.065095 0.065095 -0.202697 0.119195 -0.264995 -0.546017 -1.918349 -1.577736 -0.474815 -0.495312 ... -0.184898 -0.225615 0.019575 0.063986 0.483343 0.452754 0.524828 0.581878 0.293239 0.341040
b0005 0.028802 -0.028802 -0.865171 -0.951179 0.428769 0.123564 -1.660351 -1.531147 0.240353 -0.151132 ... -0.308221 -0.581714 0.018820 0.004040 -1.228763 -1.451750 -0.839203 -0.529349 -0.413336 -0.478682
b0006 0.009087 -0.009087 -0.131039 -0.124079 -0.144870 -0.090152 -0.219917 -0.046648 -0.044537 -0.089204 ... 1.464603 1.415706 1.230831 1.165153 0.447447 0.458852 0.421417 0.408077 1.151066 1.198529

5 rows × 278 columns

[9]:
ica_data.X = X

Adding annotation tables

You may load in additional data tables with information about your samples, genes, or iModulons.

These tables are originally empty, but can be altered like any Pandas DataFrame.

[10]:
ica_data.gene_table.head()
[10]:
b0002
b0003
b0004
b0005
b0006

Annotation tables contain one sample/gene/iModulon per row, and information about the respective item in columns. For example, a gene_table may include the gene function, genomic position, or Cluster of Orthologous Groups (COG) Category. See the Creating the Gene Table tutorial for a step-by-step example on how to contruct this table. Gene names must match the gene names in the M matrix.

[11]:
gene_table = example_data.gene_table
gene_table.head()
[11]:
start end strand gene_name length operon COG accession
b0001 189 255 + thrL 66 thrLABC No COG Annotation NC_000913.3
b0002 336 2799 + thrA 2463 thrLABC Amino acid transport and metabolism NC_000913.3
b0003 2800 3733 + thrB 933 thrLABC Amino acid transport and metabolism NC_000913.3
b0004 3733 5020 + thrC 1287 thrLABC Amino acid transport and metabolism NC_000913.3
b0005 5233 5530 + yaaX 297 yaaX Function unknown NC_000913.3

The sample_table contains detailed experimental metadata about each sample. This must be manually created, and can contain information related to the strains or experimental conditions used in the study.

[12]:
sample_table = example_data.sample_table
sample_table.head()
[12]:
Study project condition Replicate # Strain Description Strain Base Media Carbon Source (g/L) Nitrogen Source (g/L) Electron Acceptor ... Growth Rate (1/hr) Evolved Sample Isolate Type Sequencing Machine ALEdb sample Additional Details Biological Replicates Alignment DOI GEO
Sample ID
control__wt_glc__1 Control control wt_glc 1 Escherichia coli K-12 MG1655 MG1655 M9 glucose(2) NH4Cl(1) O2 ... NaN No NaN MiSeq NaN NaN 2 94.33 doi.org/10.1101/080929 GSE65643
control__wt_glc__2 Control control wt_glc 2 Escherichia coli K-12 MG1655 MG1655 M9 glucose(2) NH4Cl(1) O2 ... NaN No NaN MiSeq NaN NaN 2 94.24 doi.org/10.1101/080929 GSE65643
fur__wt_dpd__1 Fur fur wt_dpd 1 Escherichia coli K-12 MG1655 MG1655 M9 glucose(2) NH4Cl(1) O2 ... 0.00 No NaN MiSeq NaN NaN 2 98.04 doi.org/10.1038/ncomms5910 GSE54900
fur__wt_dpd__2 Fur fur wt_dpd 2 Escherichia coli K-12 MG1655 MG1655 M9 glucose(2) NH4Cl(1) O2 ... 0.00 No NaN MiSeq NaN NaN 2 98.30 doi.org/10.1038/ncomms5910 GSE54900
fur__wt_fe__1 Fur fur wt_fe 1 Escherichia coli K-12 MG1655 MG1655 M9 glucose(2) NH4Cl(1) O2 ... 1.06 No NaN MiSeq NaN NaN 2 93.35 doi.org/10.1038/ncomms5910 GSE54900

5 rows × 26 columns

The project and condition columns in the sample_table will be useful for the plotting functions described in the Plotting Functions tutorial.

The imodulon_table contains information about each iModulon, such as regulator enrichments or iModulon size.

[13]:
imodulon_table = example_data.imodulon_table
imodulon_table.head()
[13]:
regulator f1score pvalue precision recall TP n_genes n_tf Category threshold
name
AllR/AraC/FucR allR/araC/fucR 0.750000 1.190000e-41 1.000000 0.600000 18.0 18 3 Carbon Source Utilization 0.086996
ArcA-1 arcA 0.130952 6.420000e-20 0.660000 0.072687 33.0 50 1 Energy Metabolism 0.058051
ArcA-2 arcA 0.087683 1.150000e-16 0.840000 0.046256 21.0 25 1 Energy Metabolism 0.081113
ArgR argR 0.177778 6.030000e-18 0.923077 0.098361 12.0 13 1 Amino Acid and Nucleotide Biosynthesis 0.080441
AtoC atoC 0.800000 1.520000e-12 0.666667 1.000000 4.0 6 1 Miscellaneous Metabolism 0.105756

The tables can be loaded into the IcaData object as either filenames or as a Pandas DataFrame

[14]:
ica_data.gene_table = gene_table
ica_data.sample_table = sample_table
ica_data.imodulon_table = imodulon_table
[15]:
ica_data.sample_table.head()
[15]:
Study project condition Replicate # Strain Description Strain Base Media Carbon Source (g/L) Nitrogen Source (g/L) Electron Acceptor ... Growth Rate (1/hr) Evolved Sample Isolate Type Sequencing Machine ALEdb sample Additional Details Biological Replicates Alignment DOI GEO
control__wt_glc__1 Control control wt_glc 1 Escherichia coli K-12 MG1655 MG1655 M9 glucose(2) NH4Cl(1) O2 ... NaN No NaN MiSeq NaN NaN 2 94.33 doi.org/10.1101/080929 GSE65643
control__wt_glc__2 Control control wt_glc 2 Escherichia coli K-12 MG1655 MG1655 M9 glucose(2) NH4Cl(1) O2 ... NaN No NaN MiSeq NaN NaN 2 94.24 doi.org/10.1101/080929 GSE65643
fur__wt_dpd__1 Fur fur wt_dpd 1 Escherichia coli K-12 MG1655 MG1655 M9 glucose(2) NH4Cl(1) O2 ... 0.00 No NaN MiSeq NaN NaN 2 98.04 doi.org/10.1038/ncomms5910 GSE54900
fur__wt_dpd__2 Fur fur wt_dpd 2 Escherichia coli K-12 MG1655 MG1655 M9 glucose(2) NH4Cl(1) O2 ... 0.00 No NaN MiSeq NaN NaN 2 98.30 doi.org/10.1038/ncomms5910 GSE54900
fur__wt_fe__1 Fur fur wt_fe 1 Escherichia coli K-12 MG1655 MG1655 M9 glucose(2) NH4Cl(1) O2 ... 1.06 No NaN MiSeq NaN NaN 2 93.35 doi.org/10.1038/ncomms5910 GSE54900

5 rows × 26 columns

[16]:
ica_data.gene_table.head()
[16]:
start end strand gene_name length operon COG accession
b0002 336 2799 + thrA 2463 thrLABC Amino acid transport and metabolism NC_000913.3
b0003 2800 3733 + thrB 933 thrLABC Amino acid transport and metabolism NC_000913.3
b0004 3733 5020 + thrC 1287 thrLABC Amino acid transport and metabolism NC_000913.3
b0005 5233 5530 + yaaX 297 yaaX Function unknown NC_000913.3
b0006 5682 6459 - yaaA 777 yaaA Function unknown NC_000913.3
[17]:
ica_data.imodulon_table.head()
[17]:
regulator f1score pvalue precision recall TP n_genes n_tf Category threshold
AllR/AraC/FucR allR/araC/fucR 0.750000 1.190000e-41 1.000000 0.600000 18.0 18 3 Carbon Source Utilization 0.086996
ArcA-1 arcA 0.130952 6.420000e-20 0.660000 0.072687 33.0 50 1 Energy Metabolism 0.058051
ArcA-2 arcA 0.087683 1.150000e-16 0.840000 0.046256 21.0 25 1 Energy Metabolism 0.081113
ArgR argR 0.177778 6.030000e-18 0.923077 0.098361 12.0 13 1 Amino Acid and Nucleotide Biosynthesis 0.080441
AtoC atoC 0.800000 1.520000e-12 0.666667 1.000000 4.0 6 1 Miscellaneous Metabolism 0.105756

Converting between gene names and locus tags

If the gene_table contains a gene_name columns, the name2num and num2name methods can convert between locus tags and gene names.

[18]:
ica_data.num2name('b0002')
[18]:
'thrA'
[19]:
ica_data.name2num('thrA')
[19]:
'b0002'

Adding the TRN

Adding the transcriptional regulatory network (TRN) to the IcaData object enables automated calculation of regulon enrichments. Each row of the TRN file represents a regulatory interaction. The TRN must contain the following columns:

  • regulator: Name of the regulator (/ or + characters will be converted to ;)

  • gene_id: Locus tag of the target gene (must be in ica_data.gene_names)

The following columns are optional, but are helpful to have:

  • regulator_id - Locus tag of regulator

  • gene_name - Name of gene (can automatically update this using name2num)

  • direction - Direction of regulation (+ for activation, - for repression, ? or NaN for unknown)

  • evidence - Evidence of regulation (e.g. ChIP-exo, qRT-PCR, SELEX, Motif search)

  • PMID - Reference for regulatory interaction

[20]:
trn = example_data.trn
trn.head()
[20]:
regulator gene_id effect
0 FMN b3041 -
1 L-tryptophan b3708 +
2 L-tryptophan b3709 +
3 TPP b0066 -
4 TPP b0067 -

Again, this table can be passed in as either a filename or a Pandas DataFrame.

[21]:
ica_data.trn = trn
ica_data.trn.head()
[21]:
regulator gene_id effect
0 FMN b3041 -
1 L-tryptophan b3708 +
2 L-tryptophan b3709 +
3 TPP b0066 -
4 TPP b0067 -

Inspecting iModulons

view_imodulon shows the information about each gene in the iModulon. Most information is retrieved from the gene_table, but the regulator column comes from the trn.

[22]:
ica_data.view_imodulon('GlpR')
[22]:
gene_weight start end strand gene_name length operon COG accession regulator
b2239 0.211384 2349934 2351011 - glpQ 1077 glpTQ Energy production and conversion NC_000913.3 crp,fis,fnr,glpR,ihf,nac,rpoD
b2240 0.306134 2351015 2352374 - glpT 1359 glpTQ Carbohydrate transport and metabolism NC_000913.3 crp,fis,fnr,glpR,ihf,nac,rpoD
b2241 0.375662 2352646 2354275 + glpA 1629 glpABC Energy production and conversion NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b2242 0.328961 2354264 2355524 + glpB 1260 glpABC Amino acid transport and metabolism NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b2243 0.315752 2355520 2356711 + glpC 1191 glpABC Energy production and conversion NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b3426 0.350034 3562012 3563518 + glpD 1506 glpD Energy production and conversion NC_000913.3 arcA,crp,glpR,rpoD,yieP
b3926 0.290235 4115713 4117222 - glpK 1509 glpFKX Energy production and conversion NC_000913.3 crp,glpR,rpoD
b3927 0.312307 4117244 4118090 - glpF 846 glpFKX Carbohydrate transport and metabolism NC_000913.3 crp,glpR,rpoD

Searching for genes in iModulons

To find which iModulons contain a specific gene, use the imodulons_with method.

[23]:
ica_data.imodulons_with('b2239')
[23]:
['GlpR']

If the gene_table contains a gene_name columns, this function will work with either the locus tag or the gene name.

[24]:
ica_data.imodulons_with('carA')
[24]:
['PurR-2']

Renaming iModulons

Individual iModulons can be renamed using the rename_imodulons method

[25]:
print('Original iModulon Names:', ica_data.imodulon_names[:5])
ica_data.rename_imodulons({'AllR/AraC/FucR':'AllR'})
print('Renamed iModulon Names:', ica_data.imodulon_names[:5])
Original iModulon Names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
Renamed iModulon Names: ['AllR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']

These changes are reflected throughout the IcaData object.

[26]:
print('M matrix columns:', ica_data.M.columns[:5])
M matrix columns: Index(['AllR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC'], dtype='object')

iModulon names can be updated all at once as well.

[27]:
print('Original iModulon Names:', ica_data.imodulon_names[:5])

new_names = ['AllR/AraC/FucR']+ica_data.imodulon_names[1:]

print('New iModulon names:', new_names[:5])

ica_data.imodulon_names = new_names

print('Renamed iModulon Names:', ica_data.imodulon_names[:5])
Original iModulon Names: ['AllR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
New iModulon names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
Renamed iModulon Names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']

Copying IcaData objects

The copy method creates a new IcaData object identical to the old one.

[28]:
ica_data.copy()
[28]:
<pymodulon.core.IcaData at 0x7fe93ded5d10>

Saving and Loading IcaData Objects

To facilitate data sharing, you can save IcaData objects as json files that can be easily re-loaded

[29]:
from pymodulon.io import *
from os import path
[30]:
filepath = path.join('tmp','ecoli_data.json')
save_to_json(ica_data,filepath)
[31]:
ica_data = load_json_model(filepath)
[32]:
ica_data.imodulon_table.head()
[32]:
regulator f1score pvalue precision recall TP n_genes n_tf Category threshold
AllR/AraC/FucR allR/araC/fucR 0.750000 1.190000e-41 1.000000 0.600000 18.0 18 3 Carbon Source Utilization 0.086996
ArcA-1 arcA 0.130952 6.420000e-20 0.660000 0.072687 33.0 50 1 Energy Metabolism 0.058051
ArcA-2 arcA 0.087683 1.150000e-16 0.840000 0.046256 21.0 25 1 Energy Metabolism 0.081113
ArgR argR 0.177778 6.030000e-18 0.923077 0.098361 12.0 13 1 Amino Acid and Nucleotide Biosynthesis 0.080441
AtoC atoC 0.800000 1.520000e-12 0.666667 1.000000 4.0 6 1 Miscellaneous Metabolism 0.105756
[ ]:

Plotting Functions

PyModulon contains a suite of plotting functions for data visualization and iModulon characterization.

[1]:
from pymodulon.plotting import *
from pymodulon.example_data import load_ecoli_data
[2]:
ica_data = load_ecoli_data()

Bar plots

Gene expression and iModulon activities are easily viewed as bar plots. Use the plot_expression and plot_activities functions, respectively. Any numeric metadata for your experiments can be plotted using the plot_metadata function.

Optional arguments:

  • projects: Only show specific project(s)

  • highlight: Show individiual conditions for specific project(s)

  • ax: Use a pre-existing matplotlib axis (helpful if you want to manually determine the plot size).

  • legend_args: Dictionary of arguments to pass to the matplotlib legend (e.g. {'fontsize':12, 'loc':0, 'ncol':2})

Plot Gene Expression

You can plot the compendium-wide expression of a gene using either the locus tag or gene name. Each bar represents an experimental condition, and points are overlaid on the bars to show the gene expression of individual replicates. The plot is subdivided into projects, as stored in the sample_table.

[3]:
plot_expression(ica_data,'b0002');
_images/tutorials_plotting_functions_7_0.png

The plot can be limited to specific projects using the projects argument. The highlight argument shows individual conditions in the legend.

[4]:
plot_expression(ica_data,'thrA',projects=['ica','fps'],highlight='fps');
_images/tutorials_plotting_functions_9_0.png
Plot iModulon Activities

The plot_activities function mirrors the plot_expression function, but shows iModulon activities instead of gene expression.

[5]:
plot_activities(ica_data,'GlpR',highlight='crp');
_images/tutorials_plotting_functions_12_0.png
[6]:
plot_activities(ica_data,'GlpR',projects='crp');
_images/tutorials_plotting_functions_13_0.png
Plot sample metadata

If the sample_table contains numerical values in a column, this column can be graphed using the plot_metadata function.

[7]:
plot_metadata(ica_data,'Growth Rate (1/hr)');
_images/tutorials_plotting_functions_16_0.png

Scatterplots

Gene expression and iModulon activities can be compared with a scatter plot. Use the compare_expression and compare_activities functions, respectively. In addition, compare_values can be used to compare any compendium-wide value against another, including gene expression, iModulon activity, and sample metadata.

Optional arguments:

  • groups: Mapping of samples to specific groups

  • colors: Color of points, list of colors to use for different groups, or dictionary mapping groups to colors

  • show_labels: Show labels for points. (default: False)

  • adjust_labels: Automatically avoid label overlap

  • fit_metric: Correlation metric of 'pearson','spearman', or 'r2adj' (default: 'pearson')

  • ax: Use a pre-existing axis (helpful if you want to manually determine the plot size)

Formatting arguments:

Plot gene weights

plot_gene_weights will plot an iModulon’s gene weights against its genomic position. If the number of genes in the iModulon is fewer than 20, it will also show the gene names (or locus tags, if gene name is unavailable). If the gene_table contains a COG column, genes will be colored by their Cluster of Orthologous Genes category.

[8]:
plot_gene_weights(ica_data,'GlpR')
[8]:
<AxesSubplot:xlabel='Gene Start', ylabel='GlpR Gene Weight'>
_images/tutorials_plotting_functions_19_1.png

If there are more than 20 genes, gene names will not be shown by default.

[9]:
plot_gene_weights(ica_data,'Fnr')
[9]:
<AxesSubplot:xlabel='Gene Start', ylabel='Fnr Gene Weight'>
_images/tutorials_plotting_functions_21_1.png

Use show_labels=True show gene labels. It is advisable to turn of auto-adjustment of gene labels (adjust_labels=False), as this may take a while with many genes.

[10]:
plot_gene_weights(ica_data,'Fnr',show_labels=True,adjust_labels=False)
[10]:
<AxesSubplot:xlabel='Gene Start', ylabel='Fnr Gene Weight'>
_images/tutorials_plotting_functions_23_1.png
Compare two gene expression profiles

The compare_expression function plots two iModulon activities against each other. Groups of samples can be highlighted to visualize the effects of experimental conditions.

[11]:
groups = {'minspan__wt_glc_anaero__1':'Anaerobic',
          'minspan__wt_glc_anaero__2':'Anaerobic'}
[12]:
compare_expression(ica_data,'arcA','fnr',groups=groups)
[12]:
<AxesSubplot:xlabel='$arcA$ Expression', ylabel='$fnr$ Expression'>
_images/tutorials_plotting_functions_27_1.png
Compare two iModulon activities

The compare_activities function mirrors the compare_expression function.

[13]:
compare_activities(ica_data,'Fnr','ArcA-1',groups=groups)
[13]:
<AxesSubplot:xlabel='Fnr iModulon Activity', ylabel='ArcA-1 iModulon Activity'>
_images/tutorials_plotting_functions_30_1.png
[14]:
compare_activities(ica_data,'Fnr','ArcA-1',groups=groups,colors=['green','purple'])
[14]:
<AxesSubplot:xlabel='Fnr iModulon Activity', ylabel='ArcA-1 iModulon Activity'>
_images/tutorials_plotting_functions_31_1.png
Compare iModulon gene weights

compare_gene_weights plots the gene weights of two iModulons against each other. Dashed lines indicate iModulon thresholds. Genes outside both thresholds are highlighted in red and labelled.

[15]:
compare_gene_weights(ica_data,'CysB','Cbl+CysB')
[15]:
<AxesSubplot:xlabel='CysB Gene Weight', ylabel='Cbl+CysB Gene Weight'>
_images/tutorials_plotting_functions_34_1.png
Differential iModulon activity

A Differential iModulon Activity plot, or DiMA plot shows iModulons that have significantly different iModulon activities between two experimental conditions. The plot_dima function can either compare two conditions using project:condition keys, or using a list of sample names.

[16]:
plot_dima(ica_data,'fur:wt_fe','fur:delfur_fe2')
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:174: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  _diff[":".join(name)] = abs(A_to_use[i1] - A_to_use[i2])
[16]:
<AxesSubplot:xlabel='fur:wt_fe', ylabel='fur:delfur_fe2'>
_images/tutorials_plotting_functions_37_2.png
To show the DiMA table, use table=True
Adjusting labels can be turned off using adjust=False
[17]:
ax, table = plot_dima(ica_data,
          ['fur__wt_fe__1','fur__wt_fe__2'],
          ['fur__delfur_fe2__1','fur__delfur_fe2__2'],
          table=True,
          adjust=False)
table
[17]:
difference pvalue qvalue 0 1
Fur-1 27.598965 7.320594e-04 0.011225 -6.701019 20.897946
fur-KO 14.918007 4.338664e-08 0.000004 3.516092 18.434098
RpoH 5.145175 3.091958e-03 0.031607 0.008680 5.153855
flu-yeeRS -5.020887 2.246638e-04 0.006612 5.299051 0.278164
membrane -6.005727 5.390713e-05 0.002480 -0.163181 -6.168908
PuuR -6.026142 1.397577e-03 0.017874 -3.040374 -9.066517
NarL -6.873706 4.898118e-03 0.040966 6.520503 -0.353203
iron-related -14.209552 2.874857e-04 0.006612 1.004311 -13.205241
Zinc -15.497242 5.238034e-04 0.009638 21.181402 5.684160
_images/tutorials_plotting_functions_39_1.png
Compare iModulon gene weights across organisms

In order to compare gene weights across organisms, you will need the IcaData objects for both organisms, as well as a file mapping the genes between the organisms. To generate this file, see the Comparing iModulons tutorial.

[18]:
from pymodulon.example_data import load_staph_data, load_example_bbh
staph_data = load_staph_data()
bbh = load_example_bbh()
[19]:
compare_gene_weights(ica_data,'PurR-2',
                     ica_data2 = staph_data,
                     imodulon2 = 'PyrR',
                     ortho_file = bbh)
[19]:
<AxesSubplot:xlabel='PurR-2 Gene Weight', ylabel='PyrR Gene Weight'>
_images/tutorials_plotting_functions_43_1.png

Use use_org1_names to switch which organism’s gene names are shown.

[20]:
compare_gene_weights(ica_data,'PurR-2',
                     ica_data2 = staph_data,
                     imodulon2 = 'PyrR',
                     ortho_file = bbh,
                     use_org1_names = False)
[20]:
<AxesSubplot:xlabel='PurR-2 Gene Weight', ylabel='PyrR Gene Weight'>
_images/tutorials_plotting_functions_45_1.png

Automated metadata classification

The function metadata_boxplot automatically classify iModulon activities given metadata information. Optional arguments include:

  • show_points: Overlay individual points on top of the boxplot. By default, this is True, and uses a stripplot.

  • show_points='swarm' will use a swarmplot.

  • n_boxes: Number of boxes to create

  • sample: Subset of samples to analyze

  • strip_conc: Remove concentrations from metadata (e.g. “glucose(2g/L)” would be interpreted as just “glucose”). Default is True.

  • ignore_cols: List of columns to ignore. If empty, only project and condition are ignored.

  • use_cols: List of columns to use. This supercedes ignore_cols.

  • return_results: Return a DataFrame describing the classifications.

[21]:
metadata_boxplot(ica_data,"EvgA",ignore_cols=['GEO','study','project','DOI']);
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/sklearn/tree/_classes.py:370: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent.
  FutureWarning,
_images/tutorials_plotting_functions_47_1.png

The use_cols argument selects which columns to use from the sample_table.

To turn off individual points, set show_points=False. Alternatively, show_points='swarm' will use a swarmplot.

[22]:
metadata_boxplot(ica_data,"RpoS",n_boxes=5,
                 use_cols=['Base Media','Carbon Source (g/L)',
                           'Nitrogen Source (g/L)','Supplement',
                           'Evolved Sample','pH','Growth Rate (1/hr)'],
                 show_points=False);
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/sklearn/tree/_classes.py:370: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent.
  FutureWarning,
_images/tutorials_plotting_functions_49_1.png

The samples argument selects specific samples to plot

The strip_kwargs argument can be used to customize the stripplot

[23]:
metadata_boxplot(ica_data,"RpoS",
                 n_boxes=3,
                 samples=ica_data.sample_names[:26],
                 ignore_cols = ['DOI','GEO'],
                 strip_kwargs={'size':10, 'edgecolor':'white', 'linewidth':1});
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/sklearn/tree/_classes.py:370: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent.
  FutureWarning,
_images/tutorials_plotting_functions_51_1.png

iModulon histograms

iModulon gene weights can be visualized in a histogram. If you wish to highlight genes in a regulon, it can be visualized either as overlapping bars, or side-by-side bars.

[24]:
plot_regulon_histogram(ica_data,'Fur-1','fur')
[24]:
<AxesSubplot:xlabel='Fur-1 Gene Weight', ylabel='Number of Genes'>
_images/tutorials_plotting_functions_53_1.png
[25]:
plot_regulon_histogram(ica_data,'Fur-1','fur',kind='side')
[25]:
<AxesSubplot:xlabel='Fur-1 Gene Weight', ylabel='Number of Genes'>
_images/tutorials_plotting_functions_54_1.png

Overview plots

PyModulon contains two plots to help gain a general overview of your IcaData object.

Compare iModulons vs Regulons

compare_imodulon_vs_regulon creates a scatter plot that compares the overlap between iModulons and their linked regulators. The dashed lines creates four quadrants:

  1. Top-right: iModulons that are nearly identical to known regulons (“Well-matched”)

  2. Top-left: iModulons contain a subset of the total genes thought to be regulated by the linked regulator (“Subset”)

  3. Bottom-right: iModulons contain most of the genes thought to be regulated by the linked regulator. However, these iModulons contain a high fraction of genes that are not yet confirmed to be regulated by the regulator (“Unknown-containing”).

  4. Bottom-left: iModulons have a small, but statistically significant, overlap with the associated regulator (“Closest match”)

Additional options include:

  • imodulons: List of iModulons to plot

  • cat_column: Column in the `imodulon_table that stores the category of each iModulon

  • size_column: Column in the imodulon_table that stores the size of each iModulon

  • scale: Value used to scale the size of each point

  • reg_only: Only plot iModulons with an entry in the regulator column of the imodulon_table

  • xlabel: Custom x-axis label (default: “# shared genes/Regulon size”)

  • ylabel: Custom y-axis label (default: “# shared genes/iModulon size”)

  • vline: Draw a dashed vertical line

  • hline: Draw a dashed horizontal line

[26]:
compare_imodulons_vs_regulons(ica_data,
                              size_column='n_genes',
                              cat_column='Category',
                              scale=3);
_images/tutorials_plotting_functions_57_0.png
Plot explained variance

plot_explained_variance gives an idea of how much of the expression variation is captured by iModulons

[27]:
plot_explained_variance(ica_data);
_images/tutorials_plotting_functions_60_0.png

Activity Clustering

The iModulon activites in the A matrix can be clustered based on correlation between their activities across conditions in the compendium.

Use the cluster_activities function to prepare a clustermap; the minimal input is simply your IcaData object.

[28]:
cluster_activities(ica_data);
_images/tutorials_plotting_functions_62_0.png
Using different correlation metrics

You can use multiple correlation metrics, including "pearson","spearman", and "mutual_info". Mutual information is most likely to identify biologically similar iModulons, but can be more difficult to interpret as it finds both linear and non-linear correlations.

[29]:
cluster_activities(ica_data,correlation_method='mutual_info');
_images/tutorials_plotting_functions_64_0.png
Automatic Distance Thresholding

Agglomerative (hierarchical) clustering is used under the hood. Thus, a distance threshold for defining “flat” clusters from the hierarchical structure must be determined. By default, this distance threshold is automagically calculated using a sensitivity analysis.

Different distance thresholds (this value is between 0 and 1) are tried, and the resulting clustering is assessed using a silhouette score (a measure of how separate the clusters are). The distance threshold yielding the maximum silhouette score is automatically chosen.

To see the result of this sensitivity analysis, use the show_thresholding option.

[30]:
cluster_activities(ica_data, show_thresholding=True);
_images/tutorials_plotting_functions_66_0.png
_images/tutorials_plotting_functions_66_1.png
Manual Distance Threshold

You may also determine that you’re interested in manually varying the distance threshold to see what happens to the iModulon clusters. Setting this threshold manually (with the distance_threshold option) will override the automatic thresholding shown above.

Note: distance_threshold must be set to a value between 0 and 1; larger values will generally yield smaller numbers of larger clusters, as shown in the thresholding plot above.

[31]:
cluster_activities(ica_data, distance_threshold=0.95);
_images/tutorials_plotting_functions_68_0.png
Displaying Best Clusters

The above clustermaps do not allow you to see which iModulons are actually being clustered together; use the show_best_clusters option to call out an additional plot that shows such clusters.

By default, the clusters whose individual silhouette scores are greater than the mean silhouette score (indicating their separation from the other clusters is above-average) will be shown.

The cluster numbers come from the scikit-learn AgglomerativeClustering estimator that actually performs the clustering; these labels don’t have any special significance and are just unique identifiers for each cluster. You can also access an iModulon-index-matched list of these labels by accessing the labels_ attribute of the AgglomerativeClustering object (which is returned by cluster_activities).

[32]:
cluster_activities(ica_data, show_best_clusters=True);
_images/tutorials_plotting_functions_70_0.png
_images/tutorials_plotting_functions_70_1.png

So we can see here that the clustering method does seem to capture some biologically-relevant groups of iModulons: Cluster 15 contains 2 flagella regulators, Cluster 8 is membrane-related, Cluster 13 is stress-related, Cluster 25 is iron-related, Cluster 12 is carbon metabolism related, etc.

NOTE: the parenthesized numbers next to the cluster names are the clusters’ silhouette scores (a 0 to 1 measure of a cluster’s separation from the pack).

Perhaps you’re only interested in seeing a specific number of the top clusters; use the n_best_clusters argument to specify this preference:

[33]:
cluster_activities(ica_data, show_best_clusters=True, n_best_clusters=5);
_images/tutorials_plotting_functions_72_0.png
_images/tutorials_plotting_functions_72_1.png
Naming Clusters

After performing an initial clustering and manually mapping knowledge onto your best clusters, you may decide on a new, more descriptive name for some of your clusters. To generate pretty figures that use these names instead of the soulless integer IDs, use the cluster_names option to map the integer IDs to names. You don’t have to rename all clusters.

[34]:
cluster_activities(
    ica_data, show_best_clusters=True, n_best_clusters=5,
    cluster_names={15: 'Flagella', 8: 'Membrane', 13: 'Stress'}
);
_images/tutorials_plotting_functions_74_0.png
_images/tutorials_plotting_functions_74_1.png
DIMCA

Differential iModulon Cluster Activity (DIMCA) analysis, a sister method to the differential iModulon activity (DIMA) analysis, allows you to compare all iModulon activities between 2 or more conditions.

cluster_activities itself has the capability to perform DIMCA analyses, exposing a series of dimca_-prefixed arguments that correspond with the arguments to plot_dima (which is in fact used under the hood).

For DIMCA, a “cluster activity” will be calculated for each of your best clusters (however many you ask for) and then plotted between your 2 conditions, instead of that cluster’s constituent iModulons. In this way, a DIMA plot can be rendered even simpler.

Cluster activities are simply the averages of the activities of the constituent iModulons, EXCEPT that for iModulons that are generally anti-correlated with the others in a cluster (see purR-KO from the above plot, for example), the sign of the activities is first switched.

[35]:
plot_dima(ica_data, 'fur:wt_dpd', 'fur:wt_fe');
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:174: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  _diff[":".join(name)] = abs(A_to_use[i1] - A_to_use[i2])
_images/tutorials_plotting_functions_76_1.png

This DIMA plot is fairly busy; we can use DIMCA to reduce the number of points even further:

[36]:
cluster_activities(ica_data, show_best_clusters=True, dimca_sample1='fur:wt_dpd', dimca_sample2='fur:wt_fe');
_images/tutorials_plotting_functions_78_0.png
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:174: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  _diff[":".join(name)] = abs(A_to_use[i1] - A_to_use[i2])
_images/tutorials_plotting_functions_78_2.png
_images/tutorials_plotting_functions_78_3.png

Thus, we have a somewhat simpler picture of this comparison. We can increase the number of best clusters we ask for to yield only cluster points; be careful with this though, as some of the worse-scoring clusters are actually “singleton” clusters with just a single iModulon in them (for these iModulons’ activities are generally uncorrelated with the other iModulons in the dataset).

[37]:
cluster_activities(ica_data, n_best_clusters=50, dimca_sample1='fur:wt_dpd', dimca_sample2='fur:wt_fe');
_images/tutorials_plotting_functions_80_0.png
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:174: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  _diff[":".join(name)] = abs(A_to_use[i1] - A_to_use[i2])
_images/tutorials_plotting_functions_80_2.png

And as before, if we name clusters, those names will propagate to the DIMCA plot (and DIMCA table if requested):

[38]:
cluster_obj, dimca_ax, table = cluster_activities(
    ica_data, show_best_clusters=True,
    cluster_names={25: 'Iron', 13: 'Stress', 8: 'Membrane'},
    dimca_sample1='fur:wt_dpd', dimca_sample2='fur:wt_fe', dimca_table=True
);
_images/tutorials_plotting_functions_82_0.png
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:174: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  _diff[":".join(name)] = abs(A_to_use[i1] - A_to_use[i2])
_images/tutorials_plotting_functions_82_2.png
_images/tutorials_plotting_functions_82_3.png
[39]:
table
[39]:
difference pvalue qvalue 0 1
iron-related 16.153595 0.000168 0.001971 -15.149284 1.004311
ArgR 13.753341 0.002044 0.005877 -11.636817 2.116524
NarL 8.227232 0.002884 0.007045 -1.706729 6.520503
Membrane [Clst] 7.906679 0.001441 0.004514 -6.433855 1.472824
GcvA 7.412801 0.000035 0.000823 -6.265427 1.147374
Cluster 28 7.156066 0.000250 0.002347 -6.943347 0.212719
Cluster 9 -5.126362 0.000456 0.002925 4.132141 -0.994221
SoxS -5.226992 0.000665 0.003127 5.043639 -0.183353
Thiamine -5.558917 0.003923 0.007682 5.193298 -0.365619
FecI -5.760403 0.001402 0.004514 5.871660 0.111257
uncharacterized-1 -7.566460 0.000498 0.002925 8.833006 1.266546
Stress [Clst] -7.639391 0.000619 0.003127 7.081778 -0.557613
Cluster 2 -8.318706 0.000114 0.001783 -1.638389 -9.957095
Iron [Clst] -8.611146 0.003289 0.007045 6.395635 -2.215511
Leu/Ile -8.702352 0.000850 0.003633 8.712151 0.009800
MetJ -9.857605 0.000403 0.002925 9.533284 -0.324320
nitrate-related -11.292903 0.000002 0.000097 13.926154 2.633251

Note that [Clst] is added to cluster names in the DIMCA plot to avoid confusion with unclustered iModulons.

Inferring iModulon activities for new data

Re-computing the complete set of iModulons can be computationally intensive for every new dataset. However, once a dataset reaches a critical size, you can use a pre-computed IcaData object to infer the iModulon activities of a new dataset. iModulon activities are relative measures; every dataset must have a reference condition to which all other samples are compared against.

To compute the new iModulon activities, first load the pre-computed IcaData object.

[1]:
from pymodulon.example_data import load_ecoli_data
ica_data = load_ecoli_data()

Next, load your expression profiles. This should be normalized using whichever read mapping pipeline you use, as Transcripts per Million (TPM) or log-TPM.

[2]:
from pymodulon.example_data import load_example_log_tpm
log_tpm = load_example_log_tpm()
log_tpm.head()
[2]:
Reference_1 Reference_2 Test_1 Test_2
Geneid
b0001 10.473721 10.271944 10.315476 10.808135
b0002 10.260569 10.368555 10.735874 10.726916
b0003 9.920277 10.044224 10.528432 10.503092
b0004 9.936694 10.010638 9.739519 9.722997
b0005 7.027515 7.237449 6.745798 6.497823

Next, make sure your dataset uses similar gene names as the target IcaData object.

[3]:
from matplotlib_venn import venn2
venn2((set(ica_data.gene_names),set(log_tpm.index)), set_labels=['IcaData genes','Dataset genes'])
[3]:
<matplotlib_venn._common.VennDiagram at 0x7f674e99c410>
_images/tutorials_inferring_imodulon_activities_for_new_data_7_1.png

Only genes shared between your IcaData object and the new expression profiles will be used to project your data. All other genes will be ignored.

Then, center your dataset on a reference condition, taking the average of replicates.

[4]:
centered_log_tpm = log_tpm.sub(log_tpm[['Reference_1','Reference_2']].mean(axis=1),axis=0)
centered_log_tpm.head()
[4]:
Reference_1 Reference_2 Test_1 Test_2
Geneid
b0001 0.100889 -0.100889 -0.057356 0.435303
b0002 -0.053993 0.053993 0.421312 0.412354
b0003 -0.061973 0.061973 0.546181 0.520841
b0004 -0.036972 0.036972 -0.234147 -0.250669
b0005 -0.104967 0.104967 -0.386684 -0.634659

Finally, use the pymodulon.util.infer_activities function to infer the relative iModulon activities of your dataset.

[5]:
from pymodulon.util import infer_activities
[6]:
activities = infer_activities(ica_data,centered_log_tpm)
activities.head()
/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:327: FutureWarning: Index.__and__ operating as a set operation is deprecated, in the future this will be a logical operation matching Series.__and__.  Use index.intersection(other) instead
  shared_genes = ica_data.M.index & data.index
[6]:
Reference_1 Reference_2 Test_1 Test_2
AllR/AraC/FucR 0.243143 -0.243143 1.028044 0.848571
ArcA-1 -0.157687 0.157687 -2.644027 -2.418106
ArcA-2 0.038248 -0.038248 0.182260 0.039267
ArgR -0.150147 0.150147 -1.456806 -1.293399
AtoC 0.344893 -0.344893 0.632130 1.075412

All of the plotting functions in pymodulon.plotting can be used on your inferred activities once you add it to a new IcaData object. It is advisable to create a new sample_table with project and condition columns.

[7]:
from pymodulon.core import IcaData
import pandas as pd
[8]:
new_sample_table = pd.DataFrame([['new_data','reference']]*2+[['new_data','test']]*2,columns=['project','condition'],index=log_tpm.columns)
new_sample_table
[8]:
project condition
Reference_1 new_data reference
Reference_2 new_data reference
Test_1 new_data test
Test_2 new_data test
[9]:
new_data = IcaData(ica_data.M,
                   activities,
                   gene_table = ica_data.gene_table,
                   sample_table = new_sample_table,
                   imodulon_table = ica_data.imodulon_table)
[10]:
from pymodulon.plotting import *
[11]:
plot_activities(new_data,'Fur-1',highlight='new_data')
[11]:
<AxesSubplot:ylabel='Fur-1 iModulon\nActivity'>
_images/tutorials_inferring_imodulon_activities_for_new_data_19_1.png
[ ]:

[ ]:

[ ]:

Gene enrichment analysis

[1]:
from pymodulon.enrichment import *
from pymodulon.example_data import load_ecoli_data, trn
ica_data = load_ecoli_data()

General functions

To perform a basic enrichment test between two gene sets, use the compute_enrichment function. Optional arguments:

  • label: Label for your target gene set (e.g. regulator name or GO term)

[2]:
gene_set = ['b0002','b0003','b0004'] # e.g. an iModulon or other gene set
target_genes = ['b0002','b0005','b0007'] # e.g. a regulon or genes in a COG category
all_genes = ica_data.gene_names # List of all genes in your expression dataset
[3]:
compute_enrichment(gene_set,target_genes,all_genes)
[3]:
pvalue             0.002293
precision          0.333333
recall             0.333333
f1score            0.333333
TP                 1.000000
target_set_size    3.000000
gene_set_size      3.000000
dtype: float64

To perform an enrichment test against a regulon in a TRN table, use the compute_regulon function.

[4]:
compute_regulon_enrichment(gene_set,'lrp',all_genes,trn)
[4]:
pvalue             0.000131
precision          1.000000
recall             0.015000
f1score            0.029557
TP                 3.000000
regulon_size     200.000000
gene_set_size      3.000000
n_regs             1.000000
Name: lrp, dtype: float64

Each row of the TRN table represents a regulatory interaction. The table requires the columns regulator and gene_id.

[5]:
trn.head()
[5]:
regulator gene_id effect
0 FMN b3041 -
1 L-tryptophan b3708 +
2 L-tryptophan b3709 +
3 TPP b0066 -
4 TPP b0067 -

To perform an enrichment against all regulons in a TRN table, use the compute_trn_enrichment function. This function includes false detection correction using the Benjamini-Hochberg procedure to compute Q-values. By default, the false detection rate is 1e-5, and can be updated using the fdr argument.

[6]:
compute_trn_enrichment(gene_set,all_genes,trn)
[6]:
pvalue precision recall f1score TP regulon_size gene_set_size n_regs qvalue
thr-tRNA 0.000000e+00 1.0 1.000000 1.000000 3.0 3.0 3.0 1.0 0.000000e+00
ile-tRNA 8.354256e-09 1.0 0.333333 0.500000 3.0 9.0 3.0 1.0 2.923990e-08
lrp 1.306248e-04 1.0 0.015000 0.029557 3.0 200.0 3.0 1.0 3.047911e-04
arcA 1.540883e-03 1.0 0.006608 0.013129 3.0 454.0 3.0 1.0 2.157237e-03
fnr 1.411995e-03 1.0 0.006803 0.013514 3.0 441.0 3.0 1.0 2.157237e-03
crp 3.335561e-03 1.0 0.005111 0.010169 3.0 587.0 3.0 1.0 3.891487e-03

compute_trn_enrichment can perform enrichment against complex regulons in boolean combinations. The max_regs argument determines the number of regulators to include in the complex regulon, and the method argument determines how to combine complex regulons. method='or' computes enrichments against the union of regulons, 'and' computes enrichment against intersection of regulons, and 'both' performs both tests (default).

[7]:
compute_trn_enrichment(gene_set,all_genes,trn,max_regs=2).head()
[7]:
pvalue precision recall f1score TP regulon_size gene_set_size n_regs qvalue
fnr+ile-tRNA 0.0 1.0 1.0 1.0 3.0 3.0 3.0 2.0 0.0
thr-tRNA 0.0 1.0 1.0 1.0 3.0 3.0 3.0 1.0 0.0
arcA+ile-tRNA 0.0 1.0 1.0 1.0 3.0 3.0 3.0 2.0 0.0
lrp+thr-tRNA 0.0 1.0 1.0 1.0 3.0 3.0 3.0 2.0 0.0
arcA+thr-tRNA 0.0 1.0 1.0 1.0 3.0 3.0 3.0 2.0 0.0

Increasing the number of regulators for complex regulons beyond three can lead to a combinatorial explosion. If you are sure you want to search for larger complex regulons, use force=True

[8]:
compute_trn_enrichment(gene_set,all_genes,trn,max_regs=3,force=True).head()
[8]:
pvalue precision recall f1score TP regulon_size gene_set_size n_regs qvalue
arcA+ile-tRNA+lrp 0.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0 0.0
arcA+ile-tRNA+thr-tRNA 0.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0 0.0
arcA+lrp+thr-tRNA 0.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0 0.0
arcA+rpoD+thr-tRNA 0.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0 0.0
crp+fnr+ile-tRNA 0.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0 0.0

Using the IcaData object

[9]:
# Compare a single iModulon to a simple regulon
ica_data.compute_regulon_enrichment('GlpR',regulator='glpR')
[9]:
pvalue           0
precision        1
recall           1
f1score          1
TP               9
regulon_size     9
imodulon_size    9
n_regs           1
Name: glpR, dtype: int64
[10]:
# Compare a single iModulon to a complex regulon
ica_data.compute_regulon_enrichment('Copper','cusR/cueR')
[10]:
pvalue           5.471346e-20
precision        1.000000e+00
recall           4.210526e-01
f1score          5.925926e-01
TP               8.000000e+00
regulon_size     1.900000e+01
imodulon_size    8.000000e+00
n_regs           2.000000e+00
Name: cusR/cueR, dtype: float64
[11]:
# Get all significant enrichments for all iModulons.
# Optional parameters are same as the global compute_trn_enrichment function
ica_data.compute_trn_enrichment()
[11]:
imodulon regulator pvalue qvalue precision recall f1score TP regulon_size imodulon_size n_regs
0 AllR/AraC/FucR araC 3.804040e-14 3.423636e-13 0.388889 0.500000 0.437500 7.0 14.0 18.0 1.0
1 AllR/AraC/FucR allR 3.067669e-13 1.380451e-12 0.333333 0.666667 0.444444 6.0 9.0 18.0 1.0
2 AllR/AraC/FucR fucR 2.316812e-11 6.950436e-11 0.277778 0.714286 0.400000 5.0 7.0 18.0 1.0
3 AllR/AraC/FucR rpoD 5.383749e-07 1.211343e-06 0.944444 0.011668 0.023051 17.0 1457.0 18.0 1.0
4 AllR/AraC/FucR crp 8.866740e-07 1.596013e-06 0.666667 0.020443 0.039669 12.0 587.0 18.0 1.0
... ... ... ... ... ... ... ... ... ... ... ...
181 nitrate-related narP 9.268494e-13 6.024521e-12 0.571429 0.163265 0.253968 8.0 49.0 14.0 1.0
182 nitrate-related nsrR 3.762939e-09 1.630607e-08 0.500000 0.086420 0.147368 7.0 81.0 14.0 1.0
183 nitrate-related fnr 8.576894e-09 2.787490e-08 0.785714 0.024943 0.048352 11.0 441.0 14.0 1.0
184 translation fnr 2.865424e-08 4.584679e-07 0.583333 0.031746 0.060215 14.0 441.0 24.0 1.0
185 translation arcA 4.244932e-07 3.395946e-06 0.541667 0.028634 0.054393 13.0 454.0 24.0 1.0

186 rows × 11 columns

Save to sample table

You can directly save iModulon enrichments to the sample table using either the ica_data.compute_regulon_enrichment function or ica_data.compute_trn_enrichment function. Columns unrelated to regulon enrichments (e.g. Function and Category) are left alone.

[12]:
ica_data.imodulon_table.loc[['GlpR']]
[12]:
regulator f1score pvalue precision recall TP n_genes n_tf Category threshold
GlpR glpR 1.0 0.0 1.0 1.0 9.0 9 1 Carbon Source Utilization 0.066801
[13]:
ica_data.compute_regulon_enrichment('GlpR',regulator='crp',save=True)
ica_data.imodulon_table.loc[['GlpR']]
[13]:
regulator f1score pvalue precision recall TP n_genes n_tf Category threshold regulon_size imodulon_size n_regs
GlpR crp 0.030201 3.567813e-08 1.0 0.015332 9.0 9 1 Carbon Source Utilization 0.066801 587.0 9.0 1.0
[14]:
# The enrichment with the lowest qvalue is saved as the official enrichment
ica_data.compute_trn_enrichment(save=True)
ica_data.imodulon_table
[14]:
regulator pvalue qvalue precision recall f1score TP regulon_size imodulon_size n_regs threshold Category n_tf n_genes
AllR/AraC/FucR araC 3.804040e-14 3.423636e-13 0.388889 0.500000 0.437500 7.0 14.0 18.0 1.0 0.086996 Carbon Source Utilization 3 18
ArcA-1 arcA 6.418412e-20 2.374813e-18 0.660000 0.072687 0.130952 33.0 454.0 50.0 1.0 0.058051 Energy Metabolism 1 50
ArcA-2 appY 7.019326e-18 1.895218e-16 0.320000 0.888889 0.470588 8.0 9.0 25.0 1.0 0.081113 Energy Metabolism 1 25
ArgR argR 6.026074e-18 7.833897e-17 0.923077 0.098361 0.177778 12.0 122.0 13.0 1.0 0.080441 Amino Acid and Nucleotide Biosynthesis 1 13
AtoC atoC 1.522277e-12 1.217822e-11 0.666667 1.000000 0.800000 4.0 4.0 6.0 1.0 0.105756 Miscellaneous Metabolism 1 6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
uncharacterized-4 None NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.074349 Uncharacterized 0 10
uncharacterized-5 None NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.079179 Uncharacterized 0 4
uncharacterized-6 None NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.052578 Uncharacterized 0 54
ydcI-KO None NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.099229 Genomic Alterations 0 3
yheO-KO None NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.078511 Genomic Alterations 0 16

92 rows × 14 columns

[ ]:

Searching for motifs

Using MEME to find motifs

MEME must be installed to use this functionality. find_motifs searches for motifs upstream of all genes in an iModulon. The gene_table must contain the columns accession and operon for this function to work (see notebooks/gene_annotation.ipynb).

find_motifs supports many of the command-line options for MEME:

  • outdir: Directory for output files

  • palindrome: If True, limit search to palindromic motifs (default: False)

  • nmotifs: Number of motifs to search for (default: 5)

  • upstream: Number of basepairs upstream from first gene in operon to include in motif search (default: 500)

  • downstream: Number of basepairs upstream from first gene in operon to include in motif search (default: 100)

  • verbose: Show steps in verbose output (default: True)

  • force: Force execution of MEME even if output already exists (default: False)

  • evt: E-value threshold (default: 0.001)

  • cores Number of cores to use (default: 8)

  • minw: Minimum motif width in basepairs (default: 6)

  • maxw: Maximum motif width in basepairs (default: 40)

  • minsites: Minimum number of sites required for a motif. Default is the number of operons divided by 3.

[1]:
from pymodulon.motif import *
from pymodulon.example_data import load_ecoli_data, ecoli_fasta
ica_data = load_ecoli_data()
[4]:
motifs = find_motifs(ica_data, 'ArgR', ecoli_fasta, outdir='tmp')
Finding motifs for 11 sequences
Running command: meme tmp/ArgR.fasta -oc tmp/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 3
Found 2 motifs across 15 sites

find_motifs creates a MotifInfo object

[5]:
motifs
[5]:
<MotifInfo with 2 motifs across 15 sites>

The MotifInfo object contains the following properties:

  • motifs: Information about the motifs (e.g. E-value, width, consensus sequence)

[6]:
motifs.motifs
[6]:
e_value sites width consensus motif_frac
motif
MEME-1 1.7e-30 10 39 TGAATWWAHATKCAVTWWWTWTGMATAAAWATWCAHTRR 0.909091
MEME-2 0.00015 5 39 GMAYASGCWSVTYGYGGVTGHCHGMDGCNACGCTGGCGY 0.454545
  • sites: Information about predicted binding sites (e.g. location, p-value)

[7]:
motifs.sites
[7]:
rel_position pvalue site_seq genes locus_tags start_pos strand
motif operon
MEME-1 argA 453 2.78e-12 AGAATAAAAATACACTAATTTCGAATAATCATGCAAAGA argA b2818 2948741 +
argCBH 375 1.89e-16 TCAATATTCATGCAGTATTTATGAATAAAAATACACTAA argC,argB,argH b3958,b3959,b3960 4154500 +
argD 132 6.68e-14 TGAAATTATAACCACAAAATATGCATAAAAAATCACTAA argD b3359 3490080 -
argE 128 1.89e-16 TCAATATTCATGCAGTATTTATGAATAAAAATACACTAA argE b3957 4154747 -
argF 129 6.79e-16 TGAATTAAAATTCACTTTATATGTGTAATTATTCATTTG argF b0273 290205 -
argG 412 2.07e-13 TAAATGAAAACTCATTTATTTTGCATAAAAATTCAGTGA argG b3172 3318136 +
argI 127 9.75e-16 TGAATTAAAATTCAATTTATATGGATGATTATTCATTTG argI b4254 4478211 -
artJ 171 1.63e-12 TGAATTTATATGCAATAAACATGATTAAATAATTTAAAT artJ b0860 900475 -
carAB 453 3.02e-14 TGAATTAATATGCAAATAAAGTGAGTGAATATTCTCTGG carA b0032 29150 +
dtpD 192 1.11e-11 TTATGCAATATGCTGTCTGATTGCATAAATATACATTAG dtpD b0709 742456 -
xynR NaN NaN NaN yagI b0272 289062 -
MEME-2 argA 151 1.64e-13 GCACCGCCACGACCTGGCTGACCGCAGCCAGACTGACCT argA b2818 2948741 +
argCBH 505 4.24e-22 GAATACGCTGATTGTGGGTGCCAGCGGCTACGCTGGCGC argC,argB,argH b3958,b3959,b3960 4154500 +
argD NaN NaN NaN argD b3359 3490080 -
argE 258 4.24e-22 GAATACGCTGATTGTGGGTGCCAGCGGCTACGCTGGCGC argE b3957 4154747 -
argF NaN NaN NaN argF b0273 290205 -
argG NaN NaN NaN argG b3172 3318136 +
argI NaN NaN NaN argI b4254 4478211 -
artJ 305 1.00e-15 GCACAGGCACCGTGCTGATGTCTGATGCGACGCTGGCGC artJ b0860 900475 -
carAB NaN NaN NaN carA b0032 29150 +
dtpD NaN NaN NaN dtpD b0709 742456 -
xynR 301 2.76e-13 GAACACGATGCTCGCCGCCGACTCAAACACCTCGTCCGT yagI b0272 289062 -
  • cmd: The original command used to run MEME

[8]:
motifs.cmd
[8]:
'meme tmp/ArgR.fasta -oc tmp/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 3'
  • file: Path to MEME output. This is the relative path of the file from the notebook that ran find_motifs

[9]:
motifs.file
[9]:
'tmp/ArgR/meme.txt'

This MotifInfo object is automatically stored as a dictionary in the IcaData object. It will persist after saving and re-loading the IcaData object.

[10]:
ica_data.motif_info
[10]:
{'ArgR': <MotifInfo with 2 motifs across 15 sites>}

Using TOMTOM to compare motifs against external databases

Once you have a motif from MEME, you can use TOMTOM to compare your motif against external databases. The compare_motifs function makes this process simple.

The MotifInfo object generated in the find_motifs function contains the MEME file location, which is the primary input for compare_motifs.

[17]:
compare_motifs(ica_data.motif_info['ArgR'], verbose=False)
[17]:
motif target target_id pvalue Evalue qvalue overlap optimal_offset orientation database
0 MEME-1 argR2 argR2 2.648790e-28 1.565440e-25 3.020170e-25 37 2 - dpinteract
1 MEME-1 ArgR_26-5 ArgR_26-5 4.221120e-16 2.494680e-13 2.406470e-13 26 0 - SwissRegulon_e_coli
2 MEME-1 argR argR 2.456950e-13 1.452060e-10 9.338110e-11 18 -19 - dpinteract
3 MEME-1 ArgR MX000116 3.377890e-07 1.996330e-04 6.419150e-05 14 -21 + prodoric
4 MEME-1 AhrC MX000043 4.871210e-07 2.878890e-04 6.942730e-05 16 -20 - prodoric

This information is saved in the matches attribute of the ArgR MotifInfo. Again, this information will be maintained if you save the file to json.

[13]:
ica_data.motif_info['ArgR'].matches
[13]:
motif target target_id pvalue Evalue qvalue overlap optimal_offset orientation database
0 MEME-1 argR2 argR2 2.648790e-28 1.565440e-25 3.020170e-25 37 2 - dpinteract
1 MEME-1 ArgR_26-5 ArgR_26-5 4.221120e-16 2.494680e-13 2.406470e-13 26 0 - SwissRegulon_e_coli
2 MEME-1 argR argR 2.456950e-13 1.452060e-10 9.338110e-11 18 -19 - dpinteract
3 MEME-1 ArgR MX000116 3.377890e-07 1.996330e-04 6.419150e-05 14 -21 + prodoric
4 MEME-1 AhrC MX000043 4.871210e-07 2.878890e-04 6.942730e-05 16 -20 - prodoric

By default, compare_motifs compares your motifs to 5 prokaryotic databases. You can also use the path to a local motif file or limit the search to a single database, namely:

  • 'prodoric'

  • 'collectf'

  • 'dpinteract'

  • 'regtransbase'

  • 'SwissRegulon_e_coli'

To re-run compare_motifs with different arguments, set force=True

[18]:
compare_motifs(ica_data.motif_info['ArgR'], motif_db = 'prodoric', verbose=False, force=True)
[18]:
motif target target_id pvalue Evalue qvalue overlap optimal_offset orientation database
0 MEME-1 ArgR MX000116 2.901120e-07 0.000058 0.000055 14 -21 + prodoric
1 MEME-1 AhrC MX000043 4.308400e-07 0.000087 0.000055 16 -20 - prodoric
[ ]:

iModulon Thresholds

iModulon thresholds are computed using an outlier test (See Methods section “Determination of the gene coefficient threshold” in Sastry et al 2019). This outlier test relies on the D’Agostino test statistic.

However, if you are working with an understudied organism, we recommend setting the thresholds using the K-means method when creating the :class:~pymodulon.core.IcaData object.

[1]:
from pymodulon.example_data import load_ecoli_data
from pymodulon.core import IcaData
[2]:
ica_data = load_ecoli_data()

Setting the threshold method

You can choose between the D’Agostino or K-means method of thresholding when creating the :class:~pymodulon.core.IcaData object.

[3]:
test_data = IcaData(ica_data.M,
                    ica_data.A,
                    trn=ica_data.trn,
                    threshold_method='dagostino')

test_data.thresholds['ArgR']
WARNING:root:Using the default dagostino_cutoff of 550. This may not be optimal for your dataset. Use ica_data.reoptimize_thresholds() to find the optimal threshold.
[3]:
0.0804412723

A higher D’Agostino cutoff leads to higher thresholds. The default is 550.

[4]:
test_data = IcaData(ica_data.M,
                    ica_data.A,
                    trn=ica_data.trn,
                    threshold_method='dagostino',
                    dagostino_cutoff=5000)

test_data.thresholds['ArgR']
[4]:
0.26879279035000003

To search for the optimal D’Agostino cutoff, set optimize_cutoff=True. To demonstrate this, we created a mini IcaData object since optimizing the cutoff can take 2-3 minutes.

[5]:
test_data = IcaData(ica_data.M.iloc[:,:10],
                    ica_data.A.iloc[:10,:],
                    trn=ica_data.trn,
                    threshold_method='dagostino',
                    optimize_cutoff=True)

test_data.thresholds['ArgR']
WARNING:root:Optimizing iModulon thresholds, may take 2-3 minutes...
[5]:
0.0945705206

The K-means method is a non-parametric threshold estimation method and is the default if no TRN is supplied. Although the thresholds are fairly close to the D’Agostino method, the D’Agostino method produces the highest accuracy iModulons.

[6]:
test_data = IcaData(ica_data.M,
                    ica_data.A,
                    threshold_method='kmeans')

test_data.thresholds['ArgR']
[6]:
0.1375202039

Changing the iModulon thresholds

If the TRN file is updated, the D’Agostino test statistic can be re-optimized using the reoptimize_thresholds function. To manually adjust this statistic, use the recompute_thresholds function. To change a single iModulon threshold, use the change_threshold function.

Re-optimizing the threshold

The reoptimize_threshold function may take a few minutes to complete. Fortunately, this only needs to be performed once. You can turn off the progress bar or plot using the progress and plot arguments, respectively.

[7]:
# ica_data.reoptimize_thresholds()
Changing the D’agostino statistic

To test out different D’agostino test statistics, use the recompute_thresholds function

[8]:
ica_data.view_imodulon('GlpR')
[8]:
gene_weight start end strand gene_name length operon COG accession regulator
b2239 0.211384 2349934 2351011 - glpQ 1077 glpTQ Energy production and conversion NC_000913.3 crp,fis,fnr,glpR,ihf,nac,rpoD
b2240 0.306134 2351015 2352374 - glpT 1359 glpTQ Carbohydrate transport and metabolism NC_000913.3 crp,fis,fnr,glpR,ihf,nac,rpoD
b2241 0.375662 2352646 2354275 + glpA 1629 glpABC Energy production and conversion NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b2242 0.328961 2354264 2355524 + glpB 1260 glpABC Amino acid transport and metabolism NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b2243 0.315752 2355520 2356711 + glpC 1191 glpABC Energy production and conversion NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b3426 0.350034 3562012 3563518 + glpD 1506 glpD Energy production and conversion NC_000913.3 arcA,crp,glpR,rpoD,yieP
b3925 0.084278 4114568 4115579 - glpX 1011 glpFKX Carbohydrate transport and metabolism NC_000913.3 crp,glpR,rpoD
b3926 0.290235 4115713 4117222 - glpK 1509 glpFKX Energy production and conversion NC_000913.3 crp,glpR,rpoD
b3927 0.312307 4117244 4118090 - glpF 846 glpFKX Carbohydrate transport and metabolism NC_000913.3 crp,glpR,rpoD
[9]:
ica_data.recompute_thresholds(300)
[10]:
ica_data.view_imodulon('GlpR')
[10]:
gene_weight start end strand gene_name length operon COG accession regulator
b1199 0.045686 1249124 1249757 - dhaL 633 dhaKLM Carbohydrate transport and metabolism NC_000913.3 arcA,dhaR,fnr,lrp,rpoS
b1200 0.045491 1249767 1250838 - dhaK 1071 dhaKLM Carbohydrate transport and metabolism NC_000913.3 arcA,dhaR,fnr,lrp,rpoS
b2239 0.211384 2349934 2351011 - glpQ 1077 glpTQ Energy production and conversion NC_000913.3 crp,fis,fnr,glpR,ihf,nac,rpoD
b2240 0.306134 2351015 2352374 - glpT 1359 glpTQ Carbohydrate transport and metabolism NC_000913.3 crp,fis,fnr,glpR,ihf,nac,rpoD
b2241 0.375662 2352646 2354275 + glpA 1629 glpABC Energy production and conversion NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b2242 0.328961 2354264 2355524 + glpB 1260 glpABC Amino acid transport and metabolism NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b2243 0.315752 2355520 2356711 + glpC 1191 glpABC Energy production and conversion NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b3011 0.049324 3155354 3156518 + yqhD 1164 yqhD-dkgA Energy production and conversion NC_000913.3 rpoD,yqhC
b3426 0.350034 3562012 3563518 + glpD 1506 glpD Energy production and conversion NC_000913.3 arcA,crp,glpR,rpoD,yieP
b3925 0.084278 4114568 4115579 - glpX 1011 glpFKX Carbohydrate transport and metabolism NC_000913.3 crp,glpR,rpoD
b3926 0.290235 4115713 4117222 - glpK 1509 glpFKX Energy production and conversion NC_000913.3 crp,glpR,rpoD
b3927 0.312307 4117244 4118090 - glpF 846 glpFKX Carbohydrate transport and metabolism NC_000913.3 crp,glpR,rpoD
Manually changing thresholds

To manually change a threshold, use the change_threshold function.

[11]:
ica_data.change_threshold('GlpR',0.2)
[12]:
ica_data.view_imodulon('GlpR')
[12]:
gene_weight start end strand gene_name length operon COG accession regulator
b2239 0.211384 2349934 2351011 - glpQ 1077 glpTQ Energy production and conversion NC_000913.3 crp,fis,fnr,glpR,ihf,nac,rpoD
b2240 0.306134 2351015 2352374 - glpT 1359 glpTQ Carbohydrate transport and metabolism NC_000913.3 crp,fis,fnr,glpR,ihf,nac,rpoD
b2241 0.375662 2352646 2354275 + glpA 1629 glpABC Energy production and conversion NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b2242 0.328961 2354264 2355524 + glpB 1260 glpABC Amino acid transport and metabolism NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b2243 0.315752 2355520 2356711 + glpC 1191 glpABC Energy production and conversion NC_000913.3 arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD
b3426 0.350034 3562012 3563518 + glpD 1506 glpD Energy production and conversion NC_000913.3 arcA,crp,glpR,rpoD,yieP
b3926 0.290235 4115713 4117222 - glpK 1509 glpFKX Energy production and conversion NC_000913.3 crp,glpR,rpoD
b3927 0.312307 4117244 4118090 - glpF 846 glpFKX Carbohydrate transport and metabolism NC_000913.3 crp,glpR,rpoD
[ ]:

Comparing iModulons

compare_ica is a function specifically built to compare the components between different ICA runs. The function will return a Digraph plot which shows connections between ICA components of different runs, as well as Dictionary List of the one-to-one connections.

The function requires a minimum of two inputs if running a comparison beween two ICA runs for the same organism:

  • S1: Pandas Dataframe of M matrix 1

  • S2: Pandas Dataframe of M Matrix 2

compare_ica can be found in pymodulon.compare

[1]:
from pymodulon import example_data
from pymodulon.compare import *

Comparsion within the same organism

[2]:
ecoli_data = example_data.load_ecoli_data()

We will create a copy of the E. coli M matrix that has slightly different iModulon names.

[3]:
new_M = ecoli_data.M.copy()
new_M.columns = ['run2_'+str(i) for i in new_M.columns]

For this example, we will only compare the first 5 iModulons.

[4]:
links,dots = compare_ica(ecoli_data.M.iloc[:,:5],new_M.iloc[:,:5])

The comparison data are stored in links. The first two values in each element of links are the iModulon name from the first and second datasets respectively. The final value is the Pearson R correlation between the Independent Component gene weights.

[5]:
links
[5]:
[('AllR/AraC/FucR', 'run2_AllR/AraC/FucR', 1.0),
 ('ArcA-1', 'run2_ArcA-1', 1.0),
 ('ArcA-2', 'run2_ArcA-2', 1.0),
 ('ArgR', 'run2_ArgR', 1.0),
 ('AtoC', 'run2_AtoC', 1.0)]

The graphical comparison is saved as dots. The thickness of the arrows represents how similar the iModulons in the first dataset are to the dataset in the second dataset. Since we are comparing two copies of the same dataset, all arrows are at maximum thickness.

[6]:
dots
[6]:
_images/tutorials_comparing_imodulons_11_0.svg

Comparison across organisms

compare_ica can also be used to compare ICA components between different organisms. To do so, you will need to provide an addition input, which is a string path to a Bidirectional Best Hit (BBH) orthology between your two organisms of interest.

[7]:
example_bbh = example_data.load_example_bbh()
example_bbh.head()
[7]:
gene subject PID alnLength mismatchCount gapOpenCount queryStart queryEnd subjectStart subjectEnd eVal bitScore gene_length COV BBH
3 b0003 USA300HOU_RS06760 26.897 290 183 8 1 279 4 275 2.060000e-25 99.8 310 0.935484 <=>
6 b0007 USA300HOU_RS06905 40.638 470 257 6 1 455 1 463 3.650000e-114 342.0 476 0.987395 <=>
9 b0014 USA300HOU_RS08425 55.818 636 251 4 1 636 1 606 0.000000e+00 678.0 638 0.996865 <=>
10 b0015 USA300HOU_RS08420 46.982 381 187 5 1 371 1 376 8.040000e-110 323.0 376 1.013298 <=>
15 b0023 USA300HOU_RS08455 42.353 85 36 3 1 79 1 78 1.530000e-09 46.6 87 0.977011 <=>

You can also provide a path to your own orthology file, as long as it contains the columns gene and target

It is important that the order of the organism matches the order of the orthology file. Specifically, the organism for the M1 file must be the organism that fills the gene column in the orthology file.

[8]:
staph_data = example_data.load_staph_data()
[9]:
links,dots = compare_ica(ecoli_data.M,staph_data.M,
                         ortho_file = example_bbh)
dots
[9]:
_images/tutorials_comparing_imodulons_16_0.svg

Here it is much more clear that some iModulons are more similar between species, whereas others are loosely connected. Arrows can also indicate if an iModulon is composed of multiple iModulons from another dataset.

[10]:
links[:5]
[10]:
[('AllR/AraC/FucR', 'PyrR', 0.261519782054789),
 ('Cbl+CysB', 'CymR', 0.2694541420808301),
 ('Crp-1', 'CcpA-1', 0.2576532196155267),
 ('CsqR', 'LacR', 0.4172511239115126),
 ('CysB', 'CymR', 0.31057785046755)]

Additionally, you can adjust the metric type and cutoff values for correlations by utilizing the method and cutoff parameters. The default for method is "pearson", but can be changed to `"spearman".

cutoff takes a float value between 0 and 1 and will only show correlations greater than the cutoff value (default: 0.3).

[11]:
links,dots = compare_ica(ecoli_data.M,staph_data.M,
                         ortho_file = example_bbh,
                         cutoff = 0.6)
dots
[11]:
_images/tutorials_comparing_imodulons_20_0.svg
[12]:
links,dots = compare_ica(ecoli_data.M,staph_data.M,
                         ortho_file = example_bbh,
                         cutoff = 0.15, method = 'spearman')
dots
[12]:
_images/tutorials_comparing_imodulons_21_0.svg

Create a BBH orthology file

To create your own BBH orthology file, pymodulon.compare contains a set of functions that will allow you to create the desired bbh.csv file for you specific comparison. This only needs to be done once per comparison.

In order to make the BBH orthology file, you need to obtain the Genbank Full Genome files for your organisms. These can be found at https://www.ncbi.nlm.nih.gov/genome/. Search for both of your organisms/strains and download the “Genbank (full)” file. Once saved into a known location, you can use the make_prots function to convert the Genbank files into protein FASTA files.

make_prots recieves two parameters:

  • gbk: Path to the genbank file (one file per function call)

  • out_path: Path to the protein FASTA files

[13]:
# ecoli_gb= "../pymodulon/test/data/ecoli.gb"
# mtb_gb = "../pymodulon/test/data/mtb.gb"
# ecoli_out = "../pymodulon/test/data/ecoli_prot.fasta"
# mtb_out = "../pymodulon/test/data/mtb_prot.fasta"

# make_prots(ecoli_gb,ecoli_out)
# make_prots(mtb_gb,mtb_out)
Create Genbank database files

In order to run Bidirectional Best Hits between two organisms, it is required to build the necessary Genbank Database files from the newly created protein FASTA files. In order to do so, use the make_prots_db function. You will do this for both organims.

make_prots_db recieves one parameter:

  • fasta_file: String path to protein FASTA file

If there is an error, the function will print out error message. The function will also not execut if the necessary Genbank DB files already exist.

[14]:
# make_prot_db(ecoli_out)
# make_prot_db(mtb_out)
Create BBH CSV

Once the previous steps have been completed, you can now create the BBH CSV file. In order to do so, you can use the get_bbh function.

get_bbh has the following parameters:

  • db1: String path to protein FASTA file (output of make_prots function) for organism 1

  • db2: String path to protein FASTA file (output of make_prots function) for organism 2

  • outdir: String path to output directory, default is “bbh” and will create the directory if it does not exist

  • outname: Default db1_vs_db2_parsed.csv where db[1-2] are the passed arguments name of the csv file where that will save the results

  • mincov: Minimum coverage to call hits in BLAST, must be between 0 and 1

  • evalue: evalue thershold for BLAST hits, Default .001

  • threads: Number of threads to run BLAST, Default 1

  • force: Whether to overwrite existing files or not

  • savefiles: Whether to save files to outdir

The function will return a Pandas DataFrame containing all the BLAST hits between the two genes. The function will also save the DataFrame to a csv file, which you can then use in compare_ica between the two organisms.

[15]:
# get_bbh(ecoli_out,mtb_out, outdir = "../pymodulon/test/data/")
[ ]:

Creating the Gene Table

A copy of this notebook is available on the PyModulon GitHub page: Creating the Gene Table Notebook

First, download the FASTA and GFF files for your organism and its plasmids from NCBI. We will use the E. coli K-12 MG1655 strain as an example, which does not have plasmids.

[2]:
from pymodulon import example_data

Get information from GFF files

[3]:
from pymodulon.gene_util import *

Enter the location of all your GFF files here:

[4]:
gff_files = [example_data.ecoli_gff]

The following cell will convert all the GFF files into a single Pandas DataFrame for easy manipulation. Pseudogenes have multiple rows in a GFF file (one for each fragment), but only the first fragment will be kept.

[5]:
keep_cols = ['accession','start','end','strand','gene_name','old_locus_tag','gene_product','ncbi_protein']

DF_annot = gff2pandas(gff_files,index='locus_tag')
DF_annot = DF_annot[keep_cols]

DF_annot.head()
WARNING:root:Duplicate locus_tag detected. Dropping duplicates.
[5]:
accession start end strand gene_name old_locus_tag gene_product ncbi_protein
locus_tag
b0001 NC_000913.3 190 255 + thrL None thr operon leader peptide NP_414542.1
b0002 NC_000913.3 337 2799 + thrA None fused aspartate kinase/homoserine dehydrogenase 1 NP_414543.1
b0003 NC_000913.3 2801 3733 + thrB None homoserine kinase NP_414544.1
b0004 NC_000913.3 3734 5020 + thrC None threonine synthase NP_414545.1
b0005 NC_000913.3 5234 5530 + yaaX None DUF2502 domain-containing protein YaaX NP_414546.1

To ensure that the gene index used is identical to the expression matrix, load in your data.

[6]:
DF_log_tpm = example_data.load_example_log_tpm()
DF_log_tpm.head()
[6]:
ecoli_00001 ecoli_00002 ecoli_00009 ecoli_00010
Geneid
b0001 10.473721 10.271944 10.315476 10.808135
b0002 10.260569 10.368555 10.735874 10.726916
b0003 9.920277 10.044224 10.528432 10.503092
b0004 9.936694 10.010638 9.739519 9.722997
b0005 7.027515 7.237449 6.745798 6.497823

Check that the genes are the same in the expression dataset as in the annotation dataframe. Mismatched genes are listed below.

[7]:
test = DF_annot.sort_index().index == DF_log_tpm.sort_index().index
DF_annot[~test]
[7]:
accession start end strand gene_name old_locus_tag gene_product ncbi_protein
locus_tag

(Optional) KEGG and COGs

Generate nucleotide fasta files for CDS

Enter the location of all your fasta files here:

[8]:
fasta_files = [example_data.ecoli_fasta]

The following code generates CDS files using your FASTA and GFF3 files

[9]:
from Bio import SeqIO

cds_list = []
for fasta in fasta_files:
    seq = SeqIO.read(fasta,'fasta')

    # Get gene information for genes in this fasta file
    df_genes = DF_annot[DF_annot.accession == seq.id]

    for i,row in df_genes.iterrows():
        cds = seq[row.start-1:row.end]
        if row.strand == '-':
            cds = seq[row.start-1:row.end].reverse_complement()
        cds.id = row.name
        cds.description = row.gene_name if pd.notnull(row.gene_name) else row.name
        cds_list.append(cds)
[10]:
cds_list[:5]
[10]:
[SeqRecord(seq=Seq('ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAAC...TGA', SingleLetterAlphabet()), id='b0001', name='NC_000913.3', description='thrL', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTG...TGA', SingleLetterAlphabet()), id='b0002', name='NC_000913.3', description='thrA', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGAT...TAA', SingleLetterAlphabet()), id='b0003', name='NC_000913.3', description='thrB', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCC...TAA', SingleLetterAlphabet()), id='b0004', name='NC_000913.3', description='thrC', dbxrefs=[]),
 SeqRecord(seq=Seq('GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCC...TAA', SingleLetterAlphabet()), id='b0005', name='NC_000913.3', description='yaaX', dbxrefs=[])]

Uncomment the line below to write the CDS file

[11]:
# SeqIO.write(cds_list,'CDS.fna','fasta')
Run EggNOG Mapper
  1. Go to http://eggnog-mapper.embl.de/.

  2. Upload the CDS.fna file from your organism directory (within the sequence_files folder)

  3. Make sure to limit the taxonomy to the correct level

  4. After the job is submitted, you must follow the link in your email to run the job.

  5. Once the job completes (after ~4 hrs), download the annotations file.

  6. Save the annotation file

Get KEGG IDs

Once you have the EggNOG annotations, load the annotation file

[12]:
eggnog_file = example_data.ecoli_eggnog
[13]:
DF_eggnog = pd.read_csv(eggnog_file,sep='\t',skiprows=4,header=None)
eggnog_cols = ['query_name','seed eggNOG ortholog','seed ortholog evalue','seed ortholog score',
               'Predicted taxonomic group','Predicted protein name','Gene Ontology terms',
               'EC number','KEGG_orth','KEGG_pathway','KEGG_module','KEGG_reaction',
               'KEGG_rclass','BRITE','KEGG_TC','CAZy','BiGG Reaction','tax_scope',
               'eggNOG OGs','bestOG_deprecated','COG','eggNOG free text description']

DF_eggnog.columns = eggnog_cols

# Strip last three rows as they are comments
DF_eggnog = DF_eggnog.iloc[:-3]

# Set locus tag as index
DF_eggnog = DF_eggnog.set_index('query_name')
DF_eggnog.index.name = 'locus_tag'

DF_eggnog.head()
[13]:
seed eggNOG ortholog seed ortholog evalue seed ortholog score Predicted taxonomic group Predicted protein name Gene Ontology terms EC number KEGG_orth KEGG_pathway KEGG_module ... KEGG_rclass BRITE KEGG_TC CAZy BiGG Reaction tax_scope eggNOG OGs bestOG_deprecated COG eggNOG free text description
locus_tag
b0002 316407.85674276 0.000000e+00 1600.5 Escherichia thrA GO:0003674,GO:0003824,GO:0004072,GO:0004412,GO... 1.1.1.3,2.7.2.4 ko:K12524 ko00260,ko00261,ko00270,ko00300,ko01100,ko0111... M00016,M00017,M00018,M00526,M00527 ... RC00002,RC00043,RC00087 ko00000,ko00001,ko00002,ko01000 NaN NaN iECDH1ME8569_1439.ECDH1ME8569_0002,iEcDH1_1363... Escherichia 1MW3H@1224,1RN1G@1236,3XN2A@561,COG0460@1,COG0... NA|NA|NA E homoserine dehydrogenase I
b0003 316407.85674277 3.000000e-178 630.9 Escherichia thrB GO:0000096,GO:0000097,GO:0003674,GO:0003824,GO... 2.7.1.39 ko:K00872 ko00260,ko01100,ko01110,ko01120,ko01230,map002... M00018 ... RC00002,RC00017 ko00000,ko00001,ko00002,ko01000 NaN NaN iECSE_1348.ECSE_0003 Escherichia 1MW8I@1224,1RMYR@1236,3XN01@561,COG0083@1,COG0... NA|NA|NA F Catalyzes the ATP-dependent phosphorylation of...
b0004 316407.21321894 1.200000e-246 858.6 Escherichia thrC GO:0003674,GO:0003824,GO:0004795,GO:0005575,GO... 4.2.3.1 ko:K01733 ko00260,ko00750,ko01100,ko01110,ko01120,ko0123... M00018 ... RC00017,RC00526 ko00000,ko00001,ko00002,ko01000 NaN NaN iLF82_1304.LF82_2261,iNRG857_1313.NRG857_00025 Escherichia 1MUWQ@1224,1RQ0H@1236,3XP31@561,COG0498@1,COG0... NA|NA|NA E Catalyzes the gamma-elimination of phosphate f...
b0005 316407.21321895 1.400000e-23 115.5 Escherichia yaaX NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN Escherichia 1N4SV@1224,1S9KR@1236,2B58A@1,32XJS@2,3XRGB@561 NA|NA|NA S Protein of unknown function (DUF2502)
b0006 198214.SF0006 1.600000e-143 515.4 Gammaproteobacteria NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN Escherichia 1MUAF@1224,1RMTD@1236,COG3022@1,COG3022@2 NA|NA|NA S Belongs to the UPF0246 family

5 rows × 21 columns

Now we will pull the KEGG information from the eggNOG file, including orthology, pathway, module, and reactions for each gene.

[14]:
DF_kegg = DF_eggnog[['KEGG_orth','KEGG_pathway','KEGG_module','KEGG_reaction']]

# Melt dataframe
DF_kegg = DF_kegg.reset_index().melt(id_vars='locus_tag')

# Remove null values
DF_kegg = DF_kegg[DF_kegg.value.notnull()]

# Split comma-separated values into their own rows
list2struct = []
for name,row in DF_kegg.iterrows():
    for val in row.value.split(','):
        list2struct.append([row.locus_tag,row.variable,val])

DF_kegg = pd.DataFrame(list2struct,columns=['gene_id','database','kegg_id'])

# Remove ko entries, as only map entries are searchable in KEGG pathway
DF_kegg = DF_kegg[~DF_kegg.kegg_id.str.startswith('ko')]

DF_kegg.head()
[14]:
gene_id database kegg_id
2750 b0002 KEGG_pathway map00260
2751 b0002 KEGG_pathway map00261
2752 b0002 KEGG_pathway map00270
2753 b0002 KEGG_pathway map00300
2754 b0002 KEGG_pathway map01100
Save KEGG information

Uncomment the line below to save the KEGG file

[15]:
# DF_kegg.to_csv('kegg_mapping.csv')
Save COGs to annotation dataframe
[16]:
DF_annot['COG'] = DF_eggnog.COG

# Make sure COG only has one entry per gene
DF_annot['COG'] = [item[0] if isinstance(item,str) else item for item in DF_annot['COG']]

Uniprot ID mapping

The uniprot_id_mapping function is a python wrapper for the Uniprot ID mapping tool. Use input_id=P_REFSEQ_AC if the FASTA/GFFf files are from RefSeq, and input_id=EMBL if the files are from Genbank.

[17]:
mapping_uniprot = uniprot_id_mapping(DF_annot.ncbi_protein.fillna(''),input_id='P_REFSEQ_AC',output_id='ACC',
                             input_name='ncbi_protein',output_name='uniprot')
mapping_uniprot.head()
[17]:
ncbi_protein uniprot
2090 NP_416515.1 A0A0G3HGF9
4381 NP_418733.1 A0A0N9YI43
3651 NP_418014.1 A0A223DQZ5
26 NP_414563.1 A0A385XJ53
1978 NP_416408.1 A0A385XJ53
[18]:
# Merge with current annotation
DF_annot = pd.merge(DF_annot.reset_index(),mapping_uniprot,how='left',on='ncbi_protein')
DF_annot.set_index('locus_tag',inplace=True)
DF_annot.head()
[18]:
accession start end strand gene_name old_locus_tag gene_product ncbi_protein COG uniprot
locus_tag
b0001 NC_000913.3 190 255 + thrL None thr operon leader peptide NP_414542.1 NaN P0AD86
b0002 NC_000913.3 337 2799 + thrA None fused aspartate kinase/homoserine dehydrogenase 1 NP_414543.1 E P00561
b0003 NC_000913.3 2801 3733 + thrB None homoserine kinase NP_414544.1 F P00547
b0004 NC_000913.3 3734 5020 + thrC None threonine synthase NP_414545.1 E P00934
b0005 NC_000913.3 5234 5530 + yaaX None DUF2502 domain-containing protein YaaX NP_414546.1 S P75616

Add Biocyc Operon information

To obtain operon information from Biocyc, follow the steps below

  1. Go to Biocyc.org (you may need to create an account and/or login)

  2. Change the organism database to your organism/strain

  3. Select SmartTables -> Special SmartTables

  4. Select “All genes of <organism>”

  5. Select the “Gene Name” column

  6. Under “ADD TRANSFORM COLUMN” select “Genes in same transcription unit”

  7. Select the “Genes in same transcription unit” column

  8. Under “ADD PROPERTY COLUMN” select “Accession-1”

  9. Under OPERATIONS, select “Export” -> “to Spreadsheet File…”

  10. Select “common names” and click “Export smarttable”

  11. Add file location below and run the code cell

[19]:
biocyc_file = example_data.ecoli_biocyc

DF_biocyc = pd.read_csv(biocyc_file,sep='\t')

# Remove genes with no accession
DF_biocyc = DF_biocyc[DF_biocyc['Accession-1'].notnull()]

# Set the accession (i.e. locus tag) as index
DF_biocyc = DF_biocyc.set_index('Accession-1').sort_values('Left-End-Position')

# Only keep genes in the final annotation file
DF_biocyc = DF_biocyc.reindex(DF_annot.index)

# Reformat transcription units
DF_biocyc['operon_list'] = DF_biocyc['Accession-1.1'].apply(reformat_biocyc_tu)

DF_biocyc.head()
[19]:
Gene Name Left-End-Position Right-End-Position Product Genes in same transcription unit Accession-1.1 operon_list
locus_tag
b0001 thrL 190.0 255.0 <i>thr</i> operon leader peptide thrL // thrA // thrB // thrC b0001 // b0002 // b0003 // b0004 b0001;b0002;b0003;b0004
b0002 thrA 337.0 2799.0 fused aspartate kinase/homoserine dehydrogenase 1 thrA // thrB // thrC // thrL b0002 // b0003 // b0004 // b0001 b0001;b0002;b0003;b0004
b0003 thrB 2801.0 3733.0 homoserine kinase thrA // thrB // thrC // thrL b0002 // b0003 // b0004 // b0001 b0001;b0002;b0003;b0004
b0004 thrC 3734.0 5020.0 threonine synthase thrA // thrB // thrC // thrL b0002 // b0003 // b0004 // b0001 b0001;b0002;b0003;b0004
b0005 yaaX 5234.0 5530.0 DUF2502 domain-containing protein YaaX yaaX b0005 b0005
Assign unique IDs to operons

The following code assigns unique names to each operon

[20]:
# Get all operons
operons = DF_biocyc['operon_list'].unique()

# Map each operon to a unique string
operon_dict = {operon: "Op"+str(i) for i, operon in enumerate(operons)}

# Add names to dataframe
DF_biocyc['operon'] = [operon_dict[op] for op in DF_biocyc["operon_list"]]

DF_biocyc.head()
[20]:
Gene Name Left-End-Position Right-End-Position Product Genes in same transcription unit Accession-1.1 operon_list operon
locus_tag
b0001 thrL 190.0 255.0 <i>thr</i> operon leader peptide thrL // thrA // thrB // thrC b0001 // b0002 // b0003 // b0004 b0001;b0002;b0003;b0004 Op0
b0002 thrA 337.0 2799.0 fused aspartate kinase/homoserine dehydrogenase 1 thrA // thrB // thrC // thrL b0002 // b0003 // b0004 // b0001 b0001;b0002;b0003;b0004 Op0
b0003 thrB 2801.0 3733.0 homoserine kinase thrA // thrB // thrC // thrL b0002 // b0003 // b0004 // b0001 b0001;b0002;b0003;b0004 Op0
b0004 thrC 3734.0 5020.0 threonine synthase thrA // thrB // thrC // thrL b0002 // b0003 // b0004 // b0001 b0001;b0002;b0003;b0004 Op0
b0005 yaaX 5234.0 5530.0 DUF2502 domain-containing protein YaaX yaaX b0005 b0005 Op1

Finally, merge the Biocyc information with the main annotation DataFrame

[21]:
DF_annot['operon'] = DF_biocyc['operon']

Clean up and save annotation

First, we will re-order the annotation columns

[22]:
if 'old_locus_tag' in DF_annot.columns:
    order = ['gene_name','accession','old_locus_tag','start','end','strand','gene_product','COG','uniprot','operon']
else:
    order = ['gene_name','accession','start','end','strand','gene_product','COG','uniprot','operon']

DF_annot = DF_annot[order]
[23]:
DF_annot.head()
[23]:
gene_name accession old_locus_tag start end strand gene_product COG uniprot operon
locus_tag
b0001 thrL NC_000913.3 None 190 255 + thr operon leader peptide NaN P0AD86 Op0
b0002 thrA NC_000913.3 None 337 2799 + fused aspartate kinase/homoserine dehydrogenase 1 E P00561 Op0
b0003 thrB NC_000913.3 None 2801 3733 + homoserine kinase F P00547 Op0
b0004 thrC NC_000913.3 None 3734 5020 + threonine synthase E P00934 Op0
b0005 yaaX NC_000913.3 None 5234 5530 + DUF2502 domain-containing protein YaaX S P75616 Op1

Final statistics

The following graphs show how much information is available for the organism.

[24]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('ticks')
[25]:
fig,ax = plt.subplots()
DF_annot.count().plot(kind='bar',ax=ax)
ax.set_ylabel('# of Values',fontsize=18)
ax.tick_params(labelsize=16)
_images/tutorials_creating_the_gene_table_54_0.png

Fill missing values

Some organisms are missing gene names, so these will be filled with locus tag gene names.

[26]:
# Fill in missing gene names with locus tag names
DF_annot['tmp_name'] = DF_annot.copy().index.tolist()
DF_annot.gene_name.fillna(DF_annot.tmp_name,inplace=True)
DF_annot.drop('tmp_name',axis=1,inplace=True)

COG letters will also be converted to the full name.

[27]:
# Fill missing COGs with X
DF_annot['COG'].fillna('X',inplace=True)

# Change single letter COG annotation to full description
DF_annot['COG'] = DF_annot.COG.apply(cog2str)

counts = DF_annot.COG.value_counts()
plt.pie(counts.values,labels=counts.index);
_images/tutorials_creating_the_gene_table_59_0.png

Uncomment the following line to save the gene annotation dataset

[28]:
# DF_annot.to_csv('gene_info.csv')

GO Annotations

To start, download the GO Annotations for your organism from AmiGO 2

  1. Go to AmiGO 2

  2. Filter for your organism

  3. Click CustomDL

  4. Drag GO class (direct) to the end of your Selected Fields

  5. Enter the location of your GO annotation file below and run the following code block

[29]:
go_file = example_data.ecoli_go_example
[30]:
DF_GO = pd.read_csv(go_file,sep='\t',header=None,usecols=[2,10,17])
DF_GO.columns = ['gene_name','gene_id','gene_ontology']
DF_GO.gene_id.fillna(DF_GO.gene_name,inplace=True)
DF_GO = DF_GO[['gene_id','gene_ontology']]
DF_GO.head()
[30]:
gene_id gene_ontology
0 b0312|ECK0310 betaine-aldehyde dehydrogenase activity
1 b0312|ECK0310 betaine-aldehyde dehydrogenase activity
2 b0312|ECK0310 betaine-aldehyde dehydrogenase activity
3 b0312|ECK0310 betaine-aldehyde dehydrogenase activity
4 b0312|ECK0310 response to osmotic stress

Take a look at the gene_id column: 1. Make sure there are no null entries 2. Check which naming convention is used (locus tag, new locus tag, or gene name)

If it looks like it uses the old locus tag, set old_locus_tag to True.

In this example, the gene_id column needs to be altered slightly to match with the gene name. This may not be necessary for your organism.

[31]:
def format_go_gene_id(gene_id):
    locus_tags = [item for item in gene_id.split('|') if re.match('b\d{4}',item)]
    if len(locus_tags) > 0:
        return locus_tags[0]
    else:
        return None

DF_GO.gene_id = DF_GO.gene_id.apply(format_go_gene_id)

Now we remove null entries

[32]:
DF_GO = DF_GO[DF_GO.gene_id.notnull()]
[33]:
naming = 'locus_tag' # Can be "gene_name" or "old_locus_tag"

if naming != 'locus_tag':
    convert_tags = {value:key for key,value in DF_annot[naming].items()}
    DF_GO.gene_id = DF_GO.gene_id.apply(lambda x: convert_tags[x])
[34]:
DF_GO.head()
[34]:
gene_id gene_ontology
0 b0312 betaine-aldehyde dehydrogenase activity
1 b0312 betaine-aldehyde dehydrogenase activity
2 b0312 betaine-aldehyde dehydrogenase activity
3 b0312 betaine-aldehyde dehydrogenase activity
4 b0312 response to osmotic stress

Uncomment the line below to save the annotations

[35]:
# DF_GO[['gene_id','gene_ontology']].to_csv('GO_annotations.csv')

Additional Functions

Identifying “single-gene” iModulons

Some iModulons are dominated by a single, high-coefficient gene. These iModulons may result from:

  1. Overdecomposition of the dataset to identify noisy genes

  2. Artificial knock-out of single genes

  3. Regulons with only one target gene

No matter what causes these iModulons, it is important to be aware of them. The find_single_gene_imodulons function identifies iModulons that are likely dominated by a single gene.

The iModulons identified by find_single_gene_imodulons may contain more than one gene, since a threshold-agnostic method is used to identify these iModulons.

[1]:
from pymodulon.example_data import load_ecoli_data
ica_data = load_ecoli_data()
[2]:
from pymodulon.plotting import *
[3]:
fig,ax= plt.subplots(figsize=(3,3))
plot_regulon_histogram(ica_data,'Cra',ax=ax);
_images/tutorials_additional_functions_4_0.png
[4]:
ica_data.find_single_gene_imodulons()
[4]:
['Pyruvate', 'fur-KO', 'sgrT', 'thrA-KO', 'ydcI-KO']

Single-gene iModulons can be automatically saved in the iModulon table.

[5]:
ica_data.find_single_gene_imodulons()
ica_data.imodulon_table.loc[['fur-KO']]
[5]:
regulator f1score pvalue precision recall TP n_genes n_tf Category threshold
fur-KO None NaN NaN NaN NaN NaN 13 0 Genomic Alterations 0.078193

Explained variance

To compute the total amount of expression variation explained by iModulons, use the explained_variance function.

[6]:
from pymodulon.util import explained_variance
[7]:
explained_variance(ica_data)
[7]:
0.6946207831724602

Explained variance can be computed for individual genes, samples, or iModulons

[8]:
print('iModulons explain {:.0f}% of the expression variation in thrA'.format(explained_variance(ica_data,genes='thrA')*100))
print('The RpoH iModulon explains {:.0f}% of the expression variation in the dnaKJ transcription unit'.format(explained_variance(ica_data,imodulons=['RpoH'],genes=['dnaK','dnaJ'])*100))
print('iModulons explain {:.0f}% of the expression variation induced by iron starvation'.format(explained_variance(ica_data,samples=['fur__wt_dpd__1','fur__wt_dpd__2'])*100))
print('The RpoS iModulon explains {:.1f}% of the global expression variation'.format(explained_variance(ica_data,imodulons=['RpoS'])*100))
iModulons explain 93% of the expression variation in thrA
The RpoH iModulon explains 69% of the expression variation in the dnaKJ transcription unit
iModulons explain 70% of the expression variation induced by iron starvation
The RpoS iModulon explains 4.3% of the global expression variation

The following code shows how to identify iModulons that explain the largest fraction of variance in the dataset. These iModulons tend to represent the effects of global regulators.

[9]:
import pandas as pd

rec_var = {}
for k in ica_data.imodulon_names:
    rec_var[k] = explained_variance(ica_data,imodulons=k)
df_rec_var = pd.Series(rec_var)
df_rec_var = df_rec_var.sort_values(ascending=False)
df_rec_var.head(10)
[9]:
RpoS                 0.043242
FlhDC                0.042180
Fnr                  0.034855
FliA                 0.034259
uncharacterized-5    0.032859
Fur-1                0.020544
Crp-1                0.018618
ArcA-1               0.017906
PurR-2               0.017560
NtrC+RpoN            0.016785
dtype: float64
[10]:
import matplotlib.pyplot as plt
[11]:
plt.barh(range(10,0,-1),df_rec_var.head(10),tick_label = df_rec_var.head(10).index)
plt.xlabel('Fraction of Explained Variance',fontsize=16)
# Rename uncharacterized-5 to ppGpp
new_labels = [x if x != 'uncharacterized-5' else 'ppGpp' for x in df_rec_var.head(10).index]
plt.yticks(range(10,0,-1),labels=new_labels,fontsize=14);
_images/tutorials_additional_functions_16_0.png
Plotting Explained Variance
[12]:
from pymodulon.plotting import plot_explained_variance
[13]:
plot_explained_variance(ica_data, pc=True);
_images/tutorials_additional_functions_19_0.png
[ ]:

Creating an iModulonDB Dashboard

After an iModulon analysis is completed and published, dashboards for every iModulon and gene are made available on the iModulonDB website. The PyModulon package enables generation of all necessary files for any IcaData object that meets a few simple requirements.

iModulonDB Site Overview

For general information about iModulonDB, visit the about page. Familiarize yourself with the main page types:

  • Splash page: The main entry point for the site

  • Dataset page: After choosing a dataset on the splash page, you are directed to the dataset page, which contains gray dataset metadata box and a table of iModulons.

  • iModulon page: The iModulon dashboards, which have a gray iModulon metadata box, a gene table, and several plots.

  • Gene page: Gene dashboards, which have an iModulon table connecting them to iModulons and an expression plot.

It’s helpful to understand the two parts to most websites:

  • Front End: The front end runs in a web user’s browser. It requests and receives files from the back end, and contains JavaScript functions which can create interactive plots and other features. iModulonDB’s front end code is available at this GitHub repository.

  • Back End: The back end runs in a server, which is usually operated by the site owner. It sends the necessary files to the front end when requested. With iModulonDB, all back end data is precomputed by the PyModulon package using the instructions provided here. Output files are uploaded to GitHub Pages, which serves the files to users. If you host your own version of the site, the local server will be the back end.

To generate and view iModulonDB dashboards for a custom project, follow these steps:

  1. Clone the iModulonDB GitHub repository.

  2. Follow the instructions in this document to generate your own files for the new IcaData object of interest. In imodulondb_export, specify the path to the cloned repository so that PyModulon can place new files in the appropriate locations.

  3. Host a local server from the repository folder. This will act as the back end in place of GitHub Pages.

  4. In your browser, visit http://localhost:8000/dataset.html?organism=new_organism&dataset=new_dataset. The organism and dataset names in the URL are customizable.

[1]:
from pymodulon.core import IcaData
from pymodulon.imodulondb import *
from pymodulon.example_data import load_ecoli_data
[2]:
# Increase the maximum column width
pd.set_option('display.max_colwidth', None)
[3]:
ica_data = load_ecoli_data()

The imodulondb_compatibility Function

imodulondb_compatibility is essentially a convenient checklist for iModulonDB-relevant metadata. It has four outputs, which we will go through one-by-one.

Optional arguments:

  • inplace: Whether to modify the object. It is not recommended to change this from its default (False).

  • tfcomplex_to_gene: Dictionary mapping transcription factor complexes to gene names. See the section on tf_issues.has_gene for more information.

Returns:

  • table_issues: Dataframe describing missing columns from the major annotation tables and the site’s behavior if they are not updated. Three elements in this table are “CRITICAL”, meaning the site cannot be generated without them. If those elements are missing, they will be entries in this table as well as warnings.

  • tf_issues: Dataframe describing each regulator from the imodulon_table that can’t be mapped to the trn, tf_links, and/or gene_table.

  • missing_gene_links: Series listing each gene that doesn’t have a gene_link to an external database.

  • missing_dois: Index listing each sample that doesn’t have an associated DOI in the sample_table. Clicking on these samples in the activity plots on the table will not bring you to a relevant paper.

Call this function and browse its output. Not all issues it lists will be important to fix. In some cases, the default values will be fine for your purposes, or the match it is seeking might not exist.

[4]:
table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)

For demonstration purposes, we will also create a minimal IcaData object called blank_data containing only the M and A matrices. If we call imodulondb_compatibility with this object, the table_issues output will contain all possible issues.

Note the three warnings which appear when this happens; seeing any of these indicates that iModulonDB export cannot be completed. The possible CRITICAL issues are:

  • No X matrix: Each gene page contains a plot of gene expression, so you must provide the X matrix.

  • No project column in sample_table: See the section on table_issues.sample_table.

  • No condition column in sample_table: See the section on table_issues.sample_table.

[5]:
blank_data = IcaData(ica_data.M, ica_data.A)
all_table_issues, b_tf_issues, b_missing_gene_links, b_missing_dois = imodulondb_compatibility(blank_data)
WARNING:root:Critical issue: No X matrix
WARNING:root:Critical issue: No project column in sample_table.
WARNING:root:Critical issue: No condition column in sample_table.
The table_issues Output

The first output of imodulondb_compatibility is the table_issues. Each row corresponds to an issue with one of the main class elements. The columns are:

  • Table: which table or other variable the issue is in

  • Missing Column: the column of the Table with the issue (not case sensitive; capitalization is ignored).

  • Solution: Unless “CRITICAL” is in this cell, the site behavior if the issue remained is described here.

If the X matrix is missing, it will be listed in the first row as a CRITICAL issue. After that, the imodulondb_table issues represent missing information from a dictionary specific to iModulonDB. The rest of the rows may relate to the gene_table, sample_table, or imodulon_table. These issues arise from parsing the column names of the annotation tables.

You may not be able to provide the information to fill in the missing columns (for example, if no database of gene functions exist for your organism). If you’d like, you can partially fill in any column and leave np.nan values for the information you do not know. In most cases, you can continue to omit the column - the Solution column of table_issues will describe what happens in those cases.

Column names are case insensitive. Be sure not to have multiple columns with matching names, as that will cause errors later.

[6]:
table_issues
[6]:
Table Missing Column Solution
0 iModulonDB organism The default, "New Organism", will be used.
1 iModulonDB dataset The default, "New Dataset", will be used.
2 iModulonDB strain The default, "Unspecified", will be used.
3 iModulonDB publication_name The default, "Unpublished Study", will be used.
4 iModulonDB publication_link The publication name will not be a hyperlink.
5 iModulonDB gene_link_db The default, "External Database", will be used.
6 iModulonDB organism_folder The default, "new_organism", will be used.
7 iModulonDB dataset_folder The default, "new_dataset", will be used.
8 Gene gene_product Locus tags (gene_table.index) will be used.
9 Sample sample The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#."
10 Sample n_replicates This column will be generated for you.
11 iModulon name imodulon_table.index will be used.
12 iModulon function The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard
13 iModulon exp_var This column will be left blank.
The imodulondb_table Variable

Unless you are missing the X matrix, the first set of table_issues will be relating to the iModulonDB table (imodulondb_table), which is a dictionary of details about the dataset in general. As you can see by reading the Solution column of table_issues for these entries, most entries have a default value (and the default behavior for a missing link is to not create a link).

Each entry affects the following parts of the site:

  • organism: Appears in the dataset page metadata box, and gets programmatically changed into a short form (“Eshcerichia coli” –> “E. coli”) to appear in the dataset title on iModulon and Gene pages. Be sure to use the form “Genus species”.

  • dataset: The specific dataset title, which could distinguish it from other datasets in this organism. Appears in the dataset page metadata box, and gets appended to the shortened organism name in the dataset title on iModulon and Gene pages.

  • strain: Appears on the dataset pages in the metadata box.

  • publication_name: Appears on the dataset pages in the metadata box. Typically we use the form “Smith, et al., year”.

  • publication_link: The publication_name will be turned into a hyperlink if this entry is not blank. DOIs preferred. PyModulon does not test the validity of any links, so be sure this is a valid link.

  • gene_link_db: Appears on Gene pages in the metadata box if a gene_link is also available for the gene. PyModulon expects all gene links to go to specific gene pages from the same database, and for the database name to be the one shown here. If gene links are not included, then “Not Available” will appear in place of the gene_link_db name. Examples of gene link databases include EcoCyc, SubtiWiki, and AureoWiki.

  • organism_folder: Determines output file location names. Also used in the URLs. Cannot contain spaces or slashes.

  • dataset_folder: Determines output file location names. Also used in URLs. Cannot contain spaces or slashes.

Any files with matching organism_folder and dataset_folder will be overwritten.

If you would like to italicize any part of these entries, you can use HTML tags: <i>italic</i>. These are automatically applied to organism names in the site.

[7]:
# Here are ALL possible issues with the imodulondb_table
all_table_issues.loc[all_table_issues.Table == 'iModulonDB']
[7]:
Table Missing Column Solution
1 iModulonDB organism The default, "New Organism", will be used.
2 iModulonDB dataset The default, "New Dataset", will be used.
3 iModulonDB strain The default, "Unspecified", will be used.
4 iModulonDB publication_name The default, "Unpublished Study", will be used.
5 iModulonDB publication_link The publication name will not be a hyperlink.
6 iModulonDB gene_link_db The default, "External Database", will be used.
7 iModulonDB organism_folder The default, "new_organism", will be used.
8 iModulonDB dataset_folder The default, "new_dataset", will be used.
[8]:
# here are the issues relevant to ica_data.imodulondb_table
# in this case, all issues appear
table_issues.loc[table_issues.Table == 'iModulonDB']
[8]:
Table Missing Column Solution
0 iModulonDB organism The default, "New Organism", will be used.
1 iModulonDB dataset The default, "New Dataset", will be used.
2 iModulonDB strain The default, "Unspecified", will be used.
3 iModulonDB publication_name The default, "Unpublished Study", will be used.
4 iModulonDB publication_link The publication name will not be a hyperlink.
5 iModulonDB gene_link_db The default, "External Database", will be used.
6 iModulonDB organism_folder The default, "new_organism", will be used.
7 iModulonDB dataset_folder The default, "new_dataset", will be used.
[9]:
# issues 0-7
# complete the imodulondb_table
ica_data.imodulondb_table = {
     'organism': 'Escherichia coli',
     'dataset': 'PRECISE 1',
     'strain': 'K-12 MG1655 and BW25113',
     'publication_name': 'Sastry, et al., 2019',
     'publication_link': 'https://doi.org/10.1038/s41467-019-13483-w',
     'gene_link_db': 'EcoCyc',
     'organism_folder': 'e_coli',
     'dataset_folder': 'precise1'
}
[10]:
# if we now re-call the function, the issues we fixed are removed.
table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)
table_issues.loc[table_issues.Table == 'iModulonDB']
[10]:
Table Missing Column Solution
The gene_table Variable

The gene_table contains details about each gene and shares an index with the M and X matrices. Its information is included in the following parts of the site:

  • Gene Tables on the iModulon Pages

  • Metadata box on the Gene Pages

  • Search results

  • name, cog, and gene_start are shown when a gene is hovered over in the gene scatter plot (middle right of iModulon Pages). cog is also used to color this scatter plot, and gene_start defines the x axis.

Note that column names are case-insensitive.

Some additional considerations for each column are described below:

  • gene_name: Try to use the most up to date names. Also, if a gene encodes a transcription factor (regulator) in the TRN, make sure that the names match (e.g. the transcription factor listed in the trn as ‘RpoS’ is encoded by the gene with gene_name ‘rpoS’). See the section on tf_issues.has_gene for more information.

  • gene_product: This field is meant to be a concise but specific description of the function/product of the gene, such as ‘homoserine kinase’. You can use html tags in this field if superscripts or subscripts are desired.

  • cog: This stands for ‘cluster of orthologous groups’, and should be a larger category of genes such as ‘Carbohydrate transport and metabolism’. In addition to being displayed everywhere listed above, each COG is assigned a random color in the gene scatter plot, so it is especially useful to have them.

  • gene_start: This is not displayed anywhere, but it is used as the x axis value for each gene in the scatter plots on the iModulon Pages. If a few gene_starts are missing, those genes will have a value of 0 on the x axis. If all gene_starts are not provided, then the order of the gene_table will be used to assign integer values to each gene for the x axis of this plot.

  • operon: Preferably, this field will be the shortest string listing all genes, such as ‘artPIQM’, ‘argT-hisJQMP’, or ‘pdxB-usg-truA-dedA’. For very long operons with names that don’t share letters, you can list the first and last gene as in ‘yitB->yisZ’. iModulonDB is not picky about this column, however, so operons as provided by another database are acceptable.

  • regulator: Adding a TRN will automatically generate this column for you. It should contain a comma-separated list of all regulators that regulate this gene, e.g. ‘ppGpp,Lrp,Nac’.

If any of these columns are listed in your table_issues, be sure to check the gene_table for the appropriate information under other column names. Rename those columns so that they match the names above.

[11]:
# here are all possible issues with the gene_table
all_table_issues.loc[all_table_issues.Table == 'Gene']
[11]:
Table Missing Column Solution
9 Gene gene_name Locus tags (gene_table.index) will be used.
10 Gene gene_product Locus tags (gene_table.index) will be used.
11 Gene cog COG info will not display & the gene scatter plot will not have color.
12 Gene start The x axis of the scatter plot will be a numerical value instead of a genome location.
13 Gene operon Operon info will not display.
14 Gene regulator Regulator info will not display. If you have a TRN, add it to the model to auto-generate this column.
[12]:
# here are the issues relevant to ica_data.gene_table
table_issues.loc[table_issues.Table == 'Gene']
[12]:
Table Missing Column Solution
0 Gene gene_product Locus tags (gene_table.index) will be used.
[13]:
# issue 0
# if gene_product information is available, add it to the gene_table.
The sample_table Variable

The sample_table describes each sample and shares its index with the columns of X and A. Its information is very important for generating the activity and expression bar graphs on the iModulon and gene pages. Familiarize yourself with the activity plot.

Three columns in the sample_table define progressively smaller groupings of samples. It is CRITICAL that these columns exist so that the activity plots can be generated. Short, human readable, and specific names are preferred. Formatting should be consistent within your dataset, but is not necessarily constant throughout iModulonDB.

  • project: This defines the largest grouping of samples, which share a common theme and/or were featured in the same paper. Examples include: ‘biofilm’, ‘acid’, and ‘crp_KO’. In the activity bar graph on the iModulon Pages, the vertical lines separate each project and the names are displayed across the bottom.

  • condition: This is the smallest grouping of samples, used for experimental conditions. Samples have matching project and condition names if and only if they are biological replicates. Examples include: ‘biofilm_t0’, ‘biofilm_t10’, ‘wt_ctrl’, ‘del_crp’. In the activity bar graph on the iModulon Pages, each bar corresponds to a condition, and its height is the average activity value across all samples in the condition. Hovering over a bar provides additional details from the sample table, and zooming in switches the x labels to be conditions instead of projects.

  • sample: This must be unique for each sample. If it is not a column, it is assumed that it is the index of the sample_table. Use a human readable sample name, preferably of the form ‘project__condition__#’, where ‘#’ indicates a replicate number. Examples: ‘carbon__wt_ctrl__1’, ‘carbon__wt_ctrl__2’, ‘carbon__del_crp__1’, ‘biofilm__biofilm_t0__1’. In the activity bar graph on the iModulon Pages, each dot floating near the bar is a sample. Also, the regulon scatter plots contain points corresponding to each sample; sample names will appear when points are hovered over in that plot as well.

Finally, two other columns are checked:

  • n_replicates: For each condition, the samples all must have a matching n_replicates value that is used by the code. You can ignore this output, however, since a missing n_replicates column will automatically be generated. If you would like to remove this issue from the table, you can run the command generate_n_replicates_column(ica_data); otherwise, it will be automatically called on export.

  • doi: If you click on the bars of the activity plot and the first sample of the corresponding condition has a link in the doi column, then you will go to that link. This could be useful to learn more about a given sample. DOIs are preferred, but any link is allowed. PyModulon does not test the validity of any links, so be sure this is a valid link. See the section on missing_dois for more details on this.

Note that in addition to the columns that PyModulon searches for, all columns of the ``sample_table`` can be used to annotate or color the activity plots. The more informative the sample_table is, the more useful the activity plots will be. You can access all the columns in any iModulon or Gene Page by clicking the wrench symbol next to the word ‘Activity’. Checkboxes next to each column name indicate whether its value will be displayed on hovering over the sample, and the ink button allows you to recolor the activity plot by the values of that feature. For example, ‘pH’ is not a required sample_table column, but it could be useful to color or label each sample by that variable. Other useful but completely optional columns include:

  • Strain Description

  • Carbon Source (g/L)

  • Supplement

  • Temperature (C)

  • Time (min)

  • Growth Phase

[14]:
# here are all possible issues with the sample_table
all_table_issues.loc[all_table_issues.Table == 'Sample']
[14]:
Table Missing Column Solution
15 Sample project This is a CRITICAL column defining the largest grouping of samples. Vertical bars in the activity plot will separate projects.
16 Sample condition This is an CRITICAL column defining the smallest grouping of samples. Biological replicates must have matching projects and conditions, and they will appear as single bars with averaged activities.
17 Sample sample The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#."
18 Sample n_replicates This column will be generated for you.
19 Sample doi Clicking on activity plot bars will not link to relevant papers for the samples.
[15]:
# here are the issues relevant to ica_data.sample_table
table_issues.loc[table_issues.Table == 'Sample']
[15]:
Table Missing Column Solution
1 Sample sample The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#."
2 Sample n_replicates This column will be generated for you.
[16]:
# issue 1
# check to make sure that the sample_table.index contains good names
ica_data.sample_table.index
[16]:
Index(['control__wt_glc__1', 'control__wt_glc__2', 'fur__wt_dpd__1',
       'fur__wt_dpd__2', 'fur__wt_fe__1', 'fur__wt_fe__2',
       'fur__delfur_dpd__1', 'fur__delfur_dpd__2', 'fur__delfur_fe2__1',
       'fur__delfur_fe2__2',
       ...
       'efeU__menFentC_ale29__1', 'efeU__menFentC_ale29__2',
       'efeU__menFentC_ale30__1', 'efeU__menFentC_ale30__2',
       'efeU__menFentCubiC_ale36__1', 'efeU__menFentCubiC_ale36__2',
       'efeU__menFentCubiC_ale37__1', 'efeU__menFentCubiC_ale37__2',
       'efeU__menFentCubiC_ale38__1', 'efeU__menFentCubiC_ale38__2'],
      dtype='object', length=278)
[17]:
# the sample names above are good; we can now ignore issue 1

# issue 2
# optionally add the n_replicates column
generate_n_replicates_column(ica_data)
The imodulon_table Variable

The imodulon_table describes each iModulon and shares its index with the columns of M and the index of A. Its data is featured on iModulonDB in the following places:

  • Dataset Page, main section

  • iModulon tables on the Gene Pages (regulator, function, and category columns)

  • Metadata box on the iModulon Pages

  • Search results

Some additional considerations for each column are described below:

  • name: Usually, this issue can be ignored because the imodulon_table.index always matches the imodulon_names. If the names are all integers, then the word “iModulon” will be added (“0” –> “iModulon 0”).

  • regulator: This column is very important. Use regulator names that match the TRN. Join regulators using either ‘+’ to represent the intersection of regulons or ‘/’ to represent the union of regulons. Regulator links will be added in the iModulon Page metadata boxes according to the tf_links variable. The content of this column affects the behavior of the gene table, gene histogram, regulon venn diagram, and regulon scatter plots. See the section on tf_issues.

  • function: This is meant to be a specific description of each iModulon’s function, such as “Histidine biosynthesis”.

  • category: This column provides larger groupings of iModulons, such as “Amino Acid Biosynthesis”.

  • n_genes: The number of genes in the iModulon. This can be ignored since it will be computed for you.

  • precision: The overlap between the iModulon and its regulon divided by the size of the iModulon. See the tutorial on gene enrichment analysis.

  • recall: The overlap between the iModulon and its regulon divided by the size of the regulon. See the tutorial on gene enrichment analysis.

  • exp_var: The explained variance when this iModulon alone is used to reconstruct the original X matrix. Use decimal values; they will be converted to percentages in iModulonDB. These can be computed using the explained_variance function; see the tutorial on additional functions.

[18]:
# here are all possible issues with the imodulon_table
all_table_issues.loc[all_table_issues.Table == 'iModulon']
[18]:
Table Missing Column Solution
20 iModulon name imodulon_table.index will be used.
21 iModulon regulator The regulator details will be left blank.
22 iModulon function The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard
23 iModulon category The categories will be filled in as "Uncharacterized".
24 iModulon n_genes This column will be computed for you.
25 iModulon precision This column will be left blank.
26 iModulon recall This column will be left blank.
27 iModulon exp_var This column will be left blank.
[19]:
# here are the issues relevant to ica_data.imodulon_table
table_issues.loc[table_issues.Table == 'iModulon']
[19]:
Table Missing Column Solution
3 iModulon name imodulon_table.index will be used.
4 iModulon function The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard
5 iModulon exp_var This column will be left blank.
[20]:
# issue 3
# check that the imodulon_table.index contains good names
ica_data.imodulon_table.index
[20]:
Index(['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC', 'BW25113',
       'Cbl+CysB', 'CdaR', 'CecR', 'Copper', 'CpxR', 'Cra', 'Crp-1', 'Crp-2',
       'CsqR', 'CysB', 'DhaR/Mlc', 'EvgA', 'ExuR/FucR', 'FadR', 'FecI',
       'FlhDC', 'FliA', 'Fnr', 'Fur-1', 'Fur-2', 'GadEWX', 'GadWX', 'GcvA',
       'GlcC', 'GlpR', 'GntR/TyrR', 'His-tRNA', 'Leu/Ile', 'Lrp', 'MalT',
       'MetJ', 'Nac', 'NagC/TyrR', 'NarL', 'NikR', 'NtrC+RpoN', 'OxyR', 'PrpR',
       'PurR-1', 'PurR-2', 'PuuR', 'Pyruvate', 'RbsR', 'RcsAB', 'RpoH', 'RpoS',
       'SoxS', 'SrlR+GutM', 'Thiamine', 'Tryptophan', 'XylR', 'YgbI', 'YiaJ',
       'YieP', 'YneJ', 'Zinc', 'crp-KO', 'curli', 'deletion-1', 'deletion-2',
       'duplication-1', 'e14-deletion', 'efeU-repair', 'entC-menF-KO',
       'fimbriae', 'flu-yeeRS', 'fur-KO', 'gadWX-KO', 'insertion',
       'iron-related', 'lipopolysaccharide', 'membrane', 'nitrate-related',
       'proVWX', 'purR-KO', 'sgrT', 'thrA-KO', 'translation',
       'uncharacterized-1', 'uncharacterized-2', 'uncharacterized-3',
       'uncharacterized-4', 'uncharacterized-5', 'uncharacterized-6',
       'ydcI-KO', 'yheO-KO'],
      dtype='object')
[21]:
# the names are good; we can ignore issue 3

# issue 4
# write a function column if desired
# (part of iModulon characterization)

# issue 5
# compute the explained variance for each iModulon
from pymodulon.util import explained_variance

for k in ica_data.imodulon_table.index:
    ica_data.imodulon_table.loc[k, 'exp_var'] = explained_variance(
        ica_data, imodulons=k)
The tf_issues Output

The next output is the tf_issues dataframe. Each row corresponds to a regulator that is used in the imodulon_table. Regulators that satisfy all requirements are omitted. False values in this table represent potentially missing information.

The columns refer to the following:

  • in_trn: whether the regulator is in the model.trn. Regulators not in the TRN will be ignored in the site’s histograms and gene tables. It is highly recommended that all regulators be in the TRN. Any ``False`` values in this column should be fixed by adding rows to the trn and/or ensuring that names all match up between imodulon_table.regulator and trn.regulator.

  • has_link: whether the regulator has a link in ica_data.tf_links. If it does, its name will be clickable in the iModulon Page metadata box. If not, the regulator name will not appear as a hyperlink. Any regulators without dedicated pages in other databases should be ignored.

  • has_gene: whether the regulator can be matched to a gene in the model. This is used to generate the regulon scatter plot which appears at the bottom of the iModulon Page, comparing iModulon activity to the expression of the regulators. If your regulator is a gene, try to ensure that its name matches the gene_table.gene_name. If your regulator doesn’t correspond to a gene (e.g. the small molecule regulator ppGpp), then a False value in this column is acceptable.

Note that in our minimal example, the tf_issues will be empty because there is no regulator column in the imodulon_table.

[22]:
tf_issues
[22]:
in_trn has_link has_gene
allR True False True
fucR True False True
araC True False True
arcA True False True
argR True False True
... ... ... ...
trpR True False True
xylR True False True
yiaJ True False True
zur True False True
zntR True False True

70 rows × 3 columns

The in_trn Column

As described above, any False values in this column should definitely be fixed by ensuring name match-ups between imodulon_table.regulator and trn.regulator columns, or by adding rows to trn. True values and omitted regulators indicate matchable regulators.

[23]:
# here is a list of missing regulators from the TRN
tf_issues.index[~tf_issues.in_trn.astype(bool)]
[23]:
Index([], dtype='object')
[24]:
# it is empty; no issues
The has_gene Column

As described above, any False values in this column indicate an inability to match between imodulon_table.regulator and gene_table.gene_name. Some regulators will not have genes, so some False values are acceptable.

In some cases, the regulator name will be a complex of several genes. PyModulon supports an additional input, tfcomplex_to_gene, which is a dictionary mapping those regulators to their preferred gene. This variable will need to be passed to both imodulondb_compatibility and imodulondb_export.

[29]:
tf_issues.index[~tf_issues.has_gene.astype(bool)]
[29]:
Index(['cecR', 'flhD;flhC', 'glpR', 'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'btsR',
       'rcsA;rcsB', 'gutM', 'TPP', 'L-tryptophan', 'trp-tRNA'],
      dtype='object')
[30]:
# three of these are name mismatches
# find the old names and replace them
ica_data.gene_table = ica_data.gene_table.replace({
    'ybiH':'cecR',
    'yehT':'btsR',
    'srlM':'gutM'
})

# two are complexes
tfcomplex_to_gene = {
    'flhD;flhC':'flhD',
    'rcsA;rcsB':'rcsB'
}
[31]:
# let's re-check the tf_issues now
# use the tfcomplex_to_gene input
table_issues, tf_issues, missing_gene_links, missing_dois = \
    imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)

tf_issues.index[~tf_issues.has_gene.astype(bool)]
[31]:
Index(['glpR', 'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'TPP', 'L-tryptophan',
       'trp-tRNA'],
      dtype='object')
[32]:
# the rest are pseudogene or non-gene regulators; ignore.
The missing_dois Output

The final output is the missing_dois output. This output searches for completed entries in sample_table.doi, and lists the sample_table.index associated with any missing entries.

It may be helpful to think of the DOIs as “sample_links”. Clicking on the bar of an activity plot on the iModulon Pages or an expression plot on the Gene Pages will take users to the DOI of the corresponding sample if it exists. If this project is part of the modulome, in which samples from many publications are pooled, it is especially useful to fill in all of the DOIs.

In the case where no doi column exists in the sample_table, all samples will be listed as missing their dois.

[37]:
missing_dois
[37]:
Index(['misc__wt_no_te__1', 'misc__wt_no_te__2', 'misc__bw_delcbl__1',
       'misc__bw_delcbl__2', 'misc__bw_delfabr__1', 'misc__bw_delfabr__2',
       'misc__bw_delfadr__1', 'misc__bw_delfadr__2', 'misc__nitr_031__1',
       'omics__wt_glu__1', 'omics__wt_glu__2', 'minspan__wt_glc__4',
       'minspan__bw_delcra_glc__2', 'ica__wt_glc__1', 'ica__wt_glc__2',
       'ica__wt_glc__3', 'ica__wt_glc__4', 'ica__arg_sbt__1',
       'ica__arg_sbt__2', 'ica__cytd_rib__1', 'ica__cytd_rib__2',
       'ica__gth__1', 'ica__gth__2', 'ica__leu_glcr__1', 'ica__leu_glcr__2',
       'ica__met_glc__1', 'ica__met_glc__2', 'ica__no3_anaero__1',
       'ica__no3_anaero__2', 'ica__phe_acgam__1', 'ica__phe_acgam__2',
       'ica__thm_gal__1', 'ica__thm_gal__2', 'ica__tyr_glcn__1',
       'ica__tyr_glcn__2', 'ica__ura_pyr__1', 'ica__ura_pyr__2',
       'ica__wt_glc__5', 'ica__wt_glc__6', 'ica__bw_delpurR_cytd__1',
       'ica__bw_delpurR_cytd__2', 'ica__ade_glc__1', 'ica__ade_glc__2'],
      dtype='object', name='missing_DOIs')
[38]:
# a couple of the missing dois do exist
dois = {
    'minspan__wt_glc__4': 'doi.org/10.15252/msb.20145243',
    'minspan__bw_delcra_glc__2': 'doi.org/10.15252/msb.20145243',
    'omics__wt_glu__1': 'doi.org/10.1038/ncomms13091',
    'omics__wt_glu__2': 'doi.org/10.1038/ncomms13091'
}

# add them to the sample_table
for sample, doi in dois.items():
    ica_data.sample_table.loc[sample, 'DOI'] = doi
Double-check Compatibility and Save

Once all the outputs have been checked, it is encouraged to re-run the imodulondb_compatibility check to make sure that you are comfortable ignoring all missing information. If so, save your ica_data object so that you don’t need to repeat any of these steps!

[39]:
table_issues, tf_issues, missing_gene_links, missing_dois = \
    imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)

print('--Table Issues--')
display(table_issues)
print('--TF Issues--')
display(tf_issues)
print('--Missing Gene Links--')
display(missing_gene_links.values)
print('--Missing DOIs--')
display(missing_dois.values)
--Table Issues--
Table Missing Column Solution
0 Gene gene_product Locus tags (gene_table.index) will be used.
1 Sample sample The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#."
2 iModulon name imodulon_table.index will be used.
3 iModulon function The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard
--TF Issues--
in_trn has_link has_gene
glpR True True False
his-tRNA True False False
leu-tRNA True False False
ile-tRNA True False False
TPP True False False
L-tryptophan True False False
trp-tRNA True False False
--Missing Gene Links--
array(['b0240_2', 'b0502_1', 'b0553_1', 'b0562_2', 'b1459_1', 'b2092_1',
       'b2092_2', 'b2139_2', 'b2190_2', 'b2641_2', 'b2681_1', 'b2681_2',
       'b2891_2', 'b3046_2', 'b3268_2', 'b3423_1', 'b3423_2', 'b3643_1',
       'b3777_2', 'b4038_3', 'b4308_2', 'b4488_1', 'b4488_2', 'b4490_2',
       'b4491_1', 'b4492_1', 'b4492_3', 'b4493_1', 'b4493_2', 'b4495_1',
       'b4495_2', 'b4496_1', 'b4496_2', 'b4498_1', 'b4498_2', 'b4499_1',
       'b4571_1', 'b4571_2', 'b4575_2', 'b4580_1', 'b4580_2', 'b4581_2',
       'b4582_1', 'b4582_2', 'b4587_1', 'b4587_2', 'b4600_2', 'b4623_1',
       'b4640_1', 'b4646_2', 'b4658_1', 'b4658_2', 'b4659_1', 'b4659_2',
       'b4660_1', 'b4696_1', 'b4696_2'], dtype=object)
--Missing DOIs--
array(['misc__wt_no_te__1', 'misc__wt_no_te__2', 'misc__bw_delcbl__1',
       'misc__bw_delcbl__2', 'misc__bw_delfabr__1', 'misc__bw_delfabr__2',
       'misc__bw_delfadr__1', 'misc__bw_delfadr__2', 'misc__nitr_031__1',
       'ica__wt_glc__1', 'ica__wt_glc__2', 'ica__wt_glc__3',
       'ica__wt_glc__4', 'ica__arg_sbt__1', 'ica__arg_sbt__2',
       'ica__cytd_rib__1', 'ica__cytd_rib__2', 'ica__gth__1',
       'ica__gth__2', 'ica__leu_glcr__1', 'ica__leu_glcr__2',
       'ica__met_glc__1', 'ica__met_glc__2', 'ica__no3_anaero__1',
       'ica__no3_anaero__2', 'ica__phe_acgam__1', 'ica__phe_acgam__2',
       'ica__thm_gal__1', 'ica__thm_gal__2', 'ica__tyr_glcn__1',
       'ica__tyr_glcn__2', 'ica__ura_pyr__1', 'ica__ura_pyr__2',
       'ica__wt_glc__5', 'ica__wt_glc__6', 'ica__bw_delpurR_cytd__1',
       'ica__bw_delpurR_cytd__2', 'ica__ade_glc__1', 'ica__ade_glc__2'],
      dtype=object)
[40]:
# the above output is all acceptable

# save your results
from pymodulon.io import *
save_to_json(ica_data, 'e_coli_imdb.json')

The imodulondb_export Function

After iModulonDB compatibility has been assured, it is time to use the imodulondb_export function.

Arguments:

  • model: the ica_data object

  • path: the path to the main folder of iModulonDB. This is where you cloned the iModulonDB GitHub repository, and where you should host your local server from. This function will create new folders and files inside this repository. The default is the current working directory.

  • cat_order: a list of each of the imodulon_table.categorys in order. When sorting by category in the Dataset Pages, the categories will appear in this order. Otherwise, they will appear in alphabetical order. Note that the default sorting is now by imodulon_table.exp_var (or by iModulon name if no explained variance is provided), so this input is not very important.

  • tfcomplex_to_gene: a dictionary relating regulatory complexes in the imodulon_table.regulators to gene names in gene_table.gene_names. See the secion on tf_issues.has_gene.

[41]:
cat_order = ['Carbon Source Utilization',
             'Amino Acid and Nucleotide Biosynthesis',
             'Energy Metabolism',
             'Miscellaneous Metabolism',
             'Structural Components',
             'Metal Homeostasis',
             'Stress Response',
             'Regulator Discovery',
             'Biological Enrichment',
             'Genomic Alterations',
             'Uncharacterized']

imodulondb_export(ica_data,
                  '../iModulonDB',
                  cat_order = cat_order,
                  tfcomplex_to_gene = tfcomplex_to_gene)
Writing main site files...
Done writing main site files. Writing plot files...
Two progress bars will appear below. The second will take significantly longer than the first.
Writing iModulon page files (1/2)

Writing Gene page files (2/2)

Complete! (Organism = e_coli; Dataset = precise1)

Now that the files have been exported, you can access them in your browser using the instructions at the beginning of this tutorial.

API Reference

This page contains auto-generated API reference documentation 1.

pymodulon

Submodules
pymodulon.compare
Module Contents
Functions

_get_orthologous_imodulons(M1, M2, method, cutoff)

Given two M matrices, returns the dot graph and name links of the various

_make_dot_graph(links, show_all, names1, names2)

Given a set of links between M matrices, generates a dot graph of the various

convert_gene_index(df1, df2, ortho_file=None, keep_locus=False)

Reorganizes and renames genes in a dataframe to be consistent with

compare_ica(M1, M2, ortho_file=None, cutoff=0.25, method='pearson', plot=True, show_all=False)

Compares two M matrices between a single organism or across organisms and

make_prots(gbk, out_path, lt_key='locus_tag')

Makes protein files for all the genes in the genbank file

make_prot_db(fasta_file, outname=None, combined='combined.fa')

Creates GenBank Databases from Protein FASTA of an organism

get_bbh(db1, db2, outdir='bbh', outname=None, mincov=0.8, evalue=0.001, threads=1, force=False, savefiles=True)

Runs Bidirectional Best Hit BLAST to find orthologs utilizing two protein

_get_gene_lens(file_in)

Computes gene lengths

_run_blastp(db1, db2, out, evalue, threads, force)

Runs BLASTP between two organisms

_all_clear(db1, db2, outdir, mincov)

Checks inputs are acceptable

_same_output(df1, df2)

Checks that inputs are the same

pymodulon.compare._get_orthologous_imodulons(M1, M2, method, cutoff)[source]

Given two M matrices, returns the dot graph and name links of the various connected ICA components

Parameters
  • M1 (DataFrame) – M matrix from the first organism

  • M2 (DataFrame) – M matrix from the second organism

  • method (int or str) – Correlation metric to use from {‘pearson’, ‘kendall’, ‘spearman’} or callable (see corr())

  • cutoff (float) – Cut off value for correlation metric

Returns

links – Links and distances of connected iModulons

Return type

list

pymodulon.compare._make_dot_graph(links, show_all, names1, names2)[source]

Given a set of links between M matrices, generates a dot graph of the various connected iModulons

Parameters
  • links (list) – Names and distances of connected iModulons

  • show_all (bool) – Show all iModulons regardless of their linkage (default: False)

  • names1 (list) – List of names in dataset 1

  • names2 (list) – List of names in dataset 2

Returns

dot – Dot graph of connected iModulons

Return type

Digraph

pymodulon.compare.convert_gene_index(df1, df2, ortho_file=None, keep_locus=False)[source]

Reorganizes and renames genes in a dataframe to be consistent with another object/organism

Parameters
  • df1 (DataFrame) – Dataframe from the first object/organism

  • df2 (DataFrame) – Dataframe from the second object/organism

  • ortho_file (str or DataFrame, optional) – Path to orthology file between organisms (default: None)

  • keep_locus (bool) – If True, keep old locus tags as a column (default: False)

Returns

  • df1_new (~pandas.DataFrame) – M matrix for organism 1 with indexes translated into orthologs

  • df2_new (~pandas.DataFrame) – M matrix for organism 2 with indexes translated into orthologs

pymodulon.compare.compare_ica(M1, M2, ortho_file=None, cutoff=0.25, method='pearson', plot=True, show_all=False)[source]

Compares two M matrices between a single organism or across organisms and returns the connected iModulons

Parameters
  • M1 (DataFrame) – M matrix from the first organism

  • M2 (DataFrame) – M matrix from the second organism

  • ortho_file (str, optional) – Path to orthology file between organisms (default: None)

  • cutoff (float) – Cut off value for correlation metric (default: .25)

  • method (str or Callable) – Correlation metric to use from {‘pearson’, ‘kendall’, ‘spearman’} or callable (see corr())

  • plot (bool) – Create dot plot of matches (default: True)

  • show_all (bool) – Show all iModulons regardless of their linkage (default: False)

Returns

  • matches (list) – Links and distances of connected iModulons

  • dot (Digraph) – Dot graph of connected iModulons

pymodulon.compare.make_prots(gbk, out_path, lt_key='locus_tag')[source]

Makes protein files for all the genes in the genbank file

Parameters
  • gbk (str) – Path to input genbank file

  • out_path (str) – Path to the output FASTA file

  • lt_key (str) – Key to search for locus_tag. Should be either ‘locus_tag’ or ‘old_locus_tag’

Returns

None

Return type

None

pymodulon.compare.make_prot_db(fasta_file, outname=None, combined='combined.fa')[source]

Creates GenBank Databases from Protein FASTA of an organism

Parameters
  • fasta_file (str or list) – Path to protein FASTA file or list of paths to protein fasta files

  • outname (str) – Name of BLAST database to be created. If None, it uses fasta_file name

  • combined (str) – Path to combined fasta file; only used if multiple fasta files are passed

Returns

None

Return type

None

pymodulon.compare.get_bbh(db1, db2, outdir='bbh', outname=None, mincov=0.8, evalue=0.001, threads=1, force=False, savefiles=True)[source]

Runs Bidirectional Best Hit BLAST to find orthologs utilizing two protein FASTA files. Outputs a CSV file of all orthologous genes.

Parameters
  • db1 (str) – Path to protein FASTA file for organism 1

  • db2 (str) – Path to protein FASTA file for organism 2

  • outdir (str) – Path to output directory (default: “bbh”)

  • outname (str) – Name of output CSV file (default: <db1>_vs_<db2>_parsed.csv)

  • mincov (float) – Minimum coverage to call hits in BLAST, must be between 0 and 1 (default: 0.8)

  • evalue (float) – E-value threshold for BlAST hist (default: .001)

  • threads (int) – Number of threads to use for BLAST (default: 1)

  • force (bool) – If True, overwrite existing files (default: False)

  • savefiles (bool) – If True, save files to ‘outdir’ (default: True)

Returns

out – Table of bi-directional BLAST hits between the two organisms

Return type

DataFrame

pymodulon.compare._get_gene_lens(file_in)[source]

Computes gene lengths

Parameters

file_in (str) – Input file path

Returns

out – Table of gene lengths

Return type

DataFrame

pymodulon.compare._run_blastp(db1, db2, out, evalue, threads, force)[source]

Runs BLASTP between two organisms

Parameters
  • db1 (str) – Path to protein FASTA file for organism 1

  • db2 (str) – Path to protein FASTA file for organism 2

  • out (str) – Path for BLASTP output

  • evalue (float) – E-value threshold for BlAST hits

  • threads (int) – Number of threads to use for BLAST

  • force (bool) – If True, overwrite existing files

Returns

out – Path of BLASTP output

Return type

str

pymodulon.compare._all_clear(db1, db2, outdir, mincov)[source]

Checks inputs are acceptable

pymodulon.compare._same_output(df1, df2)[source]

Checks that inputs are the same

pymodulon.core

Core functions for the IcaData object

Module Contents
Classes

IcaData

Class representation of all iModulon-related data

MotifInfo

class pymodulon.core.IcaData(M, A, X=None, log_tpm=None, gene_table=None, sample_table=None, imodulon_table=None, trn=None, dagostino_cutoff=None, optimize_cutoff=False, thresholds=None, threshold_method='dagostino', motif_info=None, imodulondb_table=None, gene_links=None, tf_links=None)[source]

Bases: object

Class representation of all iModulon-related data

property M(self)[source]

Get M matrix

property M_binarized(self)[source]

Get binarized version of M matrix based on current thresholds

property A(self)[source]

Get A matrix

property X(self)[source]

Get X matrix

property log_tpm(self)[source]

Get ‘log_tpm’ matrix

property imodulon_names(self)[source]

Get iModulon names

property sample_names(self)[source]

Get sample names

property gene_names(self)[source]

Get gene names

property gene_table(self)[source]

Get gene table

property sample_table(self)[source]

Get sample table

property imodulon_table(self)[source]

Get table of iModulons

property trn(self)[source]

Get table with TRN information

property motif_info(self)[source]

Get motif info

_update_imodulon_names(self, new_names)[source]

Iterates and updates iModulon names

Parameters

new_names (list) – New names to update iModulons

Returns

None

Return type

None

rename_imodulons(self, name_dict=None, column=None)[source]

Rename an iModulon.

Parameters
  • name_dict (dict) – Dictionary mapping old iModulon names to new names (e.g. {old_name:new_name}) (default: None)

  • column (str) – Uses a column from the iModulon table to rename iModulons ( default: None)

Returns

None

Return type

None

view_imodulon(self, imodulon)[source]

View genes in an iModulon and show relevant information about each gene.

Parameters

imodulon (int or str) – Name of iModulon

Returns

final_rows – Table showing iModulon gene information

Return type

DataFrame

find_single_gene_imodulons(self, save=False)[source]

A simple function that returns the names of all likely single-gene iModulons. Checks if the largest iModulon gene weight is more than twice the weight of the second highest iModulon gene weight.

Parameters

save (bool) – If true, save output to imodulon_table (default: False)

Returns

single_genes_imodulons – List of single-gene iModulons

Return type

list

_update_imodulon_table(self, enrichment)[source]

Update iModulon table given new iModulon enrichments

Parameters

enrichment (Series or DataFrame iModulon enrichment) –

Returns

None

Return type

None

compute_regulon_enrichment(self, imodulon, regulator, save=False, evidence=None)[source]

Compare an iModulon against a regulon. (Note: q-values cannot be computed for single enrichments)

Parameters
  • imodulon (int or str) – Name of ‘iModulon’

  • regulator (str) – TF name, or complex regulon, where “/” uses genes in any regulon and “+” uses genes in all regulons

  • save (bool) – If true, save enrichment score to the imodulon_table (default: True)

  • evidence (list or str) – ‘Evidence’ level of TRN interactions to include during TRN enrichment (default: None)

Returns

enrich – Table containing enrichment statistics

Return type

Series

compute_trn_enrichment(self, imodulons=None, fdr=1e-05, max_regs=1, save=False, method='both', force=False, evidence=None)[source]

Compare iModulons against all regulons in the TRN

Parameters
  • imodulons (int or str) – Name of iModulon(s). If none given, compute enrichments for all ‘iModulons’ (default: None)

  • fdr (float) – False detection rate (default: 1e-5)

  • max_regs (int) – Maximum number of regulators to include in complex regulon ( default: 1)

  • save (bool) – Save regulons with highest enrichment scores to the imodulon_table (default: False)

  • method (str) – How to combine multiple regulators (default: ‘both’). ‘or’ computes enrichment against union of regulons, ‘and’ computes enrichment against intersection of regulons, and ‘both’ performs both tests

  • force (bool) – If false, prevents computation of >2 regulators (default: False)

  • evidence (list or str) – Evidence level of TRN interactions to include during TRN enrichment

Returns

df_enriched – Table of statistically significant enrichments

Return type

DataFrame

compute_annotation_enrichment(self, annotation, column, imodulons=None, fdr=0.1)[source]

Compare iModulons against a gene annotation table

Parameters
  • annotation (DataFrame) – Table containing two columns: the gene locus tag, and its appropriate annotation

  • column (str) – Name of the column containing the annotation

  • imodulons (list or str or int) – Name of iModulon(s). If none given, compute enrichments for all iModulons (default: None)

  • fdr (float) – False detection rate (default: 0.1)

Returns

DF_enriched – Table of statistically significant enrichments

Return type

DataFrame

property dagostino_cutoff(self)[source]

Get D’agostino cutoff

property cutoff_optimized(self)[source]
property thresholds(self)[source]

Get thresholds

change_threshold(self, imodulon, value)[source]

Set threshold for an iModulon

Parameters
  • imodulon (int or str) – Name of iModulon

  • value (float) – New threshold

Returns

None

Return type

None

recompute_thresholds(self, dagostino_cutoff)[source]

Re-computes iModulon thresholds using a new D’Agostino cutoff

Parameters

dagostino_cutoff (int) – New D’agostino cutoff statistic

Returns

None

Return type

None

_update_thresholds(self, dagostino_cutoff)[source]
Parameters

dagostino_cutoff (int) –

reoptimize_thresholds(self, progress=True, plot=True)[source]

Re-optimizes the D’Agostino statistic cutoff for defining iModulon thresholds if the TRN has been updated

Parameters
  • progress (bool) – Show a progress bar (default: True)

  • plot (bool) – Show the sensitivity analysis plot (default: True)

Returns

New D’agostino cutoff

Return type

int

_optimize_dagostino_cutoff(self, progress, plot)[source]

Computes an abridged version of the TRN enrichments for the 20 highest-weighted genes in order to determine a global minimum for the D’Agostino cutoff ultimately used to threshold and define the genes “in” an iModulon

Parameters
  • progress (bool) – Show a progress bar (default: True)

  • plot (bool) – Show the sensitivity analysis plot (default: True)

Returns

New D’agostino cutoff

Return type

int

_kmeans_cluster(self, imodulon)[source]
compute_kmeans_thresholds(self)[source]

Computes iModulon thresholds using K-means clustering

Returns

None

Return type

None

copy(self)[source]

Make a deep copy of an IcaData object

Returns

IcaData – Copy of IcaData object

Return type

IcaData

imodulons_with(self, gene)[source]

Lists the iModulons containing a gene

Parameters

gene (str) – Gene name or locus tag

Returns

List of iModulons containing the gene

Return type

list

name2num(self, gene)[source]

Convert a gene name to the locus tag

Parameters

gene (list or str) – Gene name or list of gene names

Returns

final_list – Locus tag or list of locus tags

Return type

list or str

num2name(self, gene)[source]

Get the name of a gene from its locus tag

Parameters

gene (list or str) – Locus tag or list of locus tags

Returns

result – Gene name or list of gene names

Return type

list or str

property imodulondb_table(self)[source]
class pymodulon.core.MotifInfo(motifs, sites, cmd, file, matches=None)[source]
__repr__(self)[source]

Return repr(self).

property motifs(self)[source]
property sites(self)[source]
property cmd(self)[source]
property file(self)[source]
property matches(self)[source]
pymodulon.enrichment

Contains functions for gene set enrichment analysis

Module Contents
Functions

contingency(set1, set2, all_genes)

Creates contingency table for gene enrichment

compute_enrichment(gene_set, target_genes, all_genes, label=None)

Computes enrichment statistic for gene_set in target_genes.

FDR(p_values, fdr, total=None)

Runs false detection correction for a table of statistics

parse_regulon_str(regulon_str, trn)

Converts a complex regulon (regulon_str) into a list of genes

compute_regulon_enrichment(gene_set, regulon_str, all_genes, trn)

Computes enrichment statistics for a gene_set in a regulon

compute_trn_enrichment(gene_set, all_genes, trn, max_regs=1, fdr=0.01, method='both', force=False)

Compare a gene set against an entire TRN

compute_annotation_enrichment(gene_set, all_genes, annotation, column, fdr=0.01)

Compare a gene set against a dataframe of gene annotations

pymodulon.enrichment.contingency(set1, set2, all_genes)[source]

Creates contingency table for gene enrichment

Parameters
  • set1 (set) – Set of genes (e.g. iModulon)

  • set2 (set) – Set of genes (e.g. regulon)

  • all_genes (set) – Set of all genes

Returns

Contingency table

Return type

np.ndarray

pymodulon.enrichment.compute_enrichment(gene_set, target_genes, all_genes, label=None)[source]

Computes enrichment statistic for gene_set in target_genes.

Parameters
  • gene_set (list) – Gene set for enrichment (e.g. genes in iModulon)

  • target_genes (list) –

    Genes to be enriched against (e.g. genes in regulon or

    GO term)

  • all_genes (list) – Set of all genes

  • label (list) – Label for target_genes (e.g. regulator name or GO term)

Returns

Table containing statistically significant enrichments

Return type

pd.Series

pymodulon.enrichment.FDR(p_values, fdr, total=None)[source]

Runs false detection correction for a table of statistics

Parameters
  • p_values (DataFrame) – DataFrame with a ‘pvalue’ column

  • fdr (float) – False detection rate

  • total (int) – Total number of tests (for multi-enrichment)

Returns

Table containing entries that passed multiple hypothesis correction

Return type

DataFrame

pymodulon.enrichment.parse_regulon_str(regulon_str, trn)[source]

Converts a complex regulon (regulon_str) into a list of genes

Parameters
  • regulon_str (str) – Complex regulon, where “/” uses genes in any regulon and “+” uses genes in all regulons

  • trn (DataFrame) – Table containing transcriptional regulatory network

Returns

reg_genes – Set of genes regulated by regulon_str

Return type

set

pymodulon.enrichment.compute_regulon_enrichment(gene_set, regulon_str, all_genes, trn)[source]

Computes enrichment statistics for a gene_set in a regulon

Parameters
  • gene_set (set) – Gene set for enrichment (e.g. genes in iModulon)

  • regulon_str (str) – Complex regulon, where “/” uses genes in any regulon and “+” uses genes in all regulons

  • all_genes (set) – Set of all genes

  • trn (DataFrame) – Table containing transcriptional regulatory network

Returns

result – Table containing statistically significant enrichments

Return type

DataFrame

pymodulon.enrichment.compute_trn_enrichment(gene_set, all_genes, trn, max_regs=1, fdr=0.01, method='both', force=False)[source]

Compare a gene set against an entire TRN

Parameters
  • gene_set (set) – Gene set for enrichment (e.g. genes in iModulon)

  • all_genes (set) – Set of all genes

  • trn (DataFrame) – Table containing transcriptional regulatory network

  • max_regs (int) – Maximum number of regulators to include in complex regulon (default: 1)

  • fdr (float) – False detection rate (default = .01)

  • method (str) – How to combine complex regulons. (default: ‘both’) “or” computes enrichment against union of regulons “and” computes enrichment against intersection of regulons “both” performs both tests

  • force (bool) – Allows computation of >2 regulators (default = False)

Returns

Table containing statistically significant enrichments

Return type

DataFrame

pymodulon.enrichment.compute_annotation_enrichment(gene_set, all_genes, annotation, column, fdr=0.01)[source]

Compare a gene set against a dataframe of gene annotations

Parameters
  • gene_set (set) – Gene set for enrichment (e.g. genes in iModulon)

  • all_genes (set) – Set of all genes

  • annotation (DataFrame) – Table containing gene annotations

  • column (str) – Name of column in the annotation DataFrame (default: ‘annotation’)

  • fdr (float) – False detection rate (default: 0.01)

Returns

Table containing statistically significant enrichments

Return type

pandas.DataFrame

pymodulon.example_data

Pre-loaded example dataset for PyModulon tutorials.

Module Contents
Functions

load_ecoli_data()

Load Escherichia coli IcaData object from

load_staph_data()

Load Staphylococcus aureus IcaData object from

load_bsub_data()

Load Bacillus subtilis IcaData object from

load_example_bbh()

Load an example bi-directional blast best hit (BBH) file

load_example_log_tpm()

Load an example expression dataset in units log-TPM

Attributes

_data_dir

_ecoli_dir

M

A

X

gene_table

sample_table

imodulon_table

trn

ecoli_fasta

ecoli_gff

ecoli_eggnog

ecoli_biocyc

ecoli_go_example

pymodulon.example_data._data_dir[source]
pymodulon.example_data._ecoli_dir[source]
pymodulon.example_data.M[source]
pymodulon.example_data.A[source]
pymodulon.example_data.X[source]
pymodulon.example_data.gene_table[source]
pymodulon.example_data.sample_table[source]
pymodulon.example_data.imodulon_table[source]
pymodulon.example_data.trn[source]
pymodulon.example_data.ecoli_fasta[source]
pymodulon.example_data.ecoli_gff[source]
pymodulon.example_data.ecoli_eggnog[source]
pymodulon.example_data.ecoli_biocyc[source]
pymodulon.example_data.ecoli_go_example[source]
pymodulon.example_data.load_ecoli_data()[source]

Load Escherichia coli IcaData object from [SGS+19]

Returns

ecoli_dataE. coli IcaData object

Return type

IcaData

pymodulon.example_data.load_staph_data()[source]

Load Staphylococcus aureus IcaData object from [PTS+20]

Returns

staph_dataS. aureus IcaData object

Return type

IcaData

pymodulon.example_data.load_bsub_data()[source]

Load Bacillus subtilis IcaData object from [RSP20]

Returns

bsub_dataB. subtilis IcaData object

Return type

IcaData

pymodulon.example_data.load_example_bbh()[source]

Load an example bi-directional blast best hit (BBH) file

Returns

example_bbh – Example BBH file

Return type

DataFrame

pymodulon.example_data.load_example_log_tpm()[source]

Load an example expression dataset in units log-TPM

Returns

example_tpm – Example expression dataset

Return type

DataFrame

pymodulon.gene_util

Utility functions for gene annotation

Module Contents
Functions

cog2str(cog)

Get the full description for a COG category letter

_get_attr(attributes, attr_id, ignore=False)

Helper function for parsing GFF annotations

gff2pandas(gff_file, feature='CDS', index=None)

Converts GFF file(s) to a Pandas DataFrame

reformat_biocyc_tu(tu)

param tu

Biocyc-formatted transcription unit (i.e. ‘thrL // thrA // thrB //

uniprot_id_mapping(prot_list, input_id='ACC+ID', output_id='P_REFSEQ_AC', input_name='input_id', output_name='output_id')

Python wrapper for the uniprot ID mapping tool (See

pymodulon.gene_util.cog2str(cog)[source]

Get the full description for a COG category letter

Parameters

cog (str) – COG category letter

Returns

Description of COG category

Return type

str

pymodulon.gene_util._get_attr(attributes, attr_id, ignore=False)[source]

Helper function for parsing GFF annotations

Parameters
  • attributes (str) – Attribute string

  • attr_id (str) – Attribute ID

  • ignore (bool) – If true, ignore errors if ID is not in attributes (default: False)

Returns

Value of attribute

Return type

str, optional

pymodulon.gene_util.gff2pandas(gff_file, feature='CDS', index=None)[source]

Converts GFF file(s) to a Pandas DataFrame :param gff_file: Path(s) to GFF file :type gff_file: str or list :param feature: Name(s) of features to keep (default = “CDS”) :type feature: str or list :param index: Column or attribute to use as index :type index: str, optional

Returns

df_gff – GFF formatted as a DataFrame

Return type

DataFrame

pymodulon.gene_util.reformat_biocyc_tu(tu)[source]
Parameters

tu (str) – Biocyc-formatted transcription unit (i.e. ‘thrL // thrA // thrB // thrC’)

Returns

formatted_tu – Semicolon-separated sorted gene list

Return type

str

pymodulon.gene_util.uniprot_id_mapping(prot_list, input_id='ACC+ID', output_id='P_REFSEQ_AC', input_name='input_id', output_name='output_id')[source]

Python wrapper for the uniprot ID mapping tool (See https://www.uniprot.org/uploadlists/)

Parameters
  • prot_list (list) – List of proteins to be mapped

  • input_id (str) – ID type for the mapping input (default: “ACC+ID”)

  • output_id (str) – ID type for the mapping output (default: “P_REFSEQ_AC”)

  • input_name (str) – Column name for input IDs

  • output_name (str) – Column name for output IDs

Returns

mapping – Table containing two columns, one listing the inputs, and one listing the mapped outputs. Column names are defined by input_name and output_name.

Return type

DataFrame

pymodulon.imodulondb

Functions for writing a directory for iModulonDB webpages

Module Contents
Functions

imodulondb_compatibility(model, inplace=False, tfcomplex_to_gene=None)

Checks for all issues and missing information prior to exporting to iModulonDB.

imodulondb_export(model, path='.', cat_order=None, tfcomplex_to_gene=None, skip_iMs=False, skip_genes=False)

Generates the iModulonDB page for the data and exports to the path.

imdb_dataset_table(model)

Converts the model’s imodulondb_table into dataset metadata

imdb_iM_table(imodulon_table, cat_order=None)

Reformats the iModulon table according

imdb_gene_presence(model)

Generates the two versions of the gene presence file, one as a binary

imodulondb_main_site_files(model, path_prefix='.', rewrite_annotations=True, cat_order=None)

Generates all parts of the site that do not require large iteration loops

imdb_generate_im_files(model, path_prefix='.', gene_scatter_x='start', tfcomplex_to_gene=None)

Generates all files for all iModulons in data

imdb_generate_gene_files(model, path_prefix='.')

Generates all files for all iModulons in IcaData object

parse_tf_string(model, tf_str, verbose=False)

Returns a list of relevant tfs from a string. Will ignore TFs not in the

imdb_gene_table_df(model, k)

Creates the gene table dataframe for iModulonDB

_component_DF(model, k, tfs=None)

Helper function for imdb_gene_hist_df

_tf_combo_string(row)

Creates a formatted string for the histogram legends. Helper function for

_sort_tf_strings(tfs, unique_elts)

Sorts TF strings for the legend of the histogram. Helper function for

imdb_gene_hist_df(model, k, bins=20, tol=0.001)

Creates the gene histogram for an iModulon

_gene_color_dict(model)

Helper function to match genes to colors based on COG. Used by

imdb_gene_scatter_df(model, k, gene_scatter_x='start')

Generates a dataframe for the gene scatter plot in iModulonDB

generate_n_replicates_column(model)

Generates the “n_replicates” column of the sample_table for iModulonDB.

imdb_activity_bar_df(model, k)

Generates a dataframe for the activity bar graph of iModulon k

_parse_regulon_string(model, s)

The Bacillus microarray dataset uses [] to create unusually complicated

_get_reg_genes(model, tf)

Finds the set of genes regulated by the boolean combination of regulators

imdb_regulon_venn_df(model, k)

Generates a dataframe for the regulon venn diagram of iModulon k. Returns

get_tfs_to_scatter(model, tf_string, tfcomplex_to_genename=None, verbose=False)

param model

IcaData object

imdb_regulon_scatter_df(model, k, tfcomplex_to_genename=None)

param model

IcaData object

tf_with_links(model, tf_str)

Adds links to the regulator string

tf_with_links_brackets(model, tf_str)

Adds links to the regulator string

imdb_imodulon_basics_df(model, k, reg_venn, reg_scatter)

Generates a dataframe for the metadata of iModulon k

make_im_directory(model, k, path_prefix='.', gene_scatter_x='start', tfcomplex_to_genename=None)

param model

IcaData object

imdb_gene_activity_bar_df(model, gene_id)

param model

IcaData object

imdb_gene_im_table_df(model, g, im_table, m_bin)

Generates a dataframe for the iModulon table of gene g

imdb_gene_basics_df(model, g)

param model

IcaData object

make_gene_directory(model, g, path_prefix='.')

Generates all data for gene g, stores it in a subfolder of path_prefix

pymodulon.imodulondb.imodulondb_compatibility(model, inplace=False, tfcomplex_to_gene=None)[source]

Checks for all issues and missing information prior to exporting to iModulonDB. If inplace = True, modifies the model (not recommended for main model variables).

Parameters
  • model (IcaData) – IcaData object to check

  • inplace (bool, optional) – If true, modifies the model to prepare for export. Not recommended for use with your main model variable.

  • tfcomplex_to_gene (dict, optional) – dictionary pointing complex TRN entries to matching gene names in the gene table (ex: {“FlhDC”:”flhD”})

Returns

  • table_issues (pd.DataFrame) – Each row corresponds to an issue with one of the main class elements. Columns: * Table: which table or other variable the issue is in * Missing Column: the column of the Table with the issue (not case sensitive; capitalization is ignored). * Solution: Unless “CRITICAL” is in this cell, the site behavior if the issue remained is described here.

  • tf_issues (pd.DataFrame) – Each row corresponds to a regulator that is used in the imodulon_table. Columns: * in_trn: whether the regulator is in the model.trn. Regulators not in the TRN will be ignored in the site’s histograms and gene tables. * has_link: whether the regulator has a link in tf_links. If not, no link to external regulator databases will be shown. * has_gene: whether the regulator can be matched to a gene in the model. If this is false, then there will be no regulator scatter plot on the site. You can link TF complexes to one of their genes using the tfcomplex_to_gene input.

  • missing_g_links (pd.Series) – The genes on this list don’t have links in the gene_links. Their gene pages for these genes will not display links.

  • missing_DOIs (pd.Series) – The samples listed here don’t have DOIs in the sample_table. Clicking on their associated bars in the activity plots will not link to relevant papers.

pymodulon.imodulondb.imodulondb_export(model, path='.', cat_order=None, tfcomplex_to_gene=None, skip_iMs=False, skip_genes=False)[source]

Generates the iModulonDB page for the data and exports to the path. If certain columns are unavailable but can be filled in automatically, they will be.

Parameters
  • model (IcaData) – IcaData object to export

  • path (str, optional) – Path to iModulonDB main hosting folder (default = “.”)

  • cat_order (list, optional) – List of categories in the imodulon_table, ordered as you would like them to appear in the dataset table (default = None)

  • tfcomplex_to_gene (dict, optional) – dictionary pointing complex TRN entries to matching gene names in the gene table ex: {“FlhDC”:”flhD”}

  • skip_iMs (bool, optional) – If this is True, do not output iModulon files (to save time)

  • skip_genes (bool, optional) – If this is True, do not output gene files (to save time)

Returns

None

Return type

None

pymodulon.imodulondb.imdb_dataset_table(model)[source]

Converts the model’s imodulondb_table into dataset metadata for the gray box on the left side of the dataset page

Parameters

model (IcaData) – An IcaData object

Returns

res – A series of formatted metadata

Return type

Series

pymodulon.imodulondb.imdb_iM_table(imodulon_table, cat_order=None)[source]

Reformats the iModulon table according

Parameters
  • imodulon_table (DataFrame) – Table formatted similar to IcaData.imodulon_table

  • cat_order (list, optional) – List of categories in imodulon_table.category, ordered as desired

Returns

im_table – New iModulon table with the columns expected by iModulonDB

Return type

DataFrame

pymodulon.imodulondb.imdb_gene_presence(model)[source]

Generates the two versions of the gene presence file, one as a binary matrix, and one as a DataFrame

Parameters

model (IcaData) – An IcaData object

Returns

  • mbin (~pandas.DataFrame) – Binarized M matrix

  • mbin_list (~pandas.DataFrame) – Table mapping genes to iModulons

pymodulon.imodulondb.imodulondb_main_site_files(model, path_prefix='.', rewrite_annotations=True, cat_order=None)[source]

Generates all parts of the site that do not require large iteration loops

Parameters
  • model (IcaData) – IcaData object

  • path_prefix (str, optional) – Main folder for iModulonDB files (default = “.”)

  • rewrite_annotations (bool, optional) – Set to False if the gene_table and trn are unchanged (default = True)

  • cat_order (list, optional) – list of categories in data.imodulon_table.category, ordered as you want them to appear on the dataset page (default = None)

Returns

main_folder – Dataset folder, for use as the path_prefix in imdb_generate_im_files()

Return type

str

pymodulon.imodulondb.imdb_generate_im_files(model, path_prefix='.', gene_scatter_x='start', tfcomplex_to_gene=None)[source]

Generates all files for all iModulons in data

Parameters
  • model (IcaData) – IcaData object

  • path_prefix (str, optional) – Dataset folder in which to store the files (default = “.”)

  • gene_scatter_x (str) – Column from the gene table that specificies what to use on the X-axis of the gene scatter plot (default = “start”)

  • tfcomplex_to_gene (dict, optional) – dictionary pointing complex TRN entries to matching gene names in the gene table ex: {“FlhDC”:”flhD”}

pymodulon.imodulondb.imdb_generate_gene_files(model, path_prefix='.')[source]

Generates all files for all iModulons in IcaData object

Parameters
  • model (IcaData) – IcaData object

  • path_prefix (str, optional) – Dataset folder in which to store the files (default = “.”)

Returns

Return type

None

pymodulon.imodulondb.parse_tf_string(model, tf_str, verbose=False)[source]

Returns a list of relevant tfs from a string. Will ignore TFs not in the trn file. iModulonDB helper function.

Parameters
  • model (IcaData) – IcaData object

  • tf_str (str) – String of tfs joined by ‘+’ and ‘/’ operators

  • verbose (bool, optional) – Whether or nor to print outputs

Returns

tfs – List of relevant TFs

Return type

list

pymodulon.imodulondb.imdb_gene_table_df(model, k)[source]

Creates the gene table dataframe for iModulonDB :param model: IcaData object :type model: IcaData :param k: iModulon name :type k: int or str

Returns

res – DataFrame of the gene table that is compatible with iModulonDB

Return type

DataFrame

pymodulon.imodulondb._component_DF(model, k, tfs=None)[source]

Helper function for imdb_gene_hist_df

Parameters
  • model (IcaData) – IcaData object

  • k (int or str) – iModulon name

  • tfs (list) – List of TFs (default = None)

Returns

gene_table – Gene table for the iModulon

Return type

DataFrame

pymodulon.imodulondb._tf_combo_string(row)[source]

Creates a formatted string for the histogram legends. Helper function for imdb_gene_hist_df.

Parameters

row (Series) – Boolean series indexed by TFs for a given gene

Returns

A string formatted for display (i.e. “Regulated by …”)

Return type

str

pymodulon.imodulondb._sort_tf_strings(tfs, unique_elts)[source]

Sorts TF strings for the legend of the histogram. Helper function for imdb_gene_hist_df.

Parameters
  • tfs (list[str]) – Sequence of TFs in the desired order

  • unique_elts (list[str]) – All combination strings made by _tf_combo_string

Returns

A sorted list of combination strings that have a consistent ordering

Return type

list[str]

pymodulon.imodulondb.imdb_gene_hist_df(model, k, bins=20, tol=0.001)[source]

Creates the gene histogram for an iModulon

Parameters
  • model (IcaData) – IcaData object

  • k (int or str) – iModulon name

  • bins (int) – Number of bins in the histogram (default = 20)

  • tol (float) – Distance to threshold for deciding if a bar is in the iModulon (default = .001)

Returns

gene_hist_table – A dataframe for producing the histogram that is compatible with iModulonDB

Return type

DataFrame

pymodulon.imodulondb._gene_color_dict(model)[source]

Helper function to match genes to colors based on COG. Used by imdb_gene_scatter_df.

Parameters

model (IcaData) – IcaData object

Returns

Dictionary associating gene names to colors

Return type

dict

pymodulon.imodulondb.imdb_gene_scatter_df(model, k, gene_scatter_x='start')[source]

Generates a dataframe for the gene scatter plot in iModulonDB

Parameters
  • model (IcaData) – IcaData object

  • k (int or str) – iModulon name

  • gene_scatter_x (str) – Determines x-axis of the scatterplot

Returns

res – A dataframe for producing the scatterplot

Return type

DataFrame

pymodulon.imodulondb.generate_n_replicates_column(model)[source]

Generates the “n_replicates” column of the sample_table for iModulonDB.

Parameters

model (IcaData) – IcaData object. Will overwrite the existing column if it exists.

Returns

None

Return type

None

pymodulon.imodulondb.imdb_activity_bar_df(model, k)[source]

Generates a dataframe for the activity bar graph of iModulon k

Parameters
  • model (IcaData) – IcaData object

  • k (int or str) – iModulon name

Returns

res – A dataframe for producing the activity bar graph for iModulonDB

Return type

DataFrame

pymodulon.imodulondb._parse_regulon_string(model, s)[source]

The Bacillus microarray dataset uses [] to create unusually complicated TF strings. This function parses those, as a helper to _get_reg_genes for imdb_regulon_venn_df.

Parameters
  • model (IcaData) – IcaData object

  • s (str) – TF string

Returns

res – Set of genes regulated by this string

Return type

set

pymodulon.imodulondb._get_reg_genes(model, tf)[source]

Finds the set of genes regulated by the boolean combination of regulators in a TF string

Parameters
  • model (IcaData) – IcaData object

  • tf (str) – string of TFs separated by +, /, and/or []

Returns

reg_genes – Set of regulated genes

Return type

set[str]

pymodulon.imodulondb.imdb_regulon_venn_df(model, k)[source]

Generates a dataframe for the regulon venn diagram of iModulon k. Returns None if there is no diagram to draw

Parameters
  • model (IcaData) – IcaData object

  • k (int or str) – iModulon name

Returns

res – A DataFrame for producing the venn diagram in iModulonDB

Return type

DataFrame

pymodulon.imodulondb.get_tfs_to_scatter(model, tf_string, tfcomplex_to_genename=None, verbose=False)[source]
Parameters
  • model (IcaData) – IcaData object

  • tf_string (str or nan) – String of TFs, or np.nan

  • tfcomplex_to_genename (dict, optional) – dictionary pointing complex TRN entries to matching gene names in the gene table ex: {“FlhDC”:”flhD”}

  • verbose (bool) – Show verbose output (default: False)

Returns

res – List of gene loci

Return type

list

pymodulon.imodulondb.imdb_regulon_scatter_df(model, k, tfcomplex_to_genename=None)[source]
Parameters
  • model (IcaData) – IcaData object

  • k (int or str) – iModulon name

  • tfcomplex_to_genename (dict, optional) – dictionary pointing complex TRN entries to matching gene names in the gene table ex: {“FlhDC”:”flhD”}

Returns

res – A dataframe for producing the regulon scatter plots in iModulonDB

Return type

DataFrame

Adds links to the regulator string

Parameters
  • model (IcaData) – IcaData object

  • tf_str (str or float) – Regulator string for a given iModulon, or np.nan

Returns

res – String with links added

Return type

str

Adds links to the regulator string Used with the complicated bracket system in Bacillus Microarray

Parameters
  • model (IcaData) – IcaData object

  • tf_str (str or float) – Regulator string for a given iModulon, or np.nan

Returns

res – String with links added

Return type

str

pymodulon.imodulondb.imdb_imodulon_basics_df(model, k, reg_venn, reg_scatter)[source]

Generates a dataframe for the metadata of iModulon k

Parameters
  • model (IcaData) – IcaData object

  • k (int or str) – iModulon name

  • reg_venn (DataFrame or None) – Output of imdb_regulon_venn_df(data, k)

  • reg_scatter (DataFrame or None) – Output of imdb_regulon_scatter_df(data, k)

Returns

res – A dataframe of metadata for iModulon k in iModulonDB

Return type

DataFrame

pymodulon.imodulondb.make_im_directory(model, k, path_prefix='.', gene_scatter_x='start', tfcomplex_to_genename=None)[source]
Parameters
  • model (IcaData) – IcaData object

  • k (int or str) – iModulon name

  • path_prefix (str, optional) – Path to the dataset folder. This function creates an ‘iModulon_files/k/’ subdirectory there to store everything. (default = “.”)

  • gene_scatter_x (str) – Passed to imdb_gene_scatter_df() to indicate the x axis type of that plot (default = “start”)

  • tfcomplex_to_genename (dict, optional) – dictionary pointing complex TRN entries to matching gene names in the gene table ex: {“FlhDC”:”flhD”}

Returns

None

Return type

None

pymodulon.imodulondb.imdb_gene_activity_bar_df(model, gene_id)[source]
Parameters
  • model (IcaData) – IcaData object

  • gene_id (str) – Locus tag of gene

Returns

res – A dataframe for the activity bar of gene in iModulonDB

Return type

DataFrame

pymodulon.imodulondb.imdb_gene_im_table_df(model, g, im_table, m_bin)[source]

Generates a dataframe for the iModulon table of gene g

Parameters
  • model (IcaData) – IcaData object

  • g (str) – Gene locus tag

  • im_table (DataFrame) – Pre-cleaned version of data.imodulon_table

  • m_bin (DataFrame) – Boolean transpose version of data.M_binarized

Returns

perGene_table – A dataframe for the iModulon table of gene g in iModulonDB

Return type

DataFrame

pymodulon.imodulondb.imdb_gene_basics_df(model, g)[source]
Parameters
  • model (IcaData) – IcaData object

  • g (str) – Gene locus

Returns

res – A dataframe for the metadata of gene g in iModulonDB

Return type

DataFrame

pymodulon.imodulondb.make_gene_directory(model, g, path_prefix='.')[source]

Generates all data for gene g, stores it in a subfolder of path_prefix

Parameters
  • model (IcaData) – IcaData object

  • g (str) – Gene locus

  • path_prefix (str, optional) – Path to the dataset folder. This function creates a ‘gene_page_files/k/’ subdirectory there to store everything. (default = “.”)

Returns

im_df – Table containing iModulon information for the gene

Return type

DataFrame

pymodulon.io

Functions for reading and writing data into files.

Module Contents
Functions

save_to_json(data, filename, compress=False)

Save IcaData object to a json file

load_json_model(filename)

Load IcaData object from a file in JSON format.

pymodulon.io.save_to_json(data, filename, compress=False)[source]

Save IcaData object to a json file

Parameters
  • data (IcaData) – ICA dataset to be saved to json file

  • filename (str) – Path to json file where the data will be saved

  • compress (bool) – Indicates if the JSON file should be compressed into a gzip archive

Returns

None

Return type

None

pymodulon.io.load_json_model(filename)[source]

Load IcaData object from a file in JSON format.

Parameters

filename (str or StringIO) – File path or descriptor that contains the JSON document describing the ICA dataset.

Returns

IcaData – The IcaData object as represented in the JSON document.

Return type

IcaData

pymodulon.motif
Module Contents
Functions

_get_upstream_seqs(ica_data, imodulon, seq_dict, upstream, downstream)

Get upstream sequences for a table of operons

find_motifs(ica_data, imodulon, fasta_file, outdir = None, palindrome = False, nmotifs = 5, upstream = 500, downstream = 100, verbose = True, force = False, evt = 0.001, cores = 8, minw = 6, maxw = 40, minsites = None)

Finds motifs upstream of genes in an iModulon

_parse_meme(directory, DF_seqs, verbose, evt)

_parse_tomtom(tomtom_dir)

compare_motifs(motif_info = None, motif_file = None, motif_db = None, outdir = None, force=False, verbose=True, evt=0.001)

Compare a MEME motif against external motif databases

pymodulon.motif._get_upstream_seqs(ica_data, imodulon, seq_dict, upstream, downstream)[source]

Get upstream sequences for a table of operons

Parameters
  • ica_data (IcaData) – IcaData object

  • imodulon (Union[str,int]) – Name of iModulon

  • seq_dict (Dict) – Dictionary mapping accession numbers to Biopython SeqRecords

  • upstream (int) – Number of basepairs upstream from first gene in operon to include in motif search

  • downstream (int) – Number of basepairs upstream from first gene in operon to include in motif search

Returns

  • pd.DataFrame – DataFrame containing operon information

  • List[SeqRecord] – List of SeqRecords containing upstream sequences

pymodulon.motif.find_motifs(ica_data, imodulon, fasta_file, outdir=None, palindrome=False, nmotifs=5, upstream=500, downstream=100, verbose=True, force=False, evt=0.001, cores=8, minw=6, maxw=40, minsites=None)[source]

Finds motifs upstream of genes in an iModulon

Parameters
  • ica_data (IcaData) – IcaData object

  • imodulon (Union[int, str]) – iModulon name

  • fasta_file (Union[os.PathLike, List[os.PathLike]]) – Path or list of paths to fasta file(s) for organism

  • outdir (os.PathLike) – Path to output directory

  • palindrome (bool) – If True, limit search to palindromic motifs (default: False)

  • nmotifs (int) – Number of motifs to search for (default: 5)

  • upstream (int) – Number of basepairs upstream from first gene in operon to include in motif search (default: 500)

  • downstream (int) – Number of basepairs upstream from first gene in operon to include in motif search (default: 100)

  • verbose (bool) – Show steps in verbose output (default: True)

  • force (bool) – Force execution of MEME even if output already exists (default: False)

  • evt (float) – E-value threshold (default: 0.001)

  • cores (int) – Number of cores to use (default: 8)

  • minw (int) – Minimum motif width in basepairs (default: 6)

  • maxw (int) – Maximum motif width in basepairs (default: 40)

  • minsites (Optional[int]) – Minimum number of sites required for a motif. Default is the number of operons divided by 3.

Returns

# TODO

Return type

add documentation of return

pymodulon.motif._parse_meme(directory, DF_seqs, verbose, evt)[source]
pymodulon.motif._parse_tomtom(tomtom_dir)[source]
pymodulon.motif.compare_motifs(motif_info=None, motif_file=None, motif_db=None, outdir=None, force=False, verbose=True, evt=0.001)[source]

Compare a MEME motif against external motif databases

Parameters
  • motif_info (Optional[MotifInfo]) – MotifInfo object. Either ‘motif_info’ or ‘motif_file’ must be provided. ‘motif_info’ takes precedence over ‘motif_file’.

  • motif_file (Optional[os.PathLike]) – Txt file generated by MEME from a motif search.

  • motif_db (Optional[Union[List,str]]) – Name or path to MEME formatted databases

  • outdir (Optional[os.PathLike]) – Output directory for TOMTOM comparisons

  • force (bool) – Force execution of TOMTOM even if output already exists (default: False)

  • verbose (bool) – Show steps in verbose output (default: True)

  • evt (float) – E-value threshold (default: 0.001)

pymodulon.plotting

Plotting functions for iModulons

Module Contents
Functions

barplot(values, sample_table, ylabel='', projects=None, highlight=None, ax=None, legend_kwargs=None)

Creates an overlaid scatter and barplot for a set of values (either gene

plot_expression(ica_data, gene, projects=None, highlight=None, ax=None, legend_kwargs=None)

Creates a barplot showing an gene’s expression across the compendium

plot_activities(ica_data, imodulon, projects=None, highlight=None, ax=None, legend_kwargs=None)

Creates a barplot showing an iModulon’s activity across the compendium

plot_metadata(ica_data, column, projects=None, highlight=None, ax=None, legend_kwargs=None)

Creates a barplot for values in the sample table

plot_regulon_histogram(ica_data, imodulon, regulator=None, bins=None, kind='overlap', ax=None, hist_label=('Not regulated', 'Regulon Genes'), color=('#aaaaaa', 'salmon'), alpha=0.7, ax_font_kwargs=None, legend_kwargs=None)

Plots a histogram of regulon vs non-regulon genes by iModulon weighting.

scatterplot(x, y, groups=None, colors=None, show_labels='auto', adjust_labels=True, line45=False, line45_margin=0, fit_line=False, fit_metric='pearson', xlabel='', ylabel='', ax=None, legend=True, ax_font_kwargs=None, scatter_kwargs=None, label_font_kwargs=None, legend_kwargs=None)

Generates a scatter-plot of the data given, with options for coloring by

plot_gene_weights(ica_data, imodulon, by='start', xaxis=None, xname='', **kwargs)

Plot gene weights on a scatter plot.

compare_gene_weights(ica_data, imodulon1, imodulon2, ica_data2=None, ortho_file=None, use_org1_names=True, **kwargs)

Create a scatterplot comparing the gene weights between two iModulons

compare_expression(ica_data, gene1, gene2, **kwargs)

Create a scatterplot comparing the compendium-wide expression profiles of two genes

compare_activities(ica_data, imodulon1, imodulon2, **kwargs)

Create a scatterplot comparing the compendium-wide activities of two iModulons

plot_dima(ica_data, sample1, sample2, threshold=5, fdr=0.1, label=True, adjust=True, table=False, **kwargs)

Plots a DiMA plot between two projects or two sets of samples

plot_explained_variance(ica_data, pc=True, ax=None)

Plots the cumulative explained variance for independent components and,

compare_imodulons_vs_regulons(ica_data, imodulons=None, cat_column=None, size_column=None, scale=1, reg_only=True, xlabel=None, ylabel=None, vline=0.6, hline=0.6, ax=None, scatter_kwargs=None, ax_font_kwargs=None, legend_kwargs=None)

Compare the overlaps between iModulons and their linked regulons

cluster_activities(ica_data, correlation_method='spearman', distance_threshold=None, show_thresholding=False, show_clustermap=True, show_best_clusters=False, n_best_clusters='auto', cluster_names=None, return_clustermap=False, dimca_sample1=None, dimca_sample2=None, dimca_threshold=5, dimca_fdr=0.1, dimca_label=True, dimca_adjust=True, dimca_table=False, **dima_kwargs)

Cluster all iModulon activity profiles using hierarchical clustering and display

metadata_boxplot(ica_data, imodulon, show_points=True, n_boxes=3, samples=None, strip_conc=True, ignore_cols=None, use_cols=None, return_results=False, ax=None, box_kwargs=None, strip_kwargs=None, swarm_kwargs=None)

Uses a decision tree regressor to automatically cluster iModulon activities

_encode_metadata(ica_data, samples, use_cols=None, ignore_cols=None, strip_conc=True)

_train_classifier(component, features, max_leaf_nodes=3)

_get_labels_from_tree(clf, encoding)

_get_sample_leaves(clf, features, labels, component)

_fit_line(x, y, ax, metric)

_get_fit(x, y)

_broken_line(x, A, B, C)

_solid_line(x, A, B)

_adj_r2(f, x, y, params)

_mod_freedman_diaconis(ica_data, imodulon)

Generates bins using optimal bin width estimate using the Freedman-Diaconis rule

_set_xaxis(xaxis, y)

Implements experimental xaxis param from plot_gene_weights. This

pymodulon.plotting.barplot(values, sample_table, ylabel='', projects=None, highlight=None, ax=None, legend_kwargs=None)[source]

Creates an overlaid scatter and barplot for a set of values (either gene expression levels or iModulon activities)

Parameters
  • values (Series) – List of values to plot

  • sample_table (DataFrame) – Sample table from IcaData object

  • ylabel (str, optional) – Y-axis label

  • projects (list or str, optional) – Name(s) of projects to show (default: show all)

  • highlight (list or str, optional) – Project(s) to highlight (default: None)

  • ax (Axes, optional) – Axes object to plot on, otherwise use current Axes

  • legend_kwargs (dict, optional) – Additional keyword arguments passed to matplotlib.pyplot.legend()

Returns

axAxes containing the barplot

Return type

Axes

pymodulon.plotting.plot_expression(ica_data, gene, projects=None, highlight=None, ax=None, legend_kwargs=None)[source]

Creates a barplot showing an gene’s expression across the compendium

Parameters
  • ica_data (IcaData) – IcaData object

  • gene (str) – Gene locus tag or name

  • projects (str or list, optional) – Name(s) of projects to show (default: show all)

  • highlight (str or list, optional) – Name(s) of projects to highlight (default: None)

  • ax (Axes, optional) – Axes object to plot on, otherwise use current Axes

  • legend_kwargs (dict, optional) – Additional keyword arguments passed to matplotlib.pyplot.legend()

Returns

axAxes containing the barplot

Return type

Axes

pymodulon.plotting.plot_activities(ica_data, imodulon, projects=None, highlight=None, ax=None, legend_kwargs=None)[source]

Creates a barplot showing an iModulon’s activity across the compendium

Parameters
  • ica_data (IcaData) – IcaData object

  • imodulon (int or str) – iModulon name

  • projects (list or str, optional) – Name(s) of projects to show (default: show all)

  • highlight (str or list, optional) – Name(s) of projects to highlight (default: None)

  • ax (Axes, optional) – Axes object to plot on, otherwise use current Axes

  • legend_kwargs (dict, optional) – Additional keyword arguments passed to matplotlib.pyplot.legend()

Returns

axAxes containing the barplot

Return type

Axes

pymodulon.plotting.plot_metadata(ica_data, column, projects=None, highlight=None, ax=None, legend_kwargs=None)[source]

Creates a barplot for values in the sample table

Parameters
  • ica_data (IcaData) – IcaData object

  • column (str) – Column name to plot

  • projects (list or str, optional) – Name(s) of projects to show (default: show all)

  • highlight (str or list, optional) – Name(s) of projects to highlight (default: None)

  • ax (Axes, optional) – Axes object to plot on, otherwise use current Axes

  • legend_kwargs (dict, optional) – Additional keyword arguments passed to matplotlib.pyplot.legend()

Returns

axAxes containing the barplot

Return type

Axes

pymodulon.plotting.plot_regulon_histogram(ica_data, imodulon, regulator=None, bins=None, kind='overlap', ax=None, hist_label=('Not regulated', 'Regulon Genes'), color=('#aaaaaa', 'salmon'), alpha=0.7, ax_font_kwargs=None, legend_kwargs=None)[source]

Plots a histogram of regulon vs non-regulon genes by iModulon weighting.

Parameters
  • ica_data (IcaData) – IcaData object

  • imodulon (int or str) – iModulon name

  • regulator (str, optional) – Name of regulator to compare enrichment against. If no regulator is given, choose top enrichment from imodulon_table

  • bins (int, str or list, optional) – The bins to use when generating the histogram. Passed on to matplotlib.pyplot.hist()

  • kind ('overlap' or 'side', optional) – Whether to plot an overlapping or side-by-side comparison histogram ( default: ‘overlap’)

  • ax (Axes, optional) – Axes object to plot on, otherwise use current Axes

  • hist_label (tuple, optional) – The label to use when plotting the regulon and non-regulon genes. Takes into a tuple of 2 values (first for non-regulon genes, second for regulon genes). Passed on to matplotlib.pyplot.hist()

  • color (list or str, optional) – The colors to use for regulon and non-regulon genes. Takes a Sequence of 2 values (first for non-regulon genes, second for regulon genes). Passed on to matplotlib.pyplot.hist()

  • alpha (float, optional) – Sets the opacity of the histogram (0 = transparent, 1 = opaque). Passed on to matplotlib.pyplot.hist()

  • ax_font_kwargs (dict, optional) – Additional keyword arguments for axes labels

  • legend_kwargs (dict, optional) – Additional keyword arguments passed to matplotlib.pyplot.legend()

Returns

axAxes containing the histogram

Return type

Axes

pymodulon.plotting.scatterplot(x, y, groups=None, colors=None, show_labels='auto', adjust_labels=True, line45=False, line45_margin=0, fit_line=False, fit_metric='pearson', xlabel='', ylabel='', ax=None, legend=True, ax_font_kwargs=None, scatter_kwargs=None, label_font_kwargs=None, legend_kwargs=None)[source]

Generates a scatter-plot of the data given, with options for coloring by group, adding labels, adding lines, and generating correlation or determination coefficients.

Parameters
  • x (Series) – X-axis data

  • y (Series) – Y-axis data

  • groups (dict, optional) – A dictionary mapping samples to group names

  • colors (str, list or dict, optional) – Color of points, list of colors for different groups, or dictionary mapping groups to colors

  • show_labels (bool or 'auto') – Show labels for data points

  • adjust_labels (bool) – Auto-adjust labels for data points

  • line45 (bool) – Show 45-degree line of equal values

  • line45_margin (float) – Show 45-degreen lines offset by a margin

  • fit_line (bool) – Draw a line of best fit on the scatterplot

  • fit_metric ('pearson', 'spearman', 'r2', or None) – Metric to report in legend for line of best fit

  • xlabel (str) – X-axis label

  • ylabel (str) – Y-axis label

  • ax (Axes, optional) – Axes object to plot on, otherwise use current Axes

  • legend (bool) – Show legend

  • ax_font_kwargs (dict, optional) – Additional keyword arguments for axis labels

  • scatter_kwargs (dict, optional) – Additional keyword arguments passed to matplotlib.pyplot.scatter()

  • label_font_kwargs (dict, optional) – Additional keyword arguments for labels passed to matplotlib.pyplot.text()

  • legend_kwargs (dict, optional) – Additional keyword arguments passed to matplotlib.pyplot.legend()

Returns

axAxes containing the scatterplot

Return type

Axes

pymodulon.plotting.plot_gene_weights(ica_data, imodulon, by='start', xaxis=None, xname='', **kwargs)[source]

Plot gene weights on a scatter plot.

Parameters
  • ica_data (IcaData) – IcaData object

  • imodulon (int or str) – iModulon name

  • by ('log-tpm-norm', 'length', or 'start') – Property to plot on x-axis. Superceded by xaxis

  • xaxis (list, dict or Series, optional) – Values on custom x-axis

  • xname (str, optional) – Name of x-axis if using custom x-axis

  • **kwargs – Additional keyword arguments passed to pymodulon.plotting.scatterplot()

Returns

axAxes containing the scatterplot

Return type

Axes

pymodulon.plotting.compare_gene_weights(ica_data, imodulon1, imodulon2, ica_data2=None, ortho_file=None, use_org1_names=True, **kwargs)[source]

Create a scatterplot comparing the gene weights between two iModulons

Parameters
  • ica_data (IcaData) – IcaData object

  • imodulon1 (int or str) – Name of iModulon on X-axis

  • imodulon2 (int or str) – Name of iModulon on X-axis

  • ica_data2 (IcaData, optional) – IcaData object for second iModulon (if comparing iModulons across objects)

  • ortho_file (str, optional) – Path to orthology file between organisms

  • use_org1_names (bool) – If true, use gene names from first organism. If false, use gene names from second organism (default: True)

  • **kwargs – Additional keyword arguments passed to pymodulon.plotting.scatterplot()

Returns

axAxes containing the scatterplot

Return type

Axes

pymodulon.plotting.compare_expression(ica_data, gene1, gene2, **kwargs)[source]

Create a scatterplot comparing the compendium-wide expression profiles of two genes

Parameters
Returns

axAxes containing the scatterplot

Return type

Axes

pymodulon.plotting.compare_activities(ica_data, imodulon1, imodulon2, **kwargs)[source]

Create a scatterplot comparing the compendium-wide activities of two iModulons

Parameters
Returns

axAxes containing the scatterplot

Return type

Axes

pymodulon.plotting.plot_dima(ica_data, sample1, sample2, threshold=5, fdr=0.1, label=True, adjust=True, table=False, **kwargs)[source]

Plots a DiMA plot between two projects or two sets of samples

Parameters
  • ica_data (IcaData) – IcaData object

  • sample1 (list or str) – List of sample IDs or name of “project:condition” for x-axis

  • sample2 (list or str) – List of sample IDs or name of “project:condition” for y-axis

  • threshold (float) – Minimum activity difference to determine DiMAs (default: 5)

  • fdr (float) – False detection rate (default: 0.1)

  • label (bool) – Label differentially activated iModulons (default: True)

  • adjust (bool) – Automatically adjust labels (default: True)

  • table (bool) – Return differential iModulon activity table (default: False)

  • **kwargs – Additional keyword arguments passed to pymodulon.plotting.scatterplot()

Returns

  • ax (~matplotlib.axes.Axes) – Axes containing the scatterplot

  • df_diff (~pandas.DataFrame, optional) – Table reporting differentially activated iModulons

pymodulon.plotting.plot_explained_variance(ica_data, pc=True, ax=None)[source]

Plots the cumulative explained variance for independent components and, optionally, principal components

Parameters
  • ica_data (IcaData) – IcaData object

  • pc (bool) – If True, plot cumulative explained variance of independent components

  • ax (Axes, optional) – Axes object to plot on, otherwise use current Axes

Returns

axAxes containing the line plot

Return type

Axes

pymodulon.plotting.compare_imodulons_vs_regulons(ica_data, imodulons=None, cat_column=None, size_column=None, scale=1, reg_only=True, xlabel=None, ylabel=None, vline=0.6, hline=0.6, ax=None, scatter_kwargs=None, ax_font_kwargs=None, legend_kwargs=None)[source]

Compare the overlaps between iModulons and their linked regulons

Parameters
  • ica_data (IcaData) – IcaData object

  • imodulons (list, optional) – List of iModulons to plot

  • cat_column (str, optional) – Column in the imodulon_table that stores the category of each iModulon

  • size_column (str, optional) – Column in the imodulon_table that stores the size of each iModulon

  • scale (float, optional (default: 1)) – Value used to scale the size of each point

  • reg_only (bool (default: True)) – Only plot iModulons with an entry in the regulator column of the imodulon_table

  • xlabel (str, optional) – Custom x-axis label (default: “# shared genes/Regulon size”)

  • ylabel (str, optional) – Custom y-axis label (default: “# shared genes/iModulon size”)

  • vline (float, optional (default: 0.6)) – Draw a dashed vertical line

  • hline (float, optional (default: 0.6)) – Draw a dashed horizontal line

  • ax (Axes, optional) – Axes object to plot on, otherwise use current Axes

  • scatter_kwargs (dict, optional) – Additional keyword arguments passed to matplotlib.pyplot.scatter()

  • ax_font_kwargs (dict, optional) – Additional keyword arguments for axes labels

  • legend_kwargs (dict, optional) – Additional keyword arguments passed to matplotlib.pyplot.legend()

Returns

axAxes containing the line plot

Return type

Axes

pymodulon.plotting.cluster_activities(ica_data, correlation_method='spearman', distance_threshold=None, show_thresholding=False, show_clustermap=True, show_best_clusters=False, n_best_clusters='auto', cluster_names=None, return_clustermap=False, dimca_sample1=None, dimca_sample2=None, dimca_threshold=5, dimca_fdr=0.1, dimca_label=True, dimca_adjust=True, dimca_table=False, **dima_kwargs)[source]

Cluster all iModulon activity profiles using hierarchical clustering and display the results using seaborn.clustermap()

Parameters
  • ica_data (IcaData) – IcaData object

  • correlation_method ('pearson', 'spearman', 'kendall', 'mutual_info' or callable) – Method for computing correlations between iModulon activities. See pandas.DataFrame.corr() Default is ‘spearman’.

  • distance_threshold (float, optional) – A distance from 0 to 1 to define flat clusters from the hierarchical clustering. Larger values yield fewer clusters. If None, automatic selection of optimal threshold will occur by maximizing the silhouette score across iModulons. (default: None)

  • show_thresholding (bool) – Show the plot of distance thresholds vs. silhouette scores (default: False)

  • show_clustermap (bool) – Show the clustermap (default: True)

  • show_best_clusters (bool) – Show individual clusters below complete clustermap

  • n_best_clusters (int or str) – Number of best clusters to show. If ‘auto’, only clusters with above-average silhouette scores are shown

  • cluster_names (dict, optional) – A dictionary mapping cluster indices to names, usually used after clustering has been performed previously.

  • return_clustermap (bool) – Return the axis containing the clustermap

  • dimca_sample1 (list or str) – List of sample IDs or name of “project:condition” of reference samples for Differential iModulon Cluster Analysis (DiMCA)

  • dimca_sample2 (list or str) – List of sample IDs or name of “project:condition” of target samples for DiMCA

  • dimca_threshold (float) – Minimum activity difference to determine DiMCAs

  • dimca_fdr (float) – False detection rate for DiMCA

  • dimca_label (bool) – Label differentially activated imodulon clusters (default: True)

  • dimca_adjust (bool) – Auto-adjust DiMCA cluster labels (default: True)

  • dimca_table (bool) – Return DiMCA table (default: False)

  • **dima_kwargs (dict, optional) – Additional keyword arguments passed to pymodulon.plotting.dima()

Returns

  • clusters (~sklearn.cluster.AgglomerativeClustering) – sklearn.cluster.AgglomerativeClustering of activity matrix

  • cg (~seaborn.matrix.ClusterGrid, optional) – ClusterGrid containing the clusterplot

  • dimca_ax (~matplotlib.axes.Axes, optional) – Axes containing the DiMCA scatterplot

  • dimca_table (~pandas.DataFrame, optional) – Table of differentially activated iModulon clusters

pymodulon.plotting.metadata_boxplot(ica_data, imodulon, show_points=True, n_boxes=3, samples=None, strip_conc=True, ignore_cols=None, use_cols=None, return_results=False, ax=None, box_kwargs=None, strip_kwargs=None, swarm_kwargs=None)[source]

Uses a decision tree regressor to automatically cluster iModulon activities using metadata. Displays results as a box plot.

Parameters
  • ica_data (IcaData) – IcaData object

  • imodulon (int or str) – iModulon name

  • show_points (bool or str (default: True)) – Overlay individual points on top of the boxplot. By default, this uses seaborn.stripplot(). show_points=’swarm’ will use :func`seaborn.swarmplot`.

  • n_boxes (int) – Number of boxes to create

  • samples (list, optional) – Subset of samples to analyze

  • strip_conc (bool) – Remove concentrations from metadata (e.g. “glucose(2g/L)” would be interpreted as just “glucose”)

  • ignore_cols (list, optional) – List of columns to ignore. If None, only “project” and “condition” are ignored

  • use_cols (list, optional) – List of columns to use. This supercedes ignore_cols.

  • return_results (bool) – Return a dataframe describing the classifications

  • ax (Axes, optional) – Axes object to plot on, otherwise use current Axes

  • box_kwargs (dict, optional) – Additional keyword arguments passed to seaborn.boxplot()

  • strip_kwargs (dict, optional) – Additional keyword arguments passed to seaborn.stripplot()

  • swarm_kwargs (dict, optional) – Additional keyword arguments passed to seaborn.swarmplot()

Returns

  • ax (~matplotlib.axes.Axes) – Axes containing the boxplot

  • df_classes (~pd.DataFrame) – Metadata classifications of the samples

pymodulon.plotting._encode_metadata(ica_data, samples, use_cols=None, ignore_cols=None, strip_conc=True)[source]
pymodulon.plotting._train_classifier(component, features, max_leaf_nodes=3)[source]
pymodulon.plotting._get_labels_from_tree(clf, encoding)[source]
pymodulon.plotting._get_sample_leaves(clf, features, labels, component)[source]
pymodulon.plotting._fit_line(x, y, ax, metric)[source]
pymodulon.plotting._get_fit(x, y)[source]
pymodulon.plotting._broken_line(x, A, B, C)[source]
pymodulon.plotting._solid_line(x, A, B)[source]
pymodulon.plotting._adj_r2(f, x, y, params)[source]
pymodulon.plotting._mod_freedman_diaconis(ica_data, imodulon)[source]

Generates bins using optimal bin width estimate using the Freedman-Diaconis rule

Parameters
Returns

bins – Numpy array of bins

Return type

ndarray

pymodulon.plotting._set_xaxis(xaxis, y)[source]

Implements experimental xaxis param from plot_gene_weights. This allows for users to generate a scatterplot comparing gene weight on the y-axis with any collection of numbers on the x-axis (as long as the lengths match).

Parameters
  • xaxis (list, set, tuple, dict, np.array, pd.Series) – Any collection or mapping of numbers (plots on x-axis)

  • y (pd.Series) – pandas Series of Gene Weights to be plotted on the y-axis of plot_gene_weights

Returns

x – Returns a pd.Series to be used as the x-axis data-points for generating the plot_gene_weights scatter-plot.

Return type

pd.Series

pymodulon.util

General utility functions for the pymodulon package

Module Contents
Functions

_check_table(table, name, index=None, index_col=0)

_check_table_helper(table, index, name)

_check_dict(table, index_col=0)

compute_threshold(ic, dagostino_cutoff)

Computes D’agostino-test-based threshold for a component of an M matrix

dima(ica_data, sample1, sample2, threshold=5, fdr=0.1, alternate_A=None)

Creates DIMA table of differentially expressed iModulons

_parse_sample(ica_data, sample)

Parses sample inputs into a list of sample IDs

explained_variance(ica_data, genes=None, samples=None, imodulons=None, reference=None)

Computes the fraction of variance explained by iModulons (from 0 to 1)

infer_activities(ica_data, data)

Infer iModulon activities for external data

mutual_info_distance(x, y)

mi(x, y, z=None, k=3, base=2, alpha=0)

Mutual information of x and y (conditioned on z if z is not None)

entropy(x, k=3, base=2)

The classic K-L k-nearest neighbor continuous entropy estimator

add_noise(x, intens=1e-10)

Small noise to break degeneracy, see doc.

build_tree(points)

query_neighbors(tree, x, k)

avgdigamma(points, dvec)

This part finds number of neighbors in some radius in the marginal space

lnc_correction(tree, points, k, alpha)

count_neighbors(tree, x, r)

pymodulon.util._check_table(table, name, index=None, index_col=0)[source]
pymodulon.util._check_table_helper(table, index, name)[source]
pymodulon.util._check_dict(table, index_col=0)[source]
pymodulon.util.compute_threshold(ic, dagostino_cutoff)[source]

Computes D’agostino-test-based threshold for a component of an M matrix

Parameters
  • ic (Series) – Pandas Series containing an independent component

  • dagostino_cutoff (int) – Minimum D’agostino test statistic value to determine threshold

Returns

iModulon threshold – List of thresholds for each iModulon

Return type

list

pymodulon.util.dima(ica_data, sample1, sample2, threshold=5, fdr=0.1, alternate_A=None)[source]

Creates DIMA table of differentially expressed iModulons

Parameters
  • ica_data (IcaData) – IcaData data object

  • sample1 (str or list) – List of sample IDs or name of “project:condition”

  • sample2 (str or list) – List of sample IDs or name of “project:condition”

  • threshold (float) – Minimum activity difference to determine DiMAs (default = 5)

  • fdr (float) – False Detection Rate (default = .1)

  • alternate_A (DataFrame) – Alternate A to use (default = None)

Returns

results – Table of differentially expressed iModulons

Return type

DataFrame

pymodulon.util._parse_sample(ica_data, sample)[source]

Parses sample inputs into a list of sample IDs

Parameters
  • ica_data (IcaData) – IcaData data object

  • sample (list) – Sequence of sample IDs or “project:condition”

Returns

samples – A list of samples

Return type

list

pymodulon.util.explained_variance(ica_data, genes=None, samples=None, imodulons=None, reference=None)[source]

Computes the fraction of variance explained by iModulons (from 0 to 1)

Parameters
  • ica_data (IcaData) – IcaData data object

  • genes (str or list, optional) – List of genes to use (default: all genes)

  • samples (str or list, optional) – List of samples to use (default: all samples)

  • imodulons (int or str or list, optional) – List of iModulons to use (default: all iModulons)

  • reference (list, optional) – List of samples that represent the reference condition for the set. If none are provided, uses the dataset-specific reference condition.

Returns

Fraction of variance explained by selected iModulons for selected genes/samples

Return type

float

pymodulon.util.infer_activities(ica_data, data)[source]

Infer iModulon activities for external data

Parameters
  • ica_data (IcaData) – IcaData data object

  • data (DataFrame) – External expression profiles (must be centered to a reference)

Returns

new_activities – Inferred activities for the expression profiles

Return type

DataFrame

pymodulon.util.mutual_info_distance(x, y)[source]
pymodulon.util.mi(x, y, z=None, k=3, base=2, alpha=0)[source]

Mutual information of x and y (conditioned on z if z is not None) x, y should be a list of vectors, e.g. x = [[1.3], [3.7], [5.1], [2.4]] if x is a one-dimensional scalar and we have four samples

pymodulon.util.entropy(x, k=3, base=2)[source]

The classic K-L k-nearest neighbor continuous entropy estimator x should be a list of vectors, e.g. x = [[1.3], [3.7], [5.1], [2.4]] if x is a one-dimensional scalar and we have four samples

pymodulon.util.add_noise(x, intens=1e-10)[source]

Small noise to break degeneracy, see doc.

pymodulon.util.build_tree(points)[source]
pymodulon.util.query_neighbors(tree, x, k)[source]
pymodulon.util.avgdigamma(points, dvec)[source]

This part finds number of neighbors in some radius in the marginal space returns expectation value of <psi(nx)>

pymodulon.util.lnc_correction(tree, points, k, alpha)[source]
pymodulon.util.count_neighbors(tree, x, r)[source]
Package Contents
pymodulon.__version__ = 0.2.1[source]
1

Created with sphinx-autoapi

Works Cited

The following is a list of all references cited throughout the documentation.

PTS+20

Saugat Poudel, Hannah Tsunemoto, Yara Seif, Anand V. Sastry, Richard Szubin, Sibei Xu, Henrique Machado, Connor A. Olson, Amitesh Anand, Joe Pogliano, Victor Nizet, and Bernhard O. Palsson. Revealing 29 sets of independently modulated genes inStaphylococcus aureus, their regulators, and role in key physiological response. Proceedings of the National Academy of Sciences, 117(29):17228–17239, July 2020. URL: https://doi.org/10.1073/pnas.2008413117, doi:10.1073/pnas.2008413117.

RDS+20

Kevin Rychel, Katherine Decker, Anand V Sastry, Patrick V Phaneuf, Saugat Poudel, and Bernhard O Palsson. iModulonDB: a knowledgebase of microbial transcriptional regulation derived from machine learning. Nucleic Acids Research, 49(D1):D112–D120, October 2020. URL: https://doi.org/10.1093/nar/gkaa810, doi:10.1093/nar/gkaa810.

RSP20

Kevin Rychel, Anand V. Sastry, and Bernhard O. Palsson. Machine learning uncovers independently regulated modules in the bacillus subtilis transcriptome. Nature Communications, December 2020. URL: https://doi.org/10.1038/s41467-020-20153-9, doi:10.1038/s41467-020-20153-9.

SGS+19

Anand V. Sastry, Ye Gao, Richard Szubin, Ying Hefner, Sibei Xu, Donghyuk Kim, Kumari Sonal Choudhary, Laurence Yang, Zachary A. King, and Bernhard O. Palsson. The escherichia coli transcriptome mostly consists of independently regulated modules. Nature Communications, December 2019. URL: https://doi.org/10.1038/s41467-019-13483-w, doi:10.1038/s41467-019-13483-w.

Indices and tables