1. Introduction to the IcaData
object
The pymodulon.core.IcaData
object is at the core of the PyModulon package. This object holds all of the data related to the expression dataset, the iModulons, and their annotations.
[1]:
from pymodulon.core import IcaData
from pymodulon import example_data
from pymodulon.io import save_to_json, load_json_model
1.1. Minimum requirements
The IcaData
object only requires two matrices, which are the results of performing Independent Component Analysis (ICA) on an expression dataset. For more information about ICA, see the iModulonDB about page
M
: The iModulon matrix contains the Independent Components (ICs) themselves. Each column represents an IC, and each row contains the gene weights for each gene across each IC.
[2]:
M = example_data.M
M.head()
[2]:
AllR/AraC/FucR | ArcA-1 | ArcA-2 | ArgR | AtoC | BW25113 | Cbl+CysB | CdaR | CecR | Copper | ... | thrA-KO | translation | uncharacterized-1 | uncharacterized-2 | uncharacterized-3 | uncharacterized-4 | uncharacterized-5 | uncharacterized-6 | ydcI-KO | yheO-KO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
b0002 | -0.010888 | -0.007717 | -0.008502 | -0.012186 | -0.061489 | -0.005599 | -0.007377 | -0.000795 | 0.004331 | 0.001845 | ... | 0.479209 | 0.035685 | 0.024778 | -0.010660 | -0.002123 | -0.004416 | -0.005428 | -0.009219 | -0.004345 | -0.007838 |
b0003 | -0.011467 | 0.003042 | 0.011448 | -0.003685 | -0.006106 | 0.006680 | -0.043512 | 0.005107 | 0.000474 | 0.007650 | ... | 0.011420 | 0.040811 | 0.003324 | -0.008424 | -0.004415 | -0.016126 | -0.016476 | -0.003497 | -0.003583 | 0.003381 |
b0004 | -0.008693 | 0.003944 | 0.012347 | -0.008104 | 0.000585 | 0.003245 | -0.041283 | 0.006390 | 0.004260 | 0.007109 | ... | 0.011339 | 0.036244 | 0.003710 | -0.005212 | 0.000700 | -0.011096 | -0.006140 | -0.003155 | -0.008418 | 0.000129 |
b0005 | 0.006565 | -0.001099 | 0.009415 | -0.008507 | 0.005399 | 0.014748 | -0.009249 | -0.003058 | -0.012649 | -0.002370 | ... | -0.015324 | 0.028972 | 0.023969 | 0.000150 | 0.018497 | 0.009428 | 0.001255 | -0.006890 | -0.028069 | 0.021534 |
b0006 | -0.006011 | 0.009889 | -0.005555 | -0.000152 | -0.002454 | 0.009678 | -0.003456 | 0.002160 | -0.001924 | -0.000628 | ... | -0.005661 | 0.000700 | -0.002538 | -0.006103 | -0.002506 | -0.005077 | -0.004616 | -0.003585 | 0.001607 | 0.001285 |
5 rows × 92 columns
A
: The Activity matrix contains the condition-specific activities. Each column represents a sample, and each row contains the activity of each iModulon across all samples.
[3]:
A = example_data.A
A.head()
[3]:
control__wt_glc__1 | control__wt_glc__2 | fur__wt_dpd__1 | fur__wt_dpd__2 | fur__wt_fe__1 | fur__wt_fe__2 | fur__delfur_dpd__1 | fur__delfur_dpd__2 | fur__delfur_fe2__1 | fur__delfur_fe2__2 | ... | efeU__menFentC_ale29__1 | efeU__menFentC_ale29__2 | efeU__menFentC_ale30__1 | efeU__menFentC_ale30__2 | efeU__menFentCubiC_ale36__1 | efeU__menFentCubiC_ale36__2 | efeU__menFentCubiC_ale37__1 | efeU__menFentCubiC_ale37__2 | efeU__menFentCubiC_ale38__1 | efeU__menFentCubiC_ale38__2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AllR/AraC/FucR | 0.378690 | -0.378690 | 2.457678 | 2.248678 | -0.327344 | -0.259164 | 1.777251 | 2.690655 | 0.656937 | 0.319583 | ... | 1.041336 | 2.203940 | 3.698292 | 0.856998 | 1.557323 | 0.337806 | 0.943742 | 1.736640 | 0.499461 | 1.581476 |
ArcA-1 | -0.440210 | 0.440210 | -5.367360 | -5.684301 | 0.131174 | 0.348843 | -4.436389 | -4.770469 | -1.799113 | -1.474222 | ... | -6.471714 | -6.549861 | -3.109145 | -2.716183 | -2.531192 | -1.461022 | -0.408849 | -0.210397 | -5.700321 | -6.237836 |
ArcA-2 | 0.762258 | -0.762258 | 2.619623 | 2.900696 | 3.120724 | 2.743634 | 1.989803 | 1.555835 | 1.782500 | 1.530811 | ... | 2.789653 | 3.959650 | 1.585147 | 0.811182 | 0.300414 | 2.537535 | 1.061408 | 2.634524 | 0.125513 | 1.178747 |
ArgR | -0.289630 | 0.289630 | -10.085719 | -13.187916 | 2.371129 | 1.861918 | -8.708701 | -7.881588 | -1.237027 | -1.235604 | ... | -11.263744 | -10.366813 | -0.289217 | 0.389228 | -5.142768 | -5.014526 | -3.648777 | -4.125952 | -4.286326 | -5.475940 |
AtoC | 0.250770 | -0.250770 | 1.844767 | 2.055052 | 0.299345 | 0.425502 | 1.801217 | 1.790987 | 0.921254 | 1.410026 | ... | 3.821909 | 3.306573 | 2.652394 | 1.910173 | 0.927772 | 1.327549 | 1.846321 | 0.909667 | 2.064662 | 2.371405 |
5 rows × 278 columns
To create the IcaData
object, the M
and A
datasets can be entered as either filenames or as a Pandas DataFrame
[4]:
ica_data = IcaData(M,A)
ica_data
[4]:
<pymodulon.core.IcaData at 0x7fc18620f9d0>
Once loaded, the M
and A
matrices can be accessed directly from the object
[5]:
ica_data.M.head()
[5]:
AllR/AraC/FucR | ArcA-1 | ArcA-2 | ArgR | AtoC | BW25113 | Cbl+CysB | CdaR | CecR | Copper | ... | thrA-KO | translation | uncharacterized-1 | uncharacterized-2 | uncharacterized-3 | uncharacterized-4 | uncharacterized-5 | uncharacterized-6 | ydcI-KO | yheO-KO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
b0002 | -0.010888 | -0.007717 | -0.008502 | -0.012186 | -0.061489 | -0.005599 | -0.007377 | -0.000795 | 0.004331 | 0.001845 | ... | 0.479209 | 0.035685 | 0.024778 | -0.010660 | -0.002123 | -0.004416 | -0.005428 | -0.009219 | -0.004345 | -0.007838 |
b0003 | -0.011467 | 0.003042 | 0.011448 | -0.003685 | -0.006106 | 0.006680 | -0.043512 | 0.005107 | 0.000474 | 0.007650 | ... | 0.011420 | 0.040811 | 0.003324 | -0.008424 | -0.004415 | -0.016126 | -0.016476 | -0.003497 | -0.003583 | 0.003381 |
b0004 | -0.008693 | 0.003944 | 0.012347 | -0.008104 | 0.000585 | 0.003245 | -0.041283 | 0.006390 | 0.004260 | 0.007109 | ... | 0.011339 | 0.036244 | 0.003710 | -0.005212 | 0.000700 | -0.011096 | -0.006140 | -0.003155 | -0.008418 | 0.000129 |
b0005 | 0.006565 | -0.001099 | 0.009415 | -0.008507 | 0.005399 | 0.014748 | -0.009249 | -0.003058 | -0.012649 | -0.002370 | ... | -0.015324 | 0.028972 | 0.023969 | 0.000150 | 0.018497 | 0.009428 | 0.001255 | -0.006890 | -0.028069 | 0.021534 |
b0006 | -0.006011 | 0.009889 | -0.005555 | -0.000152 | -0.002454 | 0.009678 | -0.003456 | 0.002160 | -0.001924 | -0.000628 | ... | -0.005661 | 0.000700 | -0.002538 | -0.006103 | -0.002506 | -0.005077 | -0.004616 | -0.003585 | 0.001607 | 0.001285 |
5 rows × 92 columns
[6]:
ica_data.A.head()
[6]:
control__wt_glc__1 | control__wt_glc__2 | fur__wt_dpd__1 | fur__wt_dpd__2 | fur__wt_fe__1 | fur__wt_fe__2 | fur__delfur_dpd__1 | fur__delfur_dpd__2 | fur__delfur_fe2__1 | fur__delfur_fe2__2 | ... | efeU__menFentC_ale29__1 | efeU__menFentC_ale29__2 | efeU__menFentC_ale30__1 | efeU__menFentC_ale30__2 | efeU__menFentCubiC_ale36__1 | efeU__menFentCubiC_ale36__2 | efeU__menFentCubiC_ale37__1 | efeU__menFentCubiC_ale37__2 | efeU__menFentCubiC_ale38__1 | efeU__menFentCubiC_ale38__2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AllR/AraC/FucR | 0.378690 | -0.378690 | 2.457678 | 2.248678 | -0.327344 | -0.259164 | 1.777251 | 2.690655 | 0.656937 | 0.319583 | ... | 1.041336 | 2.203940 | 3.698292 | 0.856998 | 1.557323 | 0.337806 | 0.943742 | 1.736640 | 0.499461 | 1.581476 |
ArcA-1 | -0.440210 | 0.440210 | -5.367360 | -5.684301 | 0.131174 | 0.348843 | -4.436389 | -4.770469 | -1.799113 | -1.474222 | ... | -6.471714 | -6.549861 | -3.109145 | -2.716183 | -2.531192 | -1.461022 | -0.408849 | -0.210397 | -5.700321 | -6.237836 |
ArcA-2 | 0.762258 | -0.762258 | 2.619623 | 2.900696 | 3.120724 | 2.743634 | 1.989803 | 1.555835 | 1.782500 | 1.530811 | ... | 2.789653 | 3.959650 | 1.585147 | 0.811182 | 0.300414 | 2.537535 | 1.061408 | 2.634524 | 0.125513 | 1.178747 |
ArgR | -0.289630 | 0.289630 | -10.085719 | -13.187916 | 2.371129 | 1.861918 | -8.708701 | -7.881588 | -1.237027 | -1.235604 | ... | -11.263744 | -10.366813 | -0.289217 | 0.389228 | -5.142768 | -5.014526 | -3.648777 | -4.125952 | -4.286326 | -5.475940 |
AtoC | 0.250770 | -0.250770 | 1.844767 | 2.055052 | 0.299345 | 0.425502 | 1.801217 | 1.790987 | 0.921254 | 1.410026 | ... | 3.821909 | 3.306573 | 2.652394 | 1.910173 | 0.927772 | 1.327549 | 1.846321 | 0.909667 | 2.064662 | 2.371405 |
5 rows × 278 columns
If the M
and A
datasets have row or column names, these will be saved as the sample/gene/iModulon names. Since genes are often re-named when characterized, the locus tag is the preferred identifier.
[7]:
print('Gene names:',ica_data.gene_names[:5])
print('Sample names:',ica_data.sample_names[:5])
print('iModulon names:',ica_data.imodulon_names[:5])
Gene names: ['b0002', 'b0003', 'b0004', 'b0005', 'b0006']
Sample names: ['control__wt_glc__1', 'control__wt_glc__2', 'fur__wt_dpd__1', 'fur__wt_dpd__2', 'fur__wt_fe__1']
iModulon names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
1.2. Adding the Expression Matrix
The X
matrix contains eXpression data and is primarily used for plotting functions. The column names of the X
matrix are the sample names, and the row names are the gene identifiers.
[8]:
X = example_data.X
X.head()
[8]:
control__wt_glc__1 | control__wt_glc__2 | fur__wt_dpd__1 | fur__wt_dpd__2 | fur__wt_fe__1 | fur__wt_fe__2 | fur__delfur_dpd__1 | fur__delfur_dpd__2 | fur__delfur_fe2__1 | fur__delfur_fe2__2 | ... | efeU__menFentC_ale29__1 | efeU__menFentC_ale29__2 | efeU__menFentC_ale30__1 | efeU__menFentC_ale30__2 | efeU__menFentCubiC_ale36__1 | efeU__menFentCubiC_ale36__2 | efeU__menFentCubiC_ale37__1 | efeU__menFentCubiC_ale37__2 | efeU__menFentCubiC_ale38__1 | efeU__menFentCubiC_ale38__2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
b0002 | -0.061772 | 0.061772 | 0.636527 | 0.819793 | -0.003615 | -0.289353 | -1.092023 | -0.777289 | 0.161343 | 0.145641 | ... | -0.797097 | -0.791859 | 0.080114 | 0.102154 | 0.608180 | 0.657673 | 0.813105 | 0.854813 | 0.427986 | 0.484338 |
b0003 | -0.053742 | 0.053742 | 0.954439 | 1.334385 | 0.307588 | 0.128414 | -0.872563 | -0.277893 | 0.428542 | 0.391761 | ... | -0.309105 | -0.352535 | -0.155074 | -0.077145 | 0.447030 | 0.439881 | 0.554528 | 0.569030 | 0.154905 | 0.294799 |
b0004 | -0.065095 | 0.065095 | -0.202697 | 0.119195 | -0.264995 | -0.546017 | -1.918349 | -1.577736 | -0.474815 | -0.495312 | ... | -0.184898 | -0.225615 | 0.019575 | 0.063986 | 0.483343 | 0.452754 | 0.524828 | 0.581878 | 0.293239 | 0.341040 |
b0005 | 0.028802 | -0.028802 | -0.865171 | -0.951179 | 0.428769 | 0.123564 | -1.660351 | -1.531147 | 0.240353 | -0.151132 | ... | -0.308221 | -0.581714 | 0.018820 | 0.004040 | -1.228763 | -1.451750 | -0.839203 | -0.529349 | -0.413336 | -0.478682 |
b0006 | 0.009087 | -0.009087 | -0.131039 | -0.124079 | -0.144870 | -0.090152 | -0.219917 | -0.046648 | -0.044537 | -0.089204 | ... | 1.464603 | 1.415706 | 1.230831 | 1.165153 | 0.447447 | 0.458852 | 0.421417 | 0.408077 | 1.151066 | 1.198529 |
5 rows × 278 columns
[9]:
ica_data.X = X
1.3. Adding annotation tables
You may load in additional data tables with information about your samples, genes, or iModulons.
These tables are originally empty, but can be altered like any Pandas DataFrame
.
[10]:
ica_data.gene_table.head()
[10]:
b0002 |
---|
b0003 |
b0004 |
b0005 |
b0006 |
Annotation tables contain one sample/gene/iModulon per row, and information about the respective item in columns. For example, a gene_table
may include the gene function, genomic position, or Cluster of Orthologous Groups (COG) Category. See the Creating the Gene Table tutorial for a step-by-step example on how to contruct this table. Gene names must match the gene names in the M
matrix.
[11]:
gene_table = example_data.gene_table
gene_table.head()
[11]:
start | end | strand | gene_name | length | operon | COG | accession | |
---|---|---|---|---|---|---|---|---|
b0001 | 189 | 255 | + | thrL | 66 | thrLABC | No COG Annotation | NC_000913.3 |
b0002 | 336 | 2799 | + | thrA | 2463 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0003 | 2800 | 3733 | + | thrB | 933 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0004 | 3733 | 5020 | + | thrC | 1287 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0005 | 5233 | 5530 | + | yaaX | 297 | yaaX | Function unknown | NC_000913.3 |
The sample_table
contains detailed experimental metadata about each sample. This must be manually created, and can contain information related to the strains or experimental conditions used in the study.
[12]:
sample_table = example_data.sample_table
sample_table.head()
[12]:
Study | project | condition | Replicate # | Strain Description | Strain | Base Media | Carbon Source (g/L) | Nitrogen Source (g/L) | Electron Acceptor | ... | Growth Rate (1/hr) | Evolved Sample | Isolate Type | Sequencing Machine | ALEdb sample | Additional Details | Biological Replicates | Alignment | DOI | GEO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample ID | |||||||||||||||||||||
control__wt_glc__1 | Control | control | wt_glc | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | NaN | No | NaN | MiSeq | NaN | NaN | 2 | 94.33 | doi.org/10.1101/080929 | GSE65643 |
control__wt_glc__2 | Control | control | wt_glc | 2 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | NaN | No | NaN | MiSeq | NaN | NaN | 2 | 94.24 | doi.org/10.1101/080929 | GSE65643 |
fur__wt_dpd__1 | Fur | fur | wt_dpd | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 0.00 | No | NaN | MiSeq | NaN | NaN | 2 | 98.04 | doi.org/10.1038/ncomms5910 | GSE54900 |
fur__wt_dpd__2 | Fur | fur | wt_dpd | 2 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 0.00 | No | NaN | MiSeq | NaN | NaN | 2 | 98.30 | doi.org/10.1038/ncomms5910 | GSE54900 |
fur__wt_fe__1 | Fur | fur | wt_fe | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 1.06 | No | NaN | MiSeq | NaN | NaN | 2 | 93.35 | doi.org/10.1038/ncomms5910 | GSE54900 |
5 rows × 26 columns
The project
and condition
columns in the sample_table
will be useful for the plotting functions described in the Plotting Functions tutorial.
The imodulon_table
contains information about each iModulon, such as regulator enrichments or iModulon size.
[13]:
imodulon_table = example_data.imodulon_table
imodulon_table.head()
[13]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | |
---|---|---|---|---|---|---|---|---|---|---|
name | ||||||||||
AllR/AraC/FucR | allR/araC/fucR | 0.750000 | 1.190000e-41 | 1.000000 | 0.600000 | 18.0 | 18 | 3 | Carbon Source Utilization | 0.086996 |
ArcA-1 | arcA | 0.130952 | 6.420000e-20 | 0.660000 | 0.072687 | 33.0 | 50 | 1 | Energy Metabolism | 0.058051 |
ArcA-2 | arcA | 0.087683 | 1.150000e-16 | 0.840000 | 0.046256 | 21.0 | 25 | 1 | Energy Metabolism | 0.081113 |
ArgR | argR | 0.177778 | 6.030000e-18 | 0.923077 | 0.098361 | 12.0 | 13 | 1 | Amino Acid and Nucleotide Biosynthesis | 0.080441 |
AtoC | atoC | 0.800000 | 1.520000e-12 | 0.666667 | 1.000000 | 4.0 | 6 | 1 | Miscellaneous Metabolism | 0.105756 |
The tables can be loaded into the IcaData
object as either filenames or as a Pandas DataFrame
[14]:
ica_data.gene_table = gene_table
ica_data.sample_table = sample_table
ica_data.imodulon_table = imodulon_table
[15]:
ica_data.sample_table.head()
[15]:
Study | project | condition | Replicate # | Strain Description | Strain | Base Media | Carbon Source (g/L) | Nitrogen Source (g/L) | Electron Acceptor | ... | Growth Rate (1/hr) | Evolved Sample | Isolate Type | Sequencing Machine | ALEdb sample | Additional Details | Biological Replicates | Alignment | DOI | GEO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
control__wt_glc__1 | Control | control | wt_glc | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | NaN | No | NaN | MiSeq | NaN | NaN | 2 | 94.33 | doi.org/10.1101/080929 | GSE65643 |
control__wt_glc__2 | Control | control | wt_glc | 2 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | NaN | No | NaN | MiSeq | NaN | NaN | 2 | 94.24 | doi.org/10.1101/080929 | GSE65643 |
fur__wt_dpd__1 | Fur | fur | wt_dpd | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 0.00 | No | NaN | MiSeq | NaN | NaN | 2 | 98.04 | doi.org/10.1038/ncomms5910 | GSE54900 |
fur__wt_dpd__2 | Fur | fur | wt_dpd | 2 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 0.00 | No | NaN | MiSeq | NaN | NaN | 2 | 98.30 | doi.org/10.1038/ncomms5910 | GSE54900 |
fur__wt_fe__1 | Fur | fur | wt_fe | 1 | Escherichia coli K-12 MG1655 | MG1655 | M9 | glucose(2) | NH4Cl(1) | O2 | ... | 1.06 | No | NaN | MiSeq | NaN | NaN | 2 | 93.35 | doi.org/10.1038/ncomms5910 | GSE54900 |
5 rows × 26 columns
[16]:
ica_data.gene_table.head()
[16]:
start | end | strand | gene_name | length | operon | COG | accession | |
---|---|---|---|---|---|---|---|---|
b0002 | 336 | 2799 | + | thrA | 2463 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0003 | 2800 | 3733 | + | thrB | 933 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0004 | 3733 | 5020 | + | thrC | 1287 | thrLABC | Amino acid transport and metabolism | NC_000913.3 |
b0005 | 5233 | 5530 | + | yaaX | 297 | yaaX | Function unknown | NC_000913.3 |
b0006 | 5682 | 6459 | - | yaaA | 777 | yaaA | Function unknown | NC_000913.3 |
[17]:
ica_data.imodulon_table.head()
[17]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | |
---|---|---|---|---|---|---|---|---|---|---|
AllR/AraC/FucR | allR/araC/fucR | 0.750000 | 1.190000e-41 | 1.000000 | 0.600000 | 18.0 | 18 | 3 | Carbon Source Utilization | 0.086996 |
ArcA-1 | arcA | 0.130952 | 6.420000e-20 | 0.660000 | 0.072687 | 33.0 | 50 | 1 | Energy Metabolism | 0.058051 |
ArcA-2 | arcA | 0.087683 | 1.150000e-16 | 0.840000 | 0.046256 | 21.0 | 25 | 1 | Energy Metabolism | 0.081113 |
ArgR | argR | 0.177778 | 6.030000e-18 | 0.923077 | 0.098361 | 12.0 | 13 | 1 | Amino Acid and Nucleotide Biosynthesis | 0.080441 |
AtoC | atoC | 0.800000 | 1.520000e-12 | 0.666667 | 1.000000 | 4.0 | 6 | 1 | Miscellaneous Metabolism | 0.105756 |
1.4. Converting between gene names and locus tags
If the gene_table
contains a gene_name
columns, the name2num
and num2name
methods can convert between locus tags and gene names.
[18]:
ica_data.num2name('b0002')
[18]:
'thrA'
[19]:
ica_data.name2num('thrA')
[19]:
'b0002'
1.5. Adding the TRN
Adding the transcriptional regulatory network (TRN) to the IcaData
object enables automated calculation of regulon enrichments. Each row of the TRN file represents a regulatory interaction. The TRN must contain the following columns:
regulator
: Name of the regulator (/
or+
characters will be converted to;
)gene_id
: Locus tag of the target gene (must be inica_data.gene_names
)
The following columns are optional, but are helpful to have:
regulator_id
- Locus tag of regulatorgene_name
- Name of gene (can automatically update this usingname2num
)direction
- Direction of regulation (+
for activation,-
for repression,?
orNaN
for unknown)evidence
- Evidence of regulation (e.g. ChIP-exo, qRT-PCR, SELEX, Motif search)PMID
- Reference for regulatory interaction
[20]:
trn = example_data.trn
trn.head()
[20]:
regulator | gene_id | effect | |
---|---|---|---|
0 | FMN | b3041 | - |
1 | L-tryptophan | b3708 | + |
2 | L-tryptophan | b3709 | + |
3 | TPP | b0066 | - |
4 | TPP | b0067 | - |
Again, this table can be passed in as either a filename or a Pandas DataFrame
.
[21]:
ica_data.trn = trn
ica_data.trn.head()
[21]:
regulator | gene_id | effect | |
---|---|---|---|
0 | FMN | b3041 | - |
1 | L-tryptophan | b3708 | + |
2 | L-tryptophan | b3709 | + |
3 | TPP | b0066 | - |
4 | TPP | b0067 | - |
1.6. Inspecting iModulons
view_imodulon
shows the information about each gene in the iModulon. Most information is retrieved from the gene_table
, but the regulator
column comes from the trn
.
[22]:
ica_data.view_imodulon('GlpR')
[22]:
gene_weight | start | end | strand | gene_name | length | operon | COG | accession | regulator | |
---|---|---|---|---|---|---|---|---|---|---|
b2239 | 0.211384 | 2349934 | 2351011 | - | glpQ | 1077 | glpTQ | Energy production and conversion | NC_000913.3 | crp,fis,fnr,glpR,ihf,nac,rpoD |
b2240 | 0.306134 | 2351015 | 2352374 | - | glpT | 1359 | glpTQ | Carbohydrate transport and metabolism | NC_000913.3 | crp,fis,fnr,glpR,ihf,nac,rpoD |
b2241 | 0.375662 | 2352646 | 2354275 | + | glpA | 1629 | glpABC | Energy production and conversion | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b2242 | 0.328961 | 2354264 | 2355524 | + | glpB | 1260 | glpABC | Amino acid transport and metabolism | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b2243 | 0.315752 | 2355520 | 2356711 | + | glpC | 1191 | glpABC | Energy production and conversion | NC_000913.3 | arcA,crp,fis,flhD;flhC,fnr,glpR,rpoD |
b3426 | 0.350034 | 3562012 | 3563518 | + | glpD | 1506 | glpD | Energy production and conversion | NC_000913.3 | arcA,crp,glpR,rpoD,yieP |
b3926 | 0.290235 | 4115713 | 4117222 | - | glpK | 1509 | glpFKX | Energy production and conversion | NC_000913.3 | crp,glpR,rpoD |
b3927 | 0.312307 | 4117244 | 4118090 | - | glpF | 846 | glpFKX | Carbohydrate transport and metabolism | NC_000913.3 | crp,glpR,rpoD |
1.7. Searching for genes in iModulons
To find which iModulons contain a specific gene, use the imodulons_with
method.
[23]:
ica_data.imodulons_with('b2239')
[23]:
['GlpR']
If the gene_table
contains a gene_name
columns, this function will work with either the locus tag or the gene name.
[24]:
ica_data.imodulons_with('carA')
[24]:
['PurR-2']
1.8. Renaming iModulons
Individual iModulons can be renamed using the rename_imodulons
method
[25]:
print('Original iModulon Names:', ica_data.imodulon_names[:5])
ica_data.rename_imodulons({'AllR/AraC/FucR':'AllR'})
print('Renamed iModulon Names:', ica_data.imodulon_names[:5])
Original iModulon Names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
Renamed iModulon Names: ['AllR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
These changes are reflected throughout the IcaData
object.
[26]:
print('M matrix columns:', ica_data.M.columns[:5])
M matrix columns: Index(['AllR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC'], dtype='object')
iModulon names can be updated all at once as well.
[27]:
print('Original iModulon Names:', ica_data.imodulon_names[:5])
new_names = ['AllR/AraC/FucR']+ica_data.imodulon_names[1:]
print('New iModulon names:', new_names[:5])
ica_data.imodulon_names = new_names
print('Renamed iModulon Names:', ica_data.imodulon_names[:5])
Original iModulon Names: ['AllR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
New iModulon names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
Renamed iModulon Names: ['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC']
1.9. Copying IcaData
objects
The copy
method creates a new IcaData
object identical to the old one.
[28]:
ica_data.copy()
[28]:
<pymodulon.core.IcaData at 0x7fc186195650>
1.10. Saving and Loading IcaData
Objects
To facilitate data sharing, you can save IcaData
objects as json files that can be easily re-loaded
[29]:
from pymodulon.io import *
from os import path
[30]:
filepath = path.join('tmp','ecoli_data.json')
save_to_json(ica_data,filepath)
[31]:
ica_data = load_json_model(filepath)
[32]:
ica_data.imodulon_table.head()
[32]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | |
---|---|---|---|---|---|---|---|---|---|---|
AllR/AraC/FucR | allR/araC/fucR | 0.750000 | 1.190000e-41 | 1.000000 | 0.600000 | 18.0 | 18 | 3 | Carbon Source Utilization | 0.086996 |
ArcA-1 | arcA | 0.130952 | 6.420000e-20 | 0.660000 | 0.072687 | 33.0 | 50 | 1 | Energy Metabolism | 0.058051 |
ArcA-2 | arcA | 0.087683 | 1.150000e-16 | 0.840000 | 0.046256 | 21.0 | 25 | 1 | Energy Metabolism | 0.081113 |
ArgR | argR | 0.177778 | 6.030000e-18 | 0.923077 | 0.098361 | 12.0 | 13 | 1 | Amino Acid and Nucleotide Biosynthesis | 0.080441 |
AtoC | atoC | 0.800000 | 1.520000e-12 | 0.666667 | 1.000000 | 4.0 | 6 | 1 | Miscellaneous Metabolism | 0.105756 |
[ ]: