4. Gene enrichment analysis
[1]:
from pymodulon.enrichment import *
from pymodulon.example_data import load_ecoli_data, trn
ica_data = load_ecoli_data()
4.1. General functions
To perform a basic enrichment test between two gene sets, use the compute_enrichment
function. Optional arguments:
label
: Label for your target gene set (e.g. regulator name or GO term)
[2]:
gene_set = ['b0002','b0003','b0004'] # e.g. an iModulon or other gene set
target_genes = ['b0002','b0005','b0007'] # e.g. a regulon or genes in a COG category
all_genes = ica_data.gene_names # List of all genes in your expression dataset
[3]:
compute_enrichment(gene_set,target_genes,all_genes)
[3]:
pvalue 0.002293
precision 0.333333
recall 0.333333
f1score 0.333333
TP 1.000000
target_set_size 3.000000
gene_set_size 3.000000
dtype: float64
To perform an enrichment test against a regulon in a TRN table, use the compute_regulon
function.
[4]:
compute_regulon_enrichment(gene_set,'lrp',all_genes,trn)
[4]:
pvalue 0.000131
precision 1.000000
recall 0.015000
f1score 0.029557
TP 3.000000
regulon_size 200.000000
gene_set_size 3.000000
n_regs 1.000000
Name: lrp, dtype: float64
Each row of the TRN table represents a regulatory interaction. The table requires the columns regulator
and gene_id
.
[5]:
trn.head()
[5]:
regulator | gene_id | effect | |
---|---|---|---|
0 | FMN | b3041 | - |
1 | L-tryptophan | b3708 | + |
2 | L-tryptophan | b3709 | + |
3 | TPP | b0066 | - |
4 | TPP | b0067 | - |
To perform an enrichment against all regulons in a TRN table, use the compute_trn_enrichment
function. This function includes false detection correction using the Benjamini-Hochberg procedure to compute Q-values. By default, the false detection rate is 1e-5
, and can be updated using the fdr
argument.
[6]:
compute_trn_enrichment(gene_set,all_genes,trn)
[6]:
pvalue | precision | recall | f1score | TP | regulon_size | gene_set_size | n_regs | qvalue | |
---|---|---|---|---|---|---|---|---|---|
thr-tRNA | 0.000000e+00 | 1.0 | 1.000000 | 1.000000 | 3.0 | 3.0 | 3.0 | 1.0 | 0.000000e+00 |
ile-tRNA | 8.354256e-09 | 1.0 | 0.333333 | 0.500000 | 3.0 | 9.0 | 3.0 | 1.0 | 2.923990e-08 |
lrp | 1.306248e-04 | 1.0 | 0.015000 | 0.029557 | 3.0 | 200.0 | 3.0 | 1.0 | 3.047911e-04 |
arcA | 1.540883e-03 | 1.0 | 0.006608 | 0.013129 | 3.0 | 454.0 | 3.0 | 1.0 | 2.157237e-03 |
fnr | 1.411995e-03 | 1.0 | 0.006803 | 0.013514 | 3.0 | 441.0 | 3.0 | 1.0 | 2.157237e-03 |
crp | 3.335561e-03 | 1.0 | 0.005111 | 0.010169 | 3.0 | 587.0 | 3.0 | 1.0 | 3.891487e-03 |
compute_trn_enrichment
can perform enrichment against complex regulons in boolean combinations. The max_regs
argument determines the number of regulators to include in the complex regulon, and the method
argument determines how to combine complex regulons. method='or'
computes enrichments against the union of regulons, 'and'
computes enrichment against intersection of regulons, and 'both'
performs both tests (default).
[7]:
compute_trn_enrichment(gene_set,all_genes,trn,max_regs=2).head()
[7]:
pvalue | precision | recall | f1score | TP | regulon_size | gene_set_size | n_regs | qvalue | |
---|---|---|---|---|---|---|---|---|---|
fnr+ile-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 2.0 | 0.0 |
thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 1.0 | 0.0 |
arcA+ile-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 2.0 | 0.0 |
lrp+thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 2.0 | 0.0 |
arcA+thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 2.0 | 0.0 |
Increasing the number of regulators for complex regulons beyond three can lead to a combinatorial explosion. If you are sure you want to search for larger complex regulons, use force=True
[8]:
compute_trn_enrichment(gene_set,all_genes,trn,max_regs=3,force=True).head()
[8]:
pvalue | precision | recall | f1score | TP | regulon_size | gene_set_size | n_regs | qvalue | |
---|---|---|---|---|---|---|---|---|---|
arcA+ile-tRNA+lrp | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 |
arcA+ile-tRNA+thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 |
arcA+lrp+thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 |
arcA+rpoD+thr-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 |
crp+fnr+ile-tRNA | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 |
4.2. Using the IcaData
object
[9]:
# Compare a single iModulon to a simple regulon
ica_data.compute_regulon_enrichment('GlpR',regulator='glpR')
[9]:
pvalue 0
precision 1
recall 1
f1score 1
TP 9
regulon_size 9
imodulon_size 9
n_regs 1
Name: glpR, dtype: int64
[10]:
# Compare a single iModulon to a complex regulon
ica_data.compute_regulon_enrichment('Copper','cusR/cueR')
[10]:
pvalue 5.471346e-20
precision 1.000000e+00
recall 4.210526e-01
f1score 5.925926e-01
TP 8.000000e+00
regulon_size 1.900000e+01
imodulon_size 8.000000e+00
n_regs 2.000000e+00
Name: cusR/cueR, dtype: float64
[11]:
# Get all significant enrichments for all iModulons.
# Optional parameters are same as the global compute_trn_enrichment function
ica_data.compute_trn_enrichment()
[11]:
imodulon | regulator | pvalue | qvalue | precision | recall | f1score | TP | regulon_size | imodulon_size | n_regs | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AllR/AraC/FucR | araC | 3.804040e-14 | 3.423636e-13 | 0.388889 | 0.500000 | 0.437500 | 7.0 | 14.0 | 18.0 | 1.0 |
1 | AllR/AraC/FucR | allR | 3.067669e-13 | 1.380451e-12 | 0.333333 | 0.666667 | 0.444444 | 6.0 | 9.0 | 18.0 | 1.0 |
2 | AllR/AraC/FucR | fucR | 2.316812e-11 | 6.950436e-11 | 0.277778 | 0.714286 | 0.400000 | 5.0 | 7.0 | 18.0 | 1.0 |
3 | AllR/AraC/FucR | rpoD | 5.383749e-07 | 1.211343e-06 | 0.944444 | 0.011668 | 0.023051 | 17.0 | 1457.0 | 18.0 | 1.0 |
4 | AllR/AraC/FucR | crp | 8.866740e-07 | 1.596013e-06 | 0.666667 | 0.020443 | 0.039669 | 12.0 | 587.0 | 18.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
181 | nitrate-related | narP | 9.268494e-13 | 6.024521e-12 | 0.571429 | 0.163265 | 0.253968 | 8.0 | 49.0 | 14.0 | 1.0 |
182 | nitrate-related | nsrR | 3.762939e-09 | 1.630607e-08 | 0.500000 | 0.086420 | 0.147368 | 7.0 | 81.0 | 14.0 | 1.0 |
183 | nitrate-related | fnr | 8.576894e-09 | 2.787490e-08 | 0.785714 | 0.024943 | 0.048352 | 11.0 | 441.0 | 14.0 | 1.0 |
184 | translation | fnr | 2.865424e-08 | 4.584679e-07 | 0.583333 | 0.031746 | 0.060215 | 14.0 | 441.0 | 24.0 | 1.0 |
185 | translation | arcA | 4.244932e-07 | 3.395946e-06 | 0.541667 | 0.028634 | 0.054393 | 13.0 | 454.0 | 24.0 | 1.0 |
186 rows × 11 columns
4.2.1. Save to sample table
You can directly save iModulon enrichments to the sample table using either the ica_data.compute_regulon_enrichment
function or ica_data.compute_trn_enrichment
function. Columns unrelated to regulon enrichments (e.g. Function
and Category
) are left alone.
[12]:
ica_data.imodulon_table.loc[['GlpR']]
[12]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | |
---|---|---|---|---|---|---|---|---|---|---|
GlpR | glpR | 1.0 | 0.0 | 1.0 | 1.0 | 9.0 | 9 | 1 | Carbon Source Utilization | 0.066801 |
[13]:
ica_data.compute_regulon_enrichment('GlpR',regulator='crp',save=True)
ica_data.imodulon_table.loc[['GlpR']]
[13]:
regulator | f1score | pvalue | precision | recall | TP | n_genes | n_tf | Category | threshold | regulon_size | imodulon_size | n_regs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GlpR | crp | 0.030201 | 3.567813e-08 | 1.0 | 0.015332 | 9.0 | 9 | 1 | Carbon Source Utilization | 0.066801 | 587.0 | 9.0 | 1.0 |
[14]:
# The enrichment with the lowest qvalue is saved as the official enrichment
ica_data.compute_trn_enrichment(save=True)
ica_data.imodulon_table
[14]:
regulator | pvalue | qvalue | precision | recall | f1score | TP | regulon_size | imodulon_size | n_regs | Category | threshold | n_tf | n_genes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AllR/AraC/FucR | araC | 3.804040e-14 | 3.423636e-13 | 0.388889 | 0.500000 | 0.437500 | 7.0 | 14.0 | 18.0 | 1.0 | Carbon Source Utilization | 0.086996 | 3 | 18 |
ArcA-1 | arcA | 6.418412e-20 | 2.374813e-18 | 0.660000 | 0.072687 | 0.130952 | 33.0 | 454.0 | 50.0 | 1.0 | Energy Metabolism | 0.058051 | 1 | 50 |
ArcA-2 | appY | 7.019326e-18 | 1.895218e-16 | 0.320000 | 0.888889 | 0.470588 | 8.0 | 9.0 | 25.0 | 1.0 | Energy Metabolism | 0.081113 | 1 | 25 |
ArgR | argR | 6.026074e-18 | 7.833897e-17 | 0.923077 | 0.098361 | 0.177778 | 12.0 | 122.0 | 13.0 | 1.0 | Amino Acid and Nucleotide Biosynthesis | 0.080441 | 1 | 13 |
AtoC | atoC | 1.522277e-12 | 1.217822e-11 | 0.666667 | 1.000000 | 0.800000 | 4.0 | 4.0 | 6.0 | 1.0 | Miscellaneous Metabolism | 0.105756 | 1 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
uncharacterized-4 | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Uncharacterized | 0.074349 | 0 | 10 |
uncharacterized-5 | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Uncharacterized | 0.079179 | 0 | 4 |
uncharacterized-6 | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Uncharacterized | 0.052578 | 0 | 54 |
ydcI-KO | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Genomic Alterations | 0.099229 | 0 | 3 |
yheO-KO | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Genomic Alterations | 0.078511 | 0 | 16 |
92 rows × 14 columns
[ ]: