3. Inferring iModulon activities for new data

Re-computing the complete set of iModulons can be computationally intensive for every new dataset. However, once a dataset reaches a critical size, you can use a pre-computed IcaData object to infer the iModulon activities of a new dataset. iModulon activities are relative measures; every dataset must have a reference condition to which all other samples are compared against.

To compute the new iModulon activities, first load the pre-computed IcaData object.

[1]:

from pymodulon.example_data import load_ecoli_data
ica_data = load_ecoli_data()

Next, load your expression profiles. This should be normalized using whichever read mapping pipeline you use, as Transcripts per Million (TPM) or log-TPM.

[2]:

from pymodulon.example_data import load_example_log_tpm
log_tpm = load_example_log_tpm()
log_tpm.head()

[2]:

	Reference_1	Reference_2	Test_1	Test_2
Geneid
b0001	10.473721	10.271944	10.315476	10.808135
b0002	10.260569	10.368555	10.735874	10.726916
b0003	9.920277	10.044224	10.528432	10.503092
b0004	9.936694	10.010638	9.739519	9.722997
b0005	7.027515	7.237449	6.745798	6.497823

Next, make sure your dataset uses similar gene names as the target IcaData object.

[3]:

from matplotlib_venn import venn2
venn2((set(ica_data.gene_names),set(log_tpm.index)), set_labels=['IcaData genes','Dataset genes'])

[3]:

<matplotlib_venn._common.VennDiagram at 0x7f6f9c174650>

../_images/tutorials_inferring_imodulon_activities_for_new_data_7_1.png

Only genes shared between your IcaData object and the new expression profiles will be used to project your data. All other genes will be ignored.

Then, center your dataset on a reference condition, taking the average of replicates.

[4]:

centered_log_tpm = log_tpm.sub(log_tpm[['Reference_1','Reference_2']].mean(axis=1),axis=0)
centered_log_tpm.head()

[4]:

	Reference_1	Reference_2	Test_1	Test_2
Geneid
b0001	0.100889	-0.100889	-0.057356	0.435303
b0002	-0.053993	0.053993	0.421312	0.412354
b0003	-0.061973	0.061973	0.546181	0.520841
b0004	-0.036972	0.036972	-0.234147	-0.250669
b0005	-0.104967	0.104967	-0.386684	-0.634659

Finally, use the pymodulon.util.infer_activities function to infer the relative iModulon activities of your dataset.

[5]:

from pymodulon.util import infer_activities

[6]:

activities = infer_activities(ica_data,centered_log_tpm)
activities.head()

/home/docs/checkouts/readthedocs.org/user_builds/pymodulon/envs/stable/lib/python3.7/site-packages/pymodulon/util.py:327: FutureWarning: Index.__and__ operating as a set operation is deprecated, in the future this will be a logical operation matching Series.__and__.  Use index.intersection(other) instead
  shared_genes = ica_data.M.index & data.index

[6]:

	Reference_1	Reference_2	Test_1	Test_2
AllR/AraC/FucR	0.243143	-0.243143	1.028044	0.848571
ArcA-1	-0.157687	0.157687	-2.644027	-2.418106
ArcA-2	0.038248	-0.038248	0.182260	0.039267
ArgR	-0.150147	0.150147	-1.456806	-1.293399
AtoC	0.344893	-0.344893	0.632130	1.075412

All of the plotting functions in pymodulon.plotting can be used on your inferred activities once you add it to a new IcaData object. It is advisable to create a new sample_table with project and condition columns.

[7]:

from pymodulon.core import IcaData
import pandas as pd

[8]:

new_sample_table = pd.DataFrame([['new_data','reference']]*2+[['new_data','test']]*2,columns=['project','condition'],index=log_tpm.columns)
new_sample_table

[8]:

	project	condition
Reference_1	new_data	reference
Reference_2	new_data	reference
Test_1	new_data	test
Test_2	new_data	test

[9]:

new_data = IcaData(ica_data.M,
                   activities,
                   gene_table = ica_data.gene_table,
                   sample_table = new_sample_table,
                   imodulon_table = ica_data.imodulon_table)

[10]:

from pymodulon.plotting import *

[11]:

plot_activities(new_data,'Fur-1',highlight='new_data')

[11]:

<AxesSubplot:ylabel='Fur-1 iModulon\nActivity'>

../_images/tutorials_inferring_imodulon_activities_for_new_data_19_1.png

[ ]:

[ ]:

[ ]: