7. Comparing iModulons

compare_ica is a function specifically built to compare the components between different ICA runs. The function will return a Digraph plot which shows connections between ICA components of different runs, as well as Dictionary List of the one-to-one connections.

The function requires a minimum of two inputs if running a comparison beween two ICA runs for the same organism:

  • S1: Pandas Dataframe of M matrix 1

  • S2: Pandas Dataframe of M Matrix 2

compare_ica can be found in pymodulon.compare

[1]:
from pymodulon import example_data
from pymodulon.compare import *

7.1. Comparsion within the same organism

[2]:
ecoli_data = example_data.load_ecoli_data()

We will create a copy of the E. coli M matrix that has slightly different iModulon names.

[3]:
new_M = ecoli_data.M.copy()
new_M.columns = ['run2_'+str(i) for i in new_M.columns]

For this example, we will only compare the first 5 iModulons.

[4]:
links,dots = compare_ica(ecoli_data.M.iloc[:,:5],new_M.iloc[:,:5])

The comparison data are stored in links. The first two values in each element of links are the iModulon name from the first and second datasets respectively. The final value is the Pearson R correlation between the Independent Component gene weights.

[5]:
links
[5]:
[('AllR/AraC/FucR', 'run2_AllR/AraC/FucR', 1.0),
 ('ArcA-1', 'run2_ArcA-1', 1.0),
 ('ArcA-2', 'run2_ArcA-2', 1.0),
 ('ArgR', 'run2_ArgR', 1.0),
 ('AtoC', 'run2_AtoC', 1.0)]

The graphical comparison is saved as dots. The thickness of the arrows represents how similar the iModulons in the first dataset are to the dataset in the second dataset. Since we are comparing two copies of the same dataset, all arrows are at maximum thickness.

[6]:
dots
[6]:
../_images/tutorials_comparing_imodulons_11_0.svg

7.2. Comparison across organisms

compare_ica can also be used to compare ICA components between different organisms. To do so, you will need to provide an addition input, which is a string path to a Bidirectional Best Hit (BBH) orthology between your two organisms of interest.

[7]:
example_bbh = example_data.load_example_bbh()
example_bbh.head()
[7]:
gene subject PID alnLength mismatchCount gapOpenCount queryStart queryEnd subjectStart subjectEnd eVal bitScore gene_length COV BBH
3 b0003 USA300HOU_RS06760 26.897 290 183 8 1 279 4 275 2.060000e-25 99.8 310 0.935484 <=>
6 b0007 USA300HOU_RS06905 40.638 470 257 6 1 455 1 463 3.650000e-114 342.0 476 0.987395 <=>
9 b0014 USA300HOU_RS08425 55.818 636 251 4 1 636 1 606 0.000000e+00 678.0 638 0.996865 <=>
10 b0015 USA300HOU_RS08420 46.982 381 187 5 1 371 1 376 8.040000e-110 323.0 376 1.013298 <=>
15 b0023 USA300HOU_RS08455 42.353 85 36 3 1 79 1 78 1.530000e-09 46.6 87 0.977011 <=>

You can also provide a path to your own orthology file, as long as it contains the columns gene and target

It is important that the order of the organism matches the order of the orthology file. Specifically, the organism for the M1 file must be the organism that fills the gene column in the orthology file.

[8]:
staph_data = example_data.load_staph_data()
[9]:
links,dots = compare_ica(ecoli_data.M,staph_data.M,
                         ortho_file = example_bbh)
dots
[9]:
../_images/tutorials_comparing_imodulons_16_0.svg

Here it is much more clear that some iModulons are more similar between species, whereas others are loosely connected. Arrows can also indicate if an iModulon is composed of multiple iModulons from another dataset.

[10]:
links[:5]
[10]:
[('AllR/AraC/FucR', 'PyrR', 0.2615197820547893),
 ('Cbl+CysB', 'CymR', 0.26945414208083),
 ('Crp-1', 'CcpA-1', 0.2576532196155268),
 ('CsqR', 'LacR', 0.4172511239115131),
 ('CysB', 'CymR', 0.3105778504675501)]

Additionally, you can adjust the metric type and cutoff values for correlations by utilizing the method and cutoff parameters. The default for method is "pearson", but can be changed to `"spearman".

cutoff takes a float value between 0 and 1 and will only show correlations greater than the cutoff value (default: 0.3).

[11]:
links,dots = compare_ica(ecoli_data.M,staph_data.M,
                         ortho_file = example_bbh,
                         cutoff = 0.6)
dots
[11]:
../_images/tutorials_comparing_imodulons_20_0.svg
[12]:
links,dots = compare_ica(ecoli_data.M,staph_data.M,
                         ortho_file = example_bbh,
                         cutoff = 0.15, method = 'spearman')
dots
[12]:
../_images/tutorials_comparing_imodulons_21_0.svg

7.3. Create a BBH orthology file

To create your own BBH orthology file, pymodulon.compare contains a set of functions that will allow you to create the desired bbh.csv file for you specific comparison. This only needs to be done once per comparison.

In order to make the BBH orthology file, you need to obtain the Genbank Full Genome files for your organisms. These can be found at https://www.ncbi.nlm.nih.gov/genome/. Search for both of your organisms/strains and download the “Genbank (full)” file. Once saved into a known location, you can use the make_prots function to convert the Genbank files into protein FASTA files.

make_prots recieves two parameters:

  • gbk: Path to the genbank file (one file per function call)

  • out_path: Path to the protein FASTA files

[13]:
# ecoli_gb= "../pymodulon/test/data/ecoli.gb"
# mtb_gb = "../pymodulon/test/data/mtb.gb"
# ecoli_out = "../pymodulon/test/data/ecoli_prot.fasta"
# mtb_out = "../pymodulon/test/data/mtb_prot.fasta"

# make_prots(ecoli_gb,ecoli_out)
# make_prots(mtb_gb,mtb_out)

7.3.1. Create Genbank database files

In order to run Bidirectional Best Hits between two organisms, it is required to build the necessary Genbank Database files from the newly created protein FASTA files. In order to do so, use the make_prots_db function. You will do this for both organims.

make_prots_db recieves one parameter:

  • fasta_file: String path to protein FASTA file

If there is an error, the function will print out error message. The function will also not execut if the necessary Genbank DB files already exist.

[14]:
# make_prot_db(ecoli_out)
# make_prot_db(mtb_out)

7.3.2. Create BBH CSV

Once the previous steps have been completed, you can now create the BBH CSV file. In order to do so, you can use the get_bbh function.

get_bbh has the following parameters:

  • db1: String path to protein FASTA file (output of make_prots function) for organism 1

  • db2: String path to protein FASTA file (output of make_prots function) for organism 2

  • outdir: String path to output directory, default is “bbh” and will create the directory if it does not exist

  • outname: Default db1_vs_db2_parsed.csv where db[1-2] are the passed arguments name of the csv file where that will save the results

  • mincov: Minimum coverage to call hits in BLAST, must be between 0 and 1

  • evalue: evalue thershold for BLAST hits, Default .001

  • threads: Number of threads to run BLAST, Default 1

  • force: Whether to overwrite existing files or not

  • savefiles: Whether to save files to outdir

The function will return a Pandas DataFrame containing all the BLAST hits between the two genes. The function will also save the DataFrame to a csv file, which you can then use in compare_ica between the two organisms.

[15]:
# get_bbh(ecoli_out,mtb_out, outdir = "../pymodulon/test/data/")
[ ]: