10. Creating an iModulonDB Dashboard

After an iModulon analysis is completed and published, dashboards for every iModulon and gene are made available on the iModulonDB website. The PyModulon package enables generation of all necessary files for any IcaData object that meets a few simple requirements.

10.1. iModulonDB Site Overview

For general information about iModulonDB, visit the about page. Familiarize yourself with the main page types:

Splash page: The main entry point for the site
Dataset page: After choosing a dataset on the splash page, you are directed to the dataset page, which contains gray dataset metadata box and a table of iModulons.
iModulon page: The iModulon dashboards, which have a gray iModulon metadata box, a gene table, and several plots.
Gene page: Gene dashboards, which have an iModulon table connecting them to iModulons and an expression plot.

It’s helpful to understand the two parts to most websites:

Front End: The front end runs in a web user’s browser. It requests and receives files from the back end, and contains JavaScript functions which can create interactive plots and other features. iModulonDB’s front end code is available at this GitHub repository.
Back End: The back end runs in a server, which is usually operated by the site owner. It sends the necessary files to the front end when requested. With iModulonDB, all back end data is precomputed by the PyModulon package using the instructions provided here. Output files are uploaded to GitHub Pages, which serves the files to users. If you host your own version of the site, the local server will be the back end.

To generate and view iModulonDB dashboards for a custom project, follow these steps:

Clone the iModulonDB GitHub repository.
Follow the instructions in this document to generate your own files for the new IcaData object of interest. In imodulondb_export, specify the path to the cloned repository so that PyModulon can place new files in the appropriate locations.
Host a local server from the repository folder. This will act as the back end in place of GitHub Pages.
In your browser, visit http://localhost:8000/dataset.html?organism=new_organism&dataset=new_dataset. The organism and dataset names in the URL are customizable.

[1]:

from pymodulon.core import IcaData
from pymodulon.imodulondb import *
from pymodulon.example_data import load_ecoli_data

[2]:

# Increase the maximum column width
pd.set_option('display.max_colwidth', None)

[3]:

ica_data = load_ecoli_data()

10.2. The `imodulondb_compatibility` Function

imodulondb_compatibility is essentially a convenient checklist for iModulonDB-relevant metadata. It has four outputs, which we will go through one-by-one.

Optional arguments:

inplace: Whether to modify the object. It is not recommended to change this from its default (False).
tfcomplex_to_gene: Dictionary mapping transcription factor complexes to gene names. See the section on tf_issues.has_gene for more information.

Returns:

table_issues: Dataframe describing missing columns from the major annotation tables and the site’s behavior if they are not updated. Three elements in this table are “CRITICAL”, meaning the site cannot be generated without them. If those elements are missing, they will be entries in this table as well as warnings.
tf_issues: Dataframe describing each regulator from the imodulon_table that can’t be mapped to the trn, tf_links, and/or gene_table.
missing_gene_links: Series listing each gene that doesn’t have a gene_link to an external database.
missing_dois: Index listing each sample that doesn’t have an associated DOI in the sample_table. Clicking on these samples in the activity plots on the table will not bring you to a relevant paper.

Call this function and browse its output. Not all issues it lists will be important to fix. In some cases, the default values will be fine for your purposes, or the match it is seeking might not exist.

[4]:

table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)

For demonstration purposes, we will also create a minimal IcaData object called blank_data containing only the M and A matrices. If we call imodulondb_compatibility with this object, the table_issues output will contain all possible issues.

Note the three warnings which appear when this happens; seeing any of these indicates that iModulonDB export cannot be completed. The possible CRITICAL issues are:

No X matrix: Each gene page contains a plot of gene expression, so you must provide the X matrix.
No project column in sample_table: See the section on table_issues.sample_table.
No condition column in sample_table: See the section on table_issues.sample_table.

[5]:

blank_data = IcaData(ica_data.M, ica_data.A)
all_table_issues, b_tf_issues, b_missing_gene_links, b_missing_dois = imodulondb_compatibility(blank_data)

WARNING:root:Critical issue: No X matrix
WARNING:root:Critical issue: No project column in sample_table.
WARNING:root:Critical issue: No condition column in sample_table.

10.2.1. The `table_issues` Output

The first output of imodulondb_compatibility is the table_issues. Each row corresponds to an issue with one of the main class elements. The columns are:

Table: which table or other variable the issue is in
Missing Column: the column of the Table with the issue (not case sensitive; capitalization is ignored).
Solution: Unless “CRITICAL” is in this cell, the site behavior if the issue remained is described here.

If the X matrix is missing, it will be listed in the first row as a CRITICAL issue. After that, the imodulondb_table issues represent missing information from a dictionary specific to iModulonDB. The rest of the rows may relate to the gene_table, sample_table, or imodulon_table. These issues arise from parsing the column names of the annotation tables.

You may not be able to provide the information to fill in the missing columns (for example, if no database of gene functions exist for your organism). If you’d like, you can partially fill in any column and leave np.nan values for the information you do not know. In most cases, you can continue to omit the column - the Solution column of table_issues will describe what happens in those cases.

Column names are case insensitive. Be sure not to have multiple columns with matching names, as that will cause errors later.

[6]:

table_issues

[6]:

	Table	Missing Column	Solution
0	iModulonDB	organism	The default, "New Organism", will be used.
1	iModulonDB	dataset	The default, "New Dataset", will be used.
2	iModulonDB	strain	The default, "Unspecified", will be used.
3	iModulonDB	publication_name	The default, "Unpublished Study", will be used.
4	iModulonDB	publication_link	The publication name will not be a hyperlink.
5	iModulonDB	gene_link_db	The default, "External Database", will be used.
6	iModulonDB	organism_folder	The default, "new_organism", will be used.
7	iModulonDB	dataset_folder	The default, "new_dataset", will be used.
8	Gene	gene_product	Locus tags (gene_table.index) will be used.
9	Sample	sample	The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#."
10	Sample	n_replicates	This column will be generated for you.
11	iModulon	name	imodulon_table.index will be used.
12	iModulon	function	The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard
13	iModulon	exp_var	This column will be left blank.

10.2.1.1. The `imodulondb_table` Variable

Unless you are missing the X matrix, the first set of table_issues will be relating to the iModulonDB table (imodulondb_table), which is a dictionary of details about the dataset in general. As you can see by reading the Solution column of table_issues for these entries, most entries have a default value (and the default behavior for a missing link is to not create a link).

Each entry affects the following parts of the site:

organism: Appears in the dataset page metadata box, and gets programmatically changed into a short form (“Eshcerichia coli” –> “E. coli”) to appear in the dataset title on iModulon and Gene pages. Be sure to use the form “Genus species”.
dataset: The specific dataset title, which could distinguish it from other datasets in this organism. Appears in the dataset page metadata box, and gets appended to the shortened organism name in the dataset title on iModulon and Gene pages.
strain: Appears on the dataset pages in the metadata box.
publication_name: Appears on the dataset pages in the metadata box. Typically we use the form “Smith, et al., year”.
publication_link: The publication_name will be turned into a hyperlink if this entry is not blank. DOIs preferred. PyModulon does not test the validity of any links, so be sure this is a valid link.
gene_link_db: Appears on Gene pages in the metadata box if a gene_link is also available for the gene. PyModulon expects all gene links to go to specific gene pages from the same database, and for the database name to be the one shown here. If gene links are not included, then “Not Available” will appear in place of the gene_link_db name. Examples of gene link databases include EcoCyc, SubtiWiki, and AureoWiki.
organism_folder: Determines output file location names. Also used in the URLs. Cannot contain spaces or slashes.
dataset_folder: Determines output file location names. Also used in URLs. Cannot contain spaces or slashes.

Any files with matching organism_folder and dataset_folder will be overwritten.

If you would like to italicize any part of these entries, you can use HTML tags: <i>italic</i>. These are automatically applied to organism names in the site.

[7]:

# Here are ALL possible issues with the imodulondb_table
all_table_issues.loc[all_table_issues.Table == 'iModulonDB']

[7]:

	Table	Missing Column	Solution
1	iModulonDB	organism	The default, "New Organism", will be used.
2	iModulonDB	dataset	The default, "New Dataset", will be used.
3	iModulonDB	strain	The default, "Unspecified", will be used.
4	iModulonDB	publication_name	The default, "Unpublished Study", will be used.
5	iModulonDB	publication_link	The publication name will not be a hyperlink.
6	iModulonDB	gene_link_db	The default, "External Database", will be used.
7	iModulonDB	organism_folder	The default, "new_organism", will be used.
8	iModulonDB	dataset_folder	The default, "new_dataset", will be used.

[8]:

# here are the issues relevant to ica_data.imodulondb_table
# in this case, all issues appear
table_issues.loc[table_issues.Table == 'iModulonDB']

[8]:

	Table	Missing Column	Solution
0	iModulonDB	organism	The default, "New Organism", will be used.
1	iModulonDB	dataset	The default, "New Dataset", will be used.
2	iModulonDB	strain	The default, "Unspecified", will be used.
3	iModulonDB	publication_name	The default, "Unpublished Study", will be used.
4	iModulonDB	publication_link	The publication name will not be a hyperlink.
5	iModulonDB	gene_link_db	The default, "External Database", will be used.
6	iModulonDB	organism_folder	The default, "new_organism", will be used.
7	iModulonDB	dataset_folder	The default, "new_dataset", will be used.

[9]:

# issues 0-7
# complete the imodulondb_table
ica_data.imodulondb_table = {
     'organism': 'Escherichia coli',
     'dataset': 'PRECISE 1',
     'strain': 'K-12 MG1655 and BW25113',
     'publication_name': 'Sastry, et al., 2019',
     'publication_link': 'https://doi.org/10.1038/s41467-019-13483-w',
     'gene_link_db': 'EcoCyc',
     'organism_folder': 'e_coli',
     'dataset_folder': 'precise1'
}

[10]:

# if we now re-call the function, the issues we fixed are removed.
table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)
table_issues.loc[table_issues.Table == 'iModulonDB']

[10]:

	Table	Missing Column	Solution

10.2.1.2. The `gene_table` Variable

The gene_table contains details about each gene and shares an index with the M and X matrices. Its information is included in the following parts of the site:

Gene Tables on the iModulon Pages
Metadata box on the Gene Pages
Search results
name, cog, and gene_start are shown when a gene is hovered over in the gene scatter plot (middle right of iModulon Pages). cog is also used to color this scatter plot, and gene_start defines the x axis.

Note that column names are case-insensitive.

Some additional considerations for each column are described below:

gene_name: Try to use the most up to date names. Also, if a gene encodes a transcription factor (regulator) in the TRN, make sure that the names match (e.g. the transcription factor listed in the trn as ‘RpoS’ is encoded by the gene with gene_name ‘rpoS’). See the section on tf_issues.has_gene for more information.
gene_product: This field is meant to be a concise but specific description of the function/product of the gene, such as ‘homoserine kinase’. You can use html tags in this field if superscripts or subscripts are desired.
cog: This stands for ‘cluster of orthologous groups’, and should be a larger category of genes such as ‘Carbohydrate transport and metabolism’. In addition to being displayed everywhere listed above, each COG is assigned a random color in the gene scatter plot, so it is especially useful to have them.
gene_start: This is not displayed anywhere, but it is used as the x axis value for each gene in the scatter plots on the iModulon Pages. If a few gene_starts are missing, those genes will have a value of 0 on the x axis. If all gene_starts are not provided, then the order of the gene_table will be used to assign integer values to each gene for the x axis of this plot.
operon: Preferably, this field will be the shortest string listing all genes, such as ‘artPIQM’, ‘argT-hisJQMP’, or ‘pdxB-usg-truA-dedA’. For very long operons with names that don’t share letters, you can list the first and last gene as in ‘yitB->yisZ’. iModulonDB is not picky about this column, however, so operons as provided by another database are acceptable.
regulator: Adding a TRN will automatically generate this column for you. It should contain a comma-separated list of all regulators that regulate this gene, e.g. ‘ppGpp,Lrp,Nac’.

If any of these columns are listed in your table_issues, be sure to check the gene_table for the appropriate information under other column names. Rename those columns so that they match the names above.

[11]:

# here are all possible issues with the gene_table
all_table_issues.loc[all_table_issues.Table == 'Gene']

[11]:

	Table	Missing Column	Solution
9	Gene	gene_name	Locus tags (gene_table.index) will be used.
10	Gene	gene_product	Locus tags (gene_table.index) will be used.
11	Gene	cog	COG info will not display & the gene scatter plot will not have color.
12	Gene	start	The x axis of the scatter plot will be a numerical value instead of a genome location.
13	Gene	operon	Operon info will not display.
14	Gene	regulator	Regulator info will not display. If you have a TRN, add it to the model to auto-generate this column.

[12]:

# here are the issues relevant to ica_data.gene_table
table_issues.loc[table_issues.Table == 'Gene']

[12]:

	Table	Missing Column	Solution
0	Gene	gene_product	Locus tags (gene_table.index) will be used.

[13]:

# issue 0
# if gene_product information is available, add it to the gene_table.

10.2.1.3. The `sample_table` Variable

The sample_table describes each sample and shares its index with the columns of X and A. Its information is very important for generating the activity and expression bar graphs on the iModulon and gene pages. Familiarize yourself with the activity plot.

Three columns in the sample_table define progressively smaller groupings of samples. It is CRITICAL that these columns exist so that the activity plots can be generated. Short, human readable, and specific names are preferred. Formatting should be consistent within your dataset, but is not necessarily constant throughout iModulonDB.

project: This defines the largest grouping of samples, which share a common theme and/or were featured in the same paper. Examples include: ‘biofilm’, ‘acid’, and ‘crp_KO’. In the activity bar graph on the iModulon Pages, the vertical lines separate each project and the names are displayed across the bottom.
condition: This is the smallest grouping of samples, used for experimental conditions. Samples have matching project and condition names if and only if they are biological replicates. Examples include: ‘biofilm_t0’, ‘biofilm_t10’, ‘wt_ctrl’, ‘del_crp’. In the activity bar graph on the iModulon Pages, each bar corresponds to a condition, and its height is the average activity value across all samples in the condition. Hovering over a bar provides additional details from the sample table, and zooming in switches the x labels to be conditions instead of projects.
sample: This must be unique for each sample. If it is not a column, it is assumed that it is the index of the sample_table. Use a human readable sample name, preferably of the form ‘project__condition__#’, where ‘#’ indicates a replicate number. Examples: ‘carbon__wt_ctrl__1’, ‘carbon__wt_ctrl__2’, ‘carbon__del_crp__1’, ‘biofilm__biofilm_t0__1’. In the activity bar graph on the iModulon Pages, each dot floating near the bar is a sample. Also, the regulon scatter plots contain points corresponding to each sample; sample names will appear when points are hovered over in that plot as well.

Finally, two other columns are checked:

n_replicates: For each condition, the samples all must have a matching n_replicates value that is used by the code. You can ignore this output, however, since a missing n_replicates column will automatically be generated. If you would like to remove this issue from the table, you can run the command generate_n_replicates_column(ica_data); otherwise, it will be automatically called on export.
doi: If you click on the bars of the activity plot and the first sample of the corresponding condition has a link in the doi column, then you will go to that link. This could be useful to learn more about a given sample. DOIs are preferred, but any link is allowed. PyModulon does not test the validity of any links, so be sure this is a valid link. See the section on missing_dois for more details on this.

Note that in addition to the columns that PyModulon searches for, all columns of the ``sample_table`` can be used to annotate or color the activity plots. The more informative the sample_table is, the more useful the activity plots will be. You can access all the columns in any iModulon or Gene Page by clicking the wrench symbol next to the word ‘Activity’. Checkboxes next to each column name indicate whether its value will be displayed on hovering over the sample, and the ink button allows you to recolor the activity plot by the values of that feature. For example, ‘pH’ is not a required sample_table column, but it could be useful to color or label each sample by that variable. Other useful but completely optional columns include:

Strain Description
Carbon Source (g/L)
Supplement
Temperature (C)
Time (min)
Growth Phase

[14]:

# here are all possible issues with the sample_table
all_table_issues.loc[all_table_issues.Table == 'Sample']

[14]:

	Table	Missing Column	Solution
15	Sample	project	This is a CRITICAL column defining the largest grouping of samples. Vertical bars in the activity plot will separate projects.
16	Sample	condition	This is an CRITICAL column defining the smallest grouping of samples. Biological replicates must have matching projects and conditions, and they will appear as single bars with averaged activities.
17	Sample	sample	The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#."
18	Sample	n_replicates	This column will be generated for you.
19	Sample	doi	Clicking on activity plot bars will not link to relevant papers for the samples.

[15]:

# here are the issues relevant to ica_data.sample_table
table_issues.loc[table_issues.Table == 'Sample']

[15]:

	Table	Missing Column	Solution
1	Sample	sample	The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#."
2	Sample	n_replicates	This column will be generated for you.

[16]:

# issue 1
# check to make sure that the sample_table.index contains good names
ica_data.sample_table.index

[16]:

Index(['control__wt_glc__1', 'control__wt_glc__2', 'fur__wt_dpd__1',
       'fur__wt_dpd__2', 'fur__wt_fe__1', 'fur__wt_fe__2',
       'fur__delfur_dpd__1', 'fur__delfur_dpd__2', 'fur__delfur_fe2__1',
       'fur__delfur_fe2__2',
       ...
       'efeU__menFentC_ale29__1', 'efeU__menFentC_ale29__2',
       'efeU__menFentC_ale30__1', 'efeU__menFentC_ale30__2',
       'efeU__menFentCubiC_ale36__1', 'efeU__menFentCubiC_ale36__2',
       'efeU__menFentCubiC_ale37__1', 'efeU__menFentCubiC_ale37__2',
       'efeU__menFentCubiC_ale38__1', 'efeU__menFentCubiC_ale38__2'],
      dtype='object', length=278)

[17]:

# the sample names above are good; we can now ignore issue 1

# issue 2
# optionally add the n_replicates column
generate_n_replicates_column(ica_data)

10.2.1.4. The `imodulon_table` Variable

The imodulon_table describes each iModulon and shares its index with the columns of M and the index of A. Its data is featured on iModulonDB in the following places:

Dataset Page, main section
iModulon tables on the Gene Pages (regulator, function, and category columns)
Metadata box on the iModulon Pages
Search results

Some additional considerations for each column are described below:

name: Usually, this issue can be ignored because the imodulon_table.index always matches the imodulon_names. If the names are all integers, then the word “iModulon” will be added (“0” –> “iModulon 0”).
regulator: This column is very important. Use regulator names that match the TRN. Join regulators using either ‘+’ to represent the intersection of regulons or ‘/’ to represent the union of regulons. Regulator links will be added in the iModulon Page metadata boxes according to the tf_links variable. The content of this column affects the behavior of the gene table, gene histogram, regulon venn diagram, and regulon scatter plots. See the section on tf_issues.
function: This is meant to be a specific description of each iModulon’s function, such as “Histidine biosynthesis”.
category: This column provides larger groupings of iModulons, such as “Amino Acid Biosynthesis”.
n_genes: The number of genes in the iModulon. This can be ignored since it will be computed for you.
precision: The overlap between the iModulon and its regulon divided by the size of the iModulon. See the tutorial on gene enrichment analysis.
recall: The overlap between the iModulon and its regulon divided by the size of the regulon. See the tutorial on gene enrichment analysis.
exp_var: The explained variance when this iModulon alone is used to reconstruct the original X matrix. Use decimal values; they will be converted to percentages in iModulonDB. These can be computed using the explained_variance function; see the tutorial on additional functions.

[18]:

# here are all possible issues with the imodulon_table
all_table_issues.loc[all_table_issues.Table == 'iModulon']

[18]:

	Table	Missing Column	Solution
20	iModulon	name	imodulon_table.index will be used.
21	iModulon	regulator	The regulator details will be left blank.
22	iModulon	function	The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard
23	iModulon	category	The categories will be filled in as "Uncharacterized".
24	iModulon	n_genes	This column will be computed for you.
25	iModulon	precision	This column will be left blank.
26	iModulon	recall	This column will be left blank.
27	iModulon	exp_var	This column will be left blank.

[19]:

# here are the issues relevant to ica_data.imodulon_table
table_issues.loc[table_issues.Table == 'iModulon']

[19]:

	Table	Missing Column	Solution
3	iModulon	name	imodulon_table.index will be used.
4	iModulon	function	The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard
5	iModulon	exp_var	This column will be left blank.

[20]:

# issue 3
# check that the imodulon_table.index contains good names
ica_data.imodulon_table.index

[20]:

Index(['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC', 'BW25113',
       'Cbl+CysB', 'CdaR', 'CecR', 'Copper', 'CpxR', 'Cra', 'Crp-1', 'Crp-2',
       'CsqR', 'CysB', 'DhaR/Mlc', 'EvgA', 'ExuR/FucR', 'FadR', 'FecI',
       'FlhDC', 'FliA', 'Fnr', 'Fur-1', 'Fur-2', 'GadEWX', 'GadWX', 'GcvA',
       'GlcC', 'GlpR', 'GntR/TyrR', 'His-tRNA', 'Leu/Ile', 'Lrp', 'MalT',
       'MetJ', 'Nac', 'NagC/TyrR', 'NarL', 'NikR', 'NtrC+RpoN', 'OxyR', 'PrpR',
       'PurR-1', 'PurR-2', 'PuuR', 'Pyruvate', 'RbsR', 'RcsAB', 'RpoH', 'RpoS',
       'SoxS', 'SrlR+GutM', 'Thiamine', 'Tryptophan', 'XylR', 'YgbI', 'YiaJ',
       'YieP', 'YneJ', 'Zinc', 'crp-KO', 'curli', 'deletion-1', 'deletion-2',
       'duplication-1', 'e14-deletion', 'efeU-repair', 'entC-menF-KO',
       'fimbriae', 'flu-yeeRS', 'fur-KO', 'gadWX-KO', 'insertion',
       'iron-related', 'lipopolysaccharide', 'membrane', 'nitrate-related',
       'proVWX', 'purR-KO', 'sgrT', 'thrA-KO', 'translation',
       'uncharacterized-1', 'uncharacterized-2', 'uncharacterized-3',
       'uncharacterized-4', 'uncharacterized-5', 'uncharacterized-6',
       'ydcI-KO', 'yheO-KO'],
      dtype='object')

[21]:

# the names are good; we can ignore issue 3

# issue 4
# write a function column if desired
# (part of iModulon characterization)

# issue 5
# compute the explained variance for each iModulon
from pymodulon.util import explained_variance

for k in ica_data.imodulon_table.index:
    ica_data.imodulon_table.loc[k, 'exp_var'] = explained_variance(
        ica_data, imodulons=k)

10.2.2. The `tf_issues` Output

The next output is the tf_issues dataframe. Each row corresponds to a regulator that is used in the imodulon_table. Regulators that satisfy all requirements are omitted. False values in this table represent potentially missing information.

The columns refer to the following:

in_trn: whether the regulator is in the model.trn. Regulators not in the TRN will be ignored in the site’s histograms and gene tables. It is highly recommended that all regulators be in the TRN. Any ``False`` values in this column should be fixed by adding rows to the trn and/or ensuring that names all match up between imodulon_table.regulator and trn.regulator.
has_link: whether the regulator has a link in ica_data.tf_links. If it does, its name will be clickable in the iModulon Page metadata box. If not, the regulator name will not appear as a hyperlink. Any regulators without dedicated pages in other databases should be ignored.
has_gene: whether the regulator can be matched to a gene in the model. This is used to generate the regulon scatter plot which appears at the bottom of the iModulon Page, comparing iModulon activity to the expression of the regulators. If your regulator is a gene, try to ensure that its name matches the gene_table.gene_name. If your regulator doesn’t correspond to a gene (e.g. the small molecule regulator ppGpp), then a False value in this column is acceptable.

Note that in our minimal example, the tf_issues will be empty because there is no regulator column in the imodulon_table.

[22]:

tf_issues

[22]:

	in_trn	has_link	has_gene
allR	True	False	True
fucR	True	False	True
araC	True	False	True
arcA	True	False	True
argR	True	False	True
...	...	...	...
trpR	True	False	True
xylR	True	False	True
yiaJ	True	False	True
zur	True	False	True
zntR	True	False	True

70 rows × 3 columns

10.2.2.1. The `in_trn` Column

As described above, any False values in this column should definitely be fixed by ensuring name match-ups between imodulon_table.regulator and trn.regulator columns, or by adding rows to trn. True values and omitted regulators indicate matchable regulators.

[23]:

# here is a list of missing regulators from the TRN
tf_issues.index[~tf_issues.in_trn.astype(bool)]

[23]:

Index([], dtype='object')

[24]:

# it is empty; no issues

10.2.2.2. The `has_link` Column

This column encourages you to find links for each transcription factor and put them in the dictionary ica_data.tf_links. Find a database for your regulators (e.g. RegulonDB), and either make a file full of links or programmatically generate links by finding patterns in the databases URLs.

In the example below, a file of regulator links has alreay been made.

[25]:

# here is a list of missing tf_links
tf_issues.index[~tf_issues.has_link.astype(bool)]

[25]:

Index(['allR', 'fucR', 'araC', 'arcA', 'argR', 'atoC', 'cbl', 'cysB', 'cdaR',
       'cecR', 'cusR', 'hprR', 'cueR', 'cpxR', 'cra', 'crp', 'csqR', 'dhaR',
       'mlc', 'evgA', 'exuR', 'fadR', 'fecI', 'flhD;flhC', 'fliA', 'fnr',
       'fur', 'gadW', 'gadE', 'gadX', 'gcvA', 'glcC', 'glpR', 'tyrR', 'gntR',
       'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'ilvY', 'lrp', 'malT', 'metJ',
       'nac', 'nagC', 'narL', 'nikR', 'ntrC', 'rpoN', 'oxyR', 'prpR', 'purR',
       'puuR', 'btsR', 'ypdB', 'pdhR', 'rbsR', 'rcsA;rcsB', 'rpoH', 'rpoS',
       'soxS', 'gutM', 'srlR', 'TPP', 'L-tryptophan', 'trp-tRNA', 'trpR',
       'xylR', 'yiaJ', 'zur', 'zntR'],
      dtype='object')

[26]:

# read in a file of links
import pandas as pd
file_location = '../../src/pymodulon/data/imodulondb/e_coli_tf_links.csv'
tf_links = pd.read_csv(file_location, header = None, index_col = 0)

# convert to dictionary
tf_links = tf_links.to_dict()[1]

# add to ica_data
ica_data.tf_links = tf_links

# display a few for demonstration purposes
{k:tf_links[k] for k in list(tf_links.keys())[0:5]}

[26]:

{'glpR': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120012730&organism=ECK12&format=jsp&type=regulon',
 'dhaR': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120015690&organism=ECK12&format=jsp&type=regulon',
 'mlc': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120011240&organism=ECK12&format=jsp&type=regulon',
 'argR': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120011670&organism=ECK12&format=jsp&type=regulon',
 'narL': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120011502&organism=ECK12&format=jsp&type=regulon'}

[27]:

# let's re-check the tf_issues now
table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)
tf_issues.index[~tf_issues.has_link.astype(bool)]

[27]:

Index(['his-tRNA', 'leu-tRNA', 'ile-tRNA', 'TPP', 'L-tryptophan', 'trp-tRNA'], dtype='object')

[28]:

# none of those regulators have pages on RegulonDB
# so they can be ignored

10.2.2.3. The `has_gene` Column

As described above, any False values in this column indicate an inability to match between imodulon_table.regulator and gene_table.gene_name. Some regulators will not have genes, so some False values are acceptable.

In some cases, the regulator name will be a complex of several genes. PyModulon supports an additional input, tfcomplex_to_gene, which is a dictionary mapping those regulators to their preferred gene. This variable will need to be passed to both imodulondb_compatibility and imodulondb_export.

[29]:

tf_issues.index[~tf_issues.has_gene.astype(bool)]

[29]:

Index(['cecR', 'flhD;flhC', 'glpR', 'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'btsR',
       'rcsA;rcsB', 'gutM', 'TPP', 'L-tryptophan', 'trp-tRNA'],
      dtype='object')

[30]:

# three of these are name mismatches
# find the old names and replace them
ica_data.gene_table = ica_data.gene_table.replace({
    'ybiH':'cecR',
    'yehT':'btsR',
    'srlM':'gutM'
})

# two are complexes
tfcomplex_to_gene = {
    'flhD;flhC':'flhD',
    'rcsA;rcsB':'rcsB'
}

[31]:

# let's re-check the tf_issues now
# use the tfcomplex_to_gene input
table_issues, tf_issues, missing_gene_links, missing_dois = \
    imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)

tf_issues.index[~tf_issues.has_gene.astype(bool)]

[31]:

Index(['glpR', 'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'TPP', 'L-tryptophan',
       'trp-tRNA'],
      dtype='object')

[32]:

# the rest are pseudogene or non-gene regulators; ignore.

10.2.3. The `missing_gene_links` Output

The next output is missing_gene_links. Similar to tf_issues.has_link, this output shows all the genes that are missing links in the gene_links dictionary. The gene_links dictionary is indexed by locus tag (gene_table.index).

Gene links are featured on the Gene Pages in the metadata box. The text of the hyperlink is determined by imodulondb_table.gene_link_db. Note that this does not have to be the same database as the one used for tf_links.

As with the tf_links, gene_links can be filled out programmatically or using an external file. In the example below, we use an external file.

[33]:

missing_gene_links

[33]:

0         b0002
1         b0003
2         b0004
3         b0005
4         b0006
         ...
3918      b4688
3919      b4693
3920    b4696_1
3921    b4696_2
3922      b4705
Name: missing_gene_links, Length: 3923, dtype: object

[34]:

# read in a file of links
import pandas as pd
file_location = '../../src/pymodulon/data/imodulondb/e_coli_gene_links.csv'
gene_links = pd.read_csv(file_location, header = None, index_col = 0)

# convert to dictionary
gene_links = gene_links.to_dict()[1]

# add to ica_data
ica_data.gene_links = gene_links

# display a few for demonstration purposes
{k:gene_links[k] for k in list(gene_links.keys())[0:5]}

[34]:

{'b0002': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10998',
 'b0003': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10999',
 'b0004': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG11000',
 'b0005': 'https://ecocyc.org/gene?orgid=ECOLI&id=G6081',
 'b0006': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10011'}

[35]:

# let's re-check the tf_issues now
table_issues, tf_issues, missing_gene_links, missing_dois = \
    imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)

missing_gene_links.values

[35]:

array(['b0240_2', 'b0502_1', 'b0553_1', 'b0562_2', 'b1459_1', 'b2092_1',
       'b2092_2', 'b2139_2', 'b2190_2', 'b2641_2', 'b2681_1', 'b2681_2',
       'b2891_2', 'b3046_2', 'b3268_2', 'b3423_1', 'b3423_2', 'b3643_1',
       'b3777_2', 'b4038_3', 'b4308_2', 'b4488_1', 'b4488_2', 'b4490_2',
       'b4491_1', 'b4492_1', 'b4492_3', 'b4493_1', 'b4493_2', 'b4495_1',
       'b4495_2', 'b4496_1', 'b4496_2', 'b4498_1', 'b4498_2', 'b4499_1',
       'b4571_1', 'b4571_2', 'b4575_2', 'b4580_1', 'b4580_2', 'b4581_2',
       'b4582_1', 'b4582_2', 'b4587_1', 'b4587_2', 'b4600_2', 'b4623_1',
       'b4640_1', 'b4646_2', 'b4658_1', 'b4658_2', 'b4659_1', 'b4659_2',
       'b4660_1', 'b4696_1', 'b4696_2'], dtype=object)

[36]:

# all the above are pseudogenes without available links
# they can be ignored

10.2.4. The `missing_dois` Output

The final output is the missing_dois output. This output searches for completed entries in sample_table.doi, and lists the sample_table.index associated with any missing entries.

It may be helpful to think of the DOIs as “sample_links”. Clicking on the bar of an activity plot on the iModulon Pages or an expression plot on the Gene Pages will take users to the DOI of the corresponding sample if it exists. If this project is part of the modulome, in which samples from many publications are pooled, it is especially useful to fill in all of the DOIs.

In the case where no doi column exists in the sample_table, all samples will be listed as missing their dois.

[37]:

missing_dois

[37]:

Index(['misc__wt_no_te__1', 'misc__wt_no_te__2', 'misc__bw_delcbl__1',
       'misc__bw_delcbl__2', 'misc__bw_delfabr__1', 'misc__bw_delfabr__2',
       'misc__bw_delfadr__1', 'misc__bw_delfadr__2', 'misc__nitr_031__1',
       'omics__wt_glu__1', 'omics__wt_glu__2', 'minspan__wt_glc__4',
       'minspan__bw_delcra_glc__2', 'ica__wt_glc__1', 'ica__wt_glc__2',
       'ica__wt_glc__3', 'ica__wt_glc__4', 'ica__arg_sbt__1',
       'ica__arg_sbt__2', 'ica__cytd_rib__1', 'ica__cytd_rib__2',
       'ica__gth__1', 'ica__gth__2', 'ica__leu_glcr__1', 'ica__leu_glcr__2',
       'ica__met_glc__1', 'ica__met_glc__2', 'ica__no3_anaero__1',
       'ica__no3_anaero__2', 'ica__phe_acgam__1', 'ica__phe_acgam__2',
       'ica__thm_gal__1', 'ica__thm_gal__2', 'ica__tyr_glcn__1',
       'ica__tyr_glcn__2', 'ica__ura_pyr__1', 'ica__ura_pyr__2',
       'ica__wt_glc__5', 'ica__wt_glc__6', 'ica__bw_delpurR_cytd__1',
       'ica__bw_delpurR_cytd__2', 'ica__ade_glc__1', 'ica__ade_glc__2'],
      dtype='object', name='missing_DOIs')

[38]:

# a couple of the missing dois do exist
dois = {
    'minspan__wt_glc__4': 'doi.org/10.15252/msb.20145243',
    'minspan__bw_delcra_glc__2': 'doi.org/10.15252/msb.20145243',
    'omics__wt_glu__1': 'doi.org/10.1038/ncomms13091',
    'omics__wt_glu__2': 'doi.org/10.1038/ncomms13091'
}

# add them to the sample_table
for sample, doi in dois.items():
    ica_data.sample_table.loc[sample, 'DOI'] = doi

10.2.5. Double-check Compatibility and Save

Once all the outputs have been checked, it is encouraged to re-run the imodulondb_compatibility check to make sure that you are comfortable ignoring all missing information. If so, save your ica_data object so that you don’t need to repeat any of these steps!

[39]:

table_issues, tf_issues, missing_gene_links, missing_dois = \
    imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)

print('--Table Issues--')
display(table_issues)
print('--TF Issues--')
display(tf_issues)
print('--Missing Gene Links--')
display(missing_gene_links.values)
print('--Missing DOIs--')
display(missing_dois.values)

--Table Issues--

	Table	Missing Column	Solution
0	Gene	gene_product	Locus tags (gene_table.index) will be used.
1	Sample	sample	The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#."
2	iModulon	name	imodulon_table.index will be used.
3	iModulon	function	The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard

--TF Issues--

	in_trn	has_link	has_gene
glpR	True	True	False
his-tRNA	True	False	False
leu-tRNA	True	False	False
ile-tRNA	True	False	False
TPP	True	False	False
L-tryptophan	True	False	False
trp-tRNA	True	False	False

--Missing Gene Links--

array(['b0240_2', 'b0502_1', 'b0553_1', 'b0562_2', 'b1459_1', 'b2092_1',
       'b2092_2', 'b2139_2', 'b2190_2', 'b2641_2', 'b2681_1', 'b2681_2',
       'b2891_2', 'b3046_2', 'b3268_2', 'b3423_1', 'b3423_2', 'b3643_1',
       'b3777_2', 'b4038_3', 'b4308_2', 'b4488_1', 'b4488_2', 'b4490_2',
       'b4491_1', 'b4492_1', 'b4492_3', 'b4493_1', 'b4493_2', 'b4495_1',
       'b4495_2', 'b4496_1', 'b4496_2', 'b4498_1', 'b4498_2', 'b4499_1',
       'b4571_1', 'b4571_2', 'b4575_2', 'b4580_1', 'b4580_2', 'b4581_2',
       'b4582_1', 'b4582_2', 'b4587_1', 'b4587_2', 'b4600_2', 'b4623_1',
       'b4640_1', 'b4646_2', 'b4658_1', 'b4658_2', 'b4659_1', 'b4659_2',
       'b4660_1', 'b4696_1', 'b4696_2'], dtype=object)

--Missing DOIs--

array(['misc__wt_no_te__1', 'misc__wt_no_te__2', 'misc__bw_delcbl__1',
       'misc__bw_delcbl__2', 'misc__bw_delfabr__1', 'misc__bw_delfabr__2',
       'misc__bw_delfadr__1', 'misc__bw_delfadr__2', 'misc__nitr_031__1',
       'ica__wt_glc__1', 'ica__wt_glc__2', 'ica__wt_glc__3',
       'ica__wt_glc__4', 'ica__arg_sbt__1', 'ica__arg_sbt__2',
       'ica__cytd_rib__1', 'ica__cytd_rib__2', 'ica__gth__1',
       'ica__gth__2', 'ica__leu_glcr__1', 'ica__leu_glcr__2',
       'ica__met_glc__1', 'ica__met_glc__2', 'ica__no3_anaero__1',
       'ica__no3_anaero__2', 'ica__phe_acgam__1', 'ica__phe_acgam__2',
       'ica__thm_gal__1', 'ica__thm_gal__2', 'ica__tyr_glcn__1',
       'ica__tyr_glcn__2', 'ica__ura_pyr__1', 'ica__ura_pyr__2',
       'ica__wt_glc__5', 'ica__wt_glc__6', 'ica__bw_delpurR_cytd__1',
       'ica__bw_delpurR_cytd__2', 'ica__ade_glc__1', 'ica__ade_glc__2'],
      dtype=object)

[40]:

# the above output is all acceptable

# save your results
from pymodulon.io import *
save_to_json(ica_data, 'e_coli_imdb.json')

10.3. The `imodulondb_export` Function

After iModulonDB compatibility has been assured, it is time to use the imodulondb_export function.

Arguments:

model: the ica_data object
path: the path to the main folder of iModulonDB. This is where you cloned the iModulonDB GitHub repository, and where you should host your local server from. This function will create new folders and files inside this repository. The default is the current working directory.
cat_order: a list of each of the imodulon_table.categorys in order. When sorting by category in the Dataset Pages, the categories will appear in this order. Otherwise, they will appear in alphabetical order. Note that the default sorting is now by imodulon_table.exp_var (or by iModulon name if no explained variance is provided), so this input is not very important.
tfcomplex_to_gene: a dictionary relating regulatory complexes in the imodulon_table.regulators to gene names in gene_table.gene_names. See the secion on tf_issues.has_gene.

[41]:

cat_order = ['Carbon Source Utilization',
             'Amino Acid and Nucleotide Biosynthesis',
             'Energy Metabolism',
             'Miscellaneous Metabolism',
             'Structural Components',
             'Metal Homeostasis',
             'Stress Response',
             'Regulator Discovery',
             'Biological Enrichment',
             'Genomic Alterations',
             'Uncharacterized']

imodulondb_export(ica_data,
                  '../iModulonDB',
                  cat_order = cat_order,
                  tfcomplex_to_gene = tfcomplex_to_gene)

Writing main site files...
Done writing main site files. Writing plot files...
Two progress bars will appear below. The second will take significantly longer than the first.
Writing iModulon page files (1/2)


Writing Gene page files (2/2)


Complete! (Organism = e_coli; Dataset = precise1)

Now that the files have been exported, you can access them in your browser using the instructions at the beginning of this tutorial.

10.3.1. Adding a Link to the Splash Page

If you would like to be able to access your data from the splash page of your locally hosted site, you will have to edit the HTML code in index.html, which is in the main folder.

Copy the contents of organisms/new_organism/new_dataset/html_for_splash.html
Open index.html using a text editor or IDE.
Find “<!– INSERT NEW DATASETS BELOW THIS LINE –>”, which should be near line 297
Paste the contents according to the comments in the file
If desired, increase the indentation of the pasted lines.

Now, if you reload localhost:8000, the new dataset will appear in the left hand side of the splash page.

10. Creating an iModulonDB Dashboard

10.1. iModulonDB Site Overview

10.2. The imodulondb_compatibility Function

10.2.1. The table_issues Output

10.2.1.1. The imodulondb_table Variable

10.2.1.2. The gene_table Variable

10.2.1.3. The sample_table Variable

10.2.1.4. The imodulon_table Variable

10.2.2. The tf_issues Output

10.2.2.1. The in_trn Column

10.2.2.2. The has_link Column

10.2.2.3. The has_gene Column

10.2.3. The missing_gene_links Output

10.2.4. The missing_dois Output