10. Creating an iModulonDB Dashboard
After an iModulon analysis is completed and published, dashboards for every iModulon and gene are made available on the iModulonDB website. The PyModulon package enables generation of all necessary files for any IcaData
object that meets a few simple requirements.
10.1. iModulonDB Site Overview
For general information about iModulonDB, visit the about page. Familiarize yourself with the main page types:
Splash page: The main entry point for the site
Dataset page: After choosing a dataset on the splash page, you are directed to the dataset page, which contains gray dataset metadata box and a table of iModulons.
iModulon page: The iModulon dashboards, which have a gray iModulon metadata box, a gene table, and several plots.
Gene page: Gene dashboards, which have an iModulon table connecting them to iModulons and an expression plot.
It’s helpful to understand the two parts to most websites:
Front End: The front end runs in a web user’s browser. It requests and receives files from the back end, and contains JavaScript functions which can create interactive plots and other features. iModulonDB’s front end code is available at this GitHub repository.
Back End: The back end runs in a server, which is usually operated by the site owner. It sends the necessary files to the front end when requested. With iModulonDB, all back end data is precomputed by the PyModulon package using the instructions provided here. Output files are uploaded to GitHub Pages, which serves the files to users. If you host your own version of the site, the local server will be the back end.
To generate and view iModulonDB dashboards for a custom project, follow these steps:
Follow the instructions in this document to generate your own files for the new
IcaData
object of interest. Inimodulondb_export
, specify the path to the cloned repository so that PyModulon can place new files in the appropriate locations.Host a local server from the repository folder. This will act as the back end in place of GitHub Pages.
In your browser, visit
http://localhost:8000/dataset.html?organism=new_organism&dataset=new_dataset
. The organism and dataset names in the URL are customizable.
[1]:
from pymodulon.core import IcaData
from pymodulon.imodulondb import *
from pymodulon.example_data import load_ecoli_data
[2]:
# Increase the maximum column width
pd.set_option('display.max_colwidth', None)
[3]:
ica_data = load_ecoli_data()
10.2. The imodulondb_compatibility
Function
imodulondb_compatibility
is essentially a convenient checklist for iModulonDB-relevant metadata. It has four outputs, which we will go through one-by-one.
Optional arguments:
inplace
: Whether to modify the object. It is not recommended to change this from its default (False
).tfcomplex_to_gene
: Dictionary mapping transcription factor complexes to gene names. See the section ontf_issues.has_gene
for more information.
Returns:
table_issues
: Dataframe describing missing columns from the major annotation tables and the site’s behavior if they are not updated. Three elements in this table are “CRITICAL”, meaning the site cannot be generated without them. If those elements are missing, they will be entries in this table as well as warnings.tf_issues
: Dataframe describing each regulator from theimodulon_table
that can’t be mapped to thetrn
,tf_links
, and/orgene_table
.missing_gene_links
: Series listing each gene that doesn’t have agene_link
to an external database.missing_dois
: Index listing each sample that doesn’t have an associated DOI in thesample_table
. Clicking on these samples in the activity plots on the table will not bring you to a relevant paper.
Call this function and browse its output. Not all issues it lists will be important to fix. In some cases, the default values will be fine for your purposes, or the match it is seeking might not exist.
[4]:
table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)
For demonstration purposes, we will also create a minimal IcaData
object called blank_data
containing only the M and A matrices. If we call imodulondb_compatibility
with this object, the table_issues
output will contain all possible issues.
Note the three warnings which appear when this happens; seeing any of these indicates that iModulonDB export cannot be completed. The possible CRITICAL issues are:
No
X
matrix: Each gene page contains a plot of gene expression, so you must provide the X matrix.No
project
column insample_table
: See the section ontable_issues.sample_table
.No
condition
column insample_table
: See the section ontable_issues.sample_table
.
[5]:
blank_data = IcaData(ica_data.M, ica_data.A)
all_table_issues, b_tf_issues, b_missing_gene_links, b_missing_dois = imodulondb_compatibility(blank_data)
WARNING:root:Critical issue: No X matrix
WARNING:root:Critical issue: No project column in sample_table.
WARNING:root:Critical issue: No condition column in sample_table.
10.2.1. The table_issues
Output
The first output of imodulondb_compatibility
is the table_issues
. Each row corresponds to an issue with one of the main class elements. The columns are:
Table
: which table or other variable the issue is inMissing Column
: the column of the Table with the issue (not case sensitive; capitalization is ignored).Solution
: Unless “CRITICAL” is in this cell, the site behavior if the issue remained is described here.
If the X
matrix is missing, it will be listed in the first row as a CRITICAL issue. After that, the imodulondb_table
issues represent missing information from a dictionary specific to iModulonDB. The rest of the rows may relate to the gene_table
, sample_table
, or imodulon_table
. These issues arise from parsing the column names of the annotation tables.
You may not be able to provide the information to fill in the missing columns (for example, if no database of gene functions exist for your organism). If you’d like, you can partially fill in any column and leave np.nan
values for the information you do not know. In most cases, you can continue to omit the column - the Solution
column of table_issues
will describe what happens in those cases.
Column names are case insensitive. Be sure not to have multiple columns with matching names, as that will cause errors later.
[6]:
table_issues
[6]:
Table | Missing Column | Solution | |
---|---|---|---|
0 | iModulonDB | organism | The default, "New Organism", will be used. |
1 | iModulonDB | dataset | The default, "New Dataset", will be used. |
2 | iModulonDB | strain | The default, "Unspecified", will be used. |
3 | iModulonDB | publication_name | The default, "Unpublished Study", will be used. |
4 | iModulonDB | publication_link | The publication name will not be a hyperlink. |
5 | iModulonDB | gene_link_db | The default, "External Database", will be used. |
6 | iModulonDB | organism_folder | The default, "new_organism", will be used. |
7 | iModulonDB | dataset_folder | The default, "new_dataset", will be used. |
8 | Gene | gene_product | Locus tags (gene_table.index) will be used. |
9 | Sample | sample | The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#." |
10 | Sample | n_replicates | This column will be generated for you. |
11 | iModulon | name | imodulon_table.index will be used. |
12 | iModulon | function | The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard |
13 | iModulon | exp_var | This column will be left blank. |
10.2.1.1. The imodulondb_table
Variable
Unless you are missing the X matrix, the first set of table_issues will be relating to the iModulonDB table (imodulondb_table
), which is a dictionary of details about the dataset in general. As you can see by reading the Solution
column of table_issues
for these entries, most entries have a default value (and the default behavior for a missing link is to not create a link).
Each entry affects the following parts of the site:
organism
: Appears in the dataset page metadata box, and gets programmatically changed into a short form (“Eshcerichia coli” –> “E. coli”) to appear in the dataset title on iModulon and Gene pages. Be sure to use the form “Genus species”.dataset
: The specific dataset title, which could distinguish it from other datasets in this organism. Appears in the dataset page metadata box, and gets appended to the shortened organism name in the dataset title on iModulon and Gene pages.strain
: Appears on the dataset pages in the metadata box.publication_name
: Appears on the dataset pages in the metadata box. Typically we use the form “Smith, et al., year”.publication_link
: The publication_name will be turned into a hyperlink if this entry is not blank. DOIs preferred. PyModulon does not test the validity of any links, so be sure this is a valid link.gene_link_db
: Appears on Gene pages in the metadata box if agene_link
is also available for the gene. PyModulon expects all gene links to go to specific gene pages from the same database, and for the database name to be the one shown here. If gene links are not included, then “Not Available” will appear in place of thegene_link_db
name. Examples of gene link databases include EcoCyc, SubtiWiki, and AureoWiki.organism_folder
: Determines output file location names. Also used in the URLs. Cannot contain spaces or slashes.dataset_folder
: Determines output file location names. Also used in URLs. Cannot contain spaces or slashes.
Any files with matching organism_folder
and dataset_folder
will be overwritten.
If you would like to italicize any part of these entries, you can use HTML tags: <i>italic</i>. These are automatically applied to organism names in the site.
[7]:
# Here are ALL possible issues with the imodulondb_table
all_table_issues.loc[all_table_issues.Table == 'iModulonDB']
[7]:
Table | Missing Column | Solution | |
---|---|---|---|
1 | iModulonDB | organism | The default, "New Organism", will be used. |
2 | iModulonDB | dataset | The default, "New Dataset", will be used. |
3 | iModulonDB | strain | The default, "Unspecified", will be used. |
4 | iModulonDB | publication_name | The default, "Unpublished Study", will be used. |
5 | iModulonDB | publication_link | The publication name will not be a hyperlink. |
6 | iModulonDB | gene_link_db | The default, "External Database", will be used. |
7 | iModulonDB | organism_folder | The default, "new_organism", will be used. |
8 | iModulonDB | dataset_folder | The default, "new_dataset", will be used. |
[8]:
# here are the issues relevant to ica_data.imodulondb_table
# in this case, all issues appear
table_issues.loc[table_issues.Table == 'iModulonDB']
[8]:
Table | Missing Column | Solution | |
---|---|---|---|
0 | iModulonDB | organism | The default, "New Organism", will be used. |
1 | iModulonDB | dataset | The default, "New Dataset", will be used. |
2 | iModulonDB | strain | The default, "Unspecified", will be used. |
3 | iModulonDB | publication_name | The default, "Unpublished Study", will be used. |
4 | iModulonDB | publication_link | The publication name will not be a hyperlink. |
5 | iModulonDB | gene_link_db | The default, "External Database", will be used. |
6 | iModulonDB | organism_folder | The default, "new_organism", will be used. |
7 | iModulonDB | dataset_folder | The default, "new_dataset", will be used. |
[9]:
# issues 0-7
# complete the imodulondb_table
ica_data.imodulondb_table = {
'organism': 'Escherichia coli',
'dataset': 'PRECISE 1',
'strain': 'K-12 MG1655 and BW25113',
'publication_name': 'Sastry, et al., 2019',
'publication_link': 'https://doi.org/10.1038/s41467-019-13483-w',
'gene_link_db': 'EcoCyc',
'organism_folder': 'e_coli',
'dataset_folder': 'precise1'
}
[10]:
# if we now re-call the function, the issues we fixed are removed.
table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)
table_issues.loc[table_issues.Table == 'iModulonDB']
[10]:
Table | Missing Column | Solution |
---|
10.2.1.2. The gene_table
Variable
The gene_table
contains details about each gene and shares an index with the M
and X
matrices. Its information is included in the following parts of the site:
Gene Tables on the iModulon Pages
Metadata box on the Gene Pages
Search results
name
,cog
, andgene_start
are shown when a gene is hovered over in the gene scatter plot (middle right of iModulon Pages).cog
is also used to color this scatter plot, andgene_start
defines the x axis.
Note that column names are case-insensitive.
Some additional considerations for each column are described below:
gene_name
: Try to use the most up to date names. Also, if a gene encodes a transcription factor (regulator) in the TRN, make sure that the names match (e.g. the transcription factor listed in the trn as ‘RpoS’ is encoded by the gene withgene_name
‘rpoS’). See the section ontf_issues.has_gene
for more information.gene_product
: This field is meant to be a concise but specific description of the function/product of the gene, such as ‘homoserine kinase’. You can use html tags in this field if superscripts or subscripts are desired.cog
: This stands for ‘cluster of orthologous groups’, and should be a larger category of genes such as ‘Carbohydrate transport and metabolism’. In addition to being displayed everywhere listed above, each COG is assigned a random color in the gene scatter plot, so it is especially useful to have them.gene_start
: This is not displayed anywhere, but it is used as the x axis value for each gene in the scatter plots on the iModulon Pages. If a few gene_starts are missing, those genes will have a value of 0 on the x axis. If allgene_start
s are not provided, then the order of thegene_table
will be used to assign integer values to each gene for the x axis of this plot.operon
: Preferably, this field will be the shortest string listing all genes, such as ‘artPIQM’, ‘argT-hisJQMP’, or ‘pdxB-usg-truA-dedA’. For very long operons with names that don’t share letters, you can list the first and last gene as in ‘yitB->yisZ’. iModulonDB is not picky about this column, however, so operons as provided by another database are acceptable.regulator
: Adding a TRN will automatically generate this column for you. It should contain a comma-separated list of all regulators that regulate this gene, e.g. ‘ppGpp,Lrp,Nac’.
If any of these columns are listed in your table_issues
, be sure to check the gene_table for the appropriate information under other column names. Rename those columns so that they match the names above.
[11]:
# here are all possible issues with the gene_table
all_table_issues.loc[all_table_issues.Table == 'Gene']
[11]:
Table | Missing Column | Solution | |
---|---|---|---|
9 | Gene | gene_name | Locus tags (gene_table.index) will be used. |
10 | Gene | gene_product | Locus tags (gene_table.index) will be used. |
11 | Gene | cog | COG info will not display & the gene scatter plot will not have color. |
12 | Gene | start | The x axis of the scatter plot will be a numerical value instead of a genome location. |
13 | Gene | operon | Operon info will not display. |
14 | Gene | regulator | Regulator info will not display. If you have a TRN, add it to the model to auto-generate this column. |
[12]:
# here are the issues relevant to ica_data.gene_table
table_issues.loc[table_issues.Table == 'Gene']
[12]:
Table | Missing Column | Solution | |
---|---|---|---|
0 | Gene | gene_product | Locus tags (gene_table.index) will be used. |
[13]:
# issue 0
# if gene_product information is available, add it to the gene_table.
10.2.1.3. The sample_table
Variable
The sample_table
describes each sample and shares its index with the columns of X
and A
. Its information is very important for generating the activity and expression bar graphs on the iModulon and gene pages. Familiarize yourself with the activity plot.
Three columns in the sample_table
define progressively smaller groupings of samples. It is CRITICAL that these columns exist so that the activity plots can be generated. Short, human readable, and specific names are preferred. Formatting should be consistent within your dataset, but is not necessarily constant throughout iModulonDB.
project
: This defines the largest grouping of samples, which share a common theme and/or were featured in the same paper. Examples include: ‘biofilm’, ‘acid’, and ‘crp_KO’. In the activity bar graph on the iModulon Pages, the vertical lines separate each project and the names are displayed across the bottom.condition
: This is the smallest grouping of samples, used for experimental conditions. Samples have matchingproject
andcondition
names if and only if they are biological replicates. Examples include: ‘biofilm_t0’, ‘biofilm_t10’, ‘wt_ctrl’, ‘del_crp’. In the activity bar graph on the iModulon Pages, each bar corresponds to a condition, and its height is the average activity value across all samples in the condition. Hovering over a bar provides additional details from the sample table, and zooming in switches the x labels to be conditions instead of projects.sample
: This must be unique for each sample. If it is not a column, it is assumed that it is the index of thesample_table
. Use a human readable sample name, preferably of the form ‘project__condition__#’, where ‘#’ indicates a replicate number. Examples: ‘carbon__wt_ctrl__1’, ‘carbon__wt_ctrl__2’, ‘carbon__del_crp__1’, ‘biofilm__biofilm_t0__1’. In the activity bar graph on the iModulon Pages, each dot floating near the bar is a sample. Also, the regulon scatter plots contain points corresponding to each sample; sample names will appear when points are hovered over in that plot as well.
Finally, two other columns are checked:
n_replicates
: For each condition, the samples all must have a matchingn_replicates
value that is used by the code. You can ignore this output, however, since a missingn_replicates
column will automatically be generated. If you would like to remove this issue from the table, you can run the commandgenerate_n_replicates_column(ica_data)
; otherwise, it will be automatically called on export.doi
: If you click on the bars of the activity plot and the first sample of the corresponding condition has a link in thedoi
column, then you will go to that link. This could be useful to learn more about a given sample. DOIs are preferred, but any link is allowed. PyModulon does not test the validity of any links, so be sure this is a valid link. See the section onmissing_dois
for more details on this.
Note that in addition to the columns that PyModulon searches for, all columns of the ``sample_table`` can be used to annotate or color the activity plots. The more informative the sample_table is, the more useful the activity plots will be. You can access all the columns in any iModulon or Gene Page by clicking the wrench symbol next to the word ‘Activity’. Checkboxes next to each column name indicate whether its value will be displayed on hovering over the sample, and the ink button allows
you to recolor the activity plot by the values of that feature. For example, ‘pH’ is not a required sample_table
column, but it could be useful to color or label each sample by that variable. Other useful but completely optional columns include:
Strain Description
Carbon Source (g/L)
Supplement
Temperature (C)
Time (min)
Growth Phase
[14]:
# here are all possible issues with the sample_table
all_table_issues.loc[all_table_issues.Table == 'Sample']
[14]:
Table | Missing Column | Solution | |
---|---|---|---|
15 | Sample | project | This is a CRITICAL column defining the largest grouping of samples. Vertical bars in the activity plot will separate projects. |
16 | Sample | condition | This is an CRITICAL column defining the smallest grouping of samples. Biological replicates must have matching projects and conditions, and they will appear as single bars with averaged activities. |
17 | Sample | sample | The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#." |
18 | Sample | n_replicates | This column will be generated for you. |
19 | Sample | doi | Clicking on activity plot bars will not link to relevant papers for the samples. |
[15]:
# here are the issues relevant to ica_data.sample_table
table_issues.loc[table_issues.Table == 'Sample']
[15]:
Table | Missing Column | Solution | |
---|---|---|---|
1 | Sample | sample | The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#." |
2 | Sample | n_replicates | This column will be generated for you. |
[16]:
# issue 1
# check to make sure that the sample_table.index contains good names
ica_data.sample_table.index
[16]:
Index(['control__wt_glc__1', 'control__wt_glc__2', 'fur__wt_dpd__1',
'fur__wt_dpd__2', 'fur__wt_fe__1', 'fur__wt_fe__2',
'fur__delfur_dpd__1', 'fur__delfur_dpd__2', 'fur__delfur_fe2__1',
'fur__delfur_fe2__2',
...
'efeU__menFentC_ale29__1', 'efeU__menFentC_ale29__2',
'efeU__menFentC_ale30__1', 'efeU__menFentC_ale30__2',
'efeU__menFentCubiC_ale36__1', 'efeU__menFentCubiC_ale36__2',
'efeU__menFentCubiC_ale37__1', 'efeU__menFentCubiC_ale37__2',
'efeU__menFentCubiC_ale38__1', 'efeU__menFentCubiC_ale38__2'],
dtype='object', length=278)
[17]:
# the sample names above are good; we can now ignore issue 1
# issue 2
# optionally add the n_replicates column
generate_n_replicates_column(ica_data)
10.2.1.4. The imodulon_table
Variable
The imodulon_table
describes each iModulon and shares its index with the columns of M
and the index of A
. Its data is featured on iModulonDB in the following places:
Dataset Page, main section
iModulon tables on the Gene Pages (
regulator
,function
, andcategory
columns)Metadata box on the iModulon Pages
Search results
Some additional considerations for each column are described below:
name
: Usually, this issue can be ignored because theimodulon_table.index
always matches theimodulon_names
. If the names are all integers, then the word “iModulon” will be added (“0” –> “iModulon 0”).regulator
: This column is very important. Use regulator names that match the TRN. Join regulators using either ‘+’ to represent the intersection of regulons or ‘/’ to represent the union of regulons. Regulator links will be added in the iModulon Page metadata boxes according to thetf_links
variable. The content of this column affects the behavior of the gene table, gene histogram, regulon venn diagram, and regulon scatter plots. See the section ontf_issues
.function
: This is meant to be a specific description of each iModulon’s function, such as “Histidine biosynthesis”.category
: This column provides larger groupings of iModulons, such as “Amino Acid Biosynthesis”.n_genes
: The number of genes in the iModulon. This can be ignored since it will be computed for you.precision
: The overlap between the iModulon and its regulon divided by the size of the iModulon. See the tutorial on gene enrichment analysis.recall
: The overlap between the iModulon and its regulon divided by the size of the regulon. See the tutorial on gene enrichment analysis.exp_var
: The explained variance when this iModulon alone is used to reconstruct the originalX
matrix. Use decimal values; they will be converted to percentages in iModulonDB. These can be computed using theexplained_variance
function; see the tutorial on additional functions.
[18]:
# here are all possible issues with the imodulon_table
all_table_issues.loc[all_table_issues.Table == 'iModulon']
[18]:
Table | Missing Column | Solution | |
---|---|---|---|
20 | iModulon | name | imodulon_table.index will be used. |
21 | iModulon | regulator | The regulator details will be left blank. |
22 | iModulon | function | The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard |
23 | iModulon | category | The categories will be filled in as "Uncharacterized". |
24 | iModulon | n_genes | This column will be computed for you. |
25 | iModulon | precision | This column will be left blank. |
26 | iModulon | recall | This column will be left blank. |
27 | iModulon | exp_var | This column will be left blank. |
[19]:
# here are the issues relevant to ica_data.imodulon_table
table_issues.loc[table_issues.Table == 'iModulon']
[19]:
Table | Missing Column | Solution | |
---|---|---|---|
3 | iModulon | name | imodulon_table.index will be used. |
4 | iModulon | function | The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard |
5 | iModulon | exp_var | This column will be left blank. |
[20]:
# issue 3
# check that the imodulon_table.index contains good names
ica_data.imodulon_table.index
[20]:
Index(['AllR/AraC/FucR', 'ArcA-1', 'ArcA-2', 'ArgR', 'AtoC', 'BW25113',
'Cbl+CysB', 'CdaR', 'CecR', 'Copper', 'CpxR', 'Cra', 'Crp-1', 'Crp-2',
'CsqR', 'CysB', 'DhaR/Mlc', 'EvgA', 'ExuR/FucR', 'FadR', 'FecI',
'FlhDC', 'FliA', 'Fnr', 'Fur-1', 'Fur-2', 'GadEWX', 'GadWX', 'GcvA',
'GlcC', 'GlpR', 'GntR/TyrR', 'His-tRNA', 'Leu/Ile', 'Lrp', 'MalT',
'MetJ', 'Nac', 'NagC/TyrR', 'NarL', 'NikR', 'NtrC+RpoN', 'OxyR', 'PrpR',
'PurR-1', 'PurR-2', 'PuuR', 'Pyruvate', 'RbsR', 'RcsAB', 'RpoH', 'RpoS',
'SoxS', 'SrlR+GutM', 'Thiamine', 'Tryptophan', 'XylR', 'YgbI', 'YiaJ',
'YieP', 'YneJ', 'Zinc', 'crp-KO', 'curli', 'deletion-1', 'deletion-2',
'duplication-1', 'e14-deletion', 'efeU-repair', 'entC-menF-KO',
'fimbriae', 'flu-yeeRS', 'fur-KO', 'gadWX-KO', 'insertion',
'iron-related', 'lipopolysaccharide', 'membrane', 'nitrate-related',
'proVWX', 'purR-KO', 'sgrT', 'thrA-KO', 'translation',
'uncharacterized-1', 'uncharacterized-2', 'uncharacterized-3',
'uncharacterized-4', 'uncharacterized-5', 'uncharacterized-6',
'ydcI-KO', 'yheO-KO'],
dtype='object')
[21]:
# the names are good; we can ignore issue 3
# issue 4
# write a function column if desired
# (part of iModulon characterization)
# issue 5
# compute the explained variance for each iModulon
from pymodulon.util import explained_variance
for k in ica_data.imodulon_table.index:
ica_data.imodulon_table.loc[k, 'exp_var'] = explained_variance(
ica_data, imodulons=k)
10.2.2. The tf_issues
Output
The next output is the tf_issues
dataframe. Each row corresponds to a regulator that is used in the imodulon_table
. Regulators that satisfy all requirements are omitted. False
values in this table represent potentially missing information.
The columns refer to the following:
in_trn
: whether the regulator is in the model.trn. Regulators not in the TRN will be ignored in the site’s histograms and gene tables. It is highly recommended that all regulators be in the TRN. Any ``False`` values in this column should be fixed by adding rows to thetrn
and/or ensuring that names all match up betweenimodulon_table.regulator
andtrn.regulator
.has_link
: whether the regulator has a link inica_data.tf_links
. If it does, its name will be clickable in the iModulon Page metadata box. If not, the regulator name will not appear as a hyperlink. Any regulators without dedicated pages in other databases should be ignored.has_gene
: whether the regulator can be matched to a gene in the model. This is used to generate the regulon scatter plot which appears at the bottom of the iModulon Page, comparing iModulon activity to the expression of the regulators. If your regulator is a gene, try to ensure that its name matches thegene_table.gene_name
. If your regulator doesn’t correspond to a gene (e.g. the small molecule regulator ppGpp), then aFalse
value in this column is acceptable.
Note that in our minimal example, the tf_issues
will be empty because there is no regulator column in the imodulon_table
.
[22]:
tf_issues
[22]:
in_trn | has_link | has_gene | |
---|---|---|---|
allR | True | False | True |
fucR | True | False | True |
araC | True | False | True |
arcA | True | False | True |
argR | True | False | True |
... | ... | ... | ... |
trpR | True | False | True |
xylR | True | False | True |
yiaJ | True | False | True |
zur | True | False | True |
zntR | True | False | True |
70 rows × 3 columns
10.2.2.1. The in_trn
Column
As described above, any False
values in this column should definitely be fixed by ensuring name match-ups between imodulon_table.regulator
and trn.regulator
columns, or by adding rows to trn
. True
values and omitted regulators indicate matchable regulators.
[23]:
# here is a list of missing regulators from the TRN
tf_issues.index[~tf_issues.in_trn.astype(bool)]
[23]:
Index([], dtype='object')
[24]:
# it is empty; no issues
10.2.2.2. The has_link
Column
This column encourages you to find links for each transcription factor and put them in the dictionary ica_data.tf_links
. Find a database for your regulators (e.g. RegulonDB), and either make a file full of links or programmatically generate links by finding patterns in the databases URLs.
In the example below, a file of regulator links has alreay been made.
[25]:
# here is a list of missing tf_links
tf_issues.index[~tf_issues.has_link.astype(bool)]
[25]:
Index(['allR', 'fucR', 'araC', 'arcA', 'argR', 'atoC', 'cbl', 'cysB', 'cdaR',
'cecR', 'cusR', 'hprR', 'cueR', 'cpxR', 'cra', 'crp', 'csqR', 'dhaR',
'mlc', 'evgA', 'exuR', 'fadR', 'fecI', 'flhD;flhC', 'fliA', 'fnr',
'fur', 'gadW', 'gadE', 'gadX', 'gcvA', 'glcC', 'glpR', 'tyrR', 'gntR',
'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'ilvY', 'lrp', 'malT', 'metJ',
'nac', 'nagC', 'narL', 'nikR', 'ntrC', 'rpoN', 'oxyR', 'prpR', 'purR',
'puuR', 'btsR', 'ypdB', 'pdhR', 'rbsR', 'rcsA;rcsB', 'rpoH', 'rpoS',
'soxS', 'gutM', 'srlR', 'TPP', 'L-tryptophan', 'trp-tRNA', 'trpR',
'xylR', 'yiaJ', 'zur', 'zntR'],
dtype='object')
[26]:
# read in a file of links
import pandas as pd
file_location = '../../src/pymodulon/data/imodulondb/e_coli_tf_links.csv'
tf_links = pd.read_csv(file_location, header = None, index_col = 0)
# convert to dictionary
tf_links = tf_links.to_dict()[1]
# add to ica_data
ica_data.tf_links = tf_links
# display a few for demonstration purposes
{k:tf_links[k] for k in list(tf_links.keys())[0:5]}
[26]:
{'glpR': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120012730&organism=ECK12&format=jsp&type=regulon',
'dhaR': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120015690&organism=ECK12&format=jsp&type=regulon',
'mlc': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120011240&organism=ECK12&format=jsp&type=regulon',
'argR': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120011670&organism=ECK12&format=jsp&type=regulon',
'narL': 'http://regulondb.ccg.unam.mx/regulon?term=ECK120011502&organism=ECK12&format=jsp&type=regulon'}
[27]:
# let's re-check the tf_issues now
table_issues, tf_issues, missing_gene_links, missing_dois = imodulondb_compatibility(ica_data)
tf_issues.index[~tf_issues.has_link.astype(bool)]
[27]:
Index(['his-tRNA', 'leu-tRNA', 'ile-tRNA', 'TPP', 'L-tryptophan', 'trp-tRNA'], dtype='object')
[28]:
# none of those regulators have pages on RegulonDB
# so they can be ignored
10.2.2.3. The has_gene
Column
As described above, any False
values in this column indicate an inability to match between imodulon_table.regulator
and gene_table.gene_name
. Some regulators will not have genes, so some False
values are acceptable.
In some cases, the regulator name will be a complex of several genes. PyModulon supports an additional input, tfcomplex_to_gene
, which is a dictionary mapping those regulators to their preferred gene. This variable will need to be passed to both imodulondb_compatibility
and imodulondb_export
.
[29]:
tf_issues.index[~tf_issues.has_gene.astype(bool)]
[29]:
Index(['cecR', 'flhD;flhC', 'glpR', 'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'btsR',
'rcsA;rcsB', 'gutM', 'TPP', 'L-tryptophan', 'trp-tRNA'],
dtype='object')
[30]:
# three of these are name mismatches
# find the old names and replace them
ica_data.gene_table = ica_data.gene_table.replace({
'ybiH':'cecR',
'yehT':'btsR',
'srlM':'gutM'
})
# two are complexes
tfcomplex_to_gene = {
'flhD;flhC':'flhD',
'rcsA;rcsB':'rcsB'
}
[31]:
# let's re-check the tf_issues now
# use the tfcomplex_to_gene input
table_issues, tf_issues, missing_gene_links, missing_dois = \
imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)
tf_issues.index[~tf_issues.has_gene.astype(bool)]
[31]:
Index(['glpR', 'his-tRNA', 'leu-tRNA', 'ile-tRNA', 'TPP', 'L-tryptophan',
'trp-tRNA'],
dtype='object')
[32]:
# the rest are pseudogene or non-gene regulators; ignore.
10.2.3. The missing_gene_links
Output
The next output is missing_gene_links
. Similar to tf_issues.has_link
, this output shows all the genes that are missing links in the gene_links
dictionary. The gene_links
dictionary is indexed by locus tag (gene_table.index
).
Gene links are featured on the Gene Pages in the metadata box. The text of the hyperlink is determined by imodulondb_table.gene_link_db
. Note that this does not have to be the same database as the one used for tf_links
.
As with the tf_links
, gene_links
can be filled out programmatically or using an external file. In the example below, we use an external file.
[33]:
missing_gene_links
[33]:
0 b0002
1 b0003
2 b0004
3 b0005
4 b0006
...
3918 b4688
3919 b4693
3920 b4696_1
3921 b4696_2
3922 b4705
Name: missing_gene_links, Length: 3923, dtype: object
[34]:
# read in a file of links
import pandas as pd
file_location = '../../src/pymodulon/data/imodulondb/e_coli_gene_links.csv'
gene_links = pd.read_csv(file_location, header = None, index_col = 0)
# convert to dictionary
gene_links = gene_links.to_dict()[1]
# add to ica_data
ica_data.gene_links = gene_links
# display a few for demonstration purposes
{k:gene_links[k] for k in list(gene_links.keys())[0:5]}
[34]:
{'b0002': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10998',
'b0003': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10999',
'b0004': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG11000',
'b0005': 'https://ecocyc.org/gene?orgid=ECOLI&id=G6081',
'b0006': 'https://ecocyc.org/gene?orgid=ECOLI&id=EG10011'}
[35]:
# let's re-check the tf_issues now
table_issues, tf_issues, missing_gene_links, missing_dois = \
imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)
missing_gene_links.values
[35]:
array(['b0240_2', 'b0502_1', 'b0553_1', 'b0562_2', 'b1459_1', 'b2092_1',
'b2092_2', 'b2139_2', 'b2190_2', 'b2641_2', 'b2681_1', 'b2681_2',
'b2891_2', 'b3046_2', 'b3268_2', 'b3423_1', 'b3423_2', 'b3643_1',
'b3777_2', 'b4038_3', 'b4308_2', 'b4488_1', 'b4488_2', 'b4490_2',
'b4491_1', 'b4492_1', 'b4492_3', 'b4493_1', 'b4493_2', 'b4495_1',
'b4495_2', 'b4496_1', 'b4496_2', 'b4498_1', 'b4498_2', 'b4499_1',
'b4571_1', 'b4571_2', 'b4575_2', 'b4580_1', 'b4580_2', 'b4581_2',
'b4582_1', 'b4582_2', 'b4587_1', 'b4587_2', 'b4600_2', 'b4623_1',
'b4640_1', 'b4646_2', 'b4658_1', 'b4658_2', 'b4659_1', 'b4659_2',
'b4660_1', 'b4696_1', 'b4696_2'], dtype=object)
[36]:
# all the above are pseudogenes without available links
# they can be ignored
10.2.4. The missing_dois
Output
The final output is the missing_dois
output. This output searches for completed entries in sample_table.doi
, and lists the sample_table.index
associated with any missing entries.
It may be helpful to think of the DOIs as “sample_links”. Clicking on the bar of an activity plot on the iModulon Pages or an expression plot on the Gene Pages will take users to the DOI of the corresponding sample if it exists. If this project is part of the modulome, in which samples from many publications are pooled, it is especially useful to fill in all of the DOIs.
In the case where no doi
column exists in the sample_table
, all samples will be listed as missing their dois.
[37]:
missing_dois
[37]:
Index(['misc__wt_no_te__1', 'misc__wt_no_te__2', 'misc__bw_delcbl__1',
'misc__bw_delcbl__2', 'misc__bw_delfabr__1', 'misc__bw_delfabr__2',
'misc__bw_delfadr__1', 'misc__bw_delfadr__2', 'misc__nitr_031__1',
'omics__wt_glu__1', 'omics__wt_glu__2', 'minspan__wt_glc__4',
'minspan__bw_delcra_glc__2', 'ica__wt_glc__1', 'ica__wt_glc__2',
'ica__wt_glc__3', 'ica__wt_glc__4', 'ica__arg_sbt__1',
'ica__arg_sbt__2', 'ica__cytd_rib__1', 'ica__cytd_rib__2',
'ica__gth__1', 'ica__gth__2', 'ica__leu_glcr__1', 'ica__leu_glcr__2',
'ica__met_glc__1', 'ica__met_glc__2', 'ica__no3_anaero__1',
'ica__no3_anaero__2', 'ica__phe_acgam__1', 'ica__phe_acgam__2',
'ica__thm_gal__1', 'ica__thm_gal__2', 'ica__tyr_glcn__1',
'ica__tyr_glcn__2', 'ica__ura_pyr__1', 'ica__ura_pyr__2',
'ica__wt_glc__5', 'ica__wt_glc__6', 'ica__bw_delpurR_cytd__1',
'ica__bw_delpurR_cytd__2', 'ica__ade_glc__1', 'ica__ade_glc__2'],
dtype='object', name='missing_DOIs')
[38]:
# a couple of the missing dois do exist
dois = {
'minspan__wt_glc__4': 'doi.org/10.15252/msb.20145243',
'minspan__bw_delcra_glc__2': 'doi.org/10.15252/msb.20145243',
'omics__wt_glu__1': 'doi.org/10.1038/ncomms13091',
'omics__wt_glu__2': 'doi.org/10.1038/ncomms13091'
}
# add them to the sample_table
for sample, doi in dois.items():
ica_data.sample_table.loc[sample, 'DOI'] = doi
10.2.5. Double-check Compatibility and Save
Once all the outputs have been checked, it is encouraged to re-run the imodulondb_compatibility
check to make sure that you are comfortable ignoring all missing information. If so, save your ica_data
object so that you don’t need to repeat any of these steps!
[39]:
table_issues, tf_issues, missing_gene_links, missing_dois = \
imodulondb_compatibility(ica_data, tfcomplex_to_gene = tfcomplex_to_gene)
print('--Table Issues--')
display(table_issues)
print('--TF Issues--')
display(tf_issues)
print('--Missing Gene Links--')
display(missing_gene_links.values)
print('--Missing DOIs--')
display(missing_dois.values)
--Table Issues--
Table | Missing Column | Solution | |
---|---|---|---|
0 | Gene | gene_product | Locus tags (gene_table.index) will be used. |
1 | Sample | sample | The sample_table.index will be used. Each entry must be unique. Note that the preferred syntax is "project__condition__#." |
2 | iModulon | name | imodulon_table.index will be used. |
3 | iModulon | function | The function will be blank in the dataset table and "Uncharacterized" in the iModulon dashboard |
--TF Issues--
in_trn | has_link | has_gene | |
---|---|---|---|
glpR | True | True | False |
his-tRNA | True | False | False |
leu-tRNA | True | False | False |
ile-tRNA | True | False | False |
TPP | True | False | False |
L-tryptophan | True | False | False |
trp-tRNA | True | False | False |
--Missing Gene Links--
array(['b0240_2', 'b0502_1', 'b0553_1', 'b0562_2', 'b1459_1', 'b2092_1',
'b2092_2', 'b2139_2', 'b2190_2', 'b2641_2', 'b2681_1', 'b2681_2',
'b2891_2', 'b3046_2', 'b3268_2', 'b3423_1', 'b3423_2', 'b3643_1',
'b3777_2', 'b4038_3', 'b4308_2', 'b4488_1', 'b4488_2', 'b4490_2',
'b4491_1', 'b4492_1', 'b4492_3', 'b4493_1', 'b4493_2', 'b4495_1',
'b4495_2', 'b4496_1', 'b4496_2', 'b4498_1', 'b4498_2', 'b4499_1',
'b4571_1', 'b4571_2', 'b4575_2', 'b4580_1', 'b4580_2', 'b4581_2',
'b4582_1', 'b4582_2', 'b4587_1', 'b4587_2', 'b4600_2', 'b4623_1',
'b4640_1', 'b4646_2', 'b4658_1', 'b4658_2', 'b4659_1', 'b4659_2',
'b4660_1', 'b4696_1', 'b4696_2'], dtype=object)
--Missing DOIs--
array(['misc__wt_no_te__1', 'misc__wt_no_te__2', 'misc__bw_delcbl__1',
'misc__bw_delcbl__2', 'misc__bw_delfabr__1', 'misc__bw_delfabr__2',
'misc__bw_delfadr__1', 'misc__bw_delfadr__2', 'misc__nitr_031__1',
'ica__wt_glc__1', 'ica__wt_glc__2', 'ica__wt_glc__3',
'ica__wt_glc__4', 'ica__arg_sbt__1', 'ica__arg_sbt__2',
'ica__cytd_rib__1', 'ica__cytd_rib__2', 'ica__gth__1',
'ica__gth__2', 'ica__leu_glcr__1', 'ica__leu_glcr__2',
'ica__met_glc__1', 'ica__met_glc__2', 'ica__no3_anaero__1',
'ica__no3_anaero__2', 'ica__phe_acgam__1', 'ica__phe_acgam__2',
'ica__thm_gal__1', 'ica__thm_gal__2', 'ica__tyr_glcn__1',
'ica__tyr_glcn__2', 'ica__ura_pyr__1', 'ica__ura_pyr__2',
'ica__wt_glc__5', 'ica__wt_glc__6', 'ica__bw_delpurR_cytd__1',
'ica__bw_delpurR_cytd__2', 'ica__ade_glc__1', 'ica__ade_glc__2'],
dtype=object)
[40]:
# the above output is all acceptable
# save your results
from pymodulon.io import *
save_to_json(ica_data, 'e_coli_imdb.json')
10.3. The imodulondb_export
Function
After iModulonDB compatibility has been assured, it is time to use the imodulondb_export
function.
Arguments:
model
: theica_data
objectpath
: the path to the main folder of iModulonDB. This is where you cloned the iModulonDB GitHub repository, and where you should host your local server from. This function will create new folders and files inside this repository. The default is the current working directory.cat_order
: a list of each of theimodulon_table.category
s in order. When sorting by category in the Dataset Pages, the categories will appear in this order. Otherwise, they will appear in alphabetical order. Note that the default sorting is now byimodulon_table.exp_var
(or by iModulon name if no explained variance is provided), so this input is not very important.tfcomplex_to_gene
: a dictionary relating regulatory complexes in theimodulon_table.regulator
s to gene names ingene_table.gene_name
s. See the secion ontf_issues.has_gene
.
[41]:
cat_order = ['Carbon Source Utilization',
'Amino Acid and Nucleotide Biosynthesis',
'Energy Metabolism',
'Miscellaneous Metabolism',
'Structural Components',
'Metal Homeostasis',
'Stress Response',
'Regulator Discovery',
'Biological Enrichment',
'Genomic Alterations',
'Uncharacterized']
imodulondb_export(ica_data,
'../iModulonDB',
cat_order = cat_order,
tfcomplex_to_gene = tfcomplex_to_gene)
Writing main site files...
Done writing main site files. Writing plot files...
Two progress bars will appear below. The second will take significantly longer than the first.
Writing iModulon page files (1/2)
Writing Gene page files (2/2)
Complete! (Organism = e_coli; Dataset = precise1)
Now that the files have been exported, you can access them in your browser using the instructions at the beginning of this tutorial.
10.3.1. Adding a Link to the Splash Page
If you would like to be able to access your data from the splash page of your locally hosted site, you will have to edit the HTML code in index.html
, which is in the main folder.
Copy the contents of organisms/new_organism/new_dataset/html_for_splash.html
Open index.html using a text editor or IDE.
Find “<!– INSERT NEW DATASETS BELOW THIS LINE –>”, which should be near line 297
Paste the contents according to the comments in the file
If desired, increase the indentation of the pasted lines.
Now, if you reload localhost:8000, the new dataset will appear in the left hand side of the splash page.