5. Searching for motifs
5.1. Using MEME to find motifs
MEME must be installed to use this functionality. find_motifs
searches for motifs upstream of all genes in an iModulon. The gene_table
must contain the columns accession
and operon
for this function to work (see notebooks/gene_annotation.ipynb
).
find_motifs
supports many of the command-line options for MEME:
outdir
: Directory for output filespalindrome
: If True, limit search to palindromic motifs (default: False)nmotifs
: Number of motifs to search for (default: 5)upstream
: Number of basepairs upstream from first gene in operon to include in motif search (default: 500)downstream
: Number of basepairs upstream from first gene in operon to include in motif search (default: 100)verbose
: Show steps in verbose output (default: True)force
: Force execution of MEME even if output already exists (default: False)evt
: E-value threshold (default: 0.001)cores
Number of cores to use (default: 8)minw
: Minimum motif width in basepairs (default: 6)maxw
: Maximum motif width in basepairs (default: 40)minsites
: Minimum number of sites required for a motif. Default is the number of operons divided by 3.
[1]:
from pymodulon.motif import *
from pymodulon.example_data import load_ecoli_data, ecoli_fasta
ica_data = load_ecoli_data()
[4]:
motifs = find_motifs(ica_data, 'ArgR', ecoli_fasta, outdir='tmp')
Finding motifs for 11 sequences
Running command: meme tmp/ArgR.fasta -oc tmp/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 3
Found 2 motifs across 15 sites
find_motifs
creates a MotifInfo
object
[5]:
motifs
[5]:
<MotifInfo with 2 motifs across 15 sites>
The MotifInfo
object contains the following properties:
motifs
: Information about the motifs (e.g. E-value, width, consensus sequence)
[6]:
motifs.motifs
[6]:
e_value | sites | width | consensus | motif_frac | |
---|---|---|---|---|---|
motif | |||||
MEME-1 | 1.7e-30 | 10 | 39 | TGAATWWAHATKCAVTWWWTWTGMATAAAWATWCAHTRR | 0.909091 |
MEME-2 | 0.00015 | 5 | 39 | GMAYASGCWSVTYGYGGVTGHCHGMDGCNACGCTGGCGY | 0.454545 |
sites
: Information about predicted binding sites (e.g. location, p-value)
[7]:
motifs.sites
[7]:
rel_position | pvalue | site_seq | genes | locus_tags | start_pos | strand | ||
---|---|---|---|---|---|---|---|---|
motif | operon | |||||||
MEME-1 | argA | 453 | 2.78e-12 | AGAATAAAAATACACTAATTTCGAATAATCATGCAAAGA | argA | b2818 | 2948741 | + |
argCBH | 375 | 1.89e-16 | TCAATATTCATGCAGTATTTATGAATAAAAATACACTAA | argC,argB,argH | b3958,b3959,b3960 | 4154500 | + | |
argD | 132 | 6.68e-14 | TGAAATTATAACCACAAAATATGCATAAAAAATCACTAA | argD | b3359 | 3490080 | - | |
argE | 128 | 1.89e-16 | TCAATATTCATGCAGTATTTATGAATAAAAATACACTAA | argE | b3957 | 4154747 | - | |
argF | 129 | 6.79e-16 | TGAATTAAAATTCACTTTATATGTGTAATTATTCATTTG | argF | b0273 | 290205 | - | |
argG | 412 | 2.07e-13 | TAAATGAAAACTCATTTATTTTGCATAAAAATTCAGTGA | argG | b3172 | 3318136 | + | |
argI | 127 | 9.75e-16 | TGAATTAAAATTCAATTTATATGGATGATTATTCATTTG | argI | b4254 | 4478211 | - | |
artJ | 171 | 1.63e-12 | TGAATTTATATGCAATAAACATGATTAAATAATTTAAAT | artJ | b0860 | 900475 | - | |
carAB | 453 | 3.02e-14 | TGAATTAATATGCAAATAAAGTGAGTGAATATTCTCTGG | carA | b0032 | 29150 | + | |
dtpD | 192 | 1.11e-11 | TTATGCAATATGCTGTCTGATTGCATAAATATACATTAG | dtpD | b0709 | 742456 | - | |
xynR | NaN | NaN | NaN | yagI | b0272 | 289062 | - | |
MEME-2 | argA | 151 | 1.64e-13 | GCACCGCCACGACCTGGCTGACCGCAGCCAGACTGACCT | argA | b2818 | 2948741 | + |
argCBH | 505 | 4.24e-22 | GAATACGCTGATTGTGGGTGCCAGCGGCTACGCTGGCGC | argC,argB,argH | b3958,b3959,b3960 | 4154500 | + | |
argD | NaN | NaN | NaN | argD | b3359 | 3490080 | - | |
argE | 258 | 4.24e-22 | GAATACGCTGATTGTGGGTGCCAGCGGCTACGCTGGCGC | argE | b3957 | 4154747 | - | |
argF | NaN | NaN | NaN | argF | b0273 | 290205 | - | |
argG | NaN | NaN | NaN | argG | b3172 | 3318136 | + | |
argI | NaN | NaN | NaN | argI | b4254 | 4478211 | - | |
artJ | 305 | 1.00e-15 | GCACAGGCACCGTGCTGATGTCTGATGCGACGCTGGCGC | artJ | b0860 | 900475 | - | |
carAB | NaN | NaN | NaN | carA | b0032 | 29150 | + | |
dtpD | NaN | NaN | NaN | dtpD | b0709 | 742456 | - | |
xynR | 301 | 2.76e-13 | GAACACGATGCTCGCCGCCGACTCAAACACCTCGTCCGT | yagI | b0272 | 289062 | - |
cmd
: The original command used to run MEME
[8]:
motifs.cmd
[8]:
'meme tmp/ArgR.fasta -oc tmp/ArgR -dna -mod zoops -p 8 -nmotifs 5 -evt 0.001 -minw 6 -maxw 40 -allw -minsites 3'
file
: Path to MEME output. This is the relative path of the file from the notebook that ranfind_motifs
[9]:
motifs.file
[9]:
'tmp/ArgR/meme.txt'
This MotifInfo
object is automatically stored as a dictionary in the IcaData object. It will persist after saving and re-loading the IcaData object.
[10]:
ica_data.motif_info
[10]:
{'ArgR': <MotifInfo with 2 motifs across 15 sites>}
5.2. Using TOMTOM to compare motifs against external databases
Once you have a motif from MEME, you can use TOMTOM to compare your motif against external databases. The compare_motifs
function makes this process simple.
The MotifInfo
object generated in the find_motifs
function contains the MEME file location, which is the primary input for compare_motifs
.
[17]:
compare_motifs(ica_data.motif_info['ArgR'], verbose=False)
[17]:
motif | target | target_id | pvalue | Evalue | qvalue | overlap | optimal_offset | orientation | database | |
---|---|---|---|---|---|---|---|---|---|---|
0 | MEME-1 | argR2 | argR2 | 2.648790e-28 | 1.565440e-25 | 3.020170e-25 | 37 | 2 | - | dpinteract |
1 | MEME-1 | ArgR_26-5 | ArgR_26-5 | 4.221120e-16 | 2.494680e-13 | 2.406470e-13 | 26 | 0 | - | SwissRegulon_e_coli |
2 | MEME-1 | argR | argR | 2.456950e-13 | 1.452060e-10 | 9.338110e-11 | 18 | -19 | - | dpinteract |
3 | MEME-1 | ArgR | MX000116 | 3.377890e-07 | 1.996330e-04 | 6.419150e-05 | 14 | -21 | + | prodoric |
4 | MEME-1 | AhrC | MX000043 | 4.871210e-07 | 2.878890e-04 | 6.942730e-05 | 16 | -20 | - | prodoric |
This information is saved in the matches
attribute of the ArgR MotifInfo
. Again, this information will be maintained if you save the file to json.
[13]:
ica_data.motif_info['ArgR'].matches
[13]:
motif | target | target_id | pvalue | Evalue | qvalue | overlap | optimal_offset | orientation | database | |
---|---|---|---|---|---|---|---|---|---|---|
0 | MEME-1 | argR2 | argR2 | 2.648790e-28 | 1.565440e-25 | 3.020170e-25 | 37 | 2 | - | dpinteract |
1 | MEME-1 | ArgR_26-5 | ArgR_26-5 | 4.221120e-16 | 2.494680e-13 | 2.406470e-13 | 26 | 0 | - | SwissRegulon_e_coli |
2 | MEME-1 | argR | argR | 2.456950e-13 | 1.452060e-10 | 9.338110e-11 | 18 | -19 | - | dpinteract |
3 | MEME-1 | ArgR | MX000116 | 3.377890e-07 | 1.996330e-04 | 6.419150e-05 | 14 | -21 | + | prodoric |
4 | MEME-1 | AhrC | MX000043 | 4.871210e-07 | 2.878890e-04 | 6.942730e-05 | 16 | -20 | - | prodoric |
By default, compare_motifs
compares your motifs to 5 prokaryotic databases. You can also use the path to a local motif file or limit the search to a single database, namely:
'prodoric'
'collectf'
'dpinteract'
'regtransbase'
'SwissRegulon_e_coli'
To re-run compare_motifs
with different arguments, set force=True
[18]:
compare_motifs(ica_data.motif_info['ArgR'], motif_db = 'prodoric', verbose=False, force=True)
[18]:
motif | target | target_id | pvalue | Evalue | qvalue | overlap | optimal_offset | orientation | database | |
---|---|---|---|---|---|---|---|---|---|---|
0 | MEME-1 | ArgR | MX000116 | 2.901120e-07 | 0.000058 | 0.000055 | 14 | -21 | + | prodoric |
1 | MEME-1 | AhrC | MX000043 | 4.308400e-07 | 0.000087 | 0.000055 | 16 | -20 | - | prodoric |
[ ]: