8. Creating the Gene Table

A copy of this notebook is available on the PyModulon GitHub page: Creating the Gene Table Notebook

First, download the FASTA and GFF files for your organism and its plasmids from NCBI. We will use the E. coli K-12 MG1655 strain as an example, which does not have plasmids.

[2]:

from pymodulon import example_data

8.1. Get information from GFF files

[3]:

from pymodulon.gene_util import *

Enter the location of all your GFF files here:

[4]:

gff_files = [example_data.ecoli_gff]

The following cell will convert all the GFF files into a single Pandas DataFrame for easy manipulation. Pseudogenes have multiple rows in a GFF file (one for each fragment), but only the first fragment will be kept.

[5]:

keep_cols = ['accession','start','end','strand','gene_name','old_locus_tag','gene_product','ncbi_protein']

DF_annot = gff2pandas(gff_files,index='locus_tag')
DF_annot = DF_annot[keep_cols]

DF_annot.head()

WARNING:root:Duplicate locus_tag detected. Dropping duplicates.

[5]:

	accession	start	end	strand	gene_name	old_locus_tag	gene_product	ncbi_protein
locus_tag
b0001	NC_000913.3	190	255	+	thrL	None	thr operon leader peptide	NP_414542.1
b0002	NC_000913.3	337	2799	+	thrA	None	fused aspartate kinase/homoserine dehydrogenase 1	NP_414543.1
b0003	NC_000913.3	2801	3733	+	thrB	None	homoserine kinase	NP_414544.1
b0004	NC_000913.3	3734	5020	+	thrC	None	threonine synthase	NP_414545.1
b0005	NC_000913.3	5234	5530	+	yaaX	None	DUF2502 domain-containing protein YaaX	NP_414546.1

To ensure that the gene index used is identical to the expression matrix, load in your data.

[6]:

DF_log_tpm = example_data.load_example_log_tpm()
DF_log_tpm.head()

[6]:

	ecoli_00001	ecoli_00002	ecoli_00009	ecoli_00010
Geneid
b0001	10.473721	10.271944	10.315476	10.808135
b0002	10.260569	10.368555	10.735874	10.726916
b0003	9.920277	10.044224	10.528432	10.503092
b0004	9.936694	10.010638	9.739519	9.722997
b0005	7.027515	7.237449	6.745798	6.497823

Check that the genes are the same in the expression dataset as in the annotation dataframe. Mismatched genes are listed below.

[7]:

test = DF_annot.sort_index().index == DF_log_tpm.sort_index().index
DF_annot[~test]

[7]:

	accession	start	end	strand	gene_name	old_locus_tag	gene_product	ncbi_protein
locus_tag

8.2. (Optional) KEGG and COGs

8.2.1. Generate nucleotide fasta files for CDS

Enter the location of all your fasta files here:

[8]:

fasta_files = [example_data.ecoli_fasta]

The following code generates CDS files using your FASTA and GFF3 files

[9]:

from Bio import SeqIO

cds_list = []
for fasta in fasta_files:
    seq = SeqIO.read(fasta,'fasta')

    # Get gene information for genes in this fasta file
    df_genes = DF_annot[DF_annot.accession == seq.id]

    for i,row in df_genes.iterrows():
        cds = seq[row.start-1:row.end]
        if row.strand == '-':
            cds = seq[row.start-1:row.end].reverse_complement()
        cds.id = row.name
        cds.description = row.gene_name if pd.notnull(row.gene_name) else row.name
        cds_list.append(cds)

[10]:

cds_list[:5]

[10]:

[SeqRecord(seq=Seq('ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAAC...TGA', SingleLetterAlphabet()), id='b0001', name='NC_000913.3', description='thrL', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTG...TGA', SingleLetterAlphabet()), id='b0002', name='NC_000913.3', description='thrA', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGAT...TAA', SingleLetterAlphabet()), id='b0003', name='NC_000913.3', description='thrB', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCC...TAA', SingleLetterAlphabet()), id='b0004', name='NC_000913.3', description='thrC', dbxrefs=[]),
 SeqRecord(seq=Seq('GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCC...TAA', SingleLetterAlphabet()), id='b0005', name='NC_000913.3', description='yaaX', dbxrefs=[])]

Uncomment the line below to write the CDS file

[11]:

# SeqIO.write(cds_list,'CDS.fna','fasta')

8.2.2. Run EggNOG Mapper

Go to http://eggnog-mapper.embl.de/.
Upload the CDS.fna file from your organism directory (within the sequence_files folder)
Make sure to limit the taxonomy to the correct level
After the job is submitted, you must follow the link in your email to run the job.
Once the job completes (after ~4 hrs), download the annotations file.
Save the annotation file

8.2.3. Get KEGG IDs

Once you have the EggNOG annotations, load the annotation file

[12]:

eggnog_file = example_data.ecoli_eggnog

[13]:

DF_eggnog = pd.read_csv(eggnog_file,sep='\t',skiprows=4,header=None)
eggnog_cols = ['query_name','seed eggNOG ortholog','seed ortholog evalue','seed ortholog score',
               'Predicted taxonomic group','Predicted protein name','Gene Ontology terms',
               'EC number','KEGG_orth','KEGG_pathway','KEGG_module','KEGG_reaction',
               'KEGG_rclass','BRITE','KEGG_TC','CAZy','BiGG Reaction','tax_scope',
               'eggNOG OGs','bestOG_deprecated','COG','eggNOG free text description']

DF_eggnog.columns = eggnog_cols

# Strip last three rows as they are comments
DF_eggnog = DF_eggnog.iloc[:-3]

# Set locus tag as index
DF_eggnog = DF_eggnog.set_index('query_name')
DF_eggnog.index.name = 'locus_tag'

DF_eggnog.head()

[13]:

	seed eggNOG ortholog	seed ortholog evalue	seed ortholog score	Predicted taxonomic group	Predicted protein name	Gene Ontology terms	EC number	KEGG_orth	KEGG_pathway	KEGG_module	...	KEGG_rclass	BRITE	KEGG_TC	CAZy	BiGG Reaction	tax_scope	eggNOG OGs	bestOG_deprecated	COG	eggNOG free text description
locus_tag
b0002	316407.85674276	0.000000e+00	1600.5	Escherichia	thrA	GO:0003674,GO:0003824,GO:0004072,GO:0004412,GO...	1.1.1.3,2.7.2.4	ko:K12524	ko00260,ko00261,ko00270,ko00300,ko01100,ko0111...	M00016,M00017,M00018,M00526,M00527	...	RC00002,RC00043,RC00087	ko00000,ko00001,ko00002,ko01000	NaN	NaN	iECDH1ME8569_1439.ECDH1ME8569_0002,iEcDH1_1363...	Escherichia	1MW3H@1224,1RN1G@1236,3XN2A@561,COG0460@1,COG0...	NA\|NA\|NA	E	homoserine dehydrogenase I
b0003	316407.85674277	3.000000e-178	630.9	Escherichia	thrB	GO:0000096,GO:0000097,GO:0003674,GO:0003824,GO...	2.7.1.39	ko:K00872	ko00260,ko01100,ko01110,ko01120,ko01230,map002...	M00018	...	RC00002,RC00017	ko00000,ko00001,ko00002,ko01000	NaN	NaN	iECSE_1348.ECSE_0003	Escherichia	1MW8I@1224,1RMYR@1236,3XN01@561,COG0083@1,COG0...	NA\|NA\|NA	F	Catalyzes the ATP-dependent phosphorylation of...
b0004	316407.21321894	1.200000e-246	858.6	Escherichia	thrC	GO:0003674,GO:0003824,GO:0004795,GO:0005575,GO...	4.2.3.1	ko:K01733	ko00260,ko00750,ko01100,ko01110,ko01120,ko0123...	M00018	...	RC00017,RC00526	ko00000,ko00001,ko00002,ko01000	NaN	NaN	iLF82_1304.LF82_2261,iNRG857_1313.NRG857_00025	Escherichia	1MUWQ@1224,1RQ0H@1236,3XP31@561,COG0498@1,COG0...	NA\|NA\|NA	E	Catalyzes the gamma-elimination of phosphate f...
b0005	316407.21321895	1.400000e-23	115.5	Escherichia	yaaX	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	Escherichia	1N4SV@1224,1S9KR@1236,2B58A@1,32XJS@2,3XRGB@561	NA\|NA\|NA	S	Protein of unknown function (DUF2502)
b0006	198214.SF0006	1.600000e-143	515.4	Gammaproteobacteria	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	Escherichia	1MUAF@1224,1RMTD@1236,COG3022@1,COG3022@2	NA\|NA\|NA	S	Belongs to the UPF0246 family

5 rows × 21 columns

Now we will pull the KEGG information from the eggNOG file, including orthology, pathway, module, and reactions for each gene.

[14]:

DF_kegg = DF_eggnog[['KEGG_orth','KEGG_pathway','KEGG_module','KEGG_reaction']]

# Melt dataframe
DF_kegg = DF_kegg.reset_index().melt(id_vars='locus_tag')

# Remove null values
DF_kegg = DF_kegg[DF_kegg.value.notnull()]

# Split comma-separated values into their own rows
list2struct = []
for name,row in DF_kegg.iterrows():
    for val in row.value.split(','):
        list2struct.append([row.locus_tag,row.variable,val])

DF_kegg = pd.DataFrame(list2struct,columns=['gene_id','database','kegg_id'])

# Remove ko entries, as only map entries are searchable in KEGG pathway
DF_kegg = DF_kegg[~DF_kegg.kegg_id.str.startswith('ko')]

DF_kegg.head()

[14]:

	gene_id	database	kegg_id
2750	b0002	KEGG_pathway	map00260
2751	b0002	KEGG_pathway	map00261
2752	b0002	KEGG_pathway	map00270
2753	b0002	KEGG_pathway	map00300
2754	b0002	KEGG_pathway	map01100

8.2.4. Save KEGG information

Uncomment the line below to save the KEGG file

[15]:

# DF_kegg.to_csv('kegg_mapping.csv')

8.2.5. Save COGs to annotation dataframe

[16]:

DF_annot['COG'] = DF_eggnog.COG

# Make sure COG only has one entry per gene
DF_annot['COG'] = [item[0] if isinstance(item,str) else item for item in DF_annot['COG']]

8.3. Uniprot ID mapping

The uniprot_id_mapping function is a python wrapper for the Uniprot ID mapping tool. Use input_id=P_REFSEQ_AC if the FASTA/GFFf files are from RefSeq, and input_id=EMBL if the files are from Genbank.

[17]:

mapping_uniprot = uniprot_id_mapping(DF_annot.ncbi_protein.fillna(''),input_id='P_REFSEQ_AC',output_id='ACC',
                             input_name='ncbi_protein',output_name='uniprot')
mapping_uniprot.head()

[17]:

	ncbi_protein	uniprot
2090	NP_416515.1	A0A0G3HGF9
4381	NP_418733.1	A0A0N9YI43
3651	NP_418014.1	A0A223DQZ5
26	NP_414563.1	A0A385XJ53
1978	NP_416408.1	A0A385XJ53

[18]:

# Merge with current annotation
DF_annot = pd.merge(DF_annot.reset_index(),mapping_uniprot,how='left',on='ncbi_protein')
DF_annot.set_index('locus_tag',inplace=True)
DF_annot.head()

[18]:

	accession	start	end	strand	gene_name	old_locus_tag	gene_product	ncbi_protein	COG	uniprot
locus_tag
b0001	NC_000913.3	190	255	+	thrL	None	thr operon leader peptide	NP_414542.1	NaN	P0AD86
b0002	NC_000913.3	337	2799	+	thrA	None	fused aspartate kinase/homoserine dehydrogenase 1	NP_414543.1	E	P00561
b0003	NC_000913.3	2801	3733	+	thrB	None	homoserine kinase	NP_414544.1	F	P00547
b0004	NC_000913.3	3734	5020	+	thrC	None	threonine synthase	NP_414545.1	E	P00934
b0005	NC_000913.3	5234	5530	+	yaaX	None	DUF2502 domain-containing protein YaaX	NP_414546.1	S	P75616

8.4. Add Biocyc Operon information

To obtain operon information from Biocyc, follow the steps below

Go to Biocyc.org (you may need to create an account and/or login)
Change the organism database to your organism/strain
Select SmartTables -> Special SmartTables
Select “All genes of <organism>”
Select the “Gene Name” column
Under “ADD TRANSFORM COLUMN” select “Genes in same transcription unit”
Select the “Genes in same transcription unit” column
Under “ADD PROPERTY COLUMN” select “Accession-1”
Under OPERATIONS, select “Export” -> “to Spreadsheet File…”
Select “common names” and click “Export smarttable”
Add file location below and run the code cell

[19]:

biocyc_file = example_data.ecoli_biocyc

DF_biocyc = pd.read_csv(biocyc_file,sep='\t')

# Remove genes with no accession
DF_biocyc = DF_biocyc[DF_biocyc['Accession-1'].notnull()]

# Set the accession (i.e. locus tag) as index
DF_biocyc = DF_biocyc.set_index('Accession-1').sort_values('Left-End-Position')

# Only keep genes in the final annotation file
DF_biocyc = DF_biocyc.reindex(DF_annot.index)

# Reformat transcription units
DF_biocyc['operon_list'] = DF_biocyc['Accession-1.1'].apply(reformat_biocyc_tu)

DF_biocyc.head()

[19]:

	Gene Name	Left-End-Position	Right-End-Position	Product	Genes in same transcription unit	Accession-1.1	operon_list
locus_tag
b0001	thrL	190.0	255.0	<i>thr</i> operon leader peptide	thrL // thrA // thrB // thrC	b0001 // b0002 // b0003 // b0004	b0001;b0002;b0003;b0004
b0002	thrA	337.0	2799.0	fused aspartate kinase/homoserine dehydrogenase 1	thrA // thrB // thrC // thrL	b0002 // b0003 // b0004 // b0001	b0001;b0002;b0003;b0004
b0003	thrB	2801.0	3733.0	homoserine kinase	thrA // thrB // thrC // thrL	b0002 // b0003 // b0004 // b0001	b0001;b0002;b0003;b0004
b0004	thrC	3734.0	5020.0	threonine synthase	thrA // thrB // thrC // thrL	b0002 // b0003 // b0004 // b0001	b0001;b0002;b0003;b0004
b0005	yaaX	5234.0	5530.0	DUF2502 domain-containing protein YaaX	yaaX	b0005	b0005

8.4.1. Assign unique IDs to operons

The following code assigns unique names to each operon

[20]:

# Get all operons
operons = DF_biocyc['operon_list'].unique()

# Map each operon to a unique string
operon_dict = {operon: "Op"+str(i) for i, operon in enumerate(operons)}

# Add names to dataframe
DF_biocyc['operon'] = [operon_dict[op] for op in DF_biocyc["operon_list"]]

DF_biocyc.head()

[20]:

	Gene Name	Left-End-Position	Right-End-Position	Product	Genes in same transcription unit	Accession-1.1	operon_list	operon
locus_tag
b0001	thrL	190.0	255.0	<i>thr</i> operon leader peptide	thrL // thrA // thrB // thrC	b0001 // b0002 // b0003 // b0004	b0001;b0002;b0003;b0004	Op0
b0002	thrA	337.0	2799.0	fused aspartate kinase/homoserine dehydrogenase 1	thrA // thrB // thrC // thrL	b0002 // b0003 // b0004 // b0001	b0001;b0002;b0003;b0004	Op0
b0003	thrB	2801.0	3733.0	homoserine kinase	thrA // thrB // thrC // thrL	b0002 // b0003 // b0004 // b0001	b0001;b0002;b0003;b0004	Op0
b0004	thrC	3734.0	5020.0	threonine synthase	thrA // thrB // thrC // thrL	b0002 // b0003 // b0004 // b0001	b0001;b0002;b0003;b0004	Op0
b0005	yaaX	5234.0	5530.0	DUF2502 domain-containing protein YaaX	yaaX	b0005	b0005	Op1

Finally, merge the Biocyc information with the main annotation DataFrame

[21]:

DF_annot['operon'] = DF_biocyc['operon']

8.5. Clean up and save annotation

First, we will re-order the annotation columns

[22]:

if 'old_locus_tag' in DF_annot.columns:
    order = ['gene_name','accession','old_locus_tag','start','end','strand','gene_product','COG','uniprot','operon']
else:
    order = ['gene_name','accession','start','end','strand','gene_product','COG','uniprot','operon']

DF_annot = DF_annot[order]

[23]:

DF_annot.head()

[23]:

	gene_name	accession	old_locus_tag	start	end	strand	gene_product	COG	uniprot	operon
locus_tag
b0001	thrL	NC_000913.3	None	190	255	+	thr operon leader peptide	NaN	P0AD86	Op0
b0002	thrA	NC_000913.3	None	337	2799	+	fused aspartate kinase/homoserine dehydrogenase 1	E	P00561	Op0
b0003	thrB	NC_000913.3	None	2801	3733	+	homoserine kinase	F	P00547	Op0
b0004	thrC	NC_000913.3	None	3734	5020	+	threonine synthase	E	P00934	Op0
b0005	yaaX	NC_000913.3	None	5234	5530	+	DUF2502 domain-containing protein YaaX	S	P75616	Op1

8.6. Final statistics

The following graphs show how much information is available for the organism.

[24]:

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('ticks')

[25]:

fig,ax = plt.subplots()
DF_annot.count().plot(kind='bar',ax=ax)
ax.set_ylabel('# of Values',fontsize=18)
ax.tick_params(labelsize=16)

../_images/tutorials_creating_the_gene_table_54_0.png

8.7. Fill missing values

Some organisms are missing gene names, so these will be filled with locus tag gene names.

[26]:

# Fill in missing gene names with locus tag names
DF_annot['tmp_name'] = DF_annot.copy().index.tolist()
DF_annot.gene_name.fillna(DF_annot.tmp_name,inplace=True)
DF_annot.drop('tmp_name',axis=1,inplace=True)

COG letters will also be converted to the full name.

[27]:

# Fill missing COGs with X
DF_annot['COG'].fillna('X',inplace=True)

# Change single letter COG annotation to full description
DF_annot['COG'] = DF_annot.COG.apply(cog2str)

counts = DF_annot.COG.value_counts()
plt.pie(counts.values,labels=counts.index);

../_images/tutorials_creating_the_gene_table_59_0.png

Uncomment the following line to save the gene annotation dataset

[28]:

# DF_annot.to_csv('gene_info.csv')

8.8. GO Annotations

To start, download the GO Annotations for your organism from AmiGO 2

Go to AmiGO 2
Filter for your organism
Click CustomDL
Drag GO class (direct) to the end of your Selected Fields
Enter the location of your GO annotation file below and run the following code block

[29]:

go_file = example_data.ecoli_go_example

[30]:

DF_GO = pd.read_csv(go_file,sep='\t',header=None,usecols=[2,10,17])
DF_GO.columns = ['gene_name','gene_id','gene_ontology']
DF_GO.gene_id.fillna(DF_GO.gene_name,inplace=True)
DF_GO = DF_GO[['gene_id','gene_ontology']]
DF_GO.head()

[30]:

	gene_id	gene_ontology
0	b0312\|ECK0310	betaine-aldehyde dehydrogenase activity
1	b0312\|ECK0310	betaine-aldehyde dehydrogenase activity
2	b0312\|ECK0310	betaine-aldehyde dehydrogenase activity
3	b0312\|ECK0310	betaine-aldehyde dehydrogenase activity
4	b0312\|ECK0310	response to osmotic stress

Take a look at the gene_id column: 1. Make sure there are no null entries 2. Check which naming convention is used (locus tag, new locus tag, or gene name)

If it looks like it uses the old locus tag, set old_locus_tag to True.

In this example, the gene_id column needs to be altered slightly to match with the gene name. This may not be necessary for your organism.

[31]:

def format_go_gene_id(gene_id):
    locus_tags = [item for item in gene_id.split('|') if re.match('b\d{4}',item)]
    if len(locus_tags) > 0:
        return locus_tags[0]
    else:
        return None

DF_GO.gene_id = DF_GO.gene_id.apply(format_go_gene_id)

Now we remove null entries

[32]:

DF_GO = DF_GO[DF_GO.gene_id.notnull()]

[33]:

naming = 'locus_tag' # Can be "gene_name" or "old_locus_tag"

if naming != 'locus_tag':
    convert_tags = {value:key for key,value in DF_annot[naming].items()}
    DF_GO.gene_id = DF_GO.gene_id.apply(lambda x: convert_tags[x])

[34]:

DF_GO.head()

[34]:

	gene_id	gene_ontology
0	b0312	betaine-aldehyde dehydrogenase activity
1	b0312	betaine-aldehyde dehydrogenase activity
2	b0312	betaine-aldehyde dehydrogenase activity
3	b0312	betaine-aldehyde dehydrogenase activity
4	b0312	response to osmotic stress

Uncomment the line below to save the annotations

[35]:

# DF_GO[['gene_id','gene_ontology']].to_csv('GO_annotations.csv')