© 2019 Oxford University Press
The LIPID MAPS Proteome Database (LMPD) is an object-relational database of lipid-associated protein sequences and annotations. The initial release contains 2959 records, representing human and mouse proteins involved in lipid metabolism. UniProt IDs were obtained based on keyword search of KEGG and GO databases, and this LMPD protein list was then enhanced with annotations from UniProt, EntrezGene, ENZYME, GO, KEGG and other public resources. We also assigned associations with general lipid categories, based on GO and KEGG annotations. Users may search LMPD by database ID or keyword, and filter by species and/or lipid class associations; from the search results, one can then access a compilation of data relevant to each protein of interest, cross-link to external databases. The LIPID MAPS Proteome Database (LMPD) is publicly available from the LIPID MAPS Consortium website (http://www.lipidmaps.org/). The direct URL is http://www.lipidmaps.org/data/proteome/index.cgi.
Lipids play central roles in energy storage, cell membrane structure, cellular communication and regulation of biological processes such as inflammatory response, neuronal signal transmission and carbohydrate metabolism. They are furthermore known to be involved in many disease states, including Alzheimer’s, asthma, cancer, malaria and rheumatoid arthritis.
The LIPID Metabolites and Pathways Strategy (LIPID MAPS) Consortium represents a multi-institutional effort to develop a detailed understanding of lipid structure and function. As part of this effort, we will develop ‘parts lists’ of lipid metabolites and assemble these into metabolic networks. These networks will then provide an infrastructure for subsequent modeling using quantitative data from LIPID MAPS experiments. LMPD embodies the protein, gene and pathway parts lists for these networks.
Existing lipid databases include LIPIDAT (1) (http://www.lipidat.chemistry.ohio-state.edu/), LIPID BANK for Web (http://www.lipidbank.jp/) and LIPIDBASE (http://www.lipidbase.jp/category.html). They are primarily organized around the properties and structures of lipids, while LMPD is focused on proteins and genes associated with all lipids. The Arabidopsis Lipid Gene Database focuses on Arabidopsis thaliana and includes genes and proteins that are involved in acyl-lipid metabolism (2).
Identifying lipid-associated proteins
A new classification system has recently been developed for lipids (3). This classification system organizes lipids into eight main classes, with two levels of subclasses. The top-level lipid categories are fatty acyls, glycerolipids, glycerophospholipids, sphingolipids, sterol lipids, prenol lipids, saccharolipids and polyketides. A list of lipid-related GO (Gene Ontology, http://www.geneontology.org) (4) terms and KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.ad.jp/kegg) (5) pathways was compiled, using lipid-specific keywords, such as trivial names of classes, subclasses and individual lipid compounds. The UniProt (6) proteins annotated with those GO and KEGG terms were then collected. The GO terms identify lipid-related enzymatic activity and metabolic processes; KEGG terms identify lipid-related pathways. The proteins are associated with one or multiple lipid classes based on these GO/KEGG annotations. By this process we have identified ∼1600 human proteins and ∼1300 mouse proteins in UniProt. About 2500 out of the 2900 proteins are associated with at least one of the eight main lipid classes. An overview of the bioinformatics process is illustrated in Figure 1.
Protein and gene annotation
Protein annotations were primarily obtained from UniProt (7) and include UniProt accession, UniProt entry name, domain information and Swiss-Prot (8) comments such as function, catalytic activity, subcellular location and similarity. Accession numbers were used to link to various public databases and collect available information for each protein. From NCBI EntrezGene (8) we collected gene information, such as Gene ID, alternate names/synonyms and symbols, chromosomal mapping and cross-references to other databases.
The 2959 proteins comprising the LMPD protein list correspond to ∼2300 unique genes. We also gathered all the GO and KEGG annotations, not just the lipid-specific ones, using Gene IDs and UniProt accessions. Other protein records with sequences that are (i) identical with the ones in our list, (ii) splice variants of those or (iii) related (from the same gene/locus) were gathered using EntrezGene and an in-house generated non-redundant protein sequence database compiled from the most well-known public protein sequence databases such as Swiss-Prot, Trembl and GenBank (9).
EC numbers were obtained from KEGG ENZYME and UniProt ENZYME. A single protein may be associated with multiple EC numbers and multiple proteins may be associated with the same EC number. EC numbers were then used to obtain information such as enzyme name and synonyms, reaction, substrate(s) and product(s), from ExPASy ENZYME (10) (Enzyme nomenclature database) and KEGG ENZYME (11) database.
Proteins/genes that have associated KEGG annotations or EC numbers are hyperlinked/mapped to KEGG metabolic pathways. Future work will include manual mapping of those proteins to more detailed/specific lipid metabolic pathway maps such as SphinGOMAP (http://www.sphingomap.com) (12) and to signaling pathways.
DATABASE IMPLEMENTATION AND USER INTERFACES
LMPD is implemented as an object-relational database, using Oracle9i Enterprise Edition Release 188.8.131.52.0, running on a Sun Fire 880. Perl scripts and Oracle SQL*Loader were used to parse and load flat-file data into Oracle database tables. The LMPD graphical user interface (GUI) is based on Perl, and is served by the Apache 1.3.26 web server, running on a Sun Ultra-80. Both Sun machines are running Solaris 9.
The ‘Advanced’ query form provides options for conducting a more focused search, including options to search by database ID or keyword and to filter by species and/or lipid class association. Database ID fields searched include UniProt accession, UniProt entry name, gene symbols, GenBank GI, EC number, GO ID and KEGG pathway ID. Keyword search fields include Uniprot description and Swiss-Prot comments.
The results summary page presents a sortable list of proteins matching the query criteria, along with selected summary information, including LMPD_ID, accession, protein name, protein symbol and associated lipid categories. From the summary page, the user may display complete LMPD annotations for each protein.
Annotations are organized by category: Record Overview, Gene/GO/KEGG Information, UniProt Annotations, and Related Proteins. The record overview contains LMPD_ID, species, description, gene symbols, lipid categories, EC number, molecular weight, sequence length and protein sequence. Gene information includes Entrez Gene ID, chromosome, map location, primary name, primary symbol and alternate names and symbols; Gene Ontology (GO) IDs and descriptions, and KEGG pathway IDs and descriptions. UniProt annotations include primary accession number, entry name and comments such as catalytic activity, enzyme regulation, function and similarity. For related proteins and splice variants, we display source database, database ID, sequence length, and title.
The initial release of LMPD establishes a framework for creating a lipid-associated protein list, collecting relevant annotations, databasing this information and providing a user interface. This initial release includes data collected from mouse and human taxonomies.
We have chosen to gather lip >
We will also add other species, such as Escherichia coli and Saccromyces cerevisiae in subsequent releases. We intend to have two major releases a year, but we will be updating the annotation of the existing records every couple of months.
Future work will also include integration of LMPD with the LIPID MAPS Lipid Structure database. This process will involve expert help from the LIPID MAPS Consortium and will be done mostly manually. For the LMPD proteins that have EC numbers, a semi-automatic part of it will involve using the enzyme annotation from KEGG database, finding the substrates and products which are lipids and then mapping those lipids to the LIPID MAPS Lipid structure database based on name and structure matching if available. The expert knowledge and manual work will also help with the lipid class and sub-class assignment of proteins. This mapping of proteins to lipid classes and subclasses will permit users to search for proteins and then access data from corresponding lipids, and vice versa.
We ultimately aim to develop lipid interaction networks that will integrate lipid metabolic pathways and signaling networks, and tools for exploring these networks. These networks and tools, coupled with LIPID MAPS experimental data, may provide insight into the biological processes underlying lipid-involved disease processes and lead to the identification of potential drug targets.]]>