ProteinQuest user guide 1. Introduction……………………………………………………………………………………............ 3 1.1 With ProteinQuest you can…………………………………………………………………............... 3 1.2 ProteinQuest basic version……………………………………………………………………………4 1.3 ProteinQuest extended version………………………………………………………………............. 5 2. ProteinQuest dictionaries………………………………………………………………………............ 6 3. Directions for use………………………………………………………………………………............ 7 3.1 Simple query……………………………………………………………………………..................... 7 3.2 Advanced query……………………………………………………………………………................ 8 3.3 Combine dictionary terms with Boolean operators AND, OR, NOT………………………............... 8 3.4 Loading a list….……………………………………………………………………………............... 9 3.5 Identify a list of terms…………………………………………………………………………….….. 10 3.6 How to identify extra data…………………………………………………………………………… 11 3.7 Highlight data into the documents………………………………………………………...…............ 11 3.8 Wizard……………………………………………………………………………………………….. 12 3.9 Query limits………………………………………………………………………………………….. 13 3.10 Results bar………………………………………………………………………………………….. 13 3.10.1 Enlarge, Filter, Clipboard, Network………………………………………………….................... 14 3.11 Results list……………………………………………………………………………….................. 17 3.11.1 Papers, Patents and Clinical Trials……………………………………………………….............. 17 3.12 Results analysis……………………………………………………………………………............... 18 3.12.1 Excel download of a list of terms………………………………………………………................ 18 3.12.2 Excel download of a list of PMIDs…………………………………………………….................. 19 3.12.3 Excel download of a list of PMIDs, Year, Title, Authors;Journal, Volume, Pages, Notes………. 19 3.13 Graphs download………………………………………………………………………………….... 20 3.13.1 Heat Map download………………………………………………………………………............ 21 3.13.2 Network download……………………………………………………………………….............. 22 3.13.2.1 Excel……………………………………………………………………………………………. 22 3.13.2.2 Node XL…………………………………………………………………………………........... 23 3.13.2.3 Cytoscape………………………………………………………………………………………. 23 4. Tools…………………………………………………………………………………………............... 24 4.1 Saved query/Load query……………………………………………………………………............... 24 4.2 PMID list…………………………………………………………………………………………….. 24 4.3 Network:……………………………………………………………………………………............... 24 4.3.1 How to generate a Network………………………………………………………………………... 24 4.3.1.2 Automatic Network selection……………………………………………………………............. 24 4.3.1.3 Select where to collect data……………………………………………………………………… 25 1 4.3.1.4 Select Nodes……………………………………………………………………………………... 25 4.3.1.5 Generate Network……………………………………………………………………….............. 26 4.3.1.6 Black and colored edges: two types of information……………………………………............... 27 4.3.1.7 Advanced Network configuration……………………………………………………….............. 28 4.3.1.8 Set the values of occurrence, co-occurrence and Ef. ……………………………………………. 28 4.4 The Heat Map………………………………………………………………………………............... 29 4.4.1 How to generate a Heat Map……………………………………………………………................. 29 4.4.2 How to download a Heat Map……………………………………………………………............... 31 5. ProteinQuest Case Studies…………………………………………………………………….............. 31 2 1.Introduction ProteinQuest is a new platform for biomedical literature retrieval and analysis. This new platform for biodiscovery smoothly integrates data from scientific literature, data repositories and biological images. Currently ProteinQuest holds more than 15 million indexed abstracts, 9 million images, 1.8 million selected Patents, 250.000 Clinical Trials and 10 billion binary relationships. Literature information can be obtained easily by using two different query types: by inserting free key words and by guided construction of a Boolean query using cured Ontologies. ProteinQuest finds relevant insights into both article abstracts and image captions, producing more specific and comprehensive search results compared to other data mining platforms. Query results can be as specific as users require. ProteinQuest performs an accurate search as it lets you refine the field of interest by selecting specific dictionaries/ontologies such as miRNA, drugs, Biological Processes, etc. Moreover queries can be saved and reloaded whenever needed. ProteinQuest can be also used to search Patent abstracts and claims for analysis of the resulting information by means of all dictionaries/ontologies available. Additionally ProteinQuest builds complex network models to extend the understanding of your research. Networks generated by ProteinQuest reveal binding relationships between several types of concepts and biological items, as well as between people, institutions, companies, etc. 1.1 With ProteinQuest you can: Easily understand and interpret literature information through an innovative graphical layout that highlights key relationships and connections between objects included in several different Ontologies Mine for biological relationships between proteins/genes experimentally supported by one or more techniques of our choice Prioritize target genes for biomarker discovery, drug development and repositioning Create powerful, interactive networks connecting genes or proteins to diseases, identify relevant drugs and isolate sub-networks within biological fields 3 Retrieve only clinically-relevant information at any clinical stage of development Examine relevant experiments in the literature and compare your results to what people have already found Track down collaborations among people or institutions working on a topic of your choice, identifying the most relevant players in the field ProteinQuest is available in two versions: basic and extended. 1.2 ProteinQuest basic version The Basic version is the right tool to search and explore PubMed papers for easily getting a quick reply to your query. With ProteinQuest basic version you can: Retrieve information from abstracts of the entire PubMed collection (more than 15.000.000 records) and captions of all free full-text papers (about 9,000,000 entries); Launch queries both to PubMed (simple search) or to our curated, internal database (advanced search). Disambiguate entities using a semantic approach and a highly sophisticated proprietary technology to reduce the number of false positive results which common data mining tools are unable to discriminate Obtain higher accuracy, precision and recall values compared to other tools Auto complete query fields for a guaranteed accurate search Automatically expand queries that include a reference term (e.g. gene symbol), all known synonyms and add disambiguation information for ambiguous terms allowing to perform a single absolute search Perform composite queries by inserting a list of terms such as gene symbols as search input Retrieve clinically-relevant information at any clinical stage for drug development purposes Track down collaborations among people or institutions on a common topic Interrogate the scientific literature using free-words or selecting terms from 9 different dictionaries/ontologies The table below highlights the main features of ProteinQuest’s basic version 4 1.3 ProteinQuest extended version This is the full version of ProteinQuest. With ProteinQuest’s extended version you can: Retrieve relevant information from abstracts of the entire PubMed collection (more than 15.000.000 records) and captions of all free full-text papers (about 9,000,000 entries) and both Patents and Clinical Trails (1.8 million selected Patents, 250.000 Clinical Trials) Launch queries both to PubMed (simple search) or to our curated, internal database (advanced search) Disambiguate entities using a semantic approach, and a highly sophisticated and proprietary reasoned avoiding the release of false positive results which the common data mining tools are unable to discriminate. Obtain higher accuracy, precision and recall values compared to other tools Auto complete query fields for a guaranteed accurate search Automatically expand queries that include a reference term (e.g. gene symbol), all known synonyms and add disambiguation information for ambiguous terms allowing to perform a single absolute search Perform composite queries by inserting a list of terms such as gene symbols as search input Retrieve clinically-relevant information at any clinical stage for drug development purposes Track down collaborations among people or institutions on a common topic Integrate PubMed information with Patents and Clinical trials data Interrogate the scientific literature using free-words or selecting terms from 9 different dictionaries/ontologies Process the results using Heatmaps and Networks. Define Pathways also with data regarding the activation, inhibition and binding information. The table below highlights the main features of ProteinQuest’s extended version. 5 2.ProteinQuest dictionaries Molecules Functions Anatomy Lab Source Proteins miRNA Drugs Substances Protein families Bio Processes Disease Pathways Body parts Tissues Cells Cell parts Organisms Methods Papers Organizations Nationality Study type Authors Journals Year Patents Organizations Inventors USP Class Year Trials Organizations Nationality Status Phase Year 6 3. Directions for use 3.1 Simple query To set a simple query insert your keywords into the search space: The selected terms will be searched without any text processing in both abstracts and Mesh terms of PubMed papers =>But organized information is available for each dictionary 7 3.2 Advanced query The selected terms will be searched in both abstracts, Mesh terms and captions of PubMed papers, abstracts and claims of US Patents and Summaries of worldwide Clinical Trials 3.3 Combine dictionary terms with Boolean operators AND, OR, NOT After the query has been set, it is possible to change the Boolean operator from OR to AND or from AND to NOT simply by clicking on it. 8 3.4 Loading a list: - You can insert terms one at the time or you can load a list.: The file must have a .txt format 9 3.5 Identify a list of terms For each dictionary, it is possible to extract the specific match terms identified in the papers: Different lists can be extracted. Below you can see that we chose highlight miRNAs that have been identified in the result. 10 3.6 How to identify extra data If you want to visualize all the related miRNA described with the ones of the query, just click on{}Enlarge: 3.7 Highlight data into the documents 11 3.3. Wizard Through the wizard button it is possible to obtain the results of an advanced query in a single click Through wizard buttons it’s possible to obtain networks in a single click 12 3.4 Query limits -How to set limits: 3.5 Results bar For each dictionary it is possible to visualize and export the elements identified in the results. Here is the protein list where the number of documents (abstract, or captions) and images (captions) corresponding to the free captions of papers has been specified. The Ef corresponds to the enrichment factor, which relies on the frequency of the elements in the results. 13 Notably the first protein of the list is TNFAIP3 as it has been cited the most in the results. The list can also be ordered by the highest number of images or enrichment factor simply by clicking on the top of the bar: 3.10.1 Enlarge, Filter, Clipboard, Network -{}Enlarge to visualize all the related concepts of the query elements identified in the results 14 -Filter and restrict the query with an additional group of elements (checked into the box space) -Select and save all documents and images of your interest to the Clipboard 15 Furthermore it is possible to visualize all PMIDs and their corresponding titles by clicking on a title the document will appear behind the clipboard, ready to be analyzed. Within the clipboard you can CLEAR and erase or SAVE the subset of selected documents. The papers can be reloaded by selecting the saved clipboard. -Network and visualize the biological relationship among the selected terms. Inside ProteinQuest a network of at least 240 nodes can be represented. The choice of nodes relies on the number of documents and the enrichment factor (Ef) of the most connected terms. 16 3.11 Results list 3.11.1 Papers, Patents and Clinical Trials The result obtained from a query corresponds to a subset of documents: If you are interested in PubMed publications, just select the Papers directory: The title, affiliation, abstract, mesh and open sources figure of the papers will be analyzed. If you are interested in Patents, just select the Patents directory: The title, affiliation and claims of patents will be analyzed. 17 If you are interested in Clinical trials, just select the Trials directory: The analysis is related to the title, affiliation, summary and eligibility of clinical trials 3.12 Results analysis 3.12.1 Excel download of a list of terms The list of terms of each dictionary identified in the results, can be exported to excel: 18 3.12.2 Excel download of a list of PMIDs Furthermore the PMIDs list of the results can be exported from ProteinQuest. 3.12.3 Excel download of a list of PMID, Year, Title, Authors; Journal, Volume, Pages, Notes Note that the PMID is the link to download of the selected papers. Not only the list of terms but you can also export a graph, heat map or a network 19 3.13 Graphs download Here is a downloaded graph representing the number of documents and images of the most specific biological processes of a query: 20 3.13.1 Heat Map download Here is a downloaded excel file of a Heat Map that represents the methods used for the analysis of genes identified in a specific query 21 3.13.2 Network download The Network can be downloaded in different formats: 3.13.2.1 Excel With the excel file it is possible to visualize the main characteristic of the network generated in ProteinQuest: The concept selected (vertex), the occurrence (label), the Ef (tooltip) and the weight (Cooccurrence) Here is an example of excel file of a network generated from a query Vertex IL6 LMNA TNF IL1B NFKB1 STAT3 IL8 MAPK8 RELA TLR4 CASP3 CCL2 COX2 PTGS2 MAPK3 Color 254, 161, 0 254, 161, 0 255, 135, 135 255, 152, 152 255, 156, 156 255, 156, 156 255, 161, 161 255, 163, 163 255, 165, 165 255, 165, 165 255, 167, 167 255, 169, 169 255, 169, 169 255, 169, 169 255, 171, 171 Shape Size Label Tooltip Type Occurrences Weight circle 80 61 occ, Ef 167.9 Prot 61 167.86 circle 80 61 occ, Ef 4160 Prot 61 4160.01 circle 60 31 occ, Ef 46.95 Prot 31 46.9531 circle 50 16 occ, Ef 71.32 Prot 16 71.3205 circle 48 13 occ, Ef 97.91 Prot 13 97.9058 circle 48 13 occ, Ef 236.7 Prot 13 236.75 circle 46 10 occ, Ef 78.47 Prot 10 78.4673 circle 45 9 occ, Ef 80.52 Prot 9 80.5248 circle 45 8 occ, Ef 266.3 Prot 8 266.275 circle 45 8 occ, Ef 148.4 Prot 8 148.415 circle 44 7 occ, Ef 35.22 Prot 7 35.2155 circle 43 6 occ, Ef 70.26 Prot 6 70.2596 circle 43 6 occ, Ef 59.98 Prot 6 59.9779 circle 43 6 occ, Ef 60.86 Prot 6 60.8579 circle 43 5 occ, Ef 36.28 Prot 5 36.2841 22 3.13.2.2 Node XL Using NodeXL it is possible to edit the network obtained in ProteinQuest and prepare an image of it. 3.13.2.3 Cytoscape Using Cytoscape it is possible to edit your ProteinQuest network and to further analyze it through its plugins (Bingo, GeneMania, Reactome, Network Analyzer etc.,) 23 4. Tools Inside the Tool bar there are several functions: 4.1 Saved query/Load query It is possible to save your query before you log out from ProteinQuest and reload it in the following session. 4.2 PMIDs list It is possible to export the list of PMIDs identified in the results 4.3 Network There are two possible network setting options: An automatic selection will choose the interactions by relying on the number of documents and Ef among the most connected terms. 4.3.1 How to generate a Network 4.3.1.2 Automatic Network selection For automatic network generation don’t select the advanced configuration option. 24 4.3.1.3 Select where to collect data It is required to select where to collect data: papers, patents or clinical trials. For the PubMed papers it is necessary to select if terms should be collected from either abstracts or images or both 4.3.1.4 Select Nodes The nodes can be represented by their query terms, visualized only by the interactions among them (restrict to query elements) or included terms identified in the results belonging to the same dictionary or other ones. 25 Since the edges selected correspond to the documents where different nodes are described together, it is required to select which interactions to visualize by checking one or more of the options proposed. 4.3.1.5 Generate Network Here is an examle of a network automatically generated by ProteinQuest: 26 4.3.1.6 Black and colored edges: two types of information -black edges correspond to a link of specific papers described together by the relationship among the adjacent nodes. -colored edges correspond to experimental data describing interactions, inhibitions, expression regulation and enzymatic reactions. It is possible to select which data to visualize first. Other information related to the network is available in bibliometric and protein pathway network analysis. 27 4.3.1.7 Advanced Network configuration There is also the possibility to select the advanced configuration to generate the network. 4.3.1.8 Set the values of occurrence, co-occurrence and Ef And set the values of occurrence, co-occurrence and Ef. The only limits sizes are the ones set by user. 28 Here is an example of a network generated in ProteinQuest and visualized in Cytoscape: 4.4 The Heat Map 4.4.1 How to generate a Heat Map Heat Maps can be generated by selecting the correspondent button under the “Tools” directory. The Heat Map represents a useful tool to explore biological relationship among specific terms identified in the query results. 29 It is possible to visualize where two terms are described together in the papers or patents or clinical trials. Following are the steps necessary to generate a Heat Map. The resulting Heat Map will report in each cell the number of co-occurrences of two terms in the list of documents retrieved in the results. The red intensity is proportional to the fraction of hits normalized to the total hits number of each column. Following is a Heat Map reporting in each box the number of documents where each genes or proteins are described in a specific pathological context. Furthermore the numbers are also linked to the corresponding documents. 30 4.4.2 How to download a Heat Map The Heat Map can be exported to excel for further statistical analysis, such as cluster analysis, Pearson’s correlation etc. These analyses are very useful to identify for e.g. biomarker signatures and other biological information. 5. ProteinQuest Case Studies 1] S. Polidoro et al., “Effects of bisphosphonate treatment on DNA methylation in osteonecrosis of the jaw.,” Mutat. Res., vol. 757, no. 2, pp. 104–13, Oct. 2013. [2] T. Alberio et al., “Parkinson’s disease plasma biomarkers: an automated literature analysis followed by experimental validation.,” J. Proteomics, vol. 90, pp. 107–14, Sep. 2013. [3] C. Zanini et al., “Medullospheres from DAOY, UW228 and ONS-76 cells: increased stem cell population and proteomic modifications.,” PLoS One, vol. 8, no. 5, p. e63748, Jan. 2013. [4] A. Benso et al., “Reducing the complexity of complex gene coexpression networks by coupling multiweighted labeling with topological analysis.,” Biomed Res. Int., p. 676328, Jan. 2013. 31 32