Clustering Challenges In Biological Networks

Free download. Book file PDF easily for everyone and every device. You can download and read online Clustering Challenges In Biological Networks file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Clustering Challenges In Biological Networks book. Happy reading Clustering Challenges In Biological Networks Bookeveryone. Download file Free Book PDF Clustering Challenges In Biological Networks at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF Clustering Challenges In Biological Networks Pocket Guide.

Access provided by: anon Sign Out. Cluster analysis for gene expression data: a survey Abstract: DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics.

However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups.

A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. Bold entries correspond to the best values obtained for each measure on each network. Many of the remaining seven algorithms can cluster the smaller sized networks well, and in some cases may outperform the three approaches here; however, their runtime or memory requirements limit their applicability.

Supplementary Material. We build 10 synthetic test networks edges for each pairwise combination of 10 addition rates and 10 deletion rates. The averaged Separation and Accuracy measures Brohee and van Helden, for each addition and deletion rate are shown in Figure 4 see also Supplementary Table 2. For low interaction insertion and deletion rates, the methods perform comparably. Consistent with this, we find that SPICi is quite robust to perturbations of confidence values in both the STRING human and yeast networks, with a steady but relatively modest decrease in average Jaccard, PR and semantic density values as increasing amounts of noise are added Supplementary Table 3.

Ten edge deletion and insertion edges are considered from 0. The x -axis gives the random edge deletion rate, and the y -axis gives the noisy edge addition rate. Each cell corresponds to a single insertion and deletion rate combination. Values greater than 0, shown in red, indicate that SPICi is better.

Similarly, values smaller than 0, shown in green, indicate that MCL is better. While the Bayesian human network from Huttenhower et al. Each of these networks corresponds to one of specific BPs. Here, we show the type of analysis SPICi enables by its fast clustering approach—analysis that would not be possible by the previous approaches. In particular, we utilize SPICi to uncover context-specific modules from these context-specific networks.

We select context-specific modules utilizing the following criteria. For each candidate cluster, we require that:. No uncovered clusters from any other context-specific network can overlap more than half of its proteins. By applying these three criteria, we attempt to uncover modules that are unique to a certain context.

In total, clusters passed these criteria. As an example, we look at one such cluster, found in the response to inorganic substance network Fig.

Identifying communities from multiplex biological networks by randomized optimization of modularity

There are 10 proteins in this cluster. This cluster has very limited overlap at most two proteins with clusters found in the other networks. It is annotated with GO terms such as GTP binding and transcription factor binding , but has no known annotations related with response to metal ion or transport. This uncovered cluster reveals DRG1's potential role in metal ion response and transport.

A context-specific module found in the response to inorganic substance network of Huttenhower et al. Gray colored nodes correspond to proteins with the GO annotation Transport. Double peripheral ellipse nodes correspond to proteins with GO annotation response to metal ion.

Original Research ARTICLE

The DRG1 protein, shown in a box node, is discussed further in the text. We have developed a fast, memory-efficient clustering algorithm, SPICi. SPICi is significantly faster than previous clustering algorithms for biological networks, and importantly, enables us to cluster larger networks than previously possible.

Moreover, we have demonstrated via several analyses that the clusters uncovered by SPICi are of comparable quality to those found by other state-of-the-art algorithms. In our experience, SPICi is especially well-suited for dense networks, such as functional networks. Within sparser networks, we have found that SPICi also readily identifies dense regions, but for reasonable parameter settings will conservatively leave many proteins unclustered.

We have shown that SPICi can be effectively run on hundreds of large human context-specific networks in order to find context-specific modules. In the future, we foresee using SPICi to perform other types of comparative interactomics. For example, protein interaction networks for a single organism can be modified to incorporate information about each protein's tissue- or condition-specific expression, and comparing clusterings across these networks can help to identify modules that are either conserved across numerous conditions or specific to certain conditions.

Given the large number of expression datasets, this leads to the possibility of hundreds or even thousands of varying networks across a single organism. SPICi's runtime and memory efficiency enables these new types of analyses, and should be particularly useful as biological networks continue to grow in size and number. We thank Curtis Huttenhower and Olga Troyanskaya for providing their human context-specific networks. We thank members of the Singh group, especially Tao Yue and Jimin Song, for helpful discussions on our approach and for comments on our manuscript. Oxford University Press is a department of the University of Oxford.

It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide. Sign In or Create an Account. Sign In. Advanced Search. Article Navigation. Close mobile search navigation Article Navigation. Volume Article Contents. Oxford Academic. Google Scholar. Mona Singh. Associate Editor: John Quackenbush. Cite Citation. Permissions Icon Permissions. Abstract Motivation: Clustering algorithms play an important role in the analysis of biological networks, and can be used to uncover functional modules and obtain hints about cellular organization.

Table 1. Open in new tab. Given a weighted network, the goal of our algorithm is to output a set of disjoint dense subgraphs. Our approach utilizes several measures. Open in new tab Download slide.

  • Intentional Revolutions: A Seven-Point Strategy for Transforming Organizations.
  • The Spirit Stone.
  • 75 Chess problems.
  • Cultural Psychology of the Self.
  • IEEE Std 1800-2009: IEEE Standard for SystemVerilog - Unified Hardware Design, Specification, and Verification Language.
  • Fundamental Anatomy!

Semantic density: for each cluster, the average semantic similarity between each pair of annotated proteins within it is computed. In particular, for proteins p 1 and p 2 with annotations A p 1 and A p 2 respectively, the semantic similarity of their GO annotations is defined as:. Table 2. Table 3. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. Search ADS. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium.

An automated method for finding molecular complexes in large protein interaction networks. Evaluation of clustering algorithms for protein-protein interaction networks.

  • LSAT PrepTest 58?
  • Account Options.
  • Integrated pest management : principles and practice.
  • Account Options.

Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Detecting functional modules in the yeast protein-protein interaction network. Fibonacci heaps and their uses in improved network optimization algorithms. Enumeration of condition-dependent dense modules in protein interaction networks. Google Preview. ArrayProspector: a web resource of functional associations inferred from microarray expression data.

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Uncovering the overlapping community structure of complex networks in nature and society. Many of the phosphorylations involved substrates that operate in a known pathway of the kinase; however, several validated substrates function in different cellular processes from those known for the kinase, thereby revealing new functions for the protein kinases. The wealth of biochemical data generated in the past century when combined with genome sequences allows the construction of metabolic networks.

The metabolic network usually focuses on the mass flow in basic chemical pathways that generate essential components such as amino acids, sugars, and lipids, and the energy required by the biochemical reactions. As such, these networks typically present both protein and metabolite information. Literature curation and genome annotation have elucidated many complex biochemical pathways Kanehisa and Goto ; Overbeek et al. Interactions in metabolic networks are closely related to the gene functions, and therefore have great potential for immediate applications in the interpretation of gene roles.

Considerable attention has been focused on the network dynamics using constraint-based analyses such as flux balance analysis FBA , which assumes the steady state of all metabolites and that the organisms will optimize the metabolite fluxes to maximize biomass production Segre et al. This approach has led to many successful predictions.

In addition, a flux model on a yeast metabolic network was able to explain enzyme dispensability; that is, how loss-of-function mutations of many yeast enzymes result in viable strains Papp et al. This model suggested that the majority of nonessential enzymes are vital for cell growth under certain previously untested conditions, whereas only a small subset are compensated by isoenzymes or parallel pathways. Other successful constraint-based analyses in metabolic networks have also been performed.

These include 1 re-engineering micro-organisms with gene deletions for the purpose of manipulating their chemical products Burgard et al. Additional examples of constraint-based analysis can be found in a detailed review Price et al. Although many metabolic network studies were developed in micro-organisms and S. These studies may also shed light in other organisms since the fundamental network structures may be conserved in evolution. Topological analysis of metabolic networks in 43 organisms covering all three life domains revealed highly similar topological properties, although great diversity exists among individual pathways and components Jeong et al.

Combining mutations in two different genes can either synergistically reduce or enhance the growth or fitness of an organism, relative to organisms containing individual mutations. For many—if not most—species, the majority of genes are not lethal when mutated individually; this is likely because of either genetic redundancy or because the affected genes normally enhance the fitness of the organism rather than are essential for its viability. When mutations are combined in the same strain to produce a phenotype stronger than that caused by an individual mutation, the mutated genes are often thought to reside in parallel redundant pathways, although other interpretations are possible.

Regardless of the reason, the ability to combine mutations to produce strong phenotypes provides the opportunity to carry out synthetic lethal analysis on a large scale that provides a wealth of useful information. Large-scale synthetic lethal screens have been performed in S. Genetic interaction screens using either plate SGA or microarray readouts dSLAM with yeast strains containing mutations in nonessential genes have been used to systematically uncover synthetic lethal interactions Tong et al.

One recent study that combined genetic interactions from high-throughput methods and a literature curation of 53, publications in PubMed produced an S. For essential genes, strains containing conditional mutations such as those that confer a temperature-sensitive growth defect or with the gene under the control of a tetracycline titratable promoter can be analyzed under conditions that reduce, but do not eliminate, the activity of the gene product Davierwala et al.

Analysis of these interactions has also revealed functional relationships between genes and a high correlation with other properties, such as mutant phenotypes and cellular localization, thus helping to assign biological roles for unknown genes and infer novel functions to annotated genes. In addition to synthetic lethal screens, other types of genetic interactions can be measured. These include combining mutations that disrupt inhibitory interactions and thus enhance growth. In fact, interactions that when combined either enhance or reduce growth have been investigated to generate a detailed genetic interaction map, E-MAPs for epistatic miniarray profiles , for genes involved in the yeast early secretory pathway Schuldiner et al.

Another type of genetic interaction is a synthetic dosage lethal screen in which overexpressed genes are introduced into a mutant strain background; synthetic dosage lethality can provide additional, and often nonoverlapping interaction data to those found by combining inactivating mutations Measday et al. For example, overexpression of genes that inhibit growth in a mutant strain background has been used to screen for genes that would negatively regulate protein kinase substrates Sopko et al.

Finally, a conceptually similar approach to synthetic lethality is to screen for mutant strains that are hypersensitive to inhibitory small molecules. Thus far, screens have been performed between inhibitory chemical compounds and deletion mutants of all yeast nonessential genes or strains heterozygous for mutations in essential genes Giaever et al. Such chemical genetic interactions, when integrated with genetic interactions, often suggest pathways targeted by the drugs as well as potential direct drug targets. Thus, this approach offers a powerful tool in deciphering the mechanisms of action of drugs as well defining suitable biological pathways that can be targeted for inhibition.

A coexpression network, in which genes are connected if their transcripts are coregulated, was assembled in S. Proteins that share other properties, such as biological processes Tari et al. The coexpression and homolog networks differ from the other networks described above in that the interactions are based on similarities not related to gene function.

Nonetheless, they can still be investigated with similar approaches and often exhibit comparable network topology. Therefore studies on these networks may also discover novel protein roles and help to decipher the complex cellular networks, especially when integrated with other biological networks.


Interactions are often assembled into network maps comprised of proteins or genes termed vertices or nodes and connections between them defined as edges in undirected networks or arcs in directed networks. The directionality of a network is dependent on the characteristics of the biological data. Protein—protein and genetic interactions are usually represented with an undirected network, whereas transcription factor binding, phosphorylation, and metabolic networks have directionality built into their interactions.

One feature of nearly all of the interaction studies is that the strength of interactions can vary considerably. Such quantitative information, however, is rarely used in most network analyses, and interactions are usually reported as binary measurements. Future studies are likely to overcome these limitations as more accurate measurements are obtained, and weighted values can be assigned to network connections as indicators of the interaction strength. Network topology plays a vital role in understanding network architecture and performance.

Several of the most important and commonly used topological features include degree, clustering coefficient, shortest path length, and betweenness Fig. Detailed descriptions of each these statistics are listed as follows: 1 Degree: The number of links connected to one vertex is defined as its degree. In an interaction network, the maximum distance between any two nodes is termed as the graph diameter. The average distance and diameter of a network measure the approximate distance between vertices in a network.

Many real world networks such as metabolic networks have a small world architecture Watts and Strogatz , which may serve to minimize transition times between metabolic states Wagner and Fell A high clustering coefficient for a network is another indicator of a small world. Betweenness estimates the traffic load through one node or link assuming that the information flows over a network primarily following the shortest available paths. Topological parameters. Five commonly used topological parameters are illustrated in both graphs and formulae.

A Degree measures the number of connections one node has. B Distance is the length of the shortest path between two nodes. C Diameter is the maximum distance between any two nodes in a network.

Deep Learning for Network Biology

D Clustering coefficient measures the percentage of existing links among the neighborhood of one node. E Betweenness is the fraction of those shortest paths between all pairs of vertices that pass through one vertex or link. All graphs and formulae are based on an undirected network. This organization was originally discovered in World Wide Web interactions and later found to exist in four of the types of networks described above: protein—protein interactions, transcription factor binding, metabolic, and genetic data sets Barabasi and Albert ; Jeong et al.

Below we demonstrate that this is also the case for the phosphorylation network as well. Topological comparison between a random network and a scale-free network. Degree distribution in random networks is bell-shaped. The scale-free network has more high-degree nodes and a power-law degree distribution, which leads to a straight line when plotting the total number of nodes with a particular degree versus that degree in log-log scales.

Hub components in a scale-free network are extremely important and therefore usually play essential roles in biological systems. In the yeast protein—protein interaction networks, hubs are more likely to be essential and conserved relative to nonhub proteins Jeong et al.

Presumably much of the regulation in a network occurs and is mediated through such proteins. Likewise, key components whose activation is sufficient to induce a cellular process master regulator genes have been shown to be regulated by many other components and are thus target hubs; these often lie downstream in the process Weintraub et al. Not all components within a regulatory pathway serve as master regulators, probably because noise introduced into the system may inappropriately activate the process at undesired times.

Presumably, components that lie within a network are buffered through both positive and negative regulatory contacts that prevent them from directly activating a biological process. The location of master regulators at the bottom of a highly connected network would allow maximum information input to be interpreted through upstream components and relayed into a final decision output; thus master regulators often represent important regulatory nodes in biological networks.

For example, Twist, a master regulator controlling gene expression in embryonic morphogenesis, is responsible for tumor invasion and metastasis Yang et al. Further analysis of the transcription factor network has also revealed an additional novel aspect of regulatory network hierarchy. When the binding targets of E. Similar to the middle managers in social networks such as governmental hierarchies, transcription factors in the middle layers often regulate more targets and have higher betweenness, indicating that they may function as bottlenecks in the hierarchy.

With more interaction data gathered in the future, such hierarchical structures can also be investigated in other directed networks such as metabolic networks and phosphorylation networks. Transcriptional control and post-translational regulation with kinase phosphorylation are two major methods eukaryotes use for gene regulation; each controls a large number of targets. As shown in Figure 4 , we have performed a detailed comparison of the network topologies of the yeast transcription factor-binding network and phosphorylation network under rich-nutrient conditions.

These networks contain a remarkable number of similarities. First, the two networks share similar degree distributions: exponential in-degree distributions Fig. Second, many topological parameters are comparable between the two networks; however, the phosphorylation network is denser than the transcription factor-binding network and contains more nodes with large in- and out-degrees. Finally, the current phosphorylation network is smaller than the transcription factor-binding network. Both networks are built on incomplete data sets and may contain errors.

The yeast phosphorylation data, in particular, are primarily collected from one large-scale study covering only two-thirds of all the yeast kinases. The transcription factor-binding network has more experimental sources and therefore a larger coverage. Since diameter is positively correlated with the network size, and limited sampling of a network often lowers the average clustering coefficient Friedel and Zimmer , the difference in the network size may explain why the transcription factor-binding network has a larger diameter and a higher clustering coefficient.

The yeast phosphorylation network resembles the transcription factor-binding network in their topological structures. A The in-degree and out-degree distributions were plotted after the nodes were binned to several degree intervals. Both networks have power-law in-degree distributions and exponential out-degree distributions.

B Many topological parameters are comparable between the two networks, except that the transcriptional network is larger and the phosphorylation network is denser. Although initial studies have characterized the global topological structure of biological networks, recently much attention has been paid to the local units of the networks. Large subgraph units, assembled by groups of densely associated proteins and connected to each other with loose links, are defined as network modules Girvan and Newman ; Rives and Galitski ; Newman Such community-like network modules have been uncovered in many types of social networks as well as biological networks, in which they often function as essential components of the network.

For example, one study of protein interactions in a transcriptional network indicates that different types of transcriptional regulators such as transcription factors, nuclear transporters, and nucleosome remodeling proteins prefer to form modules within each class, and the modules are jointed with sparse connections Tsankov et al. The modules often contain proteins of unknown function, and therefore may shed light on protein function predictions.

Furthermore, two classes of proteins are revealed by studies of modular structures. Many methods have been developed to identify possible network modules. A traditional method, hierarchical clustering, assigns a weight value to the distance between any two nodes in a network, and then gathers nodes with similar weight vectors together into strongly connected cores Rives and Galitski Instead of detecting cores of modules in hierarchical clustering, the Girvan-Newman algorithm focuses on defining the boundaries of modules by searching for edges with high betweenness and therefore those that are more likely to link different modules Girvan and Newman Other algorithms have been introduced recently and may demonstrate improvement in module identification Guimera and Nunes Amaral ; Adamcsek et al.

One concern, however, is that network modules are often dependent on the methods and parameters used in the initial data partitioning, and in general it is difficult to tell which method is better Barabasi and Oltvai Furthermore, inaccurate and incomplete data of the interaction networks may also lead to biased module predictions. Nonetheless, networks modules are still ubiquitous structures in most biological networks and may help one to better understand the interplay between network structure and function.

The availability of large interaction data sets allows the identification of much smaller common patterns or motifs within large networks that are used with significantly higher frequencies relative to randomized networks. Analysis of transcription factor-binding data in E. It is possible that many, and perhaps all, single input motifs in eukaryotes are the result of incomplete data and that most genes probably contain multiple inputs.

We applied a tool, mfinder Milo et al. Both data sets were generated in yeast cells grown in rich media conditions.


Among all possible three-element motifs, the FFL was found to be well overrepresented in transcriptional networks Fig. Coherent FFL, in which both transcription factors have the same regulation effects induction or repression on the target, may suggest a functional design for gene transcription regulation. Studies have shown that coherent FFLs can control downstream processes in a fashion that is resistant to transient noise, since targets in FFL can only be effectively regulated through persistent signals Shen-Orr et al.

Clustering Nodes in Large-Scale Biological Networks Using External Memory Algorithms

Bi-FFL motifs are also significantly enriched in yeast transcription factor-binding networks. All three-unit and four-unit motifs enriched in the yeast transcriptional factor-binding TF network and phosphorylation PHO network. The units are colored red as regulators and green as targets. The significance of enrichment is calculated by comparing motif numbers in the transcription factor or phosphorylation networks solid bars with the numbers from randomized networks hollow bars and indicated by the z-scores. A Bi-fan motifs, in which two regulators bind common targets, are enriched in both the transcription factor network and phosphorylation network.

B Bi-parallel motifs, in which one regulator controls two other regulators that further regulate one target gene, are enriched in both the transcription factor network and phosphorylation network. C FFLs, in which one regulator controls another regulator and both of them bind a common target, are enriched in the transcription factor network only.

D Bi-FFL motifs, in which one regulator controls another regulator and both of them bind two common targets, are enriched in the transcription factor network only. E Single input motifs, in which one regulator binds to multiple targets, are enriched in the phosphorylation network only.

Thus far, FFLs are not enriched in the current yeast phosphorylation network. This may be due to the approach used to prepare the network that tends to underestimate the phosphorylation events between kinases, and additional data may be required to properly evaluate this network.

However, it is possible that the lack of FFL in phosphorylation networks relative to transcriptional networks also reflects the biology of these networks.

Phosphorylation networks are often activated by transient signals that lead to extremely rapid responses on the order of a few minutes. In contrast, transcriptional networks are slower and take longer to reach steady state. Two four-element motifs were enriched in both the yeast transcriptional network and the phosphorylation network Fig. Moreover, the cooperation of transcription factors to regulate targets can also compensate for the degeneracy and low affinity of single transcription factor-binding sites Pilpel et al.

Types of biological networks

Bi-parallel motifs are found in both transcriptional and phosphorylation networks and indicate redundancy. In addition to the two four-element motifs shared by both networks, the single input motif SIM was found to be overrepresented only in the yeast phosphorylation network. This likely reflects the lack of phosphorylation data currently available.

Integration of different experimental resources is used in several different ways: 1 to improve the accuracy of interactions, 2 to identify composite motifs, and 3 to make functional predictions. Integration of similar data sets generated with different methods provides a crucial way to improve data quality and recover missing data. Studies on C. Further functional analysis demonstrated that gene pairs connected by interactions from multiple sources are more likely from the same GO functional categories, indicating improved accuracy through data integration.

In the transcriptional network, integration with the gene expression data set has also proven to be useful to improve the data quality and reveal novel cis -regulatory modules Bar-Joseph et al. Recent bioinformatics software platforms enable users to query and integrate very different types of interaction data to learn new information Breitkreutz et al.

Instead of searching for overlapping interactions, integration of very different types of interaction data can also be performed to reveal composite motifs that contain multiple types of interactions and elements as basic units. Investigations in this mega-network revealed seven three-element kinase-centered composite motifs Fig. Thus, network integration combines various data sources together and therefore can assist in uncovering proteins that are important in multiple types of interactions and provide a more comprehensive view on their cellular functions.

Moreover, this network can be combined with other networks such as biochemical and gene interaction data to reveal a more comprehensive view of regulation in yeast. Network integration: mega-network and composite motifs. A Three types of interactions—phosphorylation blue , transcription factor binding yellow and protein-protein magenta —are combined into a mega-network. B Seven three-element kinase-centered composite motifs are listed. Motifs 1 to 5 were found to be enriched in the yeast integrated network. In addition to mapping gene roles in a multirelationship network, integration of a variety of relevant genomic data can directly help to predict gene functions and functional relationships such as protein—protein interactions Jansen et al.