Data Structures & Algorithms for Exact Inference in Hierarchical Clustering

Feb 26, 2020
27 pages
e-Print:

Citations per year

2020202120222023202410
Abstract: (arXiv)
Hierarchical clustering is a fundamental task often used to discover meaningful structures in data, such as phylogenetic trees, taxonomies of concepts, subtypes of cancer, and cascades of particle decays in particle physics. Typically approximate algorithms are used for inference due to the combinatorial number of possible hierarchical clusterings. In contrast to existing methods, we present novel dynamic-programming algorithms for \emph{exact} inference in hierarchical clustering based on a novel trellis data structure, and we prove that we can exactly compute the partition function, maximum likelihood hierarchy, and marginal probabilities of sub-hierarchies and clusters. Our algorithms scale in time and space proportional to the powerset of NN elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies. Also, for larger datasets where our exact algorithms become infeasible, we introduce an approximate algorithm based on a sparse trellis that compares well to other benchmarks. Exact methods are relevant to data analyses in particle physics and for finding correlations among gene expression in cancer genomics, and we give examples in both areas, where our algorithms outperform greedy and beam search baselines. In addition, we consider Dasgupta's cost with synthetic data.
Note:
  • 27 pages, 12 figures
  • [1]
    Sanjoy / cost function for similarity-based hierarchical clustering. In Symposium on Theory of Computing (STOC)
    • Dasgupta.
  • [2]
    Alexander Kraskov, Harald Stögbauer / and Peter Grassberger. Hierarchical clustering using mutual information
    • Andrzejak
      ,
    • Ralph
      • EPL 70 (2005) 278
  • [3]
    Philipp Cimiano and Steffen Staab. Learning concept hierarchies from text with a guided agglomerative clustering algorithm. In Proceedings of the ICMLWorkshop on Learning and Extending Lexical Ontologies with Machine Learning Methods
  • [4]
    Therese Sørlie / Robert Tibshirani, Turid Aas, Stephanie Geisler, Hilde Johnsen, Trevor Hastie, Michael B Eisen / Matt Van De Rijn, Stefanie S Jeffrey, et al / Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications
    • Perou
      ,
    • Charles
      • Proc.Nat.Acad.Sci. 98 (2001) 10869-10874
  • [6]
    YW Teh, and KA Heller. Bayesian rose trees. In Uncertainty in Artificial Intelligence, (UAI)
    • C. Blundell
  • [7]
    Hal Daume III, and Daniel M Roy. Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems (NeurIPS)
    • Teh
      ,
    • Yee
  • [8]
    Levi Boyles and Max Welling. The time-marginalized coalescent prior for hierarchical clustering. In Advances in Neural Information Processing Systems, pages 2969-2977
  • [9]
    Yuening Hu / Hal Daume III, and Z Irene Ying. Binary to bushy: Bayesian hierarchical clustering with the beta coalescent. In Advances in Neural Information Processing Systems (NeurIPS)
    • Ying
      ,
    • Jordan
  • [10]
    Density modeling and clustering using dirichlet diffusion trees. Bayesian statistics
    • Neal
      ,
    • Radford
  • [11]
    and Zoubin Ghahramani / -yor diffusion trees. In Conference on Uncertainty in Artificial Intelligence (UAI)
    • Knowles
      ,
    • David
  • [12]
    Philippe Lemey, Guy Baele, Daniel L Ayres, Alexei J Drummond / and Andrew Rambaut. Bayesian phylogenetic and phylodynamic data integration using beast 1.10. Virus evolution
    • Suchard
      ,
    • Marc
  • [13]
    Liangliang Wang, Alexandre Bouchard-Côté, and Arnaud Doucet. Bayesian phylogenetic inference using a combinatorial sequential monte carlo method. Journal of the American Statistical Association, 110(512):1362-1374
  • [14]
    David / combinatorial survey of identities for the double factorial
    • Callan.
  • [15]
    The permuted analogues of three Catalan sets
    • E. Dale
      ,
    • J. Moon
  • [16]
    Craig Greenberg, Nicholas Monath, Ari Kobren, Patrick Flaherty, Andrew McGregor, and Andrew McCallum. Compact representation of uncertainty in clustering. In / Advances in Neural Information Processing Systems 31, pages 8630-8640. Curran Associates, Inc
    • S. Bengio
      ,
    • H. Wallach
      ,
    • H. Larochelle
      ,
    • K. Grauman
      ,
    • N. Cesa-Bianchi
    et al.
  • [17]
    Vincent Cohen-Addad, Varun Kanade, and Frederik Mallmann-Trenn. Hierarchical clustering beyond the worst-case. In Advances in Neural Information Processing Systems (NeurIPS)
  • [18]
    Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. Journal of the ACM (JACM), 66(4):1-42
  • [19]
    Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-
    • [19]
      Symposium on Discrete Algorithms, pages 841-854
    • [20]
      Moses Charikar, Vaggos Chatziafratis, and Rad Niazadeh. Hierarchical clustering better than average-linkage. In Proceedings of the Thirtieth Annual ACM-
      • [20]
        Symposium on Discrete Algorithms, pages 2291-2304
      • [21]
        Moseley and Joshua Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. In Advances in Neural Information Processing Systems
      • [22]
        Aurko Roy and Sebastian Pokutta. Hierarchical clustering via spreading metrics
          • J.Machine Learning Res. 18 (2017) 3077-3111
      • [23]
        Kyle Cranmer, Sebastian Macaluso, and Duccio Pappadopulo. Toy Generative Model for Jets,. Toy Generative Model for Jets