Google s2 lib研究之空间覆盖

2019-03-30

本文适合于对Google s2有一定了解或对空间索引有一定了解的读者。

空间索引常用的方法是空间填充曲线，比如Hilbert, Geohash等，被用于许多查询场景中。Google也推出了自己的空间索引——Google s2，这个项目是一个开源项目，有兴趣的可以直接看Google s2 java版源码和官方文档。对Google s2的介绍可以看Google’s S2, geometry on the sphere, cells and Hilbert curve，中文的可以看高效的多维空间点索引算法 — Geohash 和 Google S2。关于Google s2的基本原理在此就不再赘述，本文主要讲s2中一个重要的算法——空间覆盖。

可能这个算法名称乍一看有点一头雾水，在这里截取一张官方文档中的图：

regionCover

意思就是，使用若干个s2定义的空间单元去逼近一个任意形状的区域，这些空间单元可能大小不一，并且可以设置最大的空间单元、最小的空间单元和空间单元数量。

官方文档没有给出其原理，在看了Google s2的源码之后，终于对这个算法有了一个初步的了解。

在讲算法之前，先说几点重要的概念：

每个非最底层的空间单元（简称cell，下同）都有四个子cell，即在Google s2中空间索引是四叉树的形式；
每个cell都有其特定的id，id的命名方法请参考官方文档。

接下来开始说算法：

算法输入是一个空间区域region，cell的最小层级minLevel和最大层级maxLevel，返回cell的数量maxCells，返回结果是cell id的集合result。源代码中有一个十分重要的函数addCandidate，过程有点复杂，在之后讲。在此只描述maxCells大于等于4且层级粒度为1时（特定条件）的主体算法。

定义一个优先队列candidateQueue，用于存储cell。其中排序的依据按照cell的权重由大到小排列，cell的权重计算方法后面讲。
初始化candidateQueue：用四个cell完全覆盖region的最小外包圆，这四个cell是同一层级的刚好能覆盖region的四个cell，即如果cell的层级再高一层，就不能完全覆盖region。对这四个cell执行addCandidate函数。
弹出candidateQueue的队首元素candidate，如果candidate的层级小于最小层级 || candidate的与region相交的子cell的数量只有1个 || resul.size()+candidateQueue.size()+candidate的与region相交的孩子的数量 <= maxCells，则对candidate的与region相交的所有子cell执行addCandidate函数；否则将该candidate的id插入到result中。
如果candidateQueue不为空且result中元素的数量小于maxCells，则返回步骤3；否则结束算法。

接下来讲一下重点：addCandidate函数

addCandidate函数的源码如下（在本文中，根据特定条件作了一点修改，以方便读者理解）：

/**
   * Process a candidate by either adding it to the result list or expanding its
   * children and inserting it into the priority queue. Passing an argument of
   * NULL does nothing.
   */
  private void addCandidate(Candidate candidate) {
    if (candidate == null) {
      return;
    }

    if (candidate.isTerminal) {
      result.add(candidate.cell.id());
      return;
    }

    // Expand one level at a time until we hit min_level_ to ensure that
    // we don't skip over it.
    int numLevels = (candidate.cell.level() < minLevel) ? 1 : levelMod;
    int numTerminals = expandChildren(candidate, candidate.cell, numLevels);

    if (candidate.numChildren == 0) {
      // Do nothing
    } else if (numTerminals == 1 << 2
        && candidate.cell.level() >= minLevel) {
      // Optimization: add the parent cell rather than all of its children.
      candidate.isTerminal = true;
      addCandidate(candidate);

    } else {
      // We negate the priority so that smaller absolute priorities are returned
      // first. The heuristic is designed to refine the largest cells first,
      // since those are where we have the largest potential gain. Among cells
      // at the same level, we prefer the cells with the smallest number of
      // intersecting children. Finally, we prefer cells that have the smallest
      // number of children that cannot be refined any further.
      int priority = -((((candidate.cell.level() << 2 + candidate.numChildren)
          << 2) + numTerminals);
      candidateQueue.add(new QueueEntry(priority, candidate));
    }
  }

从代码中可以看到，如果candidate被标记了isTerminal，就可以直接加入到result中，cell被标记isTerminal的条件是：

cell被region完全包含
cell再向下剖分后层级会超过maxLevel

其中expandChildren函数是获取一个cell有会被标记为isTerminal的子cell数量，这个数量被记为numTerminals。如果一个cell的所有子cell全被标记为isTerminal，则直接将这个cell加入到result中（这个是一个优化项（Optimization），再向下划分毫无意义，这样做减小了算法复杂度）；否则将这个cell加入到candidateQueue中，其中优先级的计算方法我至今没搞懂，有弄懂的欢迎评论，优先级的计算转换为数学公式如下：
$$
priority = -（(level \times 4 + numChildren)\times 4 + numTerminals）
$$
其中level表示这个cell的层级，numChildren表示这个cell的和region相交的子cell数量。candidateQueue按照这个优先级由高到低排列。

做一个总结：这个算法相当于先将初始的cell（保证能完全覆盖region的cell，一般为4个）压入candidateQueue中，然后不断地从candidateQueue中获取需要处理的cell。如果cell有需要处理的子cell则将其子cell加入队列，如果cell已不需要处理则直接插入到结果列表中。直到candidateQueue为空或者result中元素的数量达到预设的

以上就是我从Google s2的java版源码中看到的空间覆盖算法，在给出了的测试有如下效果：

Google s2空间覆盖示例

后来翻到了有人给出了更详细的s2空间覆盖源码剖析，是Go语言的：Google S2 是如何解决空间覆盖最优解问题的?

Stay hungry. Stay foolish.

技术

Google s2 lib研究之空间覆盖