E-PVM:一种应用于异构资源负载平衡的机会成本方法

本文主要讲述了Borg论文中引用的基于机会成本的E-PVM算法

这个算法的缺点(大型任务难以调度)是在我的场景下是可接受的

对Borg来说,任务实际是一个任务组,亲和需求的任务组需要特别多的资源

而在我的调度问题中,大型任务是很少的,因为我的调度器不提供任务组的概念,由业务来解耦任务组

我尝试理解该算法后做一些优化,然后通过模拟器模拟和分集群测试来查看是否可以提升资源利用率

背景介绍

Borg论文提到了两种调度算法

  • worst fit

    Borg最早使用E-PVM的变种进行评分,它在异构资源中生成单一成本值,并在放置任务时,使资源最小化

    E-PVM会将负载分布到所有机器上,为负载峰值留出余地

    优点是减少了搁浅效应:某机器一项资源完全用完,另外一项完全没有用也无法被利用上

    缺点是增加了碎片化问题:导致大型任务无法被调度

  • best fit

    尽可能紧密地填满机器。这使得一些机器为空闲(退机或者用来放大型任务)

    优点是解决了碎片化问题

    缺点是搁浅效应,并且平时对资源需求很低的任务,尝试使用额外的资源会无法使用

Borg为了减少搁浅效应最终采用了一种混合模型,它的效率比best fit高3-5%,不过论文中没有具体解释算法是如何实现的

E-PVM

E-PVM来自论文《An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster》

这一篇论文简单讨论了负载均衡的调度问题

负载均衡的调度问题可以分成三种类型:

  • 相同机器的情况:

    这是负载问题中最简单的情况。在这种情况下,所有的机器(比如服务器)都具有相同的处理能力,每个任务的负载量也是相同的。举例来说,如果有10个相同的机器处理100个相同的任务,那么理想的情况下,每个机器应该分配到10个任务。

  • 相关机器的情况:

    在这种情况下,每台机器的处理能力可能会不同,任务的负载量也可能会不同。这时,负载平衡变得更复杂,必须考虑将哪些任务分配给哪些机器。举例来说,如果有一台机器处理能力比其他机器强,那么可能需要将更多或更复杂的任务分配给这台机器。

  • 无关机器的情况:

    这是最复杂的情况。在这种情况下,不同机器的处理能力无法直接比较,任务的负载也可能在不同机器上有不同的影响。举例来说,如果每台机器都处理特定类型的任务,那么需要根据任务类型和机器特性来决定如何分配任务。

在单一负载的情况下,负载均衡的调度问题普遍是相关机器的情况

但是在多维度负载的情况下,使用机会成本方法来计算对机器的影响情况时,负载均衡的调度问题就变成了无关机器的情况

随后

ASSIGN-U

《On-Line Routing of Virtual Circuits with Applications to Load Balancing and Machine Scheduling》

1997年论文

3 On-Line Machine Load-Balancing

In this section, we present several algorithms for online machine load-balancing. Jobs arrive online, and each has to be immediately assigned to one of the machines. The goal is to minimize maximum load.

在这一部分中,我们将展示几种在线机器负载均衡的算法。任务在在线模式下到达,每个任务必须立即分配给一台机器。目标是最小化最大负载。

​ Formally, each job \(j\) is represented by its “load vector” \(\vec{p}(j)~=~(p_1(j)\),\(p_2(j),\ldots,p_n(j))\), where\(p_i(j)\geq0\) Assigning job \(j\) to machine \(i\) increases the load on this machine by \(p_i(j).\)Let \(\ell_i(j)\) denote the load on machine \(i\) after we have already assigned jobs 1 through \(j\) :

​ 形式化地描述,每个任务 \(j\) 由其“负载向量” \(\vec{p}(j)~=~(p_1(j)\),\(p_2(j),\ldots,p_n(j))\) 表示,其中 \(p_i(j)\geq0\) 。将任务 \(j\) 分配给机器\(i\)会将此机器的负载增加 \(p_i(j)\) 。令 \(\ell_i(j)\) 表示在已经分配了任务1至 \(j\) 之后机器 \(i\) 的负载:

\[ \left.\ell_{k}(j)=\left\{\begin{array}{ll}{\ell_{k}(j-1)+p_{k}(j)} &{\mathrm{~if}\quad k=i}\\ {\ell_{k}(j-1)} &{\mathrm{~otherwise.}}\end{array}\right.\right. \]

​ Consider a sequence of jobs defined by \(\sigma=(\vec{p}(1),\vec{p}(2),\ldots,\vec{p}(k))\) . Denote by \(\ell_{i}^{*}(j)\) the load on machine \(i\) achieved by the off-line algorithm \(\mathscr{A}^{*}\) after assigning jobs 1 through \(j\) in \(\sigma\) . The goal of both the off-line and the on-line algorithms is to minimize \(L^*(k) = \max_{i}\ell_{i}^{*}(k)\) and \(L(k) = \max_{i}\ell_{i}(k)\) respectively. More precisely, we measure the performance of the on-line algorithm by the supremum over all possible sequences of \(L(k)/L^*(k)\)(of arbitrary length\(k).\)

​ 考虑由 \(\sigma=(\vec{p}(1),\vec{p}(2),\ldots,\vec{p}(k))\) 定义的任务序列。用 \(\ell_{i}^{*}(j)\) 表示在 \(\sigma\) 中分配任务1至\(j\)之后离线算法 \(\mathscr{A}^{*}\) 实现的机器 \(i\) 的负载。线上算法和线下算法的目标都是要最小化 \(L^*(k) = \max_{i}\ell_{i}^{*}(k)\)\(L(k) = \max_{i}\ell_{i}(k)\) 。更准确地说,我们通过所有可能序列 \(L(k)/L^*(k)\)(长度可变)的上确界来衡量在线算法的性能。

​ As we have mentioned in the Introduction, the load-balancing problems are usually categorized into three classes, based on the properties of the load vectors. For \(identical\) machines, \(i,i^{\prime},j:p_i(j)=p_{i^{\prime}}(j)\) For \(related\) machines, \(i,i^{\prime},j\) , \(j^{\prime}: p_{i}( j) /p_{i^{\prime}}( j) = p_{i}( j^{\prime}) /p_{i^{\prime}}( j^{\prime}) = v_{i^{\prime}}/v_{i},\) where \(v_{i}\) denotes the speed of machine \(i\) . All other cases are referred to as \(unrelated\) machines.

​ 如引言中提到的,根据负载向量的属性,负载均衡问题通常被分类为三类。对于 \(identical\)机器,对于所有的 \(i,i^{\prime},j\)\(p_i(j)=p_{i^{\prime}}(j)\) 。对于 \(related\) 机器,对于所有的 \(i,i^{\prime},j,j^{\prime}\) , 有 \(p_{i}( j) /p_{i^{\prime}}( j) = p_{i}( j^{\prime}) /p_{i^{\prime}}( j^{\prime}) = v_{i^{\prime}}/v_{i}\) ,其中 \(v_{i}\) 表示机器 \(i\) 的速度。所有其他情况被称为 \(unrelated\) 机器。

​ Note that instead of “load” one can talk about “execution time”. Restating the problem in these terms, our goal is to decrease maximum execution time under the requirement that the arriving jobs are scheduled immediately.

​ 注意,除了“负载”,我们还可以讨论“执行时间”。用这些术语重新陈述这个问题,我们的目标是在需要立即调度到达的任务的要求下,减少最大执行时间。

3.1. UNRELATED MACHINES.

In this section, we consider on-line load-balancing on \(unrelated\) machines. As we will show in Section 4, the natural greedy approach is far from optimal for this case, achieving a competitive ratio of \(\Theta(n)\) .

在这一节中,我们考虑在线的 \(unrelated\) 机器负载均衡。如我们将在第4节中展示的,自然的贪婪方法在这种情况下远非最优,其竞争比达到了 \(\Theta(n)\)

​ An \(O(\log n)\)-competitive algorithm for the unrelated machines load-balancing can be constructed as a reduction to the routing problem considered in the previous section. Unfortunately, such reduction results in a confusing and nonintuitive algorithm. Instead, we present a simpler algorithm, specifically designed for the machine load-balancing problem.

​ 对于不相关机器的负载均衡,可以构建一个 \(O(\log n)\)-竞争比的算法,作为在前一节中考虑的路由问题的简化。不幸的是,这种简化使得算法变得混乱且不直观。相反,我们提出了一个更简单的算法,专门用于机器负载均衡问题。

​ For simplicity, we first consider the case where we are given a parameter \(\Lambda\), such that \(\Lambda\geq L^*\) As before, an appropriate value of \(\Lambda\) can be “guessed” using a simple doubling approach, increasing the competitive ratio by at most a factor of 4. We use tilde to denote normalization by \(\Lambda\) , that is, \(\tilde{x}=x/\Lambda.\)

​ 为了简便,我们首先考虑给定参数 \(\Lambda\) 的情况,其中 \(\Lambda\geq L^*\) 。和之前一样,合适的 \(\Lambda\) 值可以通过一个简单的加倍方法“猜测”出来,最多将竞争比提高4倍。我们使用波浪线来表示按照 \(\Lambda\) 进行归一化,也就是说,\(\tilde{x}=x/\Lambda\)

\[ \begin{align} &\textbf{procedure }ASSIGN-U(\vec{p},\vec{\ell},\Lambda,\beta);\\ &\qquad/^*\Lambda \ - \ \textbf{current estimate of }L^*.\\ &\qquad/^*\beta \ - \ \textbf{designed performance guarantee of the algorithm}.\\ &\qquad\textbf{Let s be the index minimizing }\Delta_i=a^{(\tilde\ell_i+\tilde{p}_i)}-a^{\tilde\ell_i};\\ &\qquad\textbf{if } \ell_s+ p_s> \beta\Lambda\\ &\qquad\qquad\textbf{then }b: = fail\\ &\qquad\qquad\textbf{else begin}\\ &\qquad\qquad\qquad\ell_{s}:=\ell_{s}+p_{s};\\ &\qquad\qquad\qquad b: = success\\ &\qquad\qquad\textbf{end;}\\ &\qquad return( \vec{\ell} , b) .\\ &\textbf{end.} \end{align} \]

​ Algorithm ASSIGN-U is shown in Figure 2. The basic step is to assign job \(j\) to make \(_{i=1}^na^{\tilde\ell_i(j)}\) as small as possible. In the description of the algorithm, we have omitted the job index \(j\), since a single invocation of the algorithm deals only with a single job. We will use the notion of \(designed\) performance guarantee similarly to its use in the on-line routing case: the algorithm accepts a parameter \(\Lambda\)and never creates load that exceeds \(\beta\Lambda\). The algorithm is allowed to return “fail" and to refuse to schedule a job if \(\Lambda<L^*\); otherwise, it has to schedule all of thc arriving jobs.

​ 算法ASSIGN-U如图2所示。基本步骤是将任务 \(j\)分配给使 \(_{i=1}^na^{\tilde\ell_i(j)}\) 尽可能小的机器。在算法描述中,我们省略了任务索引 \(j\),因为一个算法调用仅处理一个任务。我们将使用设计性能保障的概念,就像在在线路由案例中的使用一样:算法接受一个参数 \(\Lambda\) 并且不会产生超过 \(\beta\Lambda\) 的负载。如果 \(\Lambda<L^*\),算法可以返回“失败”并拒绝调度任务;否则,它必须调度所有到达的任务。

​ LEMMA 3.1.1. if \(\lambda^*\leq\Lambda\), then there exists \(\beta=O(\log n)\) such that algorithm ASSIGN-U never fails. Thus, the load on a machine never exceeds \(\beta \Lambda\).

​ 引理 3.1.1:当 \(\lambda^*\leq\Lambda\),存在一个 \(\beta=O(\log n)\) 使得算法ASSIGN-U不会失败。因此,机器的负载永远不会超过 \(\beta \Lambda\)

​ PRooF. Consider the state of the system after scheduling \(j-1\) jobs, and let \(a,\gamma>1\) be constants. (Later, we show that a good choice is \(a\approx2,\gamma\approx1\).) Recall the assumption that \(L^*(j)\leq L^*\leq\Lambda\), and define the potential function

​ 证明:考虑在调度 \(j-1\)个任务后的系统状态,令 \(a,\gamma>1\)为常数(稍后,我们将表明好的选择是 \(a\approx2,\gamma\approx1\)。)回顾一下我们的假设,\(L^*(j)\leq L^*\leq\Lambda\),并定义潜在函数

\[ \Phi(j)=\sum_{i=1}^{n}\:a^{\tilde{\ell}_{i}(j)}(\gamma-\tilde{\ell}\:_{i}^{*}(j)). \]

​ Assume that job \(j\) was assigned to machine \(i'\) by the on-line algorithm and to machine \(i\) by the off-line algorithm. Analogously to the proof of Eq. (3), we now have

​ 假设任务 \(j\) 被在线算法分配给了机器 \(i'\) ,并且被离线算法分配给了机器 \(i\)。类似于等式(3)的证明,我们现在有

\[ \begin{aligned} \Phi(j)-\Phi(j-1)& =(\gamma-\tilde{\ell}_{i'}^*(j-1))(a^{\tilde\ell_{i'}(j)}-a^{\tilde\ell_{i'}(j-1)})-a^{\tilde\ell_i(j)}\tilde{p}_i(j) \\ &\leq\gamma(a^{\tilde{\ell}_i(j-1)+\tilde{p}_{i^{\prime}}(j)}-a^{\tilde{\ell}_{i^{\prime}}(j-1)})-a^{\tilde{\ell}_i(j-1)}\tilde{p}_i(j) \\ &\leq\gamma(a^{\tilde{\ell}_i(j-1)+\tilde{p}_i(j)}-a^{\tilde{\ell}_i(j-1)})-a^{\tilde{\ell}_i(j-1)}\tilde{p}_i(j) \\ &=a^{-\tilde\ell_i(j-1)}(\gamma(a^{\tilde{p}_i(j)}-1)-\tilde{p}_i(j)). \end{aligned} \]

​ Note that since the off-line algorithm has assigned job \(j\) to machine \(i\), we have \(0\leq\bar{p}_{i}(j)\leq L^{*}/\Lambda\leq1\). Therefore, in order to show that the potential function does not increase, it is sufficient to show that \(x\in[0,1]:\dot{\gamma(a^x-1)}\leq x.\) This is true for \(a=1+1/\gamma\).

​ 注意,由于离线算法已经将任务 \(j\)分配给了机器\(i\),我们有 \(0\leq\bar{p}_{i}(j)\leq L^{*}/\Lambda\leq1\)。因此,为了显示潜在函数不会增加,我们只需要显示对于 \(x\in[0,1]:\dot{\gamma(a^x-1)}\leq x.\)。对于 \(a=1+1/\gamma\),这是成立的。

​ Since initially \(\Phi(0)=\gamma n\), at any point in the. assignment proccss

​ 由于最初 \(\Phi(0)=\gamma n\),在任务分配过程中的任何点,我们都有

\[ \sum_{i=1}^{n}a^{\tilde{\ell}_{i}(j)}(\gamma-1)\leq\gamma n\:, \]

and hence

因此

\[ L=\operatorname*{max}_{i}\ell_{i}(k)=\Lambda\operatorname*{max}\bar{\ell}_{i}(k)\\ \leq\Lambda\cdot\log_a\left(\frac\gamma{\gamma-1}n\right)=O(\Lambda\log n).\quad\quad\Box\quad(6) \]

​ Notice that the constants in the big \(O\) of (6) are small; for example, for \(\gamma=1.1\), we get \(L/\Lambda\leq1.07\log n + 3.7\). By changing the value of \(\gamma\), one can trade-off the multiplicative factor against the additive term.

​ 注意,公式(6)中的大\(O\)是小的,例如,对于 \(\gamma=1.1\),我们得到\(L/\Lambda\leq1.07\log n + 3.7\)。通过改变 \(\gamma\) 的值,可以权衡乘法因子和加法项。

​ Observe that \(\log n\) is a lower bound even for the restricted case considered in Azar et al. [1992]. Moreover, it is interesting to note that for the case where the coordinates of the load vector \(p(j)\) are either\(\infty\)or equal to some constant \(p_j\) thar depends only on the job \(j\) , our algorithm behaves exactly like the greedy algorithm considered in Azar et al.[1992].

​ 请注意,即使在Azar等人[1992]考虑的限制情况中,\(\log n\)也是一个下限。此外,值得一提的是,对于负载向量 \(p(j)\) 的坐标是 \(\infty\) ,或等于只依赖于任务 \(j\) 的某个常数 \(p(j)\) 的情况是有趣的,我们的算法的行为完全像Azar等人[1992]考虑的贪婪算法。