A post caught my interesting, which is about the huge boost on x264 encoding. We will all see the change when the kernel 2.6.32 comes, it uses BFS1, a new process scheduler. The current kernel uses CFS (Completely Fair Scheduler). Phoronix has a benchmark report for BFS.

I decided to do a little test. Because I have no background, so the test is just a rough numbers. Here is my system:

  • Core 2 Duo T5600 1.83GHz
  • Gentoo amd64
  • Kernel 2.6.31-gentoo-r4 (It’s kernel 2.6.31 with genpatch)
  • Kernel 2.6.32-rc5 (Just for reference)
  • kernel config file from my 2.6.30 configuration
  • BFS v3042

1   Simple concurrent process test

First, I have no idea how should I test. I just want to get some number to read. My test program is written in Python (So you knew I really have no idea), because Python’s threading could only use one core, therefore I used processing to utilize my two cores. The test code shows as follows:

#!/usr/bin/env python

from time import sleep
from time import time
import os

from processing import activeChildren
from processing import Process


MAX_COUNT = 1000000
MAX_PROCESS = 32
PASSES = 5


def counter():

  c = 0
  while c < MAX_COUNT:
    c += 1


def do_test(PROCESS_COUNT):

  processes = []
  for i in range(PROCESS_COUNT):
    p = Process(target=counter)
    p.setDaemon(True)
    processes.append(p)

  t_start = time()
  for p in processes:
    p.start()
  while activeChildren():
    sleep(0.001)
  t_end = time()
  return t_end - t_start


def main():

  avg_times = []
  for concurr in range(1, MAX_PROCESS + 1):
    print 'Running concurrent %3d processes...' % concurr,
    acc_time = 0.0
    for i in range(PASSES):
      acc_time += do_test(concurr)
    avg_time = acc_time / PASSES
    avg_times.append(avg_time)
    print '%9f seconds' % avg_time

  fname = 'test-c%d-p%d-%s' % (MAX_COUNT, MAX_PROCESS, os.uname()[2])
  f = open(fname, 'w')
  data = (os.uname()[2], avg_times)
  f.write(repr(data))
  f.close()
  print
  print 'Data written to %s.' % fname


if __name__ == '__main__':
  main()

As you can see, the created process doesn’t do actual things. It just keeps counting up. Each test would run five times and get an average of elapsed time.

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-KYR4pBMB2USJuQvH0bl6JjJOZpbUibkWsRSbnTN-uAVwfUhxzKG53ke-7K6PM8pykEHn64Ss3sYuwwKK29-s-pU4emt4yerEdW-XHhY5HlQZoMdD25G2OQuRA5-y042IW2PBb6e7R2w/s1600/p.png
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqEjIowwfj7JoEJpwdYzgrlY-LyKRwEulMqz7dbOcNnB2E0BSVL5tdEfeGgUMsg3qzp25X06-j39AztQ14VMN5Vovrqv4Pa6pneGcFDzATkA-NRl-2KKAZSGb1OvHtBeg3NDUs-L_MJG0/s1600/cost.png
#Proc  2.6.31   (Cost)  2.6.31-bfs   (Cost)  2.6.32-rc5   (Cost)     Impr%
  1  0.156739( 0.156739)  0.151051( 0.151051)  0.156921( 0.156921)   3.63%
  2  0.220114( 0.110057)  0.152029( 0.076014)  0.155770( 0.077885)  30.93%
  3  0.311475( 0.103825)  0.230545( 0.076848)  0.238812( 0.079604)  25.98%
  4  0.318061( 0.079515)  0.316369( 0.079092)  0.307760( 0.076940)   0.53%
  5  0.451945( 0.090389)  0.379189( 0.075838)  0.390504( 0.078101)  16.10%
  6  0.493751( 0.082292)  0.462939( 0.077157)  0.466906( 0.077818)   6.24%
  7  0.617738( 0.088248)  0.534748( 0.076393)  0.542779( 0.077540)  13.43%
  8  0.639118( 0.079890)  0.614359( 0.076795)  0.619499( 0.077437)   3.87%
  9  0.759511( 0.084390)  0.685477( 0.076164)  0.698372( 0.077597)   9.75%
 10  0.782722( 0.078272)  0.766383( 0.076638)  0.775433( 0.077543)   2.09%
 11  0.883538( 0.080322)  0.834171( 0.075834)  0.852352( 0.077487)   5.59%
 12  0.942532( 0.078544)  0.913356( 0.076113)  0.930670( 0.077556)   3.10%
 13  1.056529( 0.081271)  0.998580( 0.076814)  1.014098( 0.078008)   5.48%
 14  1.096089( 0.078292)  1.062566( 0.075898)  1.092146( 0.078010)   3.06%
 15  1.207977( 0.080532)  1.136702( 0.075780)  1.172611( 0.078174)   5.90%
 16  1.276357( 0.079772)  1.215091( 0.075943)  1.253329( 0.078333)   4.80%
 17  1.360708( 0.080042)  1.289036( 0.075826)  1.323230( 0.077837)   5.27%
 18  1.415339( 0.078630)  1.372739( 0.076263)  1.408926( 0.078274)   3.01%
 19  1.507686( 0.079352)  1.442193( 0.075905)  1.486329( 0.078228)   4.34%
 20  1.617503( 0.080875)  1.510492( 0.075525)  1.562270( 0.078114)   6.62%
 21  1.652496( 0.078690)  1.592773( 0.075846)  1.644975( 0.078332)   3.61%
 22  1.737832( 0.078992)  1.680954( 0.076407)  1.735154( 0.078871)   3.27%
 23  1.797144( 0.078137)  1.738987( 0.075608)  1.833433( 0.079714)   3.24%
 24  1.873404( 0.078058)  1.814224( 0.075593)  1.880322( 0.078347)   3.16%
 25  1.959188( 0.078368)  1.886172( 0.075447)  1.958433( 0.078337)   3.73%
 26  2.044073( 0.078618)  1.967344( 0.075667)  2.041252( 0.078510)   3.75%
 27  2.094033( 0.077557)  2.032489( 0.075277)  2.110076( 0.078151)   2.94%
 28  2.186407( 0.078086)  2.112829( 0.075458)  2.197447( 0.078480)   3.37%
 29  2.256274( 0.077803)  2.187584( 0.075434)  2.267832( 0.078201)   3.04%
 30  2.341825( 0.078061)  2.263803( 0.075460)  2.344355( 0.078145)   3.33%
 31  2.413136( 0.077843)  2.338420( 0.075433)  2.446170( 0.078909)   3.10%
 32  2.500092( 0.078128)  2.429601( 0.075925)  2.522627( 0.078832)   2.82%

Note that Improvement% is 2.6.31-bfs improvement over 2.6.31, cost means the time to run 1000000 count per process,

2   Compilation Test

For a real use test, I think compilation is a good one to do. So I chose e2fsprog 1.41.9 to test, here are the result:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFmn9F8SGqwa7_6W2M8uANNlPRPIlIESsVFcnRIeyIr7_tFhszKgUWiibLuz6iAe8xPl0EGlsm0VAXJrxhJPOYHceTAvI_5NgsG9wsBs-tU5cgHyTQFu1XKSUERbGfb6oX0Y90WSl7iGs/s1600/compilation.png
#Job 2.6.31  2.6.31-bfs 2.6.32-rc5  Impr%
  1 36.232000 35.235000 35.746000   2.75%
  2 23.511000 20.168000 20.327000  14.22%
  3 21.740000 20.885000 20.902000   3.93%
  4 21.700000 21.308000 21.397000   1.81%
  5 21.954000 21.433000 21.741000   2.37%

Note that Improvement% is 2.6.31-bfs improvement over 2.6.31.

3   Conclusion

In both tests, all got improved. With the number of concurrent running processes and the number of jobs are the same as the number of cores, the improvements have reached peaks.

One thing more interesting is, with CFS, the performance is better when you use 3 to 4 jobs, especially 4 jobs, it has best time. And the cost comes to minimum.

With BFS, the cost time reaches minimum just 2 jobs. It begins slightly reducing improvement after 2 jobs, the number of cores.

Also BFS has resulted smaller kernel build:

Size    Kernel
1991232 kernel-x86_64-2.6.31-gentoo-r4
1974720 kernel-x86_64-2.6.31-gentoo-r4-bfs

It seems that CFS reaches the best performance at 2*N processes and BFS reaches the best performance at N processes, where N is the number of cores. And BFS always got the better performance in these tests.

CFS doesn’t stay steady after 2*N processes, the curve finally becomes steady after 24 processes. But BFS is already steady at N processes.

BFS does look good.

(results plotting python codes: 1, 2, 3)

[1]http://ck.kolivas.org/patches/bfs/sched-BFS.txt is gone.
[2]http://ck.kolivas.org/patches/bfs/old/2.6.31-sched-bfs-304.patch is gone