Latency benchmarking cohttp shows queueing

Hi,

So I'm in the process of doing latency-benchmarks for a couple of different web server frameworks. One of the frameworks is cohttp. I have the following code for the test:

``` ocaml
(* This file is in the public domain *)

open Core.Std
open Async.Std
open Cohttp_async

(* given filename: hello_world.ml compile with:
   $ corebuild hello_world.native -pkg cohttp.async
*)

let handler ~body:_ _sock req =
  let uri = Cohttp.Request.uri req in
  match Uri.path uri with
  | "/" -> Server.respond_with_string "CHAPTER I. Down the Rabbit-Hole  Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, <and what is the use of a book,> thought Alice <without pictures or conversations?> So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, <Oh dear! Oh dear! I shall be late!> (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge. In another moment down went Alice after it, never once considering how in the world she was to get out again. The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well. Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards......"
  | _ ->
    Server.respond_with_string ~code:`Not_found "Route not found"

let start_server port () =
  eprintf "Listening for HTTP on port %d\n" port;
  eprintf "Try 'curl http://localhost:%d/'\n%!" port;
  Cohttp_async.Server.create ~on_handler_error:`Raise
    (Tcp.on_port port) handler
  >>= fun _ -> Deferred.never ()

let () =
  Command.async_basic
    ~summary:"Start a hello world Async server"
    Command.Spec.(empty +>
      flag "-p" (optional_with_default 8080 int)
        ~doc:"int Source port to listen on"
    ) start_server

  |> Command.run
```

and it is built with:

``` bash
#!/bin/sh

corebuild main.native -pkg cohttp.async

echo "Starting server"
ASYNC_CONFIG='((epoll_max_ready_events 10240) (max_num_open_file_descrs 32000) (max_num_threads 50) (file_descr_watcher Epoll) (max_num_jobs_per_priority_per_cycle 500))' ./main.native
```

There are 2 servers, one load-generating, and one running the cohttp server, over a 1gigabit line (iperf shows 953mbit throughput), with no latency whatsoever (There is a switch between, but that is all).

The servers are linux:
Linux lady-of-pain 3.19.3-3-ARCH #1 SMP PREEMPT Wed Apr 8 14:10:00 CEST 2015 x86_64 GNU/Linux

fairly recent Core i7's, no virtualization since it just slows you down.

Some sysctl.conf tuning were necessary:

```
fs.file-max = 65535
fs.nr_open = 65535
net.core.netdev_max_backlog = 4096
net.core.rmem_max = 16777216
net.core.somaxconn = 65535
net.core.wmem_max = 16777216
net.ipv4.ip_forward = 0
net.ipv4.ip_local_port_range = 1025       65535
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_max_syn_backlog = 20480
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
vm.min_free_kbytes = 65536
vm.overcommit_memory = 1
```

Async is the recent point-release and I'm forcing Epoll support because the test just fails when it runs in a select(2) loop.

The load generator is wrk2 (github.com/giltene/wrk2) which avoids coordinated omission. Most load generators coordinate: if a connection stalls towards the system-under-test (SUT), then no further requests are issued on that connection until the first one completes. You get one "bad" latency number. In wrk2, the rate is kept stable: at an interval, new requests are planned to be issued on the stalling connection, so one can get more realistic latencies from stalled connections.

We run the following:

``` bash
#!/bin/sh

exec wrk2 \
    -t8 -c10000 -d60S \
    --timeout 2000 \
    -R 30000 --latency \
    -H 'Connection: keep-alive' \
    http://172.16.0.1:$1/
```

That is, 10k connections, and a rate of 30k req/s which means 3 req/s per connection. Such a run yields the following latency histogram:

```
Running 1m test @ http://172.16.0.1:8080/
  8 threads and 10000 connections
  Thread calibration: mean lat.: 1359.711ms, rate sampling interval: 4898ms
  Thread calibration: mean lat.: 1310.689ms, rate sampling interval: 4870ms
  Thread calibration: mean lat.: 1312.983ms, rate sampling interval: 4866ms
  Thread calibration: mean lat.: 1378.042ms, rate sampling interval: 4874ms
  Thread calibration: mean lat.: 1371.283ms, rate sampling interval: 4898ms
  Thread calibration: mean lat.: 1286.566ms, rate sampling interval: 4861ms
  Thread calibration: mean lat.: 1276.854ms, rate sampling interval: 4861ms
  Thread calibration: mean lat.: 1302.831ms, rate sampling interval: 4870ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.88s     5.44s    0.95m    84.10%
    Req/Sec     1.59k   151.82     1.84k    71.88%
  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    8.20s 
 75.000%   10.35s 
 90.000%   11.79s 
 99.000%   31.88s 
 99.900%    0.90m 
 99.990%    0.94m 
 99.999%    0.95m 
100.000%    0.95m 

  Detailed Percentile spectrum:
       Value   Percentile   TotalCount 1/(1-Percentile)

    2670.591     0.000000            3         1.00
    3868.671     0.100000        55526         1.11
    5189.631     0.200000       111201         1.25
    6111.231     0.300000       166684         1.43
    7385.087     0.400000       222341         1.67
    8204.287     0.500000       277984         2.00
    8716.287     0.550000       305353         2.22
    9109.503     0.600000       333100         2.50
    9510.911     0.650000       361000         2.86
    9953.279     0.700000       389927         3.33
   10346.495     0.750000       416615         4.00
   10551.295     0.775000       430281         4.44
   10764.287     0.800000       444321         5.00
   10977.279     0.825000       458356         5.71
   11206.655     0.850000       472177         6.67
   11517.951     0.875000       485909         8.00
   11649.023     0.887500       492901         8.89
   11788.287     0.900000       500109        10.00
   12017.663     0.912500       506582        11.43
   14565.375     0.925000       513515        13.33
   16875.519     0.937500       520495        16.00
   18087.935     0.943750       523937        17.78
   19480.575     0.950000       527398        20.00
   20889.599     0.956250       530869        22.86
   22167.551     0.962500       534331        26.67
   23658.495     0.968750       537824        32.00
   24870.911     0.971875       539535        35.56
   26443.775     0.975000       541272        40.00
   27672.575     0.978125       543007        45.71
   28803.071     0.981250       544740        53.33
   30031.871     0.984375       546473        64.00
   30605.311     0.985938       547358        71.11
   31113.215     0.987500       548217        80.00
   31604.735     0.989062       549099        91.43
   32063.487     0.990625       549946       106.67
   32620.543     0.992188       550832       128.00
   32980.991     0.992969       551254       142.22
   33439.743     0.993750       551684       160.00
   34078.719     0.994531       552126       182.86
   34668.543     0.995313       552554       213.33
   35422.207     0.996094       552984       256.00
   35979.263     0.996484       553196       284.44
   36863.999     0.996875       553416       320.00
   37552.127     0.997266       553636       365.71
   38338.559     0.997656       553851       426.67
   39518.207     0.998047       554065       512.00
   52789.247     0.998242       554171       568.89
   53313.535     0.998437       554300       640.00
   53477.375     0.998633       554390       731.43
   53706.751     0.998828       554505       853.33
   53936.127     0.999023       554604      1024.00
   54099.967     0.999121       554666      1137.78
   54231.039     0.999219       554719      1280.00
   54394.879     0.999316       554779      1462.86
   54525.951     0.999414       554829      1706.67
   54657.023     0.999512       554878      2048.00
   54755.327     0.999561       554912      2275.56
   54820.863     0.999609       554934      2560.00
   54919.167     0.999658       554961      2925.71
   55017.471     0.999707       554986      3413.33
   55148.543     0.999756       555013      4096.00
   55214.079     0.999780       555025      4551.11
   55312.383     0.999805       555041      5120.00
   55410.687     0.999829       555052      5851.43
   55574.527     0.999854       555066      6826.67
   56131.583     0.999878       555080      8192.00
   56229.887     0.999890       555087      9102.22
   56295.423     0.999902       555092     10240.00
   56393.727     0.999915       555102     11702.86
   56459.263     0.999927       555108     13653.33
   56557.567     0.999939       555117     16384.00
   56557.567     0.999945       555117     18204.44
   56590.335     0.999951       555121     20480.00
   56623.103     0.999957       555124     23405.71
   56688.639     0.999963       555128     27306.67
   56721.407     0.999969       555132     32768.00
   56721.407     0.999973       555132     36408.89
   56754.175     0.999976       555134     40960.00
   56786.943     0.999979       555137     46811.43
   56786.943     0.999982       555137     54613.33
   56819.711     0.999985       555138     65536.00
   56852.479     0.999986       555141     72817.78
   56852.479     0.999988       555141     81920.00
   56852.479     0.999989       555141     93622.86
   56852.479     0.999991       555141    109226.67
   56885.247     0.999992       555143    131072.00
   56885.247     0.999993       555143    145635.56
   56885.247     0.999994       555143    163840.00
   56918.015     0.999995       555144    187245.71
   56918.015     0.999995       555144    218453.33
   56918.015     0.999996       555144    262144.00
   56950.783     0.999997       555145    291271.11
   56950.783     0.999997       555145    327680.00
   56950.783     0.999997       555145    374491.43
   56950.783     0.999998       555145    436906.67
   56950.783     0.999998       555145    524288.00
   56983.551     0.999998       555146    582542.22
   56983.551     1.000000       555146          inf
#[Mean    =     8879.930, StdDeviation   =     5441.720]
#[Max     =    56950.784, Total count    =       555146]
#[Buckets =           27, SubBuckets     =         2048]
----------------------------------------------------------
  730491 requests in 1.00m, 1.44GB read
  Socket errors: connect 0, read 434, write 92, timeout 12
Requests/sec:  12175.28
Transfer/sec:     24.59MB
```

There are a few timeout errors, but note we can't really keep the request rate at what we want, as it is at 12k req/s. The single core on the SUT maxes out, and queuing buildup happens. I've attached the histogram plot (Note the x axis, which is compressed quite a lot so you can see the high latencies) and also included `nim` which is another single-threaded solution making for a fair comparison. The `go` solution uses all 8 cores, so it has considerably less work to do per core, i.e., that comparison isn't really fair. The `wai` solution is a Haskell framework, but it also utilizes all 8 cores.

![histogram 11](https://cloud.githubusercontent.com/assets/5663/7232217/b16c1b60-e77a-11e4-9a1d-5bdf60f5151f.png)

I have yet to try an Lwt solution with cohttp. It may perform totally different, but I'm not sure from where to cut it. I thought about starting off of the file server examples by Hannes, but I'm not really sure this is the way to go. Also, tuning of the OCaml system is something I'm very interested in.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Latency benchmarking cohttp shows queueing #328

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Latency benchmarking cohttp shows queueing #328

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions