Julia　並列化 - 留学・音ゲー・研究備忘録

マルチスレッドとかなんもわかってないけど。コアが物理でスレッドが仮想（コア内に2こくらいある）でおｋ？とにかくとりあえずメモ。今後追加していく。現状、Julia v1.3である。なお以下の内容は

Announcing composable multi-threaded parallelism in Julia

をかいつまんだもの。

スレッドの数を設定

Julia起動前にターミナルで

$ export JULIA_NUM_THREADS=n

を実行。ただしnは任意の整数。もちろんコンピューターが持ってるスレッド以上の数を指定しても以下では無視されるけど。

正しく設定できてるか確認

上記のセットが終わればJuliaを起動して、

Threads.nthreads() # いくつスレッドが使われてるか
Threads.threadid()  # 現在のスレッドのID

# またはタスク振り分けを画面に表示させてみる
Threads.@threads for i = 1:10
           println("i = $i on thread $(Threads.threadid())")
       end

テスト

並列化の恩恵について以下の関数で見てみる。関数自体は1次元配列vを受け取ったときに中身をソートするものだけど、vがある一定サイズの場合、並列化を行う。最初に入れたvを半分に分割し、1~size(n)/2までの方を並列化、一定サイズになるまではそこに再帰的に関数を適用している。

import Base.Threads.@spawn

# sort the elements of `v` in place, from indices `lo` to `hi` inclusive
function psort!(v, lo::Int=1, hi::Int=length(v))
    if lo >= hi                       # 1 or 0 elements; nothing to do
        return v
    end
    if hi - lo < 100000               # below some cutoff, run in serial
        sort!(view(v, lo:hi), alg = MergeSort)
        return v
    end

    mid = (lo+hi)>>>1                 # find the midpoint

    half = @spawn psort!(v, lo, mid)  # task to sort the lower half; will run
    psort!(v, mid+1, hi)              # in parallel with the current call sorting
                                      # the upper half
    wait(half)                        # wait for the lower half to finish

    temp = v[lo:mid]                  # workspace for merging

    i, k, j = 1, lo, mid+1            # merge the two sorted sub-arrays
    @inbounds while k < j <= hi
        if v[j] < temp[i]
            v[k] = v[j]
            j += 1
        else
            v[k] = temp[i]
            i += 1
        end
        k += 1
    end
    @inbounds while k < j
        v[k] = temp[i]
        k += 1
        i += 1
    end

    return v
end

重要なのは

half = @spawn psort!(v, lo, mid)  # task to sort the lower half; will run
    psort!(v, mid+1, hi)              # in parallel with the current call sorting
                                      # the upper half
    wait(half)                        # wait for the lower half to finish

    temp = v[lo:mid]                  # workspace for merging

で再帰的に関数を適用してるから、結果的には最初の2分割で上下ともに並列化されてるんだと思われる。

MacBool Pro 13 inch 2017、2.3 GHz デュアルコア Intel Core i5での結果は

julia> a = rand(20000000);

 b = copy(a); @time sort!(b, alg = MergeSort);   # single-threaded
  2.120218 seconds (111.87 k allocations: 82.144 MiB, 31.72% gc time)

julia> b = copy(a); @time sort!(b, alg = MergeSort);
  1.985665 seconds (11 allocations: 76.294 MiB)

julia> b = copy(a); @time psort!(b);    # four threads
  1.760655 seconds (332.00 k allocations: 702.748 MiB, 4.95% gc time)

julia> b = copy(a); @time psort!(b);
  1.101960 seconds (3.77 k allocations: 686.923 MiB, 4.72% gc time)

と早くなってるのがわかる。

とまあ、参照したサイトのはわかるけど、これをどうやって数値計算に落とし込むかが〜〜〜。