C#SIMDで合計値、byte型配列をIntrinsicsとNumericsのAddで合計値の処理速度

IntrinsicsとNumericsのAddでbyte型配列の合計値を計算使ったアプリは f:id:gogowaten:20200228143015p:plain
AVX2をサポートしていないCPUだと動かないかも？
Intel CPUだとHaswellコアから対応なので第4世代、2013年以降
AMD CPUだとZenコアから対応なので、Ryzen全部とAthlonでもZenコアなら対応している、2017年以降

でも、ここみたら
pc.watch.impress.co.jp

Bulldozer系の場合は、256-bit幅のAVX2命令は、フロントエンドで2個のMacroOPに変換(Fast-Path Double)されていた。

ってあるから動くことは動くのかなあ、動きそうだなあ

ダウンロード先はギットハブ
ファイル名は、20200227_IntrinsicsAdd.zip
github.com

製作と計測環境

CPU AMD Ryzen 5 2400G(4コア8スレッド)
MEM DDR4-2666
Window 10 Home 64bit
Visual Studio 2019 Community .NET Core 3.1 WPF C#

.NET Frameworkだと参照の追加がめんどくさいので.NET Core 3.1

f:id:gogowaten:20200228143144p:plain
実行したところ
byte型配列の要素数は1千万、値は全て255を入れている、これの合計値を1000回求めた時の処理時間

	SIMD	計算時の型	マルチスレッド
Test1	未使用	long
Test2	未使用	long	使用
Test3	未使用	long
Test4	Intrinsics	int
Test5	Intrinsics	int	使用
Test6	Intrinsics	int	使用
Test7	Intrinsics	long	使用
Test8	Numerics	uint
Test9	Numerics	uint	使用
Test10	Numerics	long	使用

合計値	一斉1回目	一斉2回目	一斉3回目	個別	平均	分散	平均2
Test1_Normal	3.642	5.338	5.391	5.459	4.958	0.579	5.396
Test2_Normal_MT	1.387	1.364	1.427	1.792	1.493	0.030	1.493
Test3_Normal4	3.395	3.414	3.391	3.426	3.407	0.000	3.407
Test4_Intrinsics_int	0.932	0.953	0.945	0.943	0.943	0.000	0.943
Test5_Intrinsics_int_MT	0.388	0.402	0.392	0.365	0.387	0.000	0.387
Test6_Intrinsics_int_MT2	0.417	0.444	0.400	0.395	0.414	0.000	0.414
Test7_Intrinsics_long_MT	0.514	0.531	0.590	0.547	0.546	0.001	0.546
Test8_Numerics_uint	1.406	1.356	1.349	1.376	1.372	0.000	1.372
Test9_Nunerics_uint_MT	0.556	0.502	0.505	0.453	0.504	0.001	0.504
Test10_Numerics_long_MT	0.958	0.964	0.899	0.892	0.928	0.001	0.928

Test1の1回目だけが変な値になったので、これを無視しての平均2で比較することにした

グラフにして
f:id:gogowaten:20200228142606p:plain
f:id:gogowaten:20200228144206p:plain
水色がIntrinsics、オレンジ色がNumerics
今回はNumericsよりもIntrinsicsのほうが速い結果になった

追加using
f:id:gogowaten:20200228150420p:plain

using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Numerics;
using System.Collections.Concurrent;
using System.Diagnostics;

f:id:gogowaten:20200228150540p:plain
いつものように計算に使うbyte型配列をフィールドに用意しておいて19行目
アプリ起動時に値を入れる、26行目

f:id:gogowaten:20200228150552p:plain
今回はbyte型の最大値255で埋める、この配列を各メソッドに渡して、
時間計測は
f:id:gogowaten:20200228151051p:plain
いつもの

SIMDを使わない普通の足し算 f:id:gogowaten:20200228151415p:plain
遅い、これを基準にしていく

Parallel.ForEachでマルチスレッド化
f:id:gogowaten:20200228151626p:plain
Partitionerでの区切りサイズはCPUのスレッド数にした、割り切れなかったときは、区切りが1つ増えて、そこにあまりの要素が入る (これ以降もマルチスレッド化での区切りは同じ方法)
マルチスレッド化で3.6倍速くなった、だいたいCPUのコア数4個分速くなる

シングルだけどForの中で4つづつ処理
f:id:gogowaten:20200228152246p:plain
前回試して速くなったので今回も試した結果、1.6倍速くなった。

ここからSIMDを使うIntrinsics

シングルスレッドで計算はint型Vector256 f:id:gogowaten:20200228152556p:plain
方法は前回と同じで、Avx2.Minで最小値を求めていたのを足し算のAvx2.Addに変えただけ、118行目
これで5.7倍速
byte型Vectorだとオーバーフローするので、int型で足し算するためにAvx2.ConvertToVector256Int32を使って、byte型配列のポインタからint型Vector256を作成している、118行目
これがNumericsより速い原因だと思う。byte型から直接int型
Intrinsicsだとbyte→int
Numericsだとbyte→ushort→uint

マルチスレッド化して
f:id:gogowaten:20200228153946p:plain
14倍速、これが今回最速
これを少し改変したのがTest6
f:id:gogowaten:20200228154233p:plain
少し遅くなった。スレッドごとの集計はVectorのままにして、192行目
マルチスレッドを抜けた後にまとめて集計するようにしただけ、212行～
どうかなと思ったんだけど遅くなったねえ

Test5をlong型での足し算に変更
f:id:gogowaten:20200228154942p:plain
一度の計算できる個数がintの8から4に減るから遅くなるけど、ntrinsicsとNumericsのAddでbyte型配列の合計値を計算できる桁は増える
速度は9.9倍速、int型では14倍速だったので、落ち込みは9.9/14=0.70714286と思ったより少なかった

ここからNumerics

シングルスレッドで
f:id:gogowaten:20200228155545p:plain
速度は3.9倍速
Numericsだとbyte型からintへの変換は、Widenメソッドで2回の変換があるのがねえ、この分がIntrinsicsより遅くなっていそう

マルチスレッド化して
f:id:gogowaten:20200228160013p:plain
速度は10.7倍速、シングルからだと10.7/3.9=2.7435897、Intrinsicsよりも多少伸びが良いねえ

long型にして
f:id:gogowaten:20200228160518p:plain
速度は5.8倍速、intからの落ち込みは5.8/10.7=0.54205607と約半減、Intrinsicsのときより大きいのは一度に計算できる個数の半減と変換回数が1回増えてbyte→ushort→uint→ulongってのがあるからなあ

計算できる限界
シングルスレッドでbyte型配列の合計をVector256intで行うときの、最大配列要素数は67,372,032、約6737万。
int型の最大値int.MaxValue = 2147483647は約21億
int型Vectorの要素数Vector256int.Count = 8
byte型の最大値255
Vectorの8個の要素それぞれがint.MaxValueになった状態が、計算できる最大値になるけど、小数点はないので255の倍数でint.MaxValueに一番近い値が実際の最大値(最大個数)になる
これは
int.MaxValue / 255 ＝ 8421504.498で
小数点以下切り捨てた値の
8421504 これがVectorの要素1個あたりの最大数でこれにVector256int.Countの8をかけて
8421504 * 8 ＝ 67372032
これがVector256intで計算できるbyte型配列の要素の最大数になる
正確には8で割り切れなかった余りの要素は個別にlong型で集計しているのであまりの最大数7を足して
67372032 + 7 ＝ 67372039
になった、約6737万、これは8K解像度の約2倍
8Kの画素数は7680*4320=33177600、約3317万
それに値が255ばかりなんてケースは少ないから十分だと思う
足りなかったら配列を区切って処理すればいいし、マルチスレッド化すれば区切ることになるから一石二鳥、今回のようにマルチスレッド化で8区切りにすれば、それぞれのスレッドで6737万なので8倍すると
67372039 * 8 = 538976312で5億を超える
これでも足りなければもっと細かく区切ればいいだけ

要素数67372039 f:id:gogowaten:20200228194451p:plain
Test4ではこれが限界
67372039 * 255 = 17179869945
要素数を1個増やしてみると
f:id:gogowaten:20200228194734p:plain
Test4はオーバーフロー

要素数538976319
f:id:gogowaten:20200228195043p:plain
これはTest5と6の限界だけど、その前にuintで計算しているTest8がオーバーフローしてたｗ
要素数を1個増やしてみると f:id:gogowaten:20200228195311p:plain
Test5と6もオーバーフロー
このときのアプリのメモリ使用量
f:id:gogowaten:20200228195731p:plain
500MBを超える
538976320byteは
538976.320KBで
538.976320MBってことか！なるほど
それよりPC全体でのメモリの使用率が70%超えているのがねえ、もっと大きな要素数で試してみたいけどきつい、ちょっと変わったことしようとするとメモリ16GBだと足りないわ32GBほしい

今回の記事はMarkdownモードで書いてみた、いつもの見たままモードの比べると、エクセルからのコピペできないのがめんどくさいけど、コードをそのまま貼り付けられるのはラクでいい！
f:id:gogowaten:20200228204317p:plain
リアルタイムプレビューを見ながらだと横幅が欲しくなる、マルチモニターかもっと大きな解像度のモニターなら快適だろうねえ

MainWindow.xaml

<Window x:Class="_20200227_IntrinsicsAdd.MainWindow"
        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
        xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
        xmlns:local="clr-namespace:_20200227_IntrinsicsAdd"
        mc:Ignorable="d"
       Title="MainWindow" Height="400" Width="614">
  <Grid>
    <StackPanel>
      <StackPanel.Resources>
        <Style TargetType="StackPanel">
          <Setter Property="Margin" Value="2"/>
        </Style>
        <Style TargetType="Button">
          <Setter Property="Width" Value="60"/>
        </Style>
        <Style TargetType="TextBlock">
          <Setter Property="Margin" Value="2,0"/>
        </Style>
      </StackPanel.Resources>
      <TextBlock x:Name="MyTextBlock" Text="text" HorizontalAlignment="Center" FontSize="20"/>
      <TextBlock x:Name="MyTextBlockVectorCount" Text="vectorCount" HorizontalAlignment="Center"/>
      <TextBlock x:Name="MyTextBlockCpuThreadCount" Text="threadCount" HorizontalAlignment="Center"/>
      <StackPanel Orientation="Horizontal" HorizontalAlignment="Center">
        <Button x:Name="ButtonAll" Content="一斉テスト" Margin="20,0" Width="120"/>
        <TextBlock x:Name="TbAll" Text="time"/>
        <!--<Button x:Name="ButtonReset" Content="reset" Margin="20,0"/>-->

      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button1" Content="test1"/>
        <TextBlock x:Name="Tb1" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button2" Content="test2"/>
        <TextBlock x:Name="Tb2" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button3" Content="test3"/>
        <TextBlock x:Name="Tb3" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button4" Content="test4"/>
        <TextBlock x:Name="Tb4" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button5" Content="test5"/>
        <TextBlock x:Name="Tb5" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button6" Content="test6"/>
        <TextBlock x:Name="Tb6" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button7" Content="test7"/>
        <TextBlock x:Name="Tb7" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button8" Content="test8"/>
        <TextBlock x:Name="Tb8" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button9" Content="test9"/>
        <TextBlock x:Name="Tb9" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button10" Content="test10"/>
        <TextBlock x:Name="Tb10" Text="time"/>
      </StackPanel>

    </StackPanel>
  </Grid>
</Window>

MainWindow.xaml.cs

using System;
using System.Threading.Tasks;
using System.Windows;
using System.Windows.Controls;

using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Numerics;
using System.Collections.Concurrent;
using System.Diagnostics;

namespace _20200227_IntrinsicsAdd
{
    /// <summary>
    /// Interaction logic for MainWindow.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        private byte[] MyArray;
        private const int LOOP_COUNT = 1000;
        private const int ELEMENT_COUNT = 10_000_000;// 538_976_319;// 67_372_039;//要素数

        public MainWindow()
        {
            InitializeComponent();
            MyInitialize();
            this.Title = this.ToString();

            //var neko = int.MaxValue;
            //var neko = uint.MaxValue;


            MyTextBlock.Text = $"byte型配列要素数{ELEMENT_COUNT.ToString("N0")}の合計値を {LOOP_COUNT}回求める";
            MyTextBlockVectorCount.Text = $"Vector256<byte>.Count={Vector256<byte>.Count}  Vector<byte>.Count={Vector<byte>.Count}";
            MyTextBlockCpuThreadCount.Text = $"CPUスレッド数：{Environment.ProcessorCount}";

            ButtonAll.Click += (s, e) => MyExeAll();
            Button1.Click += (s, e) => MyExe(Test1_Normal, Tb1, MyArray);
            Button2.Click += (s, e) => MyExe(Test2_Normal_MT, Tb2, MyArray);
            Button3.Click += (s, e) => MyExe(Test3_Normal4, Tb3, MyArray);
            Button4.Click += (s, e) => MyExe(Test4_Intrinsics_int, Tb4, MyArray);
            Button5.Click += (s, e) => MyExe(Test5_Intrinsics_int_MT, Tb5, MyArray);
            Button6.Click += (s, e) => MyExe(Test6_Intrinsics_int_MT2, Tb6, MyArray);
            Button7.Click += (s, e) => MyExe(Test7_Intrinsics_long_MT, Tb7, MyArray);
            Button8.Click += (s, e) => MyExe(Test8_Numerics_uint, Tb8, MyArray);
            Button9.Click += (s, e) => MyExe(Test9_Nunerics_uint_MT, Tb9, MyArray);
            Button10.Click += (s, e) => MyExe(Test10_Numerics_long_MT, Tb10, MyArray);
        }

        //普通に足し算
        private long Test1_Normal(byte[] vs)
        {
            long total = 0;
            for (int i = 0; i < vs.Length; i++)
            {
                total += vs[i];
            }
            return total;
        }

        //普通に足し算をマルチスレッド化
        private long Test2_Normal_MT(byte[] vs)
        {
            long total = 0;
            Parallel.ForEach(
                Partitioner.Create(0, vs.Length, vs.Length / Environment.ProcessorCount),
                (range) =>
                {
                    long subtotal = 0;
                    for (int i = range.Item1; i < range.Item2; i++)
                    {
                        subtotal += vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }


        //普通に足し算のforの中の足し算を4個
        private long Test3_Normal4(byte[] vs)
        {
            long total = 0;
            int lastIndex = vs.Length - (vs.Length % 4);
            for (int i = 0; i < lastIndex; i += 4)
            {
                total += vs[i];
                total += vs[i + 1];
                total += vs[i + 2];
                total += vs[i + 3];
            }
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i];
            }
            return total;
        }

        //Intrinsics + シングルスレッド、int
        //最大要素数67_372_039(約6737万)まで、これを超えると桁あふれの可能性
        //Vector256<int>.Countは8、それぞれでint型最大値の2147483647まで入る、合計すると2147483647*8=1.7179869e+10(171億…)
        //要素が全てbyte型最大値の255だった場合に171億に入る個数は、17179869176/255=67372035.98、小数点以下切り捨てて
        //これを8個づつ計算するから8で割ると67372035/8=8421504.4、小数点以下切り捨てて
        //8421504*8=67372032、これがVectorで桁あふれしないで計算する回数になる
        //余りはlongで計算するから、ここからさらにあまりの最大数の7を足して、
        //67372032+7=67372039、これが桁あふれしないで計算できる要素の最大数になる
        //8Kの画素数は7680*4320=33177600、約3317万
        private unsafe long Test4_Intrinsics_int(byte[] vs)
        {
            var vTotal = Vector256<int>.Zero;
            int simdLength = Vector256<int>.Count;
            int lastIndex = vs.Length - (vs.Length % simdLength);

            fixed (byte* p = vs)
            {
                for (int i = 0; i < lastIndex; i += simdLength)
                {
                    vTotal = Avx2.Add(vTotal, Avx2.ConvertToVector256Int32(p + i));
                }
            }
            long total = 0;
            int* ip = stackalloc int[simdLength];
            Avx.Store(ip, vTotal);
            for (int j = 0; j < simdLength; j++)
            {
                total += ip[j];
            }
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i];
            }
            return total;
        }

        //Intrinsics + マルチスレッド、int
        //最大要素数は配列の分割数分増えて約5億
        private unsafe long Test5_Intrinsics_int_MT(byte[] vs)
        {
            int simdLength = Vector256<int>.Count;
            long total = 0;
            Parallel.ForEach(
                Partitioner.Create(0, vs.Length, vs.Length / Environment.ProcessorCount),
                (range) =>
                {
                    int lastIndex = range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    var vTotal = Vector256<int>.Zero;
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            vTotal = Avx2.Add(vTotal, Avx2.ConvertToVector256Int32(p + i));
                        }
                    }
                    int* pp = stackalloc int[simdLength];
                    Avx.Store(pp, vTotal);
                    long subtotal = 0;
                    for (int i = 0; i < simdLength; i++)
                    {
                        subtotal += pp[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }

        //↑の変形、CPUスレッド数で割り切れる範囲と余りの範囲に分けて計算
        //Intrinsics + マルチスレッド2、int
        //
        private unsafe long Test6_Intrinsics_int_MT2(byte[] vs)
        {
            int simdLength = Vector256<int>.Count;
            var bag = new ConcurrentBag<Vector256<int>>();
            //割り切れる範囲
            int block = vs.Length - (vs.Length % (simdLength * Environment.ProcessorCount));

            Parallel.ForEach(
                Partitioner.Create(0, block, block / Environment.ProcessorCount),
                (range) =>
                {
                    var vTotal = Vector256<int>.Zero;
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < range.Item2; i += simdLength)
                        {
                            vTotal = Avx2.Add(vTotal, Avx2.ConvertToVector256Int32(p + i));
                        }
                    }
                    bag.Add(vTotal);

                });
            #region bagの集計1、遅いし桁あふれも早く、67_372_096(6737万)で桁あふれ
            //Vector256<int> vv = Vector256<int>.Zero;
            //foreach (var item in bag)
            //{
            //    vv = Avx2.Add(vv, item);
            //}

            //int* ptr = stackalloc int[simdLength];
            //Avx.Store(ptr, vv);
            //long total = 0;
            //for (int i = 0; i < simdLength; i++)
            //{
            //    total += ptr[i];
            //}
            #endregion

            #region bagの集計2、こっちのほうがいい、最大要素数も配列の分割数分増えて約5億
            long total = 0;
            foreach (var item in bag)
            {
                int* pp = stackalloc int[simdLength];
                Avx.Store(pp, item);
                for (int i = 0; i < simdLength; i++)
                {
                    total += pp[i];
                }
            }
            #endregion

            for (int i = block; i < vs.Length; i++)
            {
                total += vs[i];
            }
            return total;
        }

        //Intrinsics + マルチスレッド3、long
        //longで計算、遅くなるけど桁数は大きくなる
        private unsafe long Test7_Intrinsics_long_MT(byte[] vs)
        {
            int simdLength = Vector256<long>.Count;
            long total = 0;
            Parallel.ForEach(
                Partitioner.Create(0, vs.Length, vs.Length / Environment.ProcessorCount),
                (range) =>
                {
                    int lastIndex = range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    var vTotal = Vector256<long>.Zero;
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            vTotal = Avx2.Add(vTotal, Avx2.ConvertToVector256Int64(p + i));
                        }
                    }
                    long* pp = stackalloc long[simdLength];
                    Avx.Store(pp, vTotal);
                    long subtotal = 0;
                    for (int i = 0; i < simdLength; i++)
                    {
                        subtotal += pp[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }


        //Numerics、シングルスレッド、uint
        private unsafe long Test8_Numerics_uint(byte[] vs)
        {

            int simdLength = Vector<byte>.Count;
            int lastIndex = vs.Length - (vs.Length % simdLength);
            Vector<uint> v = new Vector<uint>();
            for (int i = 0; i < lastIndex; i += simdLength)
            {
                System.Numerics.Vector.Widen(new Vector<byte>(vs, i), out Vector<ushort> vv1, out Vector<ushort> vv2);
                System.Numerics.Vector.Widen(vv1, out Vector<uint> ui1, out Vector<uint> ui2);
                System.Numerics.Vector.Widen(vv2, out Vector<uint> ui3, out Vector<uint> ui4);
                v = System.Numerics.Vector.Add(v, ui1);
                v = System.Numerics.Vector.Add(v, ui2);
                v = System.Numerics.Vector.Add(v, ui3);
                v = System.Numerics.Vector.Add(v, ui4);
            }
            long total = 0;
            for (int j = 0; j < Vector<uint>.Count; j++)
            {
                total += v[j];
            }
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i];
            }
            return total;
        }

        //Numerics、マルチスレッド、uint
        private long Test9_Nunerics_uint_MT(byte[] ary)
        {
            long total = 0;
            int simdLength = Vector<byte>.Count;
            Parallel.ForEach(
                Partitioner.Create(0, ary.Length, ary.Length / Environment.ProcessorCount),
                (range) =>
                {
                    int lastIndex = range.Item2 - ((range.Item2 - range.Item1) % simdLength);
                    var v = new Vector<uint>();
                    for (int i = range.Item1; i < lastIndex; i += simdLength)
                    {
                        System.Numerics.Vector.Widen(new Vector<byte>(ary, i), out Vector<ushort> vv1, out Vector<ushort> vv2);
                        System.Numerics.Vector.Widen(vv1, out Vector<uint> ui1, out Vector<uint> ui2);
                        System.Numerics.Vector.Widen(vv2, out Vector<uint> ui3, out Vector<uint> ui4);
                        v = System.Numerics.Vector.Add(v, ui1);
                        v = System.Numerics.Vector.Add(v, ui2);
                        v = System.Numerics.Vector.Add(v, ui3);
                        v = System.Numerics.Vector.Add(v, ui4);
                    }
                    long subtotal = 0;
                    for (int i = 0; i < Vector<uint>.Count; i++)
                    {
                        subtotal += v[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += ary[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }

        //Numerics、マルチスレッド、long
        private long Test10_Numerics_long_MT(byte[] ary)
        {
            long total = 0;
            int simdLength = Vector<byte>.Count;
            Parallel.ForEach(
                Partitioner.Create(0, ary.Length, ary.Length / Environment.ProcessorCount),
                (range) =>
                {
                    int lastIndex = range.Item2 - ((range.Item2 - range.Item1) % simdLength);
                    var v = new Vector<ulong>();
                    for (int i = range.Item1; i < lastIndex; i += simdLength)
                    {
                        System.Numerics.Vector.Widen(new Vector<byte>(ary, i), out Vector<ushort> vv1, out Vector<ushort> vv2);
                        System.Numerics.Vector.Widen(vv1, out Vector<uint> ui1, out Vector<uint> ui2);
                        System.Numerics.Vector.Widen(vv2, out Vector<uint> ui3, out Vector<uint> ui4);
                        System.Numerics.Vector.Widen(ui1, out Vector<ulong> ul1, out Vector<ulong> ul2);
                        System.Numerics.Vector.Widen(ui2, out Vector<ulong> ul3, out Vector<ulong> ul4);
                        System.Numerics.Vector.Widen(ui3, out Vector<ulong> ul5, out Vector<ulong> ul6);
                        System.Numerics.Vector.Widen(ui4, out Vector<ulong> ul7, out Vector<ulong> ul8);

                        v = System.Numerics.Vector.Add(v, ul1);
                        v = System.Numerics.Vector.Add(v, ul2);
                        v = System.Numerics.Vector.Add(v, ul3);
                        v = System.Numerics.Vector.Add(v, ul4);
                        v = System.Numerics.Vector.Add(v, ul5);
                        v = System.Numerics.Vector.Add(v, ul6);
                        v = System.Numerics.Vector.Add(v, ul7);
                        v = System.Numerics.Vector.Add(v, ul8);

                    }
                    ulong subtotal = 0;
                    for (int i = 0; i < Vector<ulong>.Count; i++)
                    {
                        subtotal += v[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += ary[i];
                    }
                    System.Threading.Interlocked.Add(ref total, (long)subtotal);
                });
            return total;
        }


        #region 未使用
        //Numerics、シングルスレッド、uint、Span
        private unsafe long Test81_Numerics_uint(Span<byte> span)
        {

            int simdLength = Vector<byte>.Count;
            int lastIndex = span.Length - (span.Length % simdLength);
            Vector<uint> v = new Vector<uint>();
            for (int i = 0; i < lastIndex; i += simdLength)
            {
                System.Numerics.Vector.Widen(new Vector<byte>(span.Slice(i)), out Vector<ushort> vv1, out Vector<ushort> vv2);
                System.Numerics.Vector.Widen(vv1, out Vector<uint> ui1, out Vector<uint> ui2);
                System.Numerics.Vector.Widen(vv2, out Vector<uint> ui3, out Vector<uint> ui4);
                v = System.Numerics.Vector.Add(v, ui1);
                v = System.Numerics.Vector.Add(v, ui2);
                v = System.Numerics.Vector.Add(v, ui3);
                v = System.Numerics.Vector.Add(v, ui4);
            }
            long total = 0;
            for (int j = 0; j < Vector<uint>.Count; j++)
            {
                total += v[j];
            }
            for (int i = lastIndex; i < span.Length; i++)
            {
                total += span[i];
            }
            return total;
        }
        //↑と組み合わせて使う
        //Numerics、マルチスレッド、uint、Spanにしてシングルスレッドに渡して処理
        private long Test91_Nunerics_uint_MT(byte[] ary)
        {
            long total = 0;
            int simdLength = Vector<byte>.Count;
            //Span<byte> span = new Span<byte>(ary);
            Parallel.ForEach(
                Partitioner.Create(0, ary.Length, ary.Length / Environment.ProcessorCount),
                (range) =>
                {
                    var s = new Span<byte>(ary);
                    long subtotal = Test81_Numerics_uint(s.Slice(range.Item1, range.Item2-range.Item1));
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }
        #endregion



        private void MyInitialize()
        {
            MyArray = new byte[ELEMENT_COUNT];

            //指定値で埋める
            var span = new Span<byte>(MyArray);
            span.Fill(255);

            //最後の要素
            //MyArray[ELEMENT_COUNT - 1] = 100;

            //ランダム値
            //var r = new Random();
            //r.NextBytes(MyArray);

            ////0～255までを連番で繰り返し
            //for (int i = 0; i < ELEMENT_COUNT; i++)
            //{
            //    MyArray[i] = (byte)i;
            //}


        }

        #region 時間計測
        private void MyExe(Func<byte[], long> func, TextBlock tb, byte[] vs)
        {
            long total = 0;
            var sw = new Stopwatch();
            sw.Start();
            for (int i = 0; i < LOOP_COUNT; i++)
            {
                total = func(vs);
            }
            sw.Stop();
            this.Dispatcher.Invoke(() => tb.Text = $"処理時間：{sw.Elapsed.TotalSeconds.ToString("000.000")}秒 {total.ToString("N0")} | {func.Method.Name}");
        }


        //一斉テスト用
        private async void MyExeAll()
        {
            var sw = new Stopwatch();
            sw.Start();
            this.IsEnabled = false;
            await Task.Run(() => MyExe(Test1_Normal, Tb1, MyArray));
            await Task.Run(() => MyExe(Test2_Normal_MT, Tb2, MyArray));
            await Task.Run(() => MyExe(Test3_Normal4, Tb3, MyArray));
            await Task.Run(() => MyExe(Test4_Intrinsics_int, Tb4, MyArray));
            await Task.Run(() => MyExe(Test5_Intrinsics_int_MT, Tb5, MyArray));
            await Task.Run(() => MyExe(Test6_Intrinsics_int_MT2, Tb6, MyArray));
            await Task.Run(() => MyExe(Test7_Intrinsics_long_MT, Tb7, MyArray));
            await Task.Run(() => MyExe(Test8_Numerics_uint, Tb8, MyArray));
            await Task.Run(() => MyExe(Test9_Nunerics_uint_MT, Tb9, MyArray));
            await Task.Run(() => MyExe(Test10_Numerics_long_MT, Tb10, MyArray));

            this.IsEnabled = true;
            sw.Stop();
            TbAll.Text = $"処理時間：{sw.Elapsed.TotalSeconds.ToString("000.000")}秒";
        }
        #endregion 時間計測

    }
}