C#.NET CoreでSIMDを使ってドット積、byte型配列ならMultiplyAddAdjacentが速かった

値は全て255の要素数1千万のbyte型配列、これのドット積を求めるのに色々試してみた

SIMDを使わない普通の掛け算と足し算
System.NumericsでSIMD
- Dot
System.Runtime.IntrinsicsでSIMD
- AVX Multiply + Add
- SSE2 MultiplyAddAdjacent
- FMA MultiplyAdd
- SSE41 DotProduct

1000回求めたときの時間を表示するアプリ
ダウンロード先はギットハブ
ファイル名：20200301_SIMDでドット積.zip

f:id:gogowaten:20200302140401p:plain
{255, 255, 255, …}
{255, 255, 255, …}
これを掛け算して 255*255=65025
{65025, 65025, 65025, …}
これを全部足し算だから
255＊255＊1千万＝650,250,000,000になるのが正しい
処理時間の右に計算結果を表示、その右に実行したメソッド名
メソッド名の意味は

	Vector	SIMD	使用メソッド名	計算時の型	マルチスレッド
Test1	不使用	不使用		long
Test2	Numerics	？	Dot	long
Test3	Intrinsics	FMA	MultiplyAdd	float
Test4	Intrinsics	FMA	MultiplyAdd	double
Test5	Intrinsics	AVX	Multiply、Add	long
Test6	Intrinsics	SSE2	MultiplyAddAdjacent	int
Test7	Intrinsics	SSE41	DotProduct	float
Test8	Intrinsics	SSE41	DotProduct	float
Test11	不使用	不使用		long	使用	1のMT
Test12	Numerics	？	Dot	long	使用	2のMT
Test13	Intrinsics	FMA	MultiplyAdd	float	使用	3のMT
Test14	Intrinsics	FMA	MultiplyAdd	double	使用	4のMT
Test15	Intrinsics	AVX	Multiply、Add	long	使用	5のMT
Test16	Intrinsics	SSE2	MultiplyAddAdjacent	int	使用	6のMT
Test17	Intrinsics	SSE41	DotProduct	float	使用	7のMT
Test18	Intrinsics	SSE41	DotProduct	float	使用	8のMT
Test23	Intrinsics	FMA	MultiplyAdd	float	使用	13の誤差解消
Test26	Intrinsics	SSE2	MultiplyAddAdjacent	int	使用	16の破綻解消
Test28	Intrinsics	SSE41	DotProduct	float	使用	18の誤差解消

Testの番号
シングルスレッドのTest1～8をマルチスレッド化したのがTest11～18で、最後の3つTest23，26、28は、計算結果が間違っているTest13、16、18を解消したもの

3回実行した結果と平均

	1回目	2回目	3回目	平均
Test1_Normal	6.164	5.552	5.911	5.876
Test2_Numerics_Dot_long	10.556	10.667	10.636	10.620
Test3_Intrinsics_FMA_MultiplyAdd_float	2.669	2.688	2.702	2.686
Test4_Intrinsics_FMA_MultiplyAdd_double	5.400	5.393	5.483	5.425
Test5_Intrinsics_AVX_Multiply_Add_long	1.922	1.912	1.937	1.924
Test6_Intrinsics_SSE2_MultiplyAddAdjacent_int	0.828	0.709	0.840	0.792
Test7_Intrinsics_SSE41_DotProduct_float	3.384	3.420	3.466	3.423
Test8_Intrinsics_SSE41_DotProduct_float	2.920	2.914	3.002	2.945
Test11_Normal_MT	2.142	2.019	2.043	2.068
Test12_Numerics_Dot_long_MT	3.809	3.306	3.237	3.451
Test13_Intrinsics_FMA_MultiplyAdd_float_MT	0.730	0.577	0.594	0.634
Test14_Intrinsics_FMA_MultiplyAdd_double_MT	1.300	1.062	1.136	1.166
Test15_Intrinsics_AVX_Multiply_Add_long_MT	0.730	0.629	0.611	0.657
Test16_Intrinsics_SSE2_MultiplyAddAdjacent_int_MT	0.374	0.390	0.394	0.386
Test17_Intrinsics_SSE41_DotProduct_float_MT	1.185	1.215	1.200	1.200
Test18_Intrinsics_SSE41_DotProduct_float_MT	0.982	0.954	0.986	0.974
Test23_Intrinsics_FMA_MultiplyAdd_float_MT_Kai	1.118	1.142	1.097	1.119
Test26_Intrinsics_SSE2_MultiplyAddAdjacent_int_MT_Kai	0.353	0.351	0.352	0.352
Test28_Intrinsics_SSE41_DotProduct_float_MT_Kai	2.010	2.025	2.010	2.015

製作と計測環境

CPU AMD Ryzen 5 2400G(4コア8スレッド)

MEM DDR4-2666
Window 10 Home 64bit
Visual Studio 2019 Community .NET Core 3.1 WPF C#

.NET Frameworkだと参照の追加がめんどくさいので.NET Core 3.1

結果をグラフにして
f:id:gogowaten:20200302143432p:plain
Test1の処理速度を1として
f:id:gogowaten:20200302143525p:plain
速かったのはTest26で普通に求めるTest1より16倍以上も速く、マルチスレッド同士で比較しても、Test11の16.7/2.8=5.9642857で6倍速い。そのTest26で使ったMultiplyAddAdjacentメソッドはshort型しか使えない、今回のテストでは元の配列がbyte型で整数なので全く問題なかったけど、これがもし小数点も計算する場合ならAVXのMultiplyとAddを使ったTest15か、FMAのMultiplyAddを使ったTest23が良さそう

	シングルスレッド	マルチスレッド化	誤差解消
Test1	1.0	2.8
Test2	0.6	1.7
Test3	2.2	9.3	5.3
Test4	1.1	5.0
Test5	3.1	8.9
Test6	7.4	15.2	16.7
Test7	1.7	4.9
Test8	2.0	6.0	2.9

マルチスレッド化がTest10番台、誤差解消がTest20番台
グラフにして
f:id:gogowaten:20200302145545p:plain

using System;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Threading.Tasks;
using System.Windows;
using System.Windows.Controls;

f:id:gogowaten:20200302150911p:plain
MultiplyAddAdjacentでドット積
Adjacentは日本語だと隣接なので直訳だと掛け算足し算隣接になる
MultiplyAddAdjacentに2つのVector128shortを渡すと1つのVector128intが返ってくる
{0,1,2,3,4,5,6,7}
{0,1,2,3,4,5,6,7}、この2つを渡すと
ペア同士を掛け算して
{0,1,4,9,16,25,36,49}、こうなって
隣同士を足し算した
{1, 13, 41, 85}、これが返ってくる(446行目)
これを次の行447で集計している
オーバーフロー
MultiplyAddAdjacentが返すのはint型なので、要素数の多い配列を一度に計算すると、int型の最大値を超えて計算結果がおかしくなる。一度に計算できる要素数は
int.MaxValue = 2147483647
byte.MaxValue = 255
Vector128int.Count = 4
で
Vector128intはint型の値を4つ扱えるので
2147483647 * 4まで大丈夫
集計する値の最大値は
255 * 255 = 65025なので、これで割ると
2147483647 * 4 / (255 * 255) = 132102.03
byte型配列のドット積をMultiplyAddAdjacentで求めるときは、配列要素数は132102個までになる
これより多いときは配列を分割(区分け)して処理、これはマルチスレッド化するときに都合がいいのでそうしたのが
今回最速だったTest26

f:id:gogowaten:20200302160252p:plain

515行目のrangeSizeが1区分あたりの要素数、これを配列を区分するPartitioner.Createに渡している(518行目)、これで531行目で集計しているところでオーバーフローしないで済む

Test2、Test12
System.Numerics.VectorクラスのDotメソッド

        private long Test2_Numerics_Dot_long(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector<byte>.Count;
            int lastIndex = vs.Length - (vs.Length % simdLength);
            for (int i = 0; i < lastIndex; i += simdLength)
            {
                System.Numerics.Vector.Widen(new Vector<byte>(vs, i),
                    out Vector<ushort> v1, out Vector<ushort> v2);
                System.Numerics.Vector.Widen(v1,
                    out Vector<uint> vv1, out Vector<uint> vv2);
                System.Numerics.Vector.Widen(v2,
                    out Vector<uint> vv3, out Vector<uint> vv4);
                total += System.Numerics.Vector.Dot(vv1, vv1);
                total += System.Numerics.Vector.Dot(vv2, vv2);
                total += System.Numerics.Vector.Dot(vv3, vv3);
                total += System.Numerics.Vector.Dot(vv4, vv4);
            }
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i] * vs[i];
            }
            return total;
        }

今回では一番遅い、以前のテストでわかっていたけど、使い方が間違っているのかもねえ

Test3、13、23
f:id:gogowaten:20200303134235p:plain
FMAっていうSIMDを使うMultiplyAddメソッド
名前の通り掛け算と足し算をした結果をVectorで返してくれる
f:id:gogowaten:20200303123450p:plain
引数は3つのVectorで型はfloatかdouble、1番目＊2番目＋3番目の結果を返す
そういえばIntrinsicsだとNumerics.と違ってVectorの型を変換するメソッドが色々用意されていて便利だねえ
MultiplyAddの引数はfloatかdouble、元の配列はbyteなので
byte→intをConvertToVector256Int32
int→floatはConvertToVector256Singleを使って変換している

MultiplyAddの計算を見てみる
f:id:gogowaten:20200303124234p:plain
1番目{0 1 2 3 4 5 6 7}
2番目{0 1 2 3 4 5 6 7}
3番目{0 0 0 0 0 0 0 0}
↓
{0 1 4 9 16 25 36 49}
が返ってくる、これを次のループで3番目に渡せば
f:id:gogowaten:20200303124658p:plain
1番目{8 9 10 11 12 13 14 15}
2番目{8 9 10 11 12 13 14 15}
3番目{0 1 4 9 16 25 36 49}
↓
{64 82 104 130 160 194 232 274}
すごい
8個の要素のVectorで掛け算と足し算を同時に行うから、これが一番速くなると思ったんだけどねえ、2番手になった
このままだと誤差無しで計算できるのは合計値が16777215まで

www.cc.kyoto-su.ac.jp ここを見て
float型で整数で誤差なしの最大値は、仮数部とかいう23bit分に収まる範囲かと思ったんだけど、試してみたら24bitまで大丈夫だった。ウィンドウズの電卓でみると
f:id:gogowaten:20200303130559p:plain
10進数で
23bitは8388607
24bitは16777215
MultiplyAddの戻り値のVector256intのそれぞれの要素が、16777215までってことで一度に計算できる要素数を求めると
Vector256intは要素数8個なので16777215 * 8が上限
byte型最大値は255、これを掛け算するから255 * 255 = 65025、さっきの上限をこれで割ると
16777215 * 8 / (255 * 255) = 2064.0941で2064個、これが最大要素数、ずいぶん少なくなる

Test3をParallel.ForEachでマルチスレッド化して、配列分割で誤差が出ないようにしたTest23
f:id:gogowaten:20200303135516p:plain
248行目は16777215って書くよりビットシフトで書いたほうがかっこいいと思ってるからこうなった

Test4、14は3、13をdouble型にしただけ
f:id:gogowaten:20200303135856p:plain
Vector256double.Countは4と半減なので速度は遅くなるけど、誤差の上限は上がる。double型の仮数部は52bitってことなので52bitは10進数では
f:id:gogowaten:20200303140340p:plain
すごい桁になった4503兆？なので上限は気にしなくて良さそう

Test5系はMultiplyとAdd2つのメソッドを使って
f:id:gogowaten:20200303140839p:plain

掛け算するAVXのMultiplyメソッドの動き
f:id:gogowaten:20200303141804p:plain
int型Vector2つを渡すと掛け算した結果をlong型Vectorで返すってある、型が変わる
f:id:gogowaten:20200303141930p:plain
1番目{0 1 2 3 4 5 6 7}
2番目{0 1 2 3 4 5 6 7}この2つを渡した結果が
結果{0 4 16 36}
なんか足んない、Multiplyメソッドは1つ飛ばしで計算するみたいで、要素の先頭を0番目とすると、1 3 5 7番目の要素は無視されている
そこで362，363行目で使っているUnpackメソッド
f:id:gogowaten:20200303143135p:plain
UnpackHighに
1番目{0 1 2 3 4 5 6 7}
2番目{0 1 2 3 4 5 6 7}この2つを渡した結果が
{2 2 3 3 6 6 7 7}

UnpackLowに
1番目{0 1 2 3 4 5 6 7}
2番目{0 1 2 3 4 5 6 7}この2つを渡した結果が
{0 0 1 1 4 4 5 5}
なんでこうなるのかわかんないけど、とても都合がいい、これなら1つ飛ばしで計算されても大丈夫
少し手間がかかるけど、今回の正確なシングルスレッドの中では最速だし、マルチスレッドでも2番手、扱える型もdouble、float、int、uintと豊富なのでこれが一番使いやすいかも、クラスの名前がAvx2っていうのも気分がいい

Test6系、SSE41クラスのDotProductメソッド
f:id:gogowaten:20200303145054p:plain
掛け算足し算を一度にするのはFMAのMultiplyAddメソッドに似ている、けど最初は使い方がわからなかったところに

www.officedaytime.com ここの算術演算のドット積のDPPSに説明があって助かる
f:id:gogowaten:20200303145922p:plain
byte型で渡す3つ目の引数がわからなかったんだよねえ、説明見たらそれぞれのbitが0か1かで、動作を指定することみたいで
1 2 3 4bit目は結果を何番目の要素に入れるかの指定で、1なら入れる、0なら入れない
5 6 7 8bti目は計算要素の指定で、1なら計算する、0ならしない
0b11110001だと
1bit目だけが1なので結果を1番目だけに入れる(要素を0から数えるなら0番目)
5 6 7 8bit全てが1なので、4つすべての要素同士を計算する
f:id:gogowaten:20200303151214p:plain
1番目{0 1 2 3}
2番目{0 1 2 3}この2つを渡した結果が
{14 0 0 0}
ドット積が0番目要素に入った状態のVectorが返ってくる
これを568行目でVector128のGetElementメソッドで0番目の値を取り出して合計している
速度的にはいまいちだったけど、動作指定をできるのは面白いねえ
気になったのが計算結果を入れるところが4つあるのに1つしか使わないのはもったいないと思って作ったのがTest8系

Test8
f:id:gogowaten:20200303152535p:plain
結果はいまいちだった、このTest8やマルチスレッド版のTest18では結構速くなったけど、DotProductの結果をVectorで集計(641行目)するようにしたから、誤差を気にする必要が出てしまって、それを解決するようにしたら遅くなったのがTest28

f:id:gogowaten:20200303154413p:plain
整数計算だけで済むならTest26のMultiplyAddAdjacent
小数点有りならTest15のMultiply＋Addか、Test23のMultiplyAdd
今回も計算対象はbyte型配列で、整数だけで計算できる状態だったので、最適だったのはMultiplyAddAdjacentになったけど、SSE2って古くて10年以上前からあるんだよねえ、しかもVector128だからAVX2でこれのVector256版があれば、もっと速くなるんじゃないかなあとか思った
前回のテストでNumerics.Vectorクラスでのドット積計算は、SIMDを使わない普通の計算のほうが速くてがっかりだったけど、IntrinsicsならSIMDでかなり速くなることがわかってよかった、本当は分散を求めるのが目的なんだけど、分散の計算はドット積の計算がほとんどだから、ドット積が速ければ分散も速くなるはず。

CPU使用率
f:id:gogowaten:20200304130313p:plain
マルチスレッド使っているところでも100%使い切っていない？タスクマネージャーの更新は1秒毎だから、1秒以下で終わっている処理は正しい値じゃない気がするので、ループ回数を10倍してTest15、23，26，28
f:id:gogowaten:20200304131749p:plain
23，26でも100%じゃないなあ、残念

MainWindow.xaml

<Window x:Class="_20200301_SIMDでドット積.MainWindow"
        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
        xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
        xmlns:local="clr-namespace:_20200301_SIMDでドット積"
        mc:Ignorable="d"
         Title="MainWindow" Height="800" Width="614">
  <Grid>
    <StackPanel>
      <StackPanel.Resources>
        <Style TargetType="StackPanel">
          <Setter Property="Margin" Value="2"/>
        </Style>
        <Style TargetType="Button">
          <Setter Property="Width" Value="60"/>
        </Style>
        <Style TargetType="TextBlock">
          <Setter Property="Margin" Value="2,0"/>
        </Style>
      </StackPanel.Resources>
      <TextBlock x:Name="MyTextBlock" Text="text" HorizontalAlignment="Center" FontSize="20"/>
      <TextBlock x:Name="MyTextBlockVectorCount" Text="vectorCount" HorizontalAlignment="Center"/>
      <TextBlock x:Name="MyTextBlockCpuThreadCount" Text="threadCount" HorizontalAlignment="Center"/>
      <StackPanel Orientation="Horizontal" HorizontalAlignment="Center">
        <Button x:Name="ButtonAll" Content="一斉テスト" Margin="20,0" Width="120"/>
        <TextBlock x:Name="TbAll" Text="time"/>
        <!--<Button x:Name="ButtonReset" Content="reset" Margin="20,0"/>-->

      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button1" Content="1"/>
        <TextBlock x:Name="Tb1" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button2" Content="2"/>
        <TextBlock x:Name="Tb2" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button3" Content="3"/>
        <TextBlock x:Name="Tb3" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button4" Content="4"/>
        <TextBlock x:Name="Tb4" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button5" Content="5"/>
        <TextBlock x:Name="Tb5" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button6" Content="6"/>
        <TextBlock x:Name="Tb6" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button7" Content="7"/>
        <TextBlock x:Name="Tb7" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button8" Content="8"/>
        <TextBlock x:Name="Tb8" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button9" Content="9"/>
        <TextBlock x:Name="Tb9" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button10" Content="10"/>
        <TextBlock x:Name="Tb10" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button11" Content="11"/>
        <TextBlock x:Name="Tb11" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button12" Content="12"/>
        <TextBlock x:Name="Tb12" Text="time"/>
      </StackPanel>
      <!--<Border Height="1" Background="Orange" UseLayoutRounding="True"/>-->
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button13" Content="13"/>
        <TextBlock x:Name="Tb13" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button14" Content="14"/>
        <TextBlock x:Name="Tb14" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button15" Content="15"/>
        <TextBlock x:Name="Tb15" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button16" Content="16"/>
        <TextBlock x:Name="Tb16" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button17" Content="17"/>
        <TextBlock x:Name="Tb17" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button18" Content="18"/>
        <TextBlock x:Name="Tb18" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button19" Content="19"/>
        <TextBlock x:Name="Tb19" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button20" Content="20"/>
        <TextBlock x:Name="Tb20" Text="time"/>
      </StackPanel>
      <Border Height="1" Background="Orange" UseLayoutRounding="True"/>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button21" Content="21"/>
        <TextBlock x:Name="Tb21" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button22" Content="22"/>
        <TextBlock x:Name="Tb22" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button23" Content="23"/>
        <TextBlock x:Name="Tb23" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button24" Content="24"/>
        <TextBlock x:Name="Tb24" Text="time"/>
      </StackPanel>
      <StackPanel Orientation="Horizontal">
        <Button x:Name="Button25" Content="25"/>
        <TextBlock x:Name="Tb25" Text="time"/>
      </StackPanel>



    </StackPanel>
  </Grid>
</Window>

MainWindow.xaml.cs

using System;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Threading.Tasks;
using System.Windows;
using System.Windows.Controls;

namespace _20200301_SIMDでドット積
{
    /// <summary>
    /// Interaction logic for MainWindow.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        private byte[] MyArray;
        private const int LOOP_COUNT = 1000;
        private const int ELEMENT_COUNT = 10_000_000;// 1_056_831;// 132_103;// 2071;//要素数

        public MainWindow()
        {
            InitializeComponent();
            MyInitialize();
            this.Title = this.ToString();



            MyTextBlock.Text = $"byte型配列要素数{ELEMENT_COUNT.ToString("N0")}のドット積を {LOOP_COUNT}回求める";
            MyTextBlockVectorCount.Text = $"Vector256<byte>.Count={Vector256<byte>.Count}  Vector<byte>.Count={Vector<byte>.Count}";
            MyTextBlockCpuThreadCount.Text = $"CPUスレッド数：{Environment.ProcessorCount}";



            ButtonAll.Click += (s, e) => MyExeAll();
            Button1.Click += (s, e) => MyExe(Test1_Normal, Tb1, MyArray);
            Button2.Click += (s, e) => MyExe(Test2_Numerics_Dot_long, Tb2, MyArray);
            Button3.Click += (s, e) => MyExe(Test3_Intrinsics_FMA_MultiplyAdd_float, Tb3, MyArray);
            Button4.Click += (s, e) => MyExe(Test4_Intrinsics_FMA_MultiplyAdd_double, Tb4, MyArray);
            Button5.Click += (s, e) => MyExe(Test5_Intrinsics_AVX_Multiply_Add_long, Tb5, MyArray);
            Button6.Click += (s, e) => MyExe(Test6_Intrinsics_SSE2_MultiplyAddAdjacent_int, Tb6, MyArray);
            Button7.Click += (s, e) => MyExe(Test7_Intrinsics_SSE41_DotProduct_float, Tb7, MyArray);
            Button8.Click += (s, e) => MyExe(Test8_Intrinsics_SSE41_DotProduct_float, Tb8, MyArray);
            //Button9.Click += (s, e) => MyExe(Test9_Intrinsics_SSE41_DotProduct_float, Tb9, MyArray);
            //Button10.Click += (s, e) => MyExe(Test11_Normal_MT, Tb10, MyArray);
            Button11.Click += (s, e) => MyExe(Test11_Normal_MT, Tb11, MyArray);
            Button12.Click += (s, e) => MyExe(Test12_Numerics_Dot_long_MT, Tb12, MyArray);
            Button13.Click += (s, e) => MyExe(Test13_Intrinsics_FMA_MultiplyAdd_float_MT, Tb13, MyArray);
            Button14.Click += (s, e) => MyExe(Test14_Intrinsics_FMA_MultiplyAdd_double_MT, Tb14, MyArray);
            Button15.Click += (s, e) => MyExe(Test15_Intrinsics_AVX_Multiply_Add_long_MT, Tb15, MyArray);
            Button16.Click += (s, e) => MyExe(Test16_Intrinsics_SSE2_MultiplyAddAdjacent_int_MT, Tb16, MyArray);
            Button17.Click += (s, e) => MyExe(Test17_Intrinsics_SSE41_DotProduct_float_MT, Tb17, MyArray);
            Button18.Click += (s, e) => MyExe(Test18_Intrinsics_SSE41_DotProduct_float_MT, Tb18, MyArray);
            //Button19.Click += (s, e) => MyExe(Test19_Intrinsics_SSE41_DotProduct_float_MT, Tb19, MyArray);
            //Button20.Click += (s, e) => MyExe(Test20, Tb20, MyArray);
            Button21.Click += (s, e) => MyExe(Test23_Intrinsics_FMA_MultiplyAdd_float_MT_Kai, Tb21, MyArray);
            Button22.Click += (s, e) => MyExe(Test26_Intrinsics_SSE2_MultiplyAddAdjacent_int_MT_Kai, Tb22, MyArray);
            Button23.Click += (s, e) => MyExe(Test28_Intrinsics_SSE41_DotProduct_float_MT_Kai, Tb23, MyArray);
            //Button24.Click += (s, e) => MyExe(Test28_Intrinsics_SSE41_DotProduct_float_MT_Kai, Tb24, MyArray);
        }

        //
        private long Test1_Normal(byte[] vs)
        {
            long total = 0;
            for (int i = 0; i < vs.Length; i++)
            {
                total += vs[i] * vs[i];
            }
            return total;
        }

        //↑をマルチスレッド化
        private long Test11_Normal_MT(byte[] vs)
        {
            long total = 0;
            int rangeSize = vs.Length / Environment.ProcessorCount;
            Parallel.ForEach(Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    long subtotal = 0;
                    for (int i = range.Item1; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }




        //Numerics Dot
        private long Test2_Numerics_Dot_long(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector<byte>.Count;
            int lastIndex = vs.Length - (vs.Length % simdLength);
            for (int i = 0; i < lastIndex; i += simdLength)
            {
                System.Numerics.Vector.Widen(new Vector<byte>(vs, i),
                    out Vector<ushort> v1, out Vector<ushort> v2);
                System.Numerics.Vector.Widen(v1,
                    out Vector<uint> vv1, out Vector<uint> vv2);
                System.Numerics.Vector.Widen(v2,
                    out Vector<uint> vv3, out Vector<uint> vv4);
                total += System.Numerics.Vector.Dot(vv1, vv1);
                total += System.Numerics.Vector.Dot(vv2, vv2);
                total += System.Numerics.Vector.Dot(vv3, vv3);
                total += System.Numerics.Vector.Dot(vv4, vv4);
            }
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i] * vs[i];
            }
            return total;
        }

        //↑をマルチスレッド化
        private long Test12_Numerics_Dot_long_MT(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector<byte>.Count;
            int rangeSize = vs.Length / Environment.ProcessorCount;
            Parallel.ForEach(Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    long subtotal = 0;
                    int lastIndex = range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    for (int i = range.Item1; i < lastIndex; i += simdLength)
                    {
                        System.Numerics.Vector.Widen(new Vector<byte>(vs, i),
                            out Vector<ushort> v1, out Vector<ushort> v2);
                        System.Numerics.Vector.Widen(v1,
                            out Vector<uint> vv1, out Vector<uint> vv2);
                        System.Numerics.Vector.Widen(v2,
                            out Vector<uint> vv3, out Vector<uint> vv4);
                        subtotal += System.Numerics.Vector.Dot(vv1, vv1);
                        subtotal += System.Numerics.Vector.Dot(vv2, vv2);
                        subtotal += System.Numerics.Vector.Dot(vv3, vv3);
                        subtotal += System.Numerics.Vector.Dot(vv4, vv4);
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });

            return total;
        }



        #region ここからIntrinsics

        //誤差無しで計算できる最大要素数は2064まで。
        //これはVector256<float>でbyte型配列を計算する場合で、
        //floatの誤差なし最大値が16777215(24bit)とbyte配列が最大の255ってことで
        //16777215/255/255=258.01176
        //小数点以下切り捨てて258個、これにVectorCountの8をかけて
        //258*8=2064、これが限界。
        //あとはおまけでVectorCountで割り切れなかった余りの最大数7を足して
        //2064+7=2071
        //FMA MultiplyAddはVector256Double型でも計算できる
        //最大要素数は増えるけどVectorCountが半減するから遅くなるので
        //配列を分割してfloat型で計算するほうが効率が良さそう
        //Intrinsics FMA MultiplyAdd float
        private unsafe long Test3_Intrinsics_FMA_MultiplyAdd_float(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector256<int>.Count;
            int lastIndex = vs.Length - (vs.Length % simdLength);
            Vector256<float> ff = Vector256.Create(0f);
            fixed (byte* p = vs)
            {
                for (int i = 0; i < lastIndex; i += simdLength)
                {
                    Vector256<int> v = Avx2.ConvertToVector256Int32(p + i);
                    Vector256<float> f = Avx.ConvertToVector256Single(v);
                    ff = Fma.MultiplyAdd(f, f, ff);//float
                }
            }

            float* pp = stackalloc float[Vector256<float>.Count];
            Avx.Store(pp, ff);
            for (int i = 0; i < Vector256<float>.Count; i++)
            {
                total += (long)pp[i];
            }
            //割り切れなかった余り要素用
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i] * vs[i];
            }
            return total;
        }

        //↑をマルチスレッド化
        //Intrinsics FMA MultiplyAdd float
        private unsafe long Test13_Intrinsics_FMA_MultiplyAdd_float_MT(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector256<int>.Count;
            int rangeSize = vs.Length / Environment.ProcessorCount;//1区分のサイズ
            Parallel.ForEach(Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    long subtotal = 0;
                    int lastIndex = range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    Vector256<float> ff = Vector256.Create(0f);
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            Vector256<int> v = Avx2.ConvertToVector256Int32(p + i);
                            Vector256<float> f = Avx.ConvertToVector256Single(v);
                            ff = Fma.MultiplyAdd(f, f, ff);//float
                        }
                    }
                    float* pp = stackalloc float[Vector256<float>.Count];
                    Avx.Store(pp, ff);
                    for (int i = 0; i < Vector256<float>.Count; i++)
                    {
                        subtotal += (long)pp[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }

        //↑を改変
        //集計用のVector256<float>で誤差が出ないように配列を分割して計算
        //Intrinsics FMA MultiplyAdd float
        private unsafe long Test23_Intrinsics_FMA_MultiplyAdd_float_MT_Kai(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector256<int>.Count;
            //集計用のVector256<float>で扱える最大要素数 = 2064
            //これを1区分あたりの要素数(分割サイズ)にする
            //floatの仮数部24bit(16777215) * 8 / (255 * 255) = 2064.0941
            int rangeSize = ((1 << 24) - 1)
                            * Vector256<float>.Count
                            / (byte.MaxValue * byte.MaxValue);
            Parallel.ForEach(Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    long subtotal = 0;
                    int lastIndex = range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    Vector256<float> vTotal = Vector256.Create(0f);//集計用
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            Vector256<int> v = Avx2.ConvertToVector256Int32(p + i);
                            Vector256<float> f = Avx.ConvertToVector256Single(v);
                            vTotal = Fma.MultiplyAdd(f, f, vTotal);//float
                        }
                    }
                    float* pp = stackalloc float[Vector256<float>.Count];
                    Avx.Store(pp, vTotal);
                    for (int i = 0; i < Vector256<float>.Count; i++)
                    {
                        subtotal += (long)pp[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }


        //Intrinsics FMA MultiplyAdd double
        private unsafe long Test4_Intrinsics_FMA_MultiplyAdd_double(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector128<int>.Count;
            int lastIndex = vs.Length - (vs.Length % simdLength);
            Vector256<double> vTotal = Vector256.Create(0d);
            fixed (byte* p = vs)
            {
                for (int i = 0; i < lastIndex; i += simdLength)
                {
                    Vector128<int> v = Sse41.ConvertToVector128Int32(p + i);
                    Vector256<double> f = Avx.ConvertToVector256Double(v);
                    vTotal = Fma.MultiplyAdd(f, f, vTotal);//double
                }
            }

            double* pp = stackalloc double[Vector256<double>.Count];
            Avx.Store(pp, vTotal);
            for (int i = 0; i < Vector256<double>.Count; i++)
            {
                total += (long)pp[i];
            }
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i] * vs[i];
            }
            return total;
        }

        //↑をマルチスレッド化
        //Intrinsics FMA MultiplyAdd double
        private unsafe long Test14_Intrinsics_FMA_MultiplyAdd_double_MT(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector128<int>.Count;
            int rangeSize = vs.Length / Environment.ProcessorCount;
            Parallel.ForEach(Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    long subtotal = 0;
                    int lastIndex = range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    Vector256<double> vTotal = Vector256.Create(0d);
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            Vector128<int> v = Avx2.ConvertToVector128Int32(p + i);
                            Vector256<double> f = Avx.ConvertToVector256Double(v);
                            vTotal = Fma.MultiplyAdd(f, f, vTotal);//float
                        }
                    }
                    double* pp = stackalloc double[Vector256<double>.Count];
                    Avx.Store(pp, vTotal);
                    for (int i = 0; i < Vector256<double>.Count; i++)
                    {
                        subtotal += (long)pp[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }


        //Intrinsics AVX Multiply + Add
        private unsafe long Test5_Intrinsics_AVX_Multiply_Add_long(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector256<int>.Count;
            int lastIndex = vs.Length - (vs.Length % simdLength);
            Vector256<long> ff = Vector256<long>.Zero;
            fixed (byte* p = vs)
            {
                for (int i = 0; i < lastIndex; i += simdLength)
                {
                    Vector256<int> vv = Avx2.ConvertToVector256Int32(p + i);
                    Vector256<int> v1 = Avx2.UnpackHigh(vv, vv);
                    Vector256<int> v2 = Avx2.UnpackLow(vv, vv);
                    Vector256<long> t1 = Avx2.Multiply(v1, v1);//double,float,int,uint
                    Vector256<long> t2 = Avx2.Multiply(v2, v2);
                    ff = Avx2.Add(ff, t1);
                    ff = Avx2.Add(ff, t2);
                }
            }
            simdLength = Vector256<long>.Count;
            long* pp = stackalloc long[simdLength];
            Avx.Store(pp, ff);
            for (int i = 0; i < simdLength; i++)
            {
                total += pp[i];
            }
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i] * vs[i];
            }
            return total;
        }

        //↑をマルチスレッド化
        //Intrinsics AVX Multiply + Add
        private unsafe long Test15_Intrinsics_AVX_Multiply_Add_long_MT(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector256<int>.Count;
            int rangeSize = vs.Length / Environment.ProcessorCount;
            Parallel.ForEach(Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    long subtotal = 0;
                    int lastIndex =
                    range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    Vector256<long> vTotal = Vector256<long>.Zero;
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            Vector256<int> vv = Avx2.ConvertToVector256Int32(p + i);
                            Vector256<int> v1 = Avx2.UnpackHigh(vv, vv);
                            Vector256<int> v2 = Avx2.UnpackLow(vv, vv);
                            Vector256<long> t1 = Avx2.Multiply(v1, v1);//double,float,int,uint
                            Vector256<long> t2 = Avx2.Multiply(v2, v2);
                            vTotal = Avx2.Add(vTotal, t1);
                            vTotal = Avx2.Add(vTotal, t2);
                        }
                    }
                    long* pp = stackalloc long[Vector256<long>.Count];
                    Avx.Store(pp, vTotal);
                    for (int i = 0; i < Vector256<long>.Count; i++)
                    {
                        subtotal += pp[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }


        //まともに計算できる最大要素数は132103まで。
        //これはVectorの各4要素の最大値がint.MaxValueの2147483647までだからで
        //byte配列が最大の255だった場合
        //2147483647/255/255=33025.508
        //小数点以下切り捨てて33025個、これにVectorCountの4をかけて
        //  33025*4=132100、これに余りの最大数3を足して、132100+3=132103。
        //Intrinsics SSE2 MultiplyAddAdjacent
        private unsafe long Test6_Intrinsics_SSE2_MultiplyAddAdjacent_int(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector128<short>.Count;
            int lastIndex = vs.Length - (vs.Length % simdLength);

            Vector128<int> vTotal = Vector128<int>.Zero;
            fixed (byte* p = vs)
            {
                for (int i = 0; i < lastIndex; i += simdLength)
                {
                    Vector128<short> v = Sse41.ConvertToVector128Int16(p + i);
                    Vector128<int> vv = Sse2.MultiplyAddAdjacent(v, v);// short + short                    
                    vTotal = Sse2.Add(vTotal, vv);
                }
            }
            simdLength = Vector128<int>.Count;
            int* pp = stackalloc int[simdLength];
            Sse2.Store(pp, vTotal);
            for (int i = 0; i < simdLength; i++)
            {
                total += pp[i];
            }
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i] * vs[i];
            }
            return total;
        }

        //↑をマルチスレッド化
        //最大要素数は1_056_831まで(8スレッドCPU)
        //Intrinsics SSE2 MultiplyAddAdjacent
        private unsafe long Test16_Intrinsics_SSE2_MultiplyAddAdjacent_int_MT(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector128<short>.Count;//8
            int rangeSize = vs.Length / Environment.ProcessorCount;
            Parallel.ForEach(Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    long subtotal = 0;
                    int lastIndex =
                    range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    Vector128<int> vTotal = Vector128<int>.Zero;
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            Vector128<short> v = Sse41.ConvertToVector128Int16(p + i);
                            Vector128<int> vv = Sse2.MultiplyAddAdjacent(v, v);// short + short                    
                            vTotal = Sse2.Add(vTotal, vv);
                        }
                    }

                    int* pp = stackalloc int[Vector128<int>.Count];
                    Sse2.Store(pp, vTotal);
                    for (int i = 0; i < Vector128<int>.Count; i++)
                    {
                        subtotal += pp[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }

        //↑を改変
        //集計用のVector128<int>がオーバーフローしないように配列を分割して計算
        //Intrinsics SSE2 MultiplyAddAdjacent
        private unsafe long Test26_Intrinsics_SSE2_MultiplyAddAdjacent_int_MT_Kai(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector128<short>.Count;//8            
            //集計用のVector128<int>で
            //オーバーフローすることなく扱える最大要素数 = 132102
            //int.MaxValue / (byte.MaxValue * byte.MaxValue) * Vector128<int>.Count
            //2147483647 / (255 * 255) * 4 = 132102.03 小数点以下切り捨てで132102            
            int rangeSize =
                int.MaxValue / (byte.MaxValue * byte.MaxValue) * Vector128<int>.Count;

            Parallel.ForEach(Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    long subtotal = 0;
                    int lastIndex =
                    range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    Vector128<int> vTotal = Vector128<int>.Zero;//集計用
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            Vector128<short> v = Sse41.ConvertToVector128Int16(p + i);
                            Vector128<int> vv = Sse2.MultiplyAddAdjacent(v, v);//short + short                    
                            vTotal = Sse2.Add(vTotal, vv);
                        }
                    }

                    int* pp = stackalloc int[Vector128<int>.Count];
                    Sse2.Store(pp, vTotal);
                    for (int i = 0; i < Vector128<int>.Count; i++)
                    {
                        subtotal += pp[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }


        //        x86/x64 SIMD命令一覧表　（SSE～AVX2）
        //https://www.officedaytime.com/tips/simd.html
        //算術演算 ドット積 DPPS
        //Intrinsics SSE41 DotProduct
        private unsafe long Test7_Intrinsics_SSE41_DotProduct_float(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector128<int>.Count;
            int lastIndex = vs.Length - (vs.Length % simdLength);
            fixed (byte* p = vs)
            {
                for (int i = 0; i < lastIndex; i += simdLength)
                {
                    Vector128<int> v = Sse41.ConvertToVector128Int32(p + i);
                    var vv = Sse2.ConvertToVector128Single(v);
                    //4要素全てを掛け算(5~8bit目を1)して、足し算した結果を0番目に入れる(1bit目を1)
                    Vector128<float> dp = Sse41.DotProduct(vv, vv, 0b11110001);
                    total += (long)dp.GetElement(0);
                }
            }
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i] * vs[i];
            }
            return total;
        }

        //↑をマルチスレッド化
        private unsafe long Test17_Intrinsics_SSE41_DotProduct_float_MT(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector128<int>.Count;
            int rangeSize = vs.Length / Environment.ProcessorCount;
            Parallel.ForEach(Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    long subtotal = 0;
                    int lastIndex = range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            Vector128<int> v = Sse41.ConvertToVector128Int32(p + i);
                            var vv = Sse2.ConvertToVector128Single(v);
                            //4要素全てを掛け算(5~8bit目を1)して、足し算した結果を0番目に入れる(1bit目を1)
                            Vector128<float> dp = Sse41.DotProduct(vv, vv, 0b11110001);
                            //vTotal = Sse.Add(vTotal, dp);
                            subtotal += (long)dp.GetElement(0);
                        }
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }


        //Intrinsics SSE41 DotProduct、ループの中で4個づつ処理
        private unsafe long Test8_Intrinsics_SSE41_DotProduct_float(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector128<int>.Count * 4;
            int lastIndex = vs.Length - (vs.Length % simdLength);
            var vTotal = Vector128<float>.Zero;
            fixed (byte* p = vs)
            {
                for (int i = 0; i < lastIndex; i += simdLength)
                {
                    Vector128<int> v = Sse41.ConvertToVector128Int32(p + i);
                    var vv = Sse2.ConvertToVector128Single(v);
                    //4要素全てを掛け算(5~8bit目を1)して、足し算した結果を0番目に入れる(1bit目を1)
                    Vector128<float> dp = Sse41.DotProduct(vv, vv, 0b11110001);
                    vTotal = Sse.Add(vTotal, dp);

                    v = Sse41.ConvertToVector128Int32(p + i + 4);
                    vv = Sse2.ConvertToVector128Single(v);
                    dp = Sse41.DotProduct(vv, vv, 0b11110010);//結果を1番目に入れる
                    vTotal = Sse.Add(vTotal, dp);

                    v = Sse41.ConvertToVector128Int32(p + i + 8);
                    vv = Sse2.ConvertToVector128Single(v);
                    dp = Sse41.DotProduct(vv, vv, 0b11110100);//結果を2番目に入れる
                    vTotal = Sse.Add(vTotal, dp);

                    v = Sse41.ConvertToVector128Int32(p + i + 12);
                    vv = Sse2.ConvertToVector128Single(v);
                    dp = Sse41.DotProduct(vv, vv, 0b11111000);//結果を3番目に入れる
                    vTotal = Sse.Add(vTotal, dp);

                }
            }
            float* f = stackalloc float[Vector128<int>.Count];
            Sse.Store(f, vTotal);
            for (int i = 0; i < Vector128<int>.Count; i++)
            {
                total += (long)f[i];
            }
            for (int i = lastIndex; i < vs.Length; i++)
            {
                total += vs[i] * vs[i];
            }
            return total;
        }

        //↑をマルチスレッド化
        private unsafe long Test18_Intrinsics_SSE41_DotProduct_float_MT(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector128<int>.Count * 4;

            int rangeSize = vs.Length / Environment.ProcessorCount;
            Parallel.ForEach(Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    var vTotal = Vector128<float>.Zero;
                    int lastIndex = range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            Vector128<int> v = Sse41.ConvertToVector128Int32(p + i);
                            var vv = Sse2.ConvertToVector128Single(v);
                            //4要素全てを掛け算(5~8bit目を1)して、足し算した結果を0番目に入れる(1bit目を1)
                            Vector128<float> dp = Sse41.DotProduct(vv, vv, 0b11110001);
                            vTotal = Sse.Add(vTotal, dp);

                            v = Sse41.ConvertToVector128Int32(p + i + 4);//結果を1番目に入れる
                            vv = Sse2.ConvertToVector128Single(v);
                            dp = Sse41.DotProduct(vv, vv, 0b11110010);
                            vTotal = Sse.Add(vTotal, dp);

                            v = Sse41.ConvertToVector128Int32(p + i + 8);//結果を2番目に入れる
                            vv = Sse2.ConvertToVector128Single(v);
                            dp = Sse41.DotProduct(vv, vv, 0b11110100);
                            vTotal = Sse.Add(vTotal, dp);

                            v = Sse41.ConvertToVector128Int32(p + i + 12);//結果を3目に入れる
                            vv = Sse2.ConvertToVector128Single(v);
                            dp = Sse41.DotProduct(vv, vv, 0b11111000);
                            vTotal = Sse.Add(vTotal, dp);
                        }
                    }
                    long subtotal = 0;
                    float* f = stackalloc float[Vector128<float>.Count];
                    Sse.Store(f, vTotal);
                    for (int i = 0; i < Vector128<float>.Count; i++)
                    {
                        subtotal += (long)f[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }
        //↑をオーバーフローしない程度に配列を分割して計算
        private unsafe long Test28_Intrinsics_SSE41_DotProduct_float_MT_Kai(byte[] vs)
        {
            long total = 0;
            int simdLength = Vector128<int>.Count * 4;

            //集計用のVector128<float> vTotalで扱える最大要素数 = 1032
            //floatの仮数部24bit / byte型最大値 * byte型最大値
            //16777215 / (255 * 255) * 4 = 1032.0471 これの小数点以下切り捨てを
            //1区分あたりの要素数(分割サイズ)
            int rangeSize =
                ((1 << 24) - 1) / (byte.MaxValue * byte.MaxValue) * Vector128<float>.Count;//1032

            Parallel.ForEach(
                Partitioner.Create(0, vs.Length, rangeSize),
                (range) =>
                {
                    var vTotal = Vector128<float>.Zero;
                    int lastIndex = range.Item2 - (range.Item2 - range.Item1) % simdLength;
                    fixed (byte* p = vs)
                    {
                        for (int i = range.Item1; i < lastIndex; i += simdLength)
                        {
                            Vector128<int> v = Sse41.ConvertToVector128Int32(p + i);
                            var vv = Sse2.ConvertToVector128Single(v);
                            //4要素全てを掛け算(5~8bit目を1)して、足し算した結果を0番目に入れる(1bit目を1)
                            Vector128<float> dp = Sse41.DotProduct(vv, vv, 0b11110001);
                            vTotal = Sse.Add(vTotal, dp);

                            v = Sse41.ConvertToVector128Int32(p + i + 4);
                            vv = Sse2.ConvertToVector128Single(v);
                            dp = Sse41.DotProduct(vv, vv, 0b11110010);//結果を1番目に入れる
                            vTotal = Sse.Add(vTotal, dp);

                            v = Sse41.ConvertToVector128Int32(p + i + 8);
                            vv = Sse2.ConvertToVector128Single(v);
                            dp = Sse41.DotProduct(vv, vv, 0b11110100);//結果を2番目に入れる
                            vTotal = Sse.Add(vTotal, dp);

                            v = Sse41.ConvertToVector128Int32(p + i + 12);
                            vv = Sse2.ConvertToVector128Single(v);
                            dp = Sse41.DotProduct(vv, vv, 0b11111000);//結果を3番目に入れる
                            vTotal = Sse.Add(vTotal, dp);
                        }
                    }
                    long subtotal = 0;
                    float* f = stackalloc float[Vector128<float>.Count];
                    Sse.Store(f, vTotal);
                    for (int i = 0; i < Vector128<float>.Count; i++)
                    {
                        subtotal += (long)f[i];
                    }
                    for (int i = lastIndex; i < range.Item2; i++)
                    {
                        subtotal += vs[i] * vs[i];
                    }
                    System.Threading.Interlocked.Add(ref total, subtotal);
                });
            return total;
        }






        #endregion










        private void MyInitialize()
        {
            MyArray = new byte[ELEMENT_COUNT];

            //指定値で埋める
            var span = new Span<byte>(MyArray);
            span.Fill(255);

            //最後の要素
            //MyArray[ELEMENT_COUNT - 1] = 100;

            //ランダム値
            //var r = new Random();
            //r.NextBytes(MyArray);

            //0～255までを連番で繰り返し
            //for (int i = 0; i < ELEMENT_COUNT; i++)
            //{
            //    MyArray[i] = (byte)i;
            //}


        }

        #region 時間計測
        private void MyExe(Func<byte[], long> func, TextBlock tb, byte[] vs)
        {
            long total = 0;
            var sw = new Stopwatch();
            sw.Start();
            for (int i = 0; i < LOOP_COUNT; i++)
            {
                total = func(vs);
            }
            sw.Stop();
            this.Dispatcher.Invoke(() => tb.Text = $"処理時間：{sw.Elapsed.TotalSeconds.ToString("000.000")}秒 {total.ToString("N0")}  {func.Method.Name}");
        }


        //一斉テスト用
        private async void MyExeAll()
        {
            var sw = new Stopwatch();
            sw.Start();
            this.IsEnabled = false;
            await Task.Run(() => MyExe(Test1_Normal, Tb1, MyArray));
            await Task.Run(() => MyExe(Test2_Numerics_Dot_long, Tb2, MyArray));
            await Task.Run(() => MyExe(Test3_Intrinsics_FMA_MultiplyAdd_float, Tb3, MyArray));
            await Task.Run(() => MyExe(Test4_Intrinsics_FMA_MultiplyAdd_double, Tb4, MyArray));
            await Task.Run(() => MyExe(Test5_Intrinsics_AVX_Multiply_Add_long, Tb5, MyArray));
            await Task.Run(() => MyExe(Test6_Intrinsics_SSE2_MultiplyAddAdjacent_int, Tb6, MyArray));
            await Task.Run(() => MyExe(Test7_Intrinsics_SSE41_DotProduct_float, Tb7, MyArray));
            await Task.Run(() => MyExe(Test8_Intrinsics_SSE41_DotProduct_float, Tb8, MyArray));
            //await Task.Run(() => MyExe(Test9_Nunerics_uint_MT, Tb9, MyArray));
            //await Task.Run(() => MyExe(Test11_Normal_MT, Tb10, MyArray));
            await Task.Run(() => MyExe(Test11_Normal_MT, Tb11, MyArray));
            await Task.Run(() => MyExe(Test12_Numerics_Dot_long_MT, Tb12, MyArray));
            await Task.Run(() => MyExe(Test13_Intrinsics_FMA_MultiplyAdd_float_MT, Tb13, MyArray));
            await Task.Run(() => MyExe(Test14_Intrinsics_FMA_MultiplyAdd_double_MT, Tb14, MyArray));
            await Task.Run(() => MyExe(Test15_Intrinsics_AVX_Multiply_Add_long_MT, Tb15, MyArray));
            await Task.Run(() => MyExe(Test16_Intrinsics_SSE2_MultiplyAddAdjacent_int_MT, Tb16, MyArray));
            await Task.Run(() => MyExe(Test17_Intrinsics_SSE41_DotProduct_float_MT, Tb17, MyArray));
            await Task.Run(() => MyExe(Test18_Intrinsics_SSE41_DotProduct_float_MT, Tb18, MyArray));


            await Task.Run(() => MyExe(Test23_Intrinsics_FMA_MultiplyAdd_float_MT_Kai, Tb21, MyArray));
            await Task.Run(() => MyExe(Test26_Intrinsics_SSE2_MultiplyAddAdjacent_int_MT_Kai, Tb22, MyArray));
            await Task.Run(() => MyExe(Test28_Intrinsics_SSE41_DotProduct_float_MT_Kai, Tb23, MyArray));


            this.IsEnabled = true;
            sw.Stop();
            TbAll.Text = $"処理時間：{sw.Elapsed.TotalSeconds.ToString("000.000")}秒";
        }
        #endregion 時間計測
    }
}