K最近傍法の実装 - パイソン

前提条件: Kに最も近い隣人

導入

それぞれが数値的な特徴 (身長、体重、年齢など) を持つ項目のデータセットが与えられたとします。特徴の数が n 項目を点として表すことができます。 n -次元グリッド。新しいアイテムを指定すると、そのアイテムからセット内の他のすべてのアイテムまでの距離を計算できます。私たちが選ぶのは、 k 最近傍のほとんどがどこに分類されているかがわかります。そこで新しいアイテムを分類します。
そこで問題はこうなります アイテム間の距離を計算する方法。 これに対する解決策はデータセットによって異なります。値が実数の場合、通常はユークリッド距離を使用します。値がカテゴリカルまたはバイナリの場合、通常はハミング距離を使用します。
アルゴリズム：

Given a new item: 1. Find distances between new item and all other items 2. Pick k shorter distances 3. Pick the most common class in these k distances 4. That class is where we will classify the new item

データの読み取り

avl ツリーのローテーション

入力ファイルを次の形式にします。

Height Weight Age Class 1.70 65 20 Programmer 1.90 85 33 Builder 1.78 76 31 Builder 1.73 74 24 Programmer 1.81 75 35 Builder 1.73 70 75 Scientist 1.80 71 63 Scientist 1.75 69 25 Programmer

各項目は行であり、「クラス」の下に項目が分類される場所が表示されます。機能名の下の値 (「高さ」など) は、その機能に対して項目が持つ値です。すべての値と機能はカンマで区切られます。
これらのデータファイルを作業ディレクトリに配置します。データ2 そしてデータ。いずれかを選択し、内容をそのままという名前のテキストファイルに貼り付けます。 データ 。
ファイル (「data.txt」という名前) から読み取り、入力を行ごとに分割します。

Javaの配列リストメソッド

f = open('data.txt' 'r'); lines = f.read().splitlines(); f.close();

ファイルの最初の行には、最後にキーワード「Class」が付いた機能名が保持されます。機能名をリストに保存したいとします。

# Split the first line by commas # remove the first element and # save the rest into a list. The # list now holds the feature # names of the data set. features = lines[0].split(' ')[:-1];

次に、データセット自体に進みます。項目をという名前のリストに保存します。 アイテム その要素は辞書です (項目ごとに 1 つ)。これらの項目辞書のキーは、機能名と項目クラスを保持する「クラス」です。最後に、リスト内の項目をシャッフルしたいと思います (これは、項目がおかしな順序になっている場合の安全対策です)。

Python3

items = []; for i in range(1 len(lines)): line = lines[i].split(' '); itemFeatures = {'Class' : line[-1]}; # Iterate through the features for j in range(len(features)): # Get the feature at index j f = features[j]; # The first item in the line # is the class skip it v = float(line[j]); # Add feature to dict itemFeatures[f] = v; # Append temp dict to items items.append(itemFeatures); shuffle(items);

データの分類

に保存されたデータを使用して、 アイテム ここで分類器の構築を開始します。分類子として新しい関数を作成します。 分類する 。項目リストを分類したい項目を入力として受け取ります。 k 最近傍の数。
もし k はデータセットの長さよりも大きいため、データセット内のアイテムの総量を超える最近傍アイテムを増やすことはできないため、分類は続行しません。 (あるいは、k を アイテム エラーメッセージを返す代わりに長さを返します)

if(k > len(Items)): # k is larger than list # length abort return 'k larger than list length';

最終的には、分類されるアイテムとトレーニングセット内のすべてのアイテムの間の距離を計算したいと考えています。 k 最短距離。現在の最も近いものを維持するために、というリストを使用します。隣人。最小の各要素は、分類される項目からの距離と、隣接する項目が属するクラスの 2 つの値を保持します。一般化されたユークリッドの公式を介して距離を計算します ( n 寸法）。次に、最も頻繁に表示されるクラスを選択します。隣人それが私たちの選択になります。コード内:

カイリー・ジェンナー兄弟

Python3

def Classify(nItem k Items): if(k > len(Items)): # k is larger than list # length abort return 'k larger than list length'; # Hold nearest neighbors. # First item is distance  # second class neighbors = []; for item in Items: # Find Euclidean Distance distance = EuclideanDistance(nItem item); # Update neighbors either adding # the current item in neighbors  # or not. neighbors = UpdateNeighbors(neighbors item distance k); # Count the number of each # class in neighbors count = CalculateNeighborsClass(neighbors k); # Find the max in count aka the # class with the most appearances. return FindMax(count);

実装する必要がある外部関数は次のとおりです。 ユークリッド距離 隣人の更新 近隣クラスの計算 そして ファインドマックス 。

ユークリッド距離を求める

2 つのベクトル x と y の一般化されたユークリッド公式は次のとおりです。

distance = sqrt{(x_{1}-y_{1})^2 + (x_{2}-y_{2})^2 + ... + (x_{n}-y_{n})^2}

コード内:

Python3

def EuclideanDistance(x y): # The sum of the squared  # differences of the elements S = 0; for key in x.keys(): S += math.pow(x[key]-y[key] 2); # The square root of the sum return math.sqrt(S);

ネイバーの更新

近隣リストがあります (長さは最大でも k ) そして、指定された距離でリストに項目を追加したいとします。まず、次のことを確認します。隣人長さがある k 。それより少ない場合は、距離に関係なくアイテムを追加します (リストを最大まで埋める必要があるため) k アイテムの拒否を開始する前に）。そうでない場合は、リスト内の最大距離を持つアイテムよりもアイテムの距離が短いかどうかを確認します。それが本当の場合、最大距離のアイテムを新しいアイテムと交換します。
最大距離項目をより迅速に見つけるために、リストを昇順にソートしておきます。したがって、リストの最後の項目が最大距離になります。新しい商品と交換させていただき、再度選別させていただきます。
このプロセスを高速化するために、リスト全体を並べ替えずにリストに新しい項目を挿入する挿入並べ替えを実装できます。ただし、このコードはかなり長く、単純ではありますが、チュートリアルが行き詰まってしまいます。

Python3

def UpdateNeighbors(neighbors item distance k): if(len(neighbors) > distance): # If yes replace the last # element with new item neighbors[-1] = [distance item['Class']]; neighbors = sorted(neighbors); return neighbors;

近隣クラスの計算

ここでは、最も頻繁に出現するクラスを計算します。隣人。そのために、という別の辞書を使用します。 カウント ここで、キーは、に表示されるクラス名です。隣人。キーが存在しない場合は追加します。存在しない場合は、その値をインクリメントします。

キャットティンプ

Python3

def CalculateNeighborsClass(neighbors k): count = {}; for i in range(k): if(neighbors[i][1] not in count): # The class at the ith index # is not in the count dict. # Initialize it to 1. count[neighbors[i][1]] = 1; else: # Found another item of class  # c[i]. Increment its counter. count[neighbors[i][1]] += 1; return count;

ファインドマックス

この関数に辞書を入力します。 カウント 私たちは組み込みます 近隣クラスの計算 そしてその最大値を返します。

Python3

def FindMax(countList): # Hold the max maximum = -1; # Hold the classification classification = ''; for key in countList.keys(): if(countList[key] > maximum): maximum = countList[key]; classification = key; return classification maximum;

結論

これで、この kNN チュートリアルは終了です。
新しいアイテムの設定を分類できるようになりました k あなたが適切だと思うように。通常、k には奇数が使用されますが、それは必須ではありません。新しいアイテムを分類するには、そのアイテムを特徴づける機能名と値をキーとする辞書を作成する必要があります。分類の例:

文字列を文字Javaに変換します

newItem = {'Height' : 1.74 'Weight' : 67 'Age' : 22}; print Classify(newItem 3 items);

上記のアプローチの完全なコードを以下に示します。

Python3

# Python Program to illustrate # KNN algorithm # For pow and sqrt import math from random import shuffle ###_Reading_### def ReadData(fileName): # Read the file splitting by lines f = open(fileName 'r') lines = f.read().splitlines() f.close() # Split the first line by commas  # remove the first element and save # the rest into a list. The list  # holds the feature names of the  # data set. features = lines[0].split(' ')[:-1] items = [] for i in range(1 len(lines)): line = lines[i].split(' ') itemFeatures = {'Class': line[-1]} for j in range(len(features)): # Get the feature at index j f = features[j] # Convert feature value to float v = float(line[j]) # Add feature value to dict itemFeatures[f] = v items.append(itemFeatures) shuffle(items) return items ###_Auxiliary Function_### def EuclideanDistance(x y): # The sum of the squared differences # of the elements S = 0 for key in x.keys(): S += math.pow(x[key] - y[key] 2) # The square root of the sum return math.sqrt(S) def CalculateNeighborsClass(neighbors k): count = {} for i in range(k): if neighbors[i][1] not in count: # The class at the ith index is # not in the count dict.  # Initialize it to 1. count[neighbors[i][1]] = 1 else: # Found another item of class  # c[i]. Increment its counter. count[neighbors[i][1]] += 1 return count def FindMax(Dict): # Find max in dictionary return  # max value and max index maximum = -1 classification = '' for key in Dict.keys(): if Dict[key] > maximum: maximum = Dict[key] classification = key return (classification maximum) ###_Core Functions_### def Classify(nItem k Items): # Hold nearest neighbours. First item # is distance second class neighbors = [] for item in Items: # Find Euclidean Distance distance = EuclideanDistance(nItem item) # Update neighbors either adding the # current item in neighbors or not. neighbors = UpdateNeighbors(neighbors item distance k) # Count the number of each class  # in neighbors count = CalculateNeighborsClass(neighbors k) # Find the max in count aka the # class with the most appearances return FindMax(count) def UpdateNeighbors(neighbors item distance k ): if len(neighbors) < k: # List is not full add  # new item and sort neighbors.append([distance item['Class']]) neighbors = sorted(neighbors) else: # List is full Check if new  # item should be entered if neighbors[-1][0] > distance: # If yes replace the  # last element with new item neighbors[-1] = [distance item['Class']] neighbors = sorted(neighbors) return neighbors ###_Evaluation Functions_### def K_FoldValidation(K k Items): if K > len(Items): return -1 # The number of correct classifications correct = 0 # The total number of classifications total = len(Items) * (K - 1) # The length of a fold l = int(len(Items) / K) for i in range(K): # Split data into training set # and test set trainingSet = Items[i * l:(i + 1) * l] testSet = Items[:i * l] + Items[(i + 1) * l:] for item in testSet: itemClass = item['Class'] itemFeatures = {} # Get feature values for key in item: if key != 'Class': # If key isn't 'Class' add  # it to itemFeatures itemFeatures[key] = item[key] # Categorize item based on # its feature values guess = Classify(itemFeatures k trainingSet)[0] if guess == itemClass: # Guessed correctly correct += 1 accuracy = correct / float(total) return accuracy def Evaluate(K k items iterations): # Run algorithm the number of # iterations pick average accuracy = 0 for i in range(iterations): shuffle(items) accuracy += K_FoldValidation(K k items) print accuracy / float(iterations) ###_Main_### def main(): items = ReadData('data.txt') Evaluate(5 5 items 100) if __name__ == '__main__': main()

出力：

0.9375

出力はマシンごとに異なる場合があります。コードには Fold Validation 関数が含まれていますが、アルゴリズムの精度を計算するために存在するアルゴリズムとは無関係です。

クイズの作成