[딥러닝] MNIST

MNIST는 손으로 쓴 숫자 이미지 데이터셋으로, 60,000개의 학습 데이터와 10,000개의 테스트 데이터로 이루어져 있다. 각 이미지는 28x28 픽셀 크기의 흑백 이미지(28x28x1)이며 0부터 9까지의 정수 레이블을 가진다. 텐서플로의 패키지에 포함되어 있기 바로 사용할 수 있다.

from tensorflow.keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

아래 코드를 실행하면 실제 데이터를 확인해볼 수 있다.

import matplotlib.pyplot as plt

plt.figure(figsize = (20, 4))
for idx in range(5):
    label = y_train[idx]
    plt.subplot(1, 5, idx + 1)
    plt.xticks([])
    plt.yticks([])
    plt.imshow(X_train[idx], cmap = 'gray')
    plt.title('Index - %d, Label - %d' % (idx, label), fontsize = 20)
plt.show()

사진에서 가장 왼쪽의 레이블 5를 가지는 데이터를 numpy로 출력해보면 아래와 같이 출력된다.

Nomarlization

이미지 데이터의 각 픽셀은 0~~255까지의 수로 나타낸다. 이를 0~~1까지의 수로 정규화한다.

X_train = X_train.astype(float) / 255
X_test = X_test.astype(float) / 255

입력값을 0부터 1사이의 값으로 스케일링하면 모델이 더 빠르게 수렴할 수 있고 더 강건하게 만들어준다. 이는 모델의 학습을 더욱 효과적으로 만들어 주는 역할을 한다.

Reshape

X_train = X_train.reshape((60000, 28 * 28))
X_test = X_test.reshape((10000, 28 * 28))

본 포스트에서는 Dense 레이어을 가진 모델을 사용할 것이기 때문에 기존의 (60000, 28, 28)의 데이터를 (60000, 28 * 28)로 바꿔준다.

one-hot encoding

레이블은 5, 0, 4, 1, 9처럼 정수형태로 되어있다. 정수 형태의 레이블은 모델이 예측한 출력과 직접적인 비교가 어렵기 때문에 보다 효과적인 학습을 위해 one-hot encoding으로 변환하는 것이 일반적이다. to_categorical 함수를 사용해 간편하게 one-hot encoding을 적용할 수 있다.

from tensorflow.keras.utils import to_categorical

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

print(y_train[:5])

출력 결과는 아래와 같다.

[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

train/test split

train_test_split 를 사용해 train set에서 valid set을 분리한다. valid set은 test set과는 별개로 모델의 학습 정도를 평가하기 위해 사용된다.

from sklearn.model_selection import train_test_split 

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train,
  test_size = 0.2,
  random_state = 12345
)

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape # ((48000, 784), (48000, 10), (12000, 784), (12000, 10))

모델

모델 정의

아래와 같이 모델을 정의한다.

from tensorflow.keras import models, layers

model = models.Sequential()
model.add(layers.Dense(512, activation = 'relu', input_shape = (28 * 28,)))
model.add(layers.Dense(256, activation = 'relu'))
model.add(layers.Dense(10, activation = 'softmax'))

model.summary()

출력 결과로 모델의 구조를 확인할 수 있다.

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 512)               401920    

 dense_1 (Dense)             (None, 256)               131328    

 dense_2 (Dense)             (None, 10)                2570      

=================================================================
Total params: 535,818
Trainable params: 535,818
Non-trainable params: 0
_________________________________________________________________

모델 컴파일

모델의 학습 방법을 설정한다. loss function으로 categorical_crossentropy을, optimizer로 rmsprop을, 평가 방법으로 accuarcy를 추가했다.

model.compile(
    loss='categorical_crossentropy',
    optimizer='rmsprop',
    metrics=['accuracy']
)

모델 학습

fit 함수를 호출해 모델 학습을 시작한다.

history = model.fit(
    X_train, 
    y_train,
    epochs=100,
    batch_size=128,
    validation_data=(X_valid, y_valid)
)

학습 결과는 아래와 같다.