스파르타 머신러닝 강의(로지스틱회귀, 다중로지스틱회귀 실습)

머신러닝

스파르타 머신러닝 강의(로지스틱회귀, 다중로지스틱회귀 실습)

골드인생 2024. 8. 12. 20:47

# 자주 쓰는 함수

sklearn.linear_model.LogisticRegression

● 속성

- classes_: 클래스(Y)의 종류

- n_features_in_ : 들어간 독립변수(X) 개수

- feature_names_in_ : 들어간 독립변수(X)의 이름

- coef_ : 가중치

- intercept_ : 바이어스

● 메소드

- fit : 데이터 학습

- predict : 데이터 예측

- predict_proba : 데이터가 Y = 1일 확률을 예측

sklearn.metrics.accuracy : 정확도
sklearn.metrics.f1_socre : f1_score

# 로지스틱회귀 간단한 실습

titanic_df.head(3) # 데이터 확인
- X변수 1개, Y변수(Survived)

titanic_df.info() # 데이터 결측치, 데이터 전체 개수 확인

# 분석할 데이터 정하기
# X변수 : Fare, Y 변수 : Survived
X_1 = titanic_df[['Fare']]
y_true = titanic_df[['Survived']]

# 데이터 학습
from sklearn.linear_model import LogisticRegression
model_lor = LogisticRegression()
model_lor.fit(X_1, y_true)

sns.scatterplot(titanic_df, x = 'Fare', y = 'Survived') # 실제 데이터 분포 시각화 확인
sns.histplot(titanic_df, x = 'Fare') # 실제 데이터 분포 시각화 확인

# 데이터 기술통계를 보는법(수치형) describe()
titanic_df.describe() # Fare 컬럼의 각 값 확인

# 함수 정의
def get_att(x):
 # x모델을 넣기
 print('클래스 종류', x.classes_)
 print('독립변수 개수', x.n_features_in_)
 print('들어간 독립변수(x)의 이름', x.feature_names_in_)
 print('가중치', x.coef_)
 print('바이어스', x.intercept_)

get_att(model_lor) # 각 값 확인

# 평가
from sklearn.metrics import accuracy_score, f1_score # 라이브러리 불러오기
def get_metrics(true, pred):
  print('정확도', accuracy_score(true, pred))
  print('f1-score', f1_score(true, pred))

y_pred_1 = model_lor.predict(X_1) # y_pred_1[:10] 데이터 확인
get_metrics(y_true, y_pred_1) # 각 값 확인

# 다중 로지스틱회귀

titanic_df.info()

 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  891 non-null    int64
 1   Survived     891 non-null    int64
 2   Pclass       891 non-null    int64
 3   Name         891 non-null    object
 4   Sex          891 non-null    object
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64
 7   Parch        891 non-null    int64
 8   Ticket       891 non-null    object
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object
 11  Embarked     889 non-null    object


# 활용 가능한 데이터 분류해보기
-숫자형
- Age, SibSp, Parch, Fare

-범주형
- Pclass, Sex, Cabin, Embarked

#활용할 데이터
# Y(Survived) : 사망
# X(수치형) : Fare
# X(범주형) : Pclass(좌석등급), Sex

def get_sex(x):
    if x == 'female':
        return 0
    else:
        return 1

titanic_df['Sex_en'] = titanic_df['Sex'].apply(get_sex) # head()로 데이터 확인

# 데이터 학습
X_2 = titanic_df[['Pclass', 'Sex_en', 'Fare']]
y_true = titanic_df[['Survived']]
model_lor_2 = LogisticRegression()
model_lor_2.fit(X_2, y_true)

get_att(model_lor_2) # 함수를 활용한 데이터 정보 확인

y_pred_2 = model_lor_2.predict(X_2) # 예측값

y_pred_1[:10]
y_pred_2[:10] # 이전 출력 값과 차이 확인


# 정확도 / f1-seore 값 차이 확인
# X변수가 Fare
get_metrics(y_true, y_pred_1)
# X변수가 Fare, Pclass, Sex
get_metrics(y_true, y_pred_2)

# 각 데이터별 Y=1인 확률 뽑아내기(생존할 확률)
# predict_proba 메소드 활용
model_lor_2.predict_proba(X_2) # y_pred_2[:10] 출력 값 왛긴

	선형회귀(회귀)	로지스틱회귀(분류)
Y(종속변수)	수치형	범주형
평가척도	Mean Square Error R Square(선형 회귀만)	Accuracy F1 - score
sklearn 모델 클래스	sklearn.linear_model.linearRegression	sklearn.linear_model.LogistricRegression
sklearn 평가 클래스	sklearn.metrics.mean_squared_error sklearn.metrics.r2_score	sklearn.metrics.accuracy_score sklearn.metrics.f1_score

MSE 0에 가까울수록

R2 1에 가까울수록

Accuracy 1에 가까울수록 - 데이터가 불균형한 경우(예: 긍정 클래스가 매우 적거나 많을 때)에는 정확도만으로는 모델의 성능을 충분히 평가할 수 없다.

F1-score 1에 가까울수록 데이터가 불균형할 때, F1-score는 모델 성능을 더 잘 나타내는 지표이다.

좋은 데이터이다.