Example¶
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
load the Excel file¶
In [2]:
dataset = pd.read_csv(r"loan.csv")
# r => row string -> To convert the text (D:\file\path\loan.csv) -> Path
dataset.head(3)
Out[2]:
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | NaN | 360.0 | 1.0 | Urban | Y |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
Total no of Row
& Column
¶
In [3]:
dataset.shape
Out[3]:
(614, 13)
No, of Column
[0] , Row
[1]¶
In [4]:
dataset.shape[0]
Out[4]:
614
In [5]:
dataset.isnull() # NaN => True,
Out[5]:
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | True | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
609 | False | False | False | False | False | False | False | False | False | False | False | False | False |
610 | False | False | False | False | False | False | False | False | False | False | False | False | False |
611 | False | False | False | False | False | False | False | False | False | False | False | False | False |
612 | False | False | False | False | False | False | False | False | False | False | False | False | False |
613 | False | False | False | False | False | False | False | False | False | False | False | False | False |
614 rows × 13 columns
Count all True (NaN
) Value for Every
Column¶
In [6]:
dataset.isnull().sum()
Out[6]:
Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 dtype: int64
In [7]:
(dataset.isnull().sum()/dataset.shape[0])*100
Out[7]:
Loan_ID 0.000000 Gender 2.117264 Married 0.488599 Dependents 2.442997 Education 0.000000 Self_Employed 5.211726 ApplicantIncome 0.000000 CoapplicantIncome 0.000000 LoanAmount 3.583062 Loan_Amount_Term 2.280130 Credit_History 8.143322 Property_Area 0.000000 Loan_Status 0.000000 dtype: float64
In [8]:
dataset.isnull().sum().sum()
Out[8]:
np.int64(149)
In [9]:
(dataset.isnull().sum().sum()/(dataset.shape[0]*dataset.shape[1]))*100
# ( Total NaN / area ) * 100
Out[9]:
np.float64(1.8667000751691305)
✅ Not NaN Value find¶
In [10]:
dataset.notnull().sum()
Out[10]:
Loan_ID 614 Gender 601 Married 611 Dependents 599 Education 614 Self_Employed 582 ApplicantIncome 614 CoapplicantIncome 614 LoanAmount 592 Loan_Amount_Term 600 Credit_History 564 Property_Area 614 Loan_Status 614 dtype: int64
In [11]:
# Total
dataset.notnull().sum().sum()
Out[11]:
np.int64(7833)
✅ Graph Data¶
In [12]:
sns.heatmap(dataset.isnull())
plt.show()
✅ HANDLING MISSING VALUES (DROPPING)¶
dataset.isnull().sum()
- Loan_ID --------------- 0
- Gender --------------- 13
- Married --------------- 3
- Dependents ----------- 15
- Education ------------- 0
- Self_Employed -------- 32
- ApplicantIncome ------- 0
- CoapplicantIncome ----- 0
- LoanAmount ----------- 22
- Loan_Amount_Term ----- 14
- `Credit_History ------ 50` <= 50% -> So, we `Remove` the Column
- Property_Area --------- 0
- Loan_Status ----------- 0
☁️ Remove¶
=> pythondataset.drop(columns=["Credit_History"])
-----------> But its not Permanent => change the View
=> dataset.drop(columns=["Credit_History"],inplace=True)
--> It's Permanently change the variable dataset
☁️ So, U can use new dataSet
¶
=> [new-Name] = pythondataset.drop(columns=["Credit_History"])
In [13]:
dataSet = dataset.drop(columns=["Credit_History"])
dataSet.isnull().sum()
Out[13]:
Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Property_Area 0 Loan_Status 0 dtype: int64
☁️ Remove all NaN Row => .dropna()
¶
In [14]:
dataSet.head(3)
Out[14]:
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | NaN | 360.0 | Urban | Y |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | Urban | Y |
In [15]:
dataSet.dropna(inplace=True)
dataSet.head(3)
Out[15]:
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | Urban | Y |
3 | LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358.0 | 120.0 | 360.0 | Urban | Y |
In [16]:
sns.heatmap(dataSet.isnull())
plt.show()
In [17]:
dataSet.shape
Out[17]:
(523, 12)
In [18]:
dataSet.isnull().sum()
Out[18]:
Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Property_Area 0 Loan_Status 0 dtype: int64
☁️ no. of Removing Row¶
In [19]:
a = (614 - 253)/ 618 # Remove Row
b = ((614 - 253)/ 618) * 100 # in %
print('Remove_Row ---------------> ' + str(a))
print('Remove_Row_percentage % --> ' + str(b))
#NB : The error you're encountering is due to trying to concatenate a string with a float in your print statements. In Python, you cannot directly concatenate a string and a float using the + operator within the print function.
# To fix this, you need to convert the float variables a and b to strings before concatenating them with the other strings. Here’s how you can correct your code:
Remove_Row ---------------> 0.5841423948220065 Remove_Row_percentage % --> 58.41423948220065
In [20]:
dataset.head(3)
# it's Old dataset
Out[20]:
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | NaN | 360.0 | 1.0 | Urban | Y |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
In [21]:
dataset.isnull().sum()
Out[21]:
Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 dtype: int64
🐼 Fill All Missing value with 10
¶
In [22]:
dataset.fillna(10).head(3)
Out[22]:
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | 10.0 | 360.0 | 1.0 | Urban | Y |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
But its not the Right Way ⬆️ to fill the Data, because, it's fill the int-Type (10) in the String-Type section.¶
=> .head(30) ⬆️¶
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | 10.0 ✅ | 360.0 | 1.0 | Urban | Y |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
3 | LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358.0 | 120.0 | 360.0 | 1.0 | Urban | Y |
4 | LP001008 | Male | No | 0 | Graduate | No | 6000 | 0.0 | 141.0 | 360.0 | 1.0 | Urban | Y |
5 | LP001011 | Male | Yes | 2 | Graduate | Yes | 5417 | 4196.0 | 267.0 | 360.0 | 1.0 | Urban | Y |
6 | LP001013 | Male | Yes | 0 | Not Graduate | No | 2333 | 1516.0 | 95.0 | 360.0 | 1.0 | Urban | Y |
7 | LP001014 | Male | Yes | 3+ | Graduate | No | 3036 | 2504.0 | 158.0 | 360.0 | 0.0 | Semiurban | N |
8 | LP001018 | Male | Yes | 2 | Graduate | No | 4006 | 1526.0 | 168.0 | 360.0 | 1.0 | Urban | Y |
9 | LP001020 | Male | Yes | 1 | Graduate | No | 12841 | 10968.0 | 349.0 | 360.0 | 1.0 | Semiurban | N |
10 | LP001024 | Male | Yes | 2 | Graduate | No | 3200 | 700.0 | 70.0 | 360.0 | 1.0 | Urban | Y |
11 | LP001027 | Male | Yes | 2 | Graduate | 10 ❌ | 2500 | 1840.0 | 109.0 | 360.0 | 1.0 | Urban | Y |
12 | LP001028 | Male | Yes | 2 | Graduate | No | 3073 | 8106.0 | 200.0 | 360.0 | 1.0 | Urban | Y |
13 | LP001029 | Male | No | 0 | Graduate | No | 1853 | 2840.0 | 114.0 | 360.0 | 1.0 | Rural | N |
14 | LP001030 | Male | Yes | 2 | Graduate | No | 1299 | 1086.0 | 17.0 | 120.0 | 1.0 | Urban | Y |
15 | LP001032 | Male | No | 0 | Graduate | No | 4950 | 0.0 | 125.0 | 360.0 | 1.0 | Urban | Y |
16 | LP001034 | Male | No | 1 | Not Graduate | No | 3596 | 0.0 | 100.0 | 240.0 | 10.0 | Urban | Y |
17 | LP001036 | Female | No | 0 | Graduate | No | 3510 | 0.0 | 76.0 | 360.0 | 0.0 | Urban | N |
18 | LP001038 | Male | Yes | 0 | Not Graduate | No | 4887 | 0.0 | 133.0 | 360.0 | 1.0 | Rural | N |
19 | LP001041 | Male | Yes | 0 | Graduate | 10 ❌ | 2600 | 3500.0 | 115.0 | 10.0 ✅ | 1.0 | Urban | Y |
20 | LP001043 | Male | Yes | 0 | Not Graduate | No | 7660 | 0.0 | 104.0 | 360.0 | 0.0 | Urban | N |
21 | LP001046 | Male | Yes | 1 | Graduate | No | 5955 | 5625.0 | 315.0 | 360.0 | 1.0 | Urban | Y |
22 | LP001047 | Male | Yes | 0 | Not Graduate | No | 2600 | 1911.0 | 116.0 | 360.0 | 0.0 | Semiurban | N |
23 | LP001050 | 10 ❌ | Yes | 2 | Not Graduate | No | 3365 | 1917.0 | 112.0 | 360.0 | 0.0 | Rural | N |
24 | LP001052 | Male | Yes | 1 | Graduate | 10 ❌ | 3717 | 2925.0 | 151.0 | 360.0 | 10.0 | Semiurban | N |
25 | LP001066 | Male | Yes | 0 | Graduate | Yes | 9560 | 0.0 | 191.0 | 360.0 | 1.0 | Semiurban | Y |
26 | LP001068 | Male | Yes | 0 | Graduate | No | 2799 | 2253.0 | 122.0 | 360.0 | 1.0 | Semiurban | Y |
27 | LP001073 | Male | Yes | 2 | Not Graduate | No | 4226 | 1040.0 | 110.0 | 360.0 | 1.0 | Urban | Y |
28 | LP001086 | Male | No | 0 | Not Graduate | No | 1442 | 0.0 | 35.0 | 360.0 | 1.0 | Urban | N |
29 | LP001087 | Female | No | 2 | Graduate | 10 ❌ | 3750 | 2083.0 | 120.0 | 360.0 | 1.0 | Semiurban | Y |
In [23]:
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 614 entries, 0 to 613 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Loan_ID 614 non-null object 1 Gender 601 non-null object 2 Married 611 non-null object 3 Dependents 599 non-null object 4 Education 614 non-null object 5 Self_Employed 582 non-null object 6 ApplicantIncome 614 non-null int64 7 CoapplicantIncome 614 non-null float64 8 LoanAmount 592 non-null float64 9 Loan_Amount_Term 600 non-null float64 10 Credit_History 564 non-null float64 11 Property_Area 614 non-null object 12 Loan_Status 614 non-null object dtypes: float64(4), int64(1), object(8) memory usage: 62.5+ KB
⬆️⬇️ Up-Down filling¶
dataset.fillna(method="bfill").head(3)
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | 128.0 ✅ | 360.0 | 1.0 | Urban | Y |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 ⬆️ | 360.0 | 1.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
dataset.fillna(method="ffill").head(30)
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18 | LP001038 | Male | Yes | 0 | Not Graduate | No⬇️ | 4887 | 0.0 | 133.0 | 360.0 | 1.0 | Rural | N |
19 | LP001041 | Male | Yes | 0 | Graduate | No✅ | 2600 | 3500.0 | 115.0 | 360.0 | 1.0 | Urban | Y |
20 | LP001043 | Male | Yes | 0 | Not Graduate | No | 7660 | 0.0 | 104.0 | 360.0 | 0.0 | Urban | N |
21 | LP001046 | Male | Yes | 1 | Graduate | No | 5955 | 5625.0 | 315.0 | 360.0 | 1.0 | Urban | Y |
22 | LP001047 | Male⬇️ | Yes | 0 | Not Graduate | No | 2600 | 1911.0 | 116.0 | 360.0⬇️ | 0.0 | Semiurban | N |
23 | LP001050 | Male✅ | Yes | 2 | Not Graduate | No | 3365 | 1917.0 | 112.0 | 360.0✅ | 0.0 | Rural | N |
➡️⬅️ side-by-side filling¶
dataset.head(1)
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | NaN ❌ | 360.0 | 1.0 | Urban | Y |
dataset.fillna(method="bfill",axis=1).head(1)
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | 360.0✅ | 360.0⬅️ | 1.0 | Urban | Y |
dataset.fillna(method="ffill",axis=1).head(1)
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0➡️ | 0.0✅ | 360.0 | 1.0 | Urban | Y |
In [24]:
# dataset["Gender"].mode()
dataset["Gender"].mode()[0]
Out[24]:
'Male'
In [25]:
# dataset["Gender"].fillna(dataset["Gender"].mode()[0],inplace=True) => Permanent Change the dataset, Or
# Or Create a New data_set
data_set = dataset["Gender"].fillna(dataset["Gender"].mode()[0])
data_set
Out[25]:
0 Male 1 Male 2 Male 3 Male 4 Male ... 609 Female 610 Male 611 Male 612 Male 613 Female Name: Gender, Length: 614, dtype: object
Loan_ID | Gender | |
---|---|---|
22 | LP001047 | Male |
23 | LP001050 | NaN ❌ |
Loan_ID | Gender | |
---|---|---|
22 | LP001047 | Male |
23 | LP001050 | Male ✅ |
In [26]:
# Create a copy of the original dataset
data_set_01 = dataset.copy()
In [27]:
# 1st Collect all .obj (object) Type data from dataset
# data_set_01.select_dtypes(include="object") ------------------> This is the dataset where the obj-Type data present
# data_set_01.select_dtypes(include="object").isnull() ---------> object Type data => True
data_set_01.select_dtypes(include="object").isnull().sum() # ---> Sum of all obj Type data
Out[27]:
Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 Property_Area 0 Loan_Status 0 dtype: int64
In [28]:
# find only Column names, as a list
data_set_01.select_dtypes(include="object").columns
Out[28]:
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status'], dtype='object')
In [29]:
# find only Column names
for i in data_set_01.select_dtypes(include="object").columns:
print(i)
Loan_ID Gender Married Dependents Education Self_Employed Property_Area Loan_Status
In [30]:
# fill Mode-value for every Columns
for i in data_set_01.select_dtypes(include="object").columns:
data_set_01[i].fillna(data_set_01[i].mode()[0],inplace=True)
C:\Users\akash\AppData\Local\Temp\ipykernel_4000\3464004097.py:4: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. data_set_01[i].fillna(data_set_01[i].mode()[0],inplace=True)
In [31]:
data_set_01.select_dtypes(include="object").isnull().sum()
Out[31]:
Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 Property_Area 0 Loan_Status 0 dtype: int64
All object Type data fill with Mode (in Every Column)¶
In [32]:
data_set_01.isnull().sum() # for All data Types
Out[32]:
Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 dtype: int64
Column | NaN |
---|---|
LoanAmount | 22 |
Loan_Amount_Term | 14 |
Credit_History | 50 |
Those are numerical Data Type => So, it's cant fill with Object Type (Mode) data¶
🐼 1st Organize the dataset & 🐼 filter columns¶
In [33]:
# Create a copy of the original dataset
data_set_02 = dataset.copy()
In [34]:
data_set_02.head(3) # main dataset
Out[34]:
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | NaN | 360.0 | 1.0 | Urban | Y |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
In [35]:
data_set_02.isnull().sum() # All Missing value
Out[35]:
Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 dtype: int64
In [36]:
data_set_02.info() # All data types
<class 'pandas.core.frame.DataFrame'> RangeIndex: 614 entries, 0 to 613 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Loan_ID 614 non-null object 1 Gender 601 non-null object 2 Married 611 non-null object 3 Dependents 599 non-null object 4 Education 614 non-null object 5 Self_Employed 582 non-null object 6 ApplicantIncome 614 non-null int64 7 CoapplicantIncome 614 non-null float64 8 LoanAmount 592 non-null float64 9 Loan_Amount_Term 600 non-null float64 10 Credit_History 564 non-null float64 11 Property_Area 614 non-null object 12 Loan_Status 614 non-null object dtypes: float64(4), int64(1), object(8) memory usage: 62.5+ KB
In [37]:
# data_set_02.select_dtypes(include="float64") # show only numeric dataType columns
# But there was float + int
numerical_columns = data_set_02.select_dtypes(include=['int64', 'float64'])
numerical_columns
Out[37]:
ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | |
---|---|---|---|---|---|
0 | 5849 | 0.0 | NaN | 360.0 | 1.0 |
1 | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 |
2 | 3000 | 0.0 | 66.0 | 360.0 | 1.0 |
3 | 2583 | 2358.0 | 120.0 | 360.0 | 1.0 |
4 | 6000 | 0.0 | 141.0 | 360.0 | 1.0 |
... | ... | ... | ... | ... | ... |
609 | 2900 | 0.0 | 71.0 | 360.0 | 1.0 |
610 | 4106 | 0.0 | 40.0 | 180.0 | 1.0 |
611 | 8072 | 240.0 | 253.0 | 360.0 | 1.0 |
612 | 7583 | 0.0 | 187.0 | 360.0 | 1.0 |
613 | 4583 | 0.0 | 133.0 | 360.0 | 0.0 |
614 rows × 5 columns
In [38]:
numerical_columns.select_dtypes(include=['int64', 'float64']).columns # See only Column Names
Out[38]:
Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History'], dtype='object')
🐼 fill Missing value using Sklearn¶
In [39]:
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy="mean")
si.fit_transform(data_set_02[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Credit_History']])
output:¶
array([[5.84900000e+03, 0.00000000e+00, 1.46412162e+02, 3.60000000e+02,
1.00000000e+00],
[4.58300000e+03, 1.50800000e+03, 1.28000000e+02, 3.60000000e+02,
1.00000000e+00],
[3.00000000e+03, 0.00000000e+00, 6.60000000e+01, 3.60000000e+02,
1.00000000e+00],
...,
[8.07200000e+03, 2.40000000e+02, 2.53000000e+02, 3.60000000e+02,
1.00000000e+00],
[7.58300000e+03, 0.00000000e+00, 1.87000000e+02, 3.60000000e+02,
1.00000000e+00],
[4.58300000e+03, 0.00000000e+00, 1.33000000e+02, 3.60000000e+02,
0.00000000e+00]])
All NaN data fill as a Array => Convert it into dataFrame¶
In [40]:
si = SimpleImputer(strategy="mean")
ar = si.fit_transform(numerical_columns[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Credit_History']])
Old ❌ NaN¶
In [41]:
# data_set_02.select_dtypes(include="float64").head(3)
numerical_columns.select_dtypes(include=['int64', 'float64']).head(3)
Out[41]:
ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | |
---|---|---|---|---|---|
0 | 5849 | 0.0 | NaN | 360.0 | 1.0 |
1 | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 |
2 | 3000 | 0.0 | 66.0 | 360.0 | 1.0 |
New ✅ 146.41... Avg.¶
In [42]:
# pd.DataFrame(ar,columns=[ColumnsName])
data_set_03 = pd.DataFrame(ar,columns=numerical_columns.select_dtypes(include=['int64', 'float64']).columns)
data_set_03.head(3)
Out[42]:
ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | |
---|---|---|---|---|---|
0 | 5849.0 | 0.0 | 146.412162 | 360.0 | 1.0 |
1 | 4583.0 | 1508.0 | 128.000000 | 360.0 | 1.0 |
2 | 3000.0 | 0.0 | 66.000000 | 360.0 | 1.0 |
In [43]:
data_set_03["LoanAmount"].mean() # check mean for one Column (LoanAmount)
Out[43]:
np.float64(146.41216216216216)
🦖 All NaN -> fill with Mean
146.41...⬆️ -> using SCIKIT-LEARN¶
In [44]:
data_set_03.isnull().sum() # All blank numeric daType (NaN) => filled
Out[44]:
ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 dtype: int64
🌐🦖 this all step use in Pipe-Line , To Automate filter & Deploy ⬆️⬆️¶
☁️ 1st See the Original dataset & missing value NaN¶
In [45]:
#import pandas as pd
#dataset = pd.read_csv("loan.csv")
dataset.head(3)
Out[45]:
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | NaN | 360.0 | 1.0 | Urban | Y |
1 | LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | Rural | N |
2 | LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
✈️ find missing (NaN) values (Categorical data)¶
dataset.isnull().sum()
# Or use, => data_set_04.isnull().sum() ---> After using => data_set_04 = dataset.copy()
# give same result
✈️ output:¶
Loan_ID 0
Gender 13 ===> ABC -> 0,1
Married 3 ===> ABC -> 0,1
Dependents 15
Education 0 ===> ABC
Self_Employed 32 ===> ABC
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50 ---> its already in 0, 1
Property_Area 0 ---> ❓
Loan_Status 0 ---> Y-N❓
☁️ fill All Categorical data (Yes - No)¶
In [46]:
# Create a copy of the original dataset
data_set_04 = dataset.copy()
data_set_04.isnull().sum() # all missing columns
Out[46]:
Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 dtype: int64
In [47]:
data_set_04["Gender"].fillna(data_set_04["Gender"].mode()[0],inplace=True) #----> fill mode in Gender column
data_set_04["Married"].fillna(data_set_04["Married"].mode()[0],inplace=True) #---> fill mode in Married column
C:\Users\akash\AppData\Local\Temp\ipykernel_4000\3812312211.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. data_set_04["Gender"].fillna(data_set_04["Gender"].mode()[0],inplace=True) #----> fill mode in Gender column C:\Users\akash\AppData\Local\Temp\ipykernel_4000\3812312211.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. data_set_04["Married"].fillna(data_set_04["Married"].mode()[0],inplace=True) #---> fill mode in Married column
In [48]:
data_set_04.isnull().sum()
Out[48]:
Loan_ID 0 Gender 0 Married 0 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 dtype: int64
Loan_ID 0
Gender 0 ==> fill with Mode ✅
Married 0 ==> fill with Mode ✅
Dependents 15
Education 0
Self_Employed 32
...
☁️ ENCODING¶
🦖 using .get_dummies()¶
In [49]:
# Filter out Gender + Married in new Variable\
data_set_05 = data_set_04[["Gender","Married"]]
data_set_05
Out[49]:
Gender | Married | |
---|---|---|
0 | Male | No |
1 | Male | Yes |
2 | Male | Yes |
3 | Male | Yes |
4 | Male | No |
... | ... | ... |
609 | Female | No |
610 | Male | Yes |
611 | Male | Yes |
612 | Male | Yes |
613 | Female | No |
614 rows × 2 columns
In [50]:
pd.get_dummies(data_set_05) # Encoding ---> True False ⬇️
# pd.get_dummies(data_set_06).info() ----> 0, 1 => boolean ⬇️
Out[50]:
Gender_Female | Gender_Male | Married_No | Married_Yes | |
---|---|---|---|---|
0 | False | True | True | False |
1 | False | True | False | True |
2 | False | True | False | True |
3 | False | True | False | True |
4 | False | True | True | False |
... | ... | ... | ... | ... |
609 | True | False | True | False |
610 | False | True | False | True |
611 | False | True | False | True |
612 | False | True | False | True |
613 | True | False | True | False |
614 rows × 4 columns
In [51]:
# Assuming 'dataset' is your DataFrame
encoded_dataset = pd.get_dummies(data_set_05)
# Display the first few rows to show 1s and 0s
print("First few rows with binary indicators:")
print(encoded_dataset.head())
print()
# Display the summary information of the DataFrame
# pd.get_dummies(data_set_05).info()
print("Summary information:")
print(encoded_dataset.info())
First few rows with binary indicators: Gender_Female Gender_Male Married_No Married_Yes 0 False True True False 1 False True False True 2 False True False True 3 False True False True 4 False True True False Summary information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 614 entries, 0 to 613 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender_Female 614 non-null bool 1 Gender_Male 614 non-null bool 2 Married_No 614 non-null bool 3 Married_Yes 614 non-null bool dtypes: bool(4) memory usage: 2.5 KB None
In [52]:
from sklearn.preprocessing import OneHotEncoder
In [53]:
obj = OneHotEncoder()
obj.fit_transform(data_set_05) # --> encoded_dataset take it from Top ⬆️
# it create me a sparse-matrix used in Deep-Learning
Out[53]:
<Compressed Sparse Row sparse matrix of dtype 'float64' with 1228 stored elements and shape (614, 4)>
In [54]:
# Convert into Array
# Assuming array_01 is your transformed array
# Example array_01 with 8 columns
array_01 = obj.fit_transform(data_set_05).toarray()
array_01
Out[54]:
array([[0., 1., 1., 0.], [0., 1., 0., 1.], [0., 1., 0., 1.], ..., [0., 1., 0., 1.], [0., 1., 0., 1.], [1., 0., 1., 0.]])
In [55]:
# Specify the correct column names (assuming array_01 has 8 columns)
column_names = ["Gender_Female", "Gender_Male", "Married_No", "Married_Yes"]
# Create DataFrame with correct column names
df = pd.DataFrame(array_01, columns=column_names)
# Display the DataFrame
print(df.head()) # Displaying first few rows for example
# ✅✅✅ Or directly u can wright => pd.DataFrame(array_01,columns=["Gender_Female", "Gender_Male", "Married_No", "Married_Yes"])
Gender_Female Gender_Male Married_No Married_Yes 0 0.0 1.0 1.0 0.0 1 0.0 1.0 0.0 1.0 2 0.0 1.0 0.0 1.0 3 0.0 1.0 0.0 1.0 4 0.0 1.0 1.0 0.0
⬆️ This Data is very Big => Remove some columns¶
In [56]:
obj_01 = OneHotEncoder(drop="first")
c = obj_01.fit_transform(data_set_05).toarray()
c
Out[56]:
array([[1., 0.], [1., 1.], [1., 1.], ..., [1., 1.], [1., 1.], [0., 0.]])
In [57]:
pd.DataFrame(c, columns= ["Gender_Male", "Married_Yes"])
Out[57]:
Gender_Male | Married_Yes | |
---|---|---|
0 | 1.0 | 0.0 |
1 | 1.0 | 1.0 |
2 | 1.0 | 1.0 |
3 | 1.0 | 1.0 |
4 | 1.0 | 0.0 |
... | ... | ... |
609 | 0.0 | 0.0 |
610 | 1.0 | 1.0 |
611 | 1.0 | 1.0 |
612 | 1.0 | 1.0 |
613 | 0.0 | 0.0 |
614 rows × 2 columns
✅ LABEL ENCODING¶
MACHINE LEARNING
|
-------------------------------------------------
| |
SUPERVISED LEARNING UN-SUPERVISED LEARNING
| |
--------------------- ---------------------
| | | |
`CLASSIFICATION` `REGRESSION` `CLUSTERING` `ASSOCIATION`
|------- `NOMINAL` ----> Cow, Dog, ox, parrot
|
CATEGORICAL DATA -------|
|
|------- `ORDINAL` -----> XL, XXL, XXXL
use for UnSequence data ⬇️ give all Name => a uniq no.
In [58]:
#import pandas as pd
dF = pd.DataFrame({"name":["Dog", "Cat", "Cow", "Lion", "Ox"]})
dF
Out[58]:
name | |
---|---|
0 | Dog |
1 | Cat |
2 | Cow |
3 | Lion |
4 | Ox |
In [59]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit_transform(dF["name"]) # fit => Trined my model
# transform -> to convert
output:¶
array([2, 0, 1, 3, 4])
In [63]:
le = LabelEncoder()
dF["en_name"] = le.fit_transform(dF["name"])
dF
Out[63]:
name | en_name | |
---|---|---|
0 | Dog | 2 |
1 | Cat | 0 |
2 | Cow | 1 |
3 | Lion | 3 |
4 | Ox | 4 |
In [ ]: