Use Polynomial Regression (Machine Learning) to predict Total Fresh Water Consumption
Base on statistical data (stat data about tourist and hotel room occupation) which can be downloaded from related web site of Hong Kong, use Polynomial Regression (algorithm of Machine Learning) to predict monthly total meter consumption of fresh water in Hong Kong.
1. DataSet
Below is part of dataset to be used for analysis
Columns are label as below:
l A1 to A11 (Raw value)
l X1 to X11 (Normalized Value)
l Y2 (Normalized Value of total water consumption)
2. Result of correlation
Use API of Python to check correlation among columns, below is the result of correlation among different columns (An, Xn, Y2)
Figure 1 : Result of correlation (“Ax” w.r.t. Y1, Y2)
Figure 2 : Result of correlation (“Xx” w.r.t. Y1, Y2)
Based of result indicated above, X3,X8, X11 (or A3, A8, A11) should be used as “feature columns”
3. Usage of Regression Algorithm for prediction
As usage of “A3,A8,A11” and “X3, X8, X11” get similar result of accuracy in prediction (Xn is slightly better), below use “X3, X8, X11” as an example
l X3: Volume index of restaurant receipts
l X8: Hong Kong Residents
l X11:Total (Cumulative Net passenger traffic)
l Y2: Total metered consumption
Training Set: May 2021 – June 2023 (26 records)
Test Set: Feb – April 2023 (3 records)
Unknown July – Nov 2023 (5 records)
Below is jupyter notebook of Polynomial Regression Analysis (Degree = 2)
Figure 3: Result of Training Set (Degree = 2)
Attached is jupyter notebook of Polynomial Regression Analysis (Degree = 3)
Figure 4: Result of Training Set (Degree = 3)
4. Comparison of Accuracy
Table 1 : Result of analysis
Conclusion:
1. A “-ve” value of accuracy means “degree = 3” is overfit, the algorithm should use “degree = 2” for analysis.
2. While use “degree = 2” for analysis, below is Y2 value (prediction) of “test set” and “unknown records”.
[Result of “Testing Set”]
[Result of “unknown records” (based on values of X3, X8, X11)
3. Trend is observed, however, the low accuracy (Table1 : 0.8099, 0.4755) should be due to low correlation (figure 2 : 0.48, 0.42, 0.41)
4. Suggest to have more records for analysis if available.
[Source Code]
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
x = [
[1.039847204,0.040847404,0.054116502],
[1.022893174,-0.00394017,-0.013269893],
[1.04926611,-0.183972591,-0.283558137],
[1.063394469,-0.038254929,-0.049247137],
[1.047382329,-0.061533294,-0.065641831],
[1.052091782,-0.084674439,-0.085602721],
[1.024776955,-0.099224642,-0.09731722],
[1.127443029,-0.184462662,-0.215576209],
[0.869365009,-0.256037507,-0.322435981],
[0.555715444,-0.576029258,-0.822362036],
[0.501085791,-0.901112845,-1.313411136],
[0.797781324,-0.900402242,-1.31888304],
[0.993694566,-0.825450809,-1.157192126],
[0.979566207,-0.900485554,-1.202473709],
[1.006881034,-1.012447137,-1.359393347],
[1.021009393,-0.805803869,-1.048818997],
[0.969205411,-0.824245235,-1.082344044],
[1.045498548,-0.860270342,-1.086400679],
[0.997462128,-0.820241356,-0.964932815],
[1.104837655,-1.368228761,-1.649488343],
[1.170769996,-1.371801377,-1.656109277],
[1.047382329,-1.212449945,-1.059762805],
[1.134978153,-2.060978189,-2.204042358],
[1.096360639,-2.365831652,-1.709062736],
[1.120849795,-1.632323032,-1.577246597],
[1.093534968,-2.596870646,-2.699071024]
]
y=[1.046619367,
1.041033801,
1.04573485,
1.044801192,
1.04970972,
1.027471194,
1.025461919,
1.014585626,
0.978096536,
0.921782234,
0.933663165,
0.971042233,
0.987345753,
1.012215992,
1.051413236,
1.035677554,
1.040171123,
1.020329532,
0.988912769,
0.977070058,
0.960591819,
0.969027498,
0.970807454,
0.953526596,
0.986504915,
0.9926747
]
x = np.array(x)
y = np.array(y)
x_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(x)
#print(x_)
model = LinearRegression().fit(x_, y)
print('r_sq = ', model.score(x_, y))
print('b_0 =', model.intercept_)
print('Model coefficient = ', model.coef_)
y_est = model.predict(x_)
print(f' y_train = {y}')
print(f' y_est_train = {y_est}')
score_train = r2_score(y, y_est)
print(f' Score [train] = {score_train}')
# analysis of test set
x_test = [
[0.831689385,0.015334316,0.014145678],
[0.964495958,0.073716455,0.103692934],
[0.964495958,0.077289071,0.109816422]
]
x_test_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(x_test)
# print(x_test_)
y_test = [0.969775516,0.993253459,1.005418309]
y_est_test = model.predict(x_test_)
score_test = r2_score(y_test, y_est_test)
print('-----------------------------------------')
print(f' Score [test] = {score_test}')
print(f' y_test = {y_test}')
print(f' y_est_test = {y_est_test}')
# analysis of analysis set
x_anal = [
[1.112372779,-2.959924921,-2.960503331],
[1.108605217,-1.527046016,-1.418708508],
[1.028544517,-3.661858277,-3.568949647],
[1.079406609,-1.31328692,-0.80107696],
[1.06433636,-1.350061835,-0.750519648]
]
x_anal_= PolynomialFeatures(degree=2, include_bias=False).fit_transform(x_anal)
y_anal_test = model.predict(x_anal_)
print('-----------------------------------------')
print(f' y_est_anal [July 2023 to Nov 2023] = {y_anal_test}')
dates = pd.date_range(start='2021-05-01', end='2023-07-01', freq='M')
print(f'date={dates}')
# Create a DataFrame from the data
data = pd.DataFrame({'Date': dates, 'y-actual': y, 'e-estimate': y_est})
# Set the 'Date' column as the index
data.set_index('Date', inplace=True)
# Create a figure and axis
fig, ax = plt.subplots()
# Plot the time series data
ax.plot(data.index, data['y-actual'], label='y-actual')
ax.plot(data.index, data['e-estimate'], label='e-estimate')
# Customize the plot
ax.set_xlabel('Month')
ax.set_ylabel('Value')
ax.set_title('Time Series Plot (Train data)')
# Set the x-axis tick locator and formatter for each month
month_locator = mdates.MonthLocator()
month_formatter = mdates.DateFormatter('%b %Y')
ax.xaxis.set_major_locator(month_locator)
ax.xaxis.set_major_formatter(month_formatter)
# Add a legend
ax.legend()
# Rotate x-axis labels for better readability (optional)
plt.xticks(rotation=90)
# Display the plot
plt.tight_layout()
plt.show()
# print with preicted value (July - Nov 2023]
dates = pd.date_range(start='2021-05-01', end='2023-12-01', freq='M')
print(f'date={dates}')
# Create a DataFrame from the data
new_y = np.append(y, [0.9926747,0.9926747,0.9926747,0.9926747,0.9926747])
new_y_est = np.append(y_est, y_anal_test)
data = pd.DataFrame({'Date': dates, 'y-actual': new_y, 'e-estimate': new_y_est})
# Set the 'Date' column as the index
data.set_index('Date', inplace=True)
# Create a figure and axis
fig, ax = plt.subplots()
# Plot the time series data
ax.plot(data.index, data['y-actual'], label='y-actual')
ax.plot(data.index, data['e-estimate'], label='e-estimate')
# Customize the plot
ax.set_xlabel('Month')
ax.set_ylabel('Value')
ax.set_title('Time Series Plot (With Predicted data)')
# Set the x-axis tick locator and formatter for each month
month_locator = mdates.MonthLocator()
month_formatter = mdates.DateFormatter('%b %Y')
ax.xaxis.set_major_locator(month_locator)
ax.xaxis.set_major_formatter(month_formatter)
# Add a legend
ax.legend()
# Rotate x-axis labels for better readability (optional)
plt.xticks(rotation=90)
# Display the plot
plt.tight_layout()
plt.show()
Comments
Post a Comment