Use Polynomial Regression (Machine Learning) to predict Total Fresh Water Consumption

Base on statistical data (stat data about tourist and hotel room occupation) which can be downloaded from related web site of Hong Kong, use Polynomial Regression (algorithm of Machine Learning) to predict monthly total meter consumption of fresh water in Hong Kong.

1. DataSet

Below is part of dataset to be used for analysis

Columns are label as below:

l A1 to A11 (Raw value)

l X1 to X11 (Normalized Value)

l Y2 (Normalized Value of total water consumption)

2. Result of correlation

Use API of Python to check correlation among columns, below is the result of correlation among different columns (An, Xn, Y2)

Figure 1 : Result of correlation (“Ax” w.r.t. Y1, Y2)

Figure 2 : Result of correlation (“Xx” w.r.t. Y1, Y2)

Based of result indicated above, X3,X8, X11 (or A3, A8, A11) should be used as “feature columns”

3. Usage of Regression Algorithm for prediction

As usage of “A3,A8,A11” and “X3, X8, X11” get similar result of accuracy in prediction (Xn is slightly better), below use “X3, X8, X11” as an example

l X3: Volume index of restaurant receipts

l X8: Hong Kong Residents

l X11:Total (Cumulative Net passenger traffic)

l Y2: Total metered consumption

Training Set: May 2021 – June 2023 (26 records)

Test Set: Feb – April 2023 (3 records)

Unknown July – Nov 2023 (5 records)

Below is jupyter notebook of Polynomial Regression Analysis (Degree = 2)

Figure 3: Result of Training Set (Degree = 2)

Attached is jupyter notebook of Polynomial Regression Analysis (Degree = 3)

Figure 4: Result of Training Set (Degree = 3)

4. Comparison of Accuracy

Table 1 : Result of analysis

Conclusion:

1. A “-ve” value of accuracy means “degree = 3” is overfit, the algorithm should use “degree = 2” for analysis.

2. While use “degree = 2” for analysis, below is Y2 value (prediction) of “test set” and “unknown records”.

[Result of “Testing Set”]

[Result of “unknown records” (based on values of X3, X8, X11)

3. Trend is observed, however, the low accuracy (Table1 : 0.8099, 0.4755) should be due to low correlation (figure 2 : 0.48, 0.42, 0.41)

4. Suggest to have more records for analysis if available.

[Source Code]

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import r2_score

import matplotlib.pyplot as plt

import matplotlib.dates as mdates

import pandas as pd

x = [

[1.039847204,0.040847404,0.054116502],

[1.022893174,-0.00394017,-0.013269893],

[1.04926611,-0.183972591,-0.283558137],

[1.063394469,-0.038254929,-0.049247137],

[1.047382329,-0.061533294,-0.065641831],

[1.052091782,-0.084674439,-0.085602721],

[1.024776955,-0.099224642,-0.09731722],

[1.127443029,-0.184462662,-0.215576209],

[0.869365009,-0.256037507,-0.322435981],

[0.555715444,-0.576029258,-0.822362036],

[0.501085791,-0.901112845,-1.313411136],

[0.797781324,-0.900402242,-1.31888304],

[0.993694566,-0.825450809,-1.157192126],

[0.979566207,-0.900485554,-1.202473709],

[1.006881034,-1.012447137,-1.359393347],

[1.021009393,-0.805803869,-1.048818997],

[0.969205411,-0.824245235,-1.082344044],

[1.045498548,-0.860270342,-1.086400679],

[0.997462128,-0.820241356,-0.964932815],

[1.104837655,-1.368228761,-1.649488343],

[1.170769996,-1.371801377,-1.656109277],

[1.047382329,-1.212449945,-1.059762805],

[1.134978153,-2.060978189,-2.204042358],

[1.096360639,-2.365831652,-1.709062736],

[1.120849795,-1.632323032,-1.577246597],

[1.093534968,-2.596870646,-2.699071024]

]

y=[1.046619367,

1.041033801,

1.04573485,

1.044801192,

1.04970972,

1.027471194,

1.025461919,

1.014585626,

0.978096536,

0.921782234,

0.933663165,

0.971042233,

0.987345753,

1.012215992,

1.051413236,

1.035677554,

1.040171123,

1.020329532,

0.988912769,

0.977070058,

0.960591819,

0.969027498,

0.970807454,

0.953526596,

0.986504915,

0.9926747

]

x = np.array(x)

y = np.array(y)

x_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(x)

#print(x_)

model = LinearRegression().fit(x_, y)

print('r_sq = ', model.score(x_, y))

print('b_0 =', model.intercept_)

print('Model coefficient = ', model.coef_)

y_est = model.predict(x_)

print(f' y_train = {y}')

print(f' y_est_train = {y_est}')

score_train = r2_score(y, y_est)

print(f' Score [train] = {score_train}')

# analysis of test set

x_test = [

[0.831689385,0.015334316,0.014145678],

[0.964495958,0.073716455,0.103692934],

[0.964495958,0.077289071,0.109816422]

]

x_test_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(x_test)

# print(x_test_)

y_test = [0.969775516,0.993253459,1.005418309]

y_est_test = model.predict(x_test_)

score_test = r2_score(y_test, y_est_test)

print('-----------------------------------------')

print(f' Score [test] = {score_test}')

print(f' y_test = {y_test}')

print(f' y_est_test = {y_est_test}')

# analysis of analysis set

x_anal = [

[1.112372779,-2.959924921,-2.960503331],

[1.108605217,-1.527046016,-1.418708508],

[1.028544517,-3.661858277,-3.568949647],

[1.079406609,-1.31328692,-0.80107696],

[1.06433636,-1.350061835,-0.750519648]

]

x_anal_= PolynomialFeatures(degree=2, include_bias=False).fit_transform(x_anal)

y_anal_test = model.predict(x_anal_)

print('-----------------------------------------')

print(f' y_est_anal [July 2023 to Nov 2023] = {y_anal_test}')

dates = pd.date_range(start='2021-05-01', end='2023-07-01', freq='M')

print(f'date={dates}')

# Create a DataFrame from the data

data = pd.DataFrame({'Date': dates, 'y-actual': y, 'e-estimate': y_est})

# Set the 'Date' column as the index

data.set_index('Date', inplace=True)

# Create a figure and axis

fig, ax = plt.subplots()

# Plot the time series data

ax.plot(data.index, data['y-actual'], label='y-actual')

ax.plot(data.index, data['e-estimate'], label='e-estimate')

# Customize the plot

ax.set_xlabel('Month')

ax.set_ylabel('Value')

ax.set_title('Time Series Plot (Train data)')

# Set the x-axis tick locator and formatter for each month

month_locator = mdates.MonthLocator()

month_formatter = mdates.DateFormatter('%b %Y')

ax.xaxis.set_major_locator(month_locator)

ax.xaxis.set_major_formatter(month_formatter)

# Add a legend

ax.legend()

# Rotate x-axis labels for better readability (optional)

plt.xticks(rotation=90)

# Display the plot

plt.tight_layout()

plt.show()

# print with preicted value (July - Nov 2023]

dates = pd.date_range(start='2021-05-01', end='2023-12-01', freq='M')

print(f'date={dates}')

# Create a DataFrame from the data

new_y = np.append(y, [0.9926747,0.9926747,0.9926747,0.9926747,0.9926747])

new_y_est = np.append(y_est, y_anal_test)

data = pd.DataFrame({'Date': dates, 'y-actual': new_y, 'e-estimate': new_y_est})