Machine Learning for Capacity Management

Dave Page October 14, 2020

Machine Learning for Capacity Management

Machine Learning has always been a fascinating topic for me, but until recently I've had limited time to really explore it. In my previous role at EDB, I was responsible for our monitoring and management tools, and for a long time I've wanted to look at how we might use machine learning techniques to automate monitoring and management of PostgreSQL deployments. My new role in our CTO office allows me the opportunity to research and learn about technologies such as these so I took the time to start digging in.

I plan on writing a number of blog posts over the coming months as I learn more and consider how we might use these technologies with PostgreSQL, both as part of our applications and to help manage our deployments. I'm starting here with a time series prediction, which is important when running systems because it allows us to use historic usage data to predict future system requirements.

Much of the work I've done to date—including the code I've been hacking—has been based on learning from Udacity's Intro to TensorFlow for Deep Learning online course, which I highly recommend.

 

Capacity Management

It's quite common to find capacity management tools in database management applications. These often utilise linear trend analysis algorithms to analyse metrics that have been collected over time and predict general trends. This can be very useful to answer questions such as "based on historical data, when do I expect my storage to reach 90% of its capacity?" 

Often capacity management is not just about ongoing upward or downward trends; it's about being able to predict seasonal trends and patterns; an obvious example might be that a web store is generally busier on weekdays than at the weekend, may have a large, narrow spike in usage on Cyber Monday, and smaller, wider peaks in usage around certain days such as New Year and Mother's day. Other patterns and factors can be less obvious. In the modern world of cloud computing, knowing when these peaks and troughs of usage may occur in advance can allow us to scale up or scale down infrastructure, responding ahead of time to possible peaks in load, and saving money at expected quiet times. 

 

Neural Networks

Whilst we can predict time series using non-machine learning techniques, it can be complex to do so especially when there are multiple seasonalities to the data such as general load throughout the week, additional monthly load caused by payroll processing at month end, and additional annual load caused by holidays, financial year end processing and so on, all coupled with noise and general upward or downward trends. Using a neural network to analyse the data makes this problem much easier to solve. 

I experimented with artificial data that was generated in Python containing seasonality, an upward trend, and noise, as well as setting up a simulated workload on a PostgreSQL server on which I'm logging the number of active connections every 5 minutes.
 

Model

After experimenting with various options, I created a neural network model using TensorFlow, in the WaveNet architecture. Whilst WaveNet was originally intended for use with audio, it works well with time series data in general. Describing how it works here would make this article far too long, but the key point is that it utilises multiple one dimensional convolutional layers of neurons (in which time steps are examined in sequential groups), with an increasing dilation rate; in other words, each layer of neurons is examining the data with increasing granularity which helps it identify different seasonal patterns, with the lower layers learning short term patterns and the higher layers learning long term patterns. The network is created with the code below:

# Create a sequential model

model = keras.models.Sequential()

 

# We're using the WaveNet architecture, so...

# Input layer

model.add(keras.layers.InputLayer(input_shape=[None, 1]))

 

# Add multiple 1D convolutional layers with increasing

# dilation rates to allow each layer to detect patterns over

# longer time frequencies

for dilation_rate in (1, 2, 4, 8, 16, 32):

   model.add(

     keras.layers.Conv1D(filters=32,

                         kernel_size=2,

                         strides=1,

                         dilation_rate=dilation_rate,

                         padding="causal",

                         activation="relu")

   )

 

# Add one output layer with one filter to give us one output

# per time step

model.add(keras.layers.Conv1D(filters=1, kernel_size=1))

There are a few other parameters in the code above that are perhaps too detailed for the broader subject of this article, but may be of interest nonetheless:

  • filters: This is essentially the number of neurons in the layer, which corresponds to the required output dimension.
  • kernel_size: The length of the 1D convolutional window; i.e. the number of time steps used to compute each output value.
  • strides: This is the number of time steps the filter will move as it processes the input.
  • padding: This determines how data is padded to ensure that inputs at the beginning and end of the time series can be processed (with a kernel size of 2, we need two additional values to process the first and last time steps). Causal padding adds values to the beginning of the time series so that we don't end up predicting future values based on padding added to the future.
  • activation: The activation function helps the network handle non-linear and interaction effects (where different variables in the network can be affected by one another). Rectified Linear Unit or ReLU is a simple algorithm that works well here in many cases - understanding how is left as an exercise for the reader!

The resulting model looks like this:

Model: "sequential"

_________________________________________________________________

Layer (type)                 Output Shape              Param #   

=================================================================

conv1d (Conv1D)              (None, None, 32)          96        

_________________________________________________________________

conv1d_1 (Conv1D)            (None, None, 32)          2080      

_________________________________________________________________

conv1d_2 (Conv1D)            (None, None, 32)          2080      

_________________________________________________________________

conv1d_3 (Conv1D)            (None, None, 32)          2080      

_________________________________________________________________

conv1d_4 (Conv1D)            (None, None, 32)          2080      

_________________________________________________________________

conv1d_5 (Conv1D)            (None, None, 32)          2080      

_________________________________________________________________

conv1d_6 (Conv1D)            (None, None, 1)           33        

=================================================================

Total params: 10,529

Trainable params: 10,529

Non-trainable params: 0

_________________________________________________________________

 

Training

Once the neural network has been created, it needs to be trained. To do this, I fed it both the training data and the validation data, and started it on the learning process—this type of training is known as Supervised Learning. It works in a loop, each iteration of which is known as an epoch, continually refining its configuration and testing itself against the validation data. One potential issue to be aware of is over-fitting. This occurs when (essentially) the network learns the training data rather than the patterns, and happens when the network has been trained too many times on the same data. To avoid this, we use a technique called early stopping, which stops the training once no significant further improvements have been made. We also save the model each time we get the best result seen so far, so we can load that and use it later, as later training epochs might not show the best performance:

# Save checkpoints when we get the best model

model_checkpoint = keras.callbacks.ModelCheckpoint(

   "checkpoint.h5", save_best_only=True)

 

# Use early stopping to prevent over fitting

early_stopping = keras.callbacks.EarlyStopping(patience=50)

history = model.fit(train_set, epochs=500,

                   validation_data=valid_set,

                   callbacks=[early_stopping, model_checkpoint])

 

# Training is done, so load the best model from the last checkpoint

model = keras.models.load_model("checkpoint.h5")

 

Prediction - Simulated Data

Once training is complete, the network can be used to predict data, for which TensorFlow provides various methods including Model.predict() which I'm using. The graph below shows my generated training and validation data, along with the predicted data overlaying the validation data so we can see how well it did. This can also be measured in terms of the Mean Absolute Error (MAE), which for this test was calculated to be 4.529:

It's easy to see that the prediction captured both the seasonality and upward trend of the data, whilst largely ignoring the noise.

 

Prediction - Real Data

Unfortunately, the 'real' data that I'm collecting (albeit from a simulated workload on a PostgreSQL server) is being collected in real time so at the time of writing I have just under a week's worth collected. The workload attempts to simulate a weekly pattern of usage, with the number of users rising and falling throughout the day, and with a similar pattern, but lower peak number of users at the weekend. Feeding the data collected into the test program we see an MAE of 4.369 for the predicted data, with a graph as shown below:

 

Conclusion

Whilst it is relatively easy to predict trends from historical monitoring data to aid us in capacity planning for our PostgreSQL deployments, being able to detect seasonal patterns can allow us greater insight into variations in capacity requirements, allowing us to prepare to scale up capacity in anticipation of increased demands, and to scale down when appropriate to reduce cost.

 

Next Steps

Aside from further refinement of this work, including more in-depth testing with more complex seasonalities, I intend to start looking at how similar techniques might be used in monitoring systems to detect alert conditions automatically based on past behaviour of a database server. Look out for future blog posts as that work progresses!
 

Dave PageVice President & Chief Architect, Database Infrastructure

Dave Page is Vice President and Chief Architect, Database Infrastructure, currently working in the CTO team on research and development, best practices with Postgres, and providing high-level guidance and support for key customers. Dave has been working with PostgreSQL since 1998 and is one of five members of the open source project's Core Team, as well as serving as Secretary of the Board of PostgreSQL Europe and Chairman of the PostgreSQL Community Association of Canada. He joined EDB in 2007 and has been influential in the company’s direction and development of critical database management tools and product packaging and deployment. Prior to EDB, Dave spent more than a decade with The Vale Housing Association as Head of IT. He joined the organization after spending four years as an electronics technician with the Department of Particle and Nuclear Physics at the University of Oxford. Dave holds a Higher National Certificate in electronic engineering from Oxford Brookes University and a Master’s degree in information technology from the University of Liverpool.