Forecasting Web Traffic with Scout and Prophet

Forecasting traffic to your web app is important for capacity planning, but generating a seasonally accurate model of your traffic is pretty daunting. 

If you under-forecast:

If you over-forecast: you could spend a lot more ðŸ’°.

Looking to get out ahead of your growth the easy way? You're in luck. I've created a shared Google Colab notebook that creates a seasonal forecast of your traffic in less than a minute. Just save a copy of the notebook, enter your Scout API token, and boom: a seasonal forecast of your traffic:

undefined

Under the hood, the notebook uses standard Python data science libraries (like Pandas), Facebook Prophet (a forecasting procedure), and data collected by Scout (exposed via the  Scout API), to train a Prophet model.

Read on for more on how this seasonal traffic forecast works.

Dataset

We need some traffic data! Ideally, this data should be restricted to just the requests hitting our app servers (and not the cache) as these trigger most of the load. The data available via an Application Performance Management (APM) service like Scout is perfect for this. Scout can monitor Ruby, Python, and Elixir apps (with more language support to come).

Via the Scout API, we can easily gather metrics like throughput, response time, and error rates.  

Assuming you've already signed up for Scout, create an API key within the settings area of the Scout UI:

undefined

Tools

I'll be using Google Colab and Python in this tutorial. All of the libraries mentioned are free and open-source.

Let's get started!

Exporting data from Scout

The code below will export four weeks of throughput data from Scout . This timeseries data is stored in the raw_ts Array.

url = "https://scoutapm.com/api/v0/apps/APP_ID/metrics/throughput" # Replace `APP_ID` with your app id.

today = datetime.now()

# The API has a max timeframe of 2 weeks but 30 days of data is stored. We'll
# fetch 4 works of data by making two requests.
range = [(today - timedelta(days=14)),today]

raw_ts = []

for end_time in range:

    params = {
        "from": (end_time - timedelta(days=14)).isoformat(),
        "to": end_time.isoformat()
    }

    r = requests.get(url, params=params, headers=HEADERS)
    res=r.json()
    raw_ts += res["results"]['series']['throughput']

Example raw_ts data:

[['2019-02-02T10:00:00Z', 64.79166666666667],
 ['2019-02-02T12:00:00Z', 222.11666666666667],
 ['2019-02-02T14:00:00Z', 232.7],
 ['2019-02-02T16:00:00Z', 223.35],
 ['2019-02-02T18:00:00Z', 224.99166666666667],
  ...

Create a DataFrame

Let's create a Pandas DataFrame with the raw data. This lets us do all sorts of manipulations and visualizations of the data.

df = pd.DataFrame(raw_ts,columns=["time","throughput"],)
df.time = pd.to_datetime(df.time)
df = df.set_index("time")

Plot the throughput

Let's get a feel for the data by plotting it within our notebook:

plt.figure(figsize=(14,8))
plt.plot(df.index.values, df[df.columns[0]].values)
plt.xlabel('Time', fontsize=12)
plt.ylabel('Throughput', fontsize=12)

...which generates:

undefined

We can see five distinct traffic spikes followed by two smaller spikes. This smells like the weekly traffic pattern of a business app!

We can also see two significant drops in traffic. These might be from deploys and shouldn't be used in our forecast. We'll remove these - Prophet does just fine with gaps in data:

df = df[df.throughput > 100]

Time Series Modeling with Prophet

Now that we have our outlier-free data it's time to model the traffic with Prophet. Straight from the project homepage:

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

That sounds a lot like web traffic patterns, doesn't it! We'll use Prophet to model our web traffic and forecast it into the future.

To start, I'll initialize a Prophet model and fit it with our historical data. Note how this pattern is very similar to ML algorithms in Scikit-Learn. Prophet requires Dataframe columns to be named in a standard format so I handle that too:

model = Prophet()
df["time_for_prophet"] = df.index # hack ... can't use index
model.fit(df.rename(columns={"time_for_prophet": "ds","throughput": "y"}))

Making a forecast

Now I'll forecast two weeks of traffic:

df_forecast = model.make_future_dataframe(periods=14, freq="D")
forecast = model.predict(df_forecast)

undefined

Prophet generates the forecast! We can see it isn't confident enough to show the daily spikes in traffic from our historical data set but still shows the distinct weekday and weekend traffic pattern. We can also see a bit of an increase in traffic in the 2nd week.

Trend and Seasonality

Prophet exposes the model's forecast components so we can look at the overall trend and seasonality:

model.plot_components(forecast);

undefined

What have we learned from this forecast?

During the day, our traffic volume is substantially lower from 00:00 UTC-08:00 UTC. During the week, our traffic is substantially lower on the weekends. We could scale down the number of application servers for a third of every day and the weekends. That's a substantial portion of time where we could be saving money on infrastructure.

Your turn!

Python is an excellent choice for analyzing web performance patterns thanks to its wonderful ecosystem of data science tools. With Prophet and the Scout API we can quickly forecast traffic, even accounting for all of the seasonality patterns in your app. Detecting these patterns early can prevent problems during recurring peak periods.

Want to skip all of the data science and just forecast your traffic? Save a copy of this shared Google Colab notebook, insert your Scout API token, and run the notebook! That's it!