In this notebook, I compare the performance of two linear models for $Q_{1c}$ and $Q_2$ which have units K/day. In both cases the inputs are the concatenated profiles of $s_l$ (g/kg) and $q_T$(K). This produces a matrix $X$ of iunputs, where each row is a 68 dimensional vector for a given horizontal location, for a given time point. Concatenating $Q_{1c}$ and $Q_2$ produces a matrix of outputs, $Y$, with 68 columns. For the tropics region of NGAqua, both $X$ and $Y$ have 821248 rows.

We will call the first linear regression, LR. It consists of the following steps

  1. Discard any column which has a variance of less then .001. This removes the columns corresponding to the upper atmospheric moisture field, which is essentially 0. Removing these columns is necessary, to avoid an ill-conditioned linear regression.
  2. Perform linear least squares regression (w/ constant profile added).

The next linear model we fit which we call MCR(n), prefilters the inputs $X$ by projecting onto the first $n$ MCA modes. It consists of these steps:

  1. Normalize each column by removing its mean, and dividing by the mass-weighted vertical average of the standard deviation of the corresponding physical variable $q_T$, $s_L$.
  2. Weight each column by the square root of the layer mass.
  3. Peform MCA analysis with $n$ modes, and compute the matrix $P_n$ which projects $X$ onto this modes $n$ modes.
  4. Project the inputs onto the MCA mode.
  5. Perform linear least squares regression (w/ constant profile) with the outputs of the previous step (a $821248\times n$ matrix) onto the full outputs $Y$.
In [1]:
import xarray as xr
import numpy as np
import pandas as pd

from sklearn.externals import joblib
from lib.models import get_linear_model, get_mca_mod
from lib.util import weighted_r2_score
import holoviews as hv