Project Verification Requirements and Process

How it works

This section describes how we review projects to ensure they meet CalTRACK standards.

CalTRACK Methods Compliance

The CalTRACK Methods for computing avoided energy use define a broad set of requirements, including data quality and data sufficiency standards, procedures for site weather station matching, the particular form of the regression model that generates the counterfactual, and methods for aggregating uncertainty within a portfolio. We demonstrate full compliance with an active verification of CalTRACK requirements for each site under analysis using the steps described below. Compliance is verified using outputs from unit tests or the open-source EEmeter and EEweather libraries.

Verification

CalTRACK compliance essentially takes the form of a checklist, each item of which is verified for each meter, at each site, at each step of the analysis, from ETL to savings estimation. Meters that fail one or more checks are placed in a queue for review so that compliance issues can be addressed appropriately. Failed data checks can include duplicate values, inadequate baseline data, null fields, etc. Please see Appendix A for a complete list of requirements for compliance and references to the official standards recorded in the CalTRACK Technical Documentation.

Details of the Compliance items are outlined in the following section:

Consumption data frequency must be specified (2.1.1.1)

During ETL, meter traces are associated with an indicator for frequency called an ‘interval’. This can take one of 5 values: 15min, 30min, hourly, daily, billing_monthly, and billing_bimonthly. The interval field is used to check whether or not the correct CalTRACK method is being applied - for instance, the interval field indicates that hourly data is being used, the platform will throw an error if the CalTRACK daily method is applied.

If data from multiple meters are combined, must be noted (2.1.1.2)

During ETL, available versions of a meter trace will be combined by dropping duplicate records, with the most complete record being used. If a record conflicts with another version, the project will be flagged for potential multiple meters. In the case of multiple meters being confirmed, the usage from multiple meters may be aggregated.

‍All consumption data must be converted to units of energy consumption (2.1.1.3)

During ETL, numerical values are converted into kWh or Therms depending on the type of project being measured (gas or electric).

Presence of net metering must be flagged (2.1.1.5)

Net metering is a billing mechanism that credits solar energy system owners for the electricity they add to the grid. Net metering must be noted, and negative meter values will be flagged as has_net and saved along with the suspect data for review, as they indicate the potential of unreported net metering.

Selected weather station must include latitude and longitude coordinates (2.1.2.1)

As part of the ETL process, site data of each project is used to map the lat/long coordinates to the site. This ensures that the weather station data being used is the best fit possible for fitting temperature-based energy usage models. ‍

Weather station information must include IECC Climate Zone (2.1.2.3)

Weather station information must include IECC Moisture Regime (2.1.2.4)

Weather station information must include the Building America Climate Zone (2.1.2.5)

If the site is in the state of California, weather station information must include the California Building Climate Zone Area (2.1.2.6)

The IECC Code, which sets specific minimum performance levels for each of the components of the building envelope, has different requirements depending on the region. During ETL, the site is matched to its appropriate climate zone using lat/long coordinates, as used to match the site to the appropriate weather station.

Data will only be mapped to weather stations that include the IECC Moisture Regime, as this is a requirement for data sufficiency.

Building America Climate Zone is a system of best practices for improving the energy performance and quality of new and existing homes in five major climate regions. The site will be matched to the appropriate weather station that includes BACZ information, as this is a requirement for data sufficiency.

The California Building Climate Zone Area applies to California climate zones as defined by the California Energy Commission Title 24 as part of the Energy Code. During ETL, the site will be matched to the appropriate weather station that includes CBZA information.

Weather station information must include observed dry-bulb temperature data (2.1.2.7)

We use NOAA weather station data, which includes observed dry-bulb temperature.

Project data must include a project start date (2.1.3.1.1)

Project data must include an Intervention completion date (2.1.3.1.2)

Project data must include an intervention active date project.includes_intervention_active_date (2.1.3.1.3)

Project data must include a baseline period end (2.1.3.1.4)

The project start date is necessary to define the start of the blackout period that will be excluded from the analysis. The project dates should be included in the data set provided.

The intervention completion date is necessary to define the end of the blackout period that will be excluded from the analysis.

The intervention active date, if not explicitly provided, is inferred from the project dates.

The baseline period end date must be defined within the data set. If not explicitly provided, this can be inferred from the intervention dates, as long as these dates are non-null

We ask clients to indicate to us the dates in their database wherein the project occurred. The last date before the project started and the first date after the project ended is used to define the blackout start date and intervention active date.

Building site data must include latitude and longitude coordinates of at least four decimal places (2.1.4.1)

During ETL, the lat/long coordinates are geocoded using the geocod.io service to each site using the provided address and/or zip code.

If building site data does not include requisite latitude and longitude coordinates, the latitude, and longitude coordinates of the centroid of the ZIP Code Tabulation Area (ZCTA) may be used instead. (2.1.4.1.1)

If lat/long coordinates are unable to be mapped, the site’s ZCTA will be used to sufficiently match weather station data. The EEweather Python package has a mapping of ZCTA to lat/long coordinates, which is used by the platform when addresses are not available.

Building site data must include a time zone (2.1.6)

The time zone is matched during ETL based on meter location, including daylight savings participation. This is used to convert all timestamps to UTC.

Unless fitting baseline models using hourly methods, the number of days of consumption and temperature data missing should not exceed 37 (10%). (2.2.1.2)

This is the maximum number of allowable missing days within a baseline period, and baseline data sufficiency is a required component in creating tracking and analytics items. During metering, items that have more than the allowable number of missing days are flagged. Inclusion of flagged meters in the data set is customizable based on the client’s portfolio settings.

For hourly methods, baseline consumption data must be available for over 90% of hours in the same calendar month as well as in each of the previous and following calendar months in the previous year. (2.2.1.2)

This is the maximum number of allowable missing hours within an hourly baseline period. Baseline data sufficiency is a required component in creating tracking and analytics items. During metering, items that have more than the allowable number of missing hours are flagged. Inclusion of flagged meters in the data set is customizable based on the client’s portfolio settings.

If data is marked as NULL, NaN, or similar, it is considered missing (2.2.1.3)

Null or N/A data cannot be used for analysis and will be excluded from the data set for the purpose of model building and saving reporting. Meters excluded due to not enough data will be listed in the Portfolio Admin section of the Platform.

Values of zero (0) are considered missing for electricity data, but not gas data (2.2.1.4)

Zero values will be excluded from the data set for electricity data. Values of zero will be marked as null. If a value for gas data is zero, it is considered a value, not a null value.

Less than 50% of high-frequency data can be missing. Missing data must be imputed as average for the time period. (2.2.2.1)

The higher frequency data will be used for analysis only if more than 50% of the values for the higher frequency data set are available. If data is unavailable, the entire period is marked as missing data.

If periods are estimated, they should be combined with subsequent periods (2.2.2.2)

If an estimated period is indicated in the source data, the data will be combined during ETL.

Daily temperature data has been checked for 50% coverage if higher frequency temperature data is being used to calculate data temperature (2.2.2.3/2.2.3.3)

Temperature data is checked during ETL to ensure that if the high-frequency data is being used to compute daily averages, no more than 50% of high-frequency temperature data can be missing. If data is unavailable, the entire period is marked as missing data.

When using billing data, estimated periods should be combined with the next period up to a 70-day limit. Estimated periods are counted as missing data for the purpose of determining data sufficiency. (2.2.3.1)

Length of billing periods is captured in the platform. If there are estimated billing periods in the data provided, they are combined during ETL. If any billing period length exceeds the 70-day limit, a warning is saved along with the suspect data for review and the data is excluded.

Temperature data has been checked for coverage across billing periods so that high-frequency temperature data covers 90% of each averaged billing period (2.2.3.2)

If average temperatures for billing periods are calculated by averaging higher frequency temperature data, the high-frequency temperature data must cover 90% of each averaged billing period. If data is unavailable, the entire period is marked as missing data.

Excessively long Billing Periods have been removed (>35 or >70 days) (2.2.3.4)

Combined estimated billing periods must not exceed 70 days. Normal billing periods must not exceed 35 days. If a billing period exceeds the respective limits, the project will be removed from the data set during ETL, and a warning is saved along with the suspect data for review.

Excessively short Billing Periods have been removed (< 25 days) (2.2.3.4)

Billing periods must not be fewer than 25 days. These readings typically occur due to meter reading problems or changes in occupancy and are removed from the analysis.

When using hourly temperature data, data may not be missing for more than six (6) consecutive hours. Missing temperature data may be linearly interpolated for up to 6 consecutive missing hours. (2.2.4.1)

During ETL, temperature data is reviewed to ensure that no more than 6 consecutive hours are missing. If the temperature data is missing, a different source will be selected. If 6 or fewer hours are missing, the temperatures will not be interpolated in the platform. This step is optional and we generally we find that temperatures are missing for big blocks of time (days, weeks), where interpolation is not allowed by CalTRACK rules.

Excess consumption data should be trimmed prior to analysis (2.2.5)

The platform uses the EEmeter get_baseline_data function and get_reporting_data function for CalTRACK compliance at this step.

Projects should be excluded from analysis if the net metering status changes during the baseline period (2.2.6)

A change in net metering status triggers the creation of a new meter trace. Therefore, if the net metering status changes during the baseline period, data no longer meets data sufficiency requirements.

Projects should be flagged if electric vehicle charging is installed during the baseline or reporting period (2.2.7)

The installation of electric vehicles is considered a non-routine event. Depending upon the timing of the installation, a non-routine adjustment may be applied, or the project will be excluded.

If using billing data and the date provided is impossible (e.g., January 32nd), use the first of the month (2.3.1.1)

The ETL process will flag impossible dates and make the necessary adjustments so that the billing data is valid.

If using billing data and the month or year is impossible, flag the date and remove it from the dataset (2.3.1.2)

During the ETL process, if the month or year is not possible, the dates will be removed from the data set, and a warning is saved along with the suspect data for review.

Where two-time series overlap, combine into a single time series by dropping duplicate records, using the most complete version possible. If timestamps conflict, flag for review. If multiple meters present may be aggregated. (2.3.2.1)

During ETL, duplicate meter trace records are always combined into the most complete meter trace possible, even if those records were split across multiple data sources.

Ensure time zone and daylight savings consistency across the meter and temperature data (2.3.3)

The time zone is matched during ETL based on meter location, including daylight savings participation. All timestamps are converted to UTC internally.

Weather data should be converted to hourly intervals using interpolation and downsampling (2.3.4)

NOAA weather is sampled roughly hourly with minute-level timestamps. This is then converted to hourly by first computing a minute-resolution time series using near interpolation of data points with a limit of 60 minutes, then downsampling to hourly temperature by taking mean of linearly-interpolated minute-level readings. This is done using the EEweather Python library.

Presence of negative meter data should be flagged as possible unreported net metering (2.3.5)

Net metering must be noted. If a negative meter value is found this may indicate the potential of unreported net metering, and a warning is saved along with the suspect data for review.

Extreme values (more than 3 interquartile ranges larger than the median) should be flagged as outliers for manual review. (2.3.6)

If extreme values are found, a warning is saved along with the suspect data for review. The projects removed for this reason can be reviewed in the Portfolio Admin section of the platform.

The site should be matched to nearest weather station within the climate zones that meet data sufficiency requirements (2.4.1)

Each site is assigned lat/long coordinates using geocod.io, which are then used to assign the nearest weather station that complies with CalTRACK weather data sufficiency requirements.

If no weather stations within the climate zone meet data sufficiency requirements, fallback to the closest weather station that has complete data (2.4.1.1)

The Platform uses EEweather for weather station matching compliance. The Platform uses station ranking that allows for naive distance fallbacks.

Weather station matches further than 200 km from the site should be flagged (2.4.2)

Sufficiency requirements state that weather stations must be within 200km of the site’s lat/long coordinates. If the matched station is > 200km away from the project location, a warning is saved along with the suspect data for review.

The baseline must be 365 days immediately prior to the blackout start date (3.1.3)

The Platform uses the EEmeter get_baseline_period method with the following settings:

start=None, end=blackout_start_date, max_days=365, allow_billing_period_overshoot=True, ignore_billing_period_gap_for_day_count=True

*For billing data, It is important to note that this is not CalTRACK 2.0 compliant because the CalTRACK methods do not account for creation of baseline periods with strict 365 day limits.

Compute an hourly methods design matrix comprising a dependent variable of total consumption per hour and independent variables of seven (or fewer) temperature features, 168 binary time-of-week dummy variables, and an occupancy binary variable. (3.10.1)

The Platform uses the EEmeter to create a compliant hourly design matrix.

For hourly methods, avoided energy use should be calculated using the form:

The Platform uses the EEmeter to compute avoided energy use.

CDD balance point range has been limited to between 30 and 90 degrees (3.2.1.1)

The Platform uses the default recommended EEmeter settings for CDD ranges in Fahrenheit.

CDDs are not used as a variable in the calculation for gas data. (3.2.1.1)

Cooling Degree Days are not considered in gas calculations.

HDD balance point range has been limited to between 30 and 90 degrees (3.2.1.2)

The Platform uses the default recommended EEmeter settings for HDD balance point ranges in Fahrenheit.

Cooling balance point must be greater than or equal to the heating balance point (3.2.2.1)

If the cooling balance point is lower than the heating balance point, a warning is saved along with the suspect data for review, as this is not a possible scenario.

A baseline model must have at least 10 non-zero degree days and at least 20 degree days per year, unless using billing data (3.2.2.2)

The Platform uses EEmeter implementation. This is in order to avoid overfitting in the case where only a few days exist with usage and nonzero degree-days, and the usage happens by chance to be unusually high on those days.

Balance point search must check at least every 3 degrees (or fewer) within the range (3.2.3)

The maximum gap between candidate balance points in the grid search is 3 degrees F or the equivalent in degrees C, which is verified in hourly data using EEmeter by checking the gap between balance points.

Regression model must follow the form:

This is the accepted equation for the regression model and is the equation used in EEmeter.

Daily Average Usage has been correctly specified (3.3.3.1)

Mean daily values from usage data is used to calculate the Usage per Day (UPD) within EEmeter. UPD is average use (gas in therms, electricity in kWh) per day during the period for each site.

Cooling degree days have been correctly specified (3.3.4.1.1)

Heating degree days have been correctly specified (3.3.5.1.1)

The Platform uses the default recommended EEmeter settings for CDD and HDD specifications.

Daily models are fit using ordinary least squares regression (3.4.1)

Billing models are fit using weighted least squares regression (3.4.2)

The Platform uses the default recommended EEmeter settings for daily and billing models.

All combinations of candidate balance points are tried (3.4.3.1)

Only include candidate models where each parameter estimate is not negative (3.4.3.2)

Select the candidate model with the highest adjusted R-squared (3.4.3.3)

The Platform uses the default recommended EEmeter settings for candidate models.

If a day in the reporting period is missing a temperature value, the corresponding consumption value for the day should be masked (3.5.1.1)

The EEmeter drops temperature values where the corresponding consumption value is missing for the purpose of calculating metered savings.

If a day in the reporting period is missing a consumption value, the corresponding counterfactual for that day should be masked (3.5.2.1)

The EEmeter drops counterfactual values where the corresponding consumption value is missing for the purpose of calculating metered savings.

Avoided energy use should not be calculated when consumption data is missing (3.5.4.1)

The EEmeter does not calculate avoided energy use where the corresponding consumption value is missing.

For daily and billing methods, avoided energy use should be calculated using the form:

The Platform uses the default recommended EEmeter settings for avoided energy use.

For hourly methods, divide a week into 168 hourly intervals starting on Monday (3.8.2)

The platform uses the default EEmeter implementation for computing time features.

To determine occupancy status, fit a single HDD and CDD weighted least squares model to the baseline dataset using fixed balance points for heating (50 degrees) and cooling (65 degrees) using the following form:

‍To determine occupancy states of a building for hourly methods, group the predictions of the occupancy model into occupied and unoccupied modes by time of week (3.8.4)

To determine which of the temperature bins for the hourly methods to include in the regression design matrix, count the number of hours in each default bin and combine the bins with fewer than 20 hours (3.9.1)

The Platform uses the default recommended EEmeter implementation for occupancy status. Temperatures are calculated in Fahrenheit.

To develop an hourly methods design matrix, first sort temperature values into bins to create a temperature matrix (3.9.1)

The platform uses the default EEmeter implementation for computing temperature features.

To aggregate single project results from individual time periods, the following should be calculated:

To aggregate multiple project results from the same time periods, the following should be calculated:

The Platform uses the above equations for aggregating of project results.‍

A CVRMSE value should be used to define building-level model uncertainty (4.3.2.1)

CVRMSE should be calculated using the following form:

The Platform uses the default recommended EEmeter implementation for CVRMSE calculations.

A Fractional Savings Uncertainty value should be used to define portfolio-level uncertainty (4.3.2.3)

Fractional Savings Uncertainty should be calculated using the following form:

Portfolio Fractional Savings Uncertainty should be calculated using the following form:

The Platform uses the default recommended EEmeter implementation for FSU. A 90% confidence interval is applied to the analysis. ‍

Site level Bias should be calculated using the following form:

Portfolio level Bias should be calculated using the following form:

The Platform does not compute site or portfolio level bias.

This data is used in a series of automated checks, which together make up the “Catrack Scorecard”.

Display

When logged into your platform, each portfolio will have a section called ‘Portfolio Admin’. Clicking on this area will show the status of the individual meters that have been excluded. The included table shows specific details regarding meters that have been excluded. A sample of a set of the included information is below:

‍