The Aggregate operation enables the merging of rows of source data into a new table comprising:
Dimension columns DIM
Rows are merged together when they share the same values across all Dimension columns, similar to categories in a pivot table.
Aggregation columns AGG
When rows are merged, the values in Aggregation columns are combined using an aggregation method (sum, count, mean, etc).
Reference aggregations REF
When rows are merged, columns set as a Reference aggregation take a value based on the value of another column. For example, pick the most recent value by referring to a column of dates.
Uses for the Aggregate operation include rolling-up transactional data to point-in-time data by reporting period, or to calculate totals and count across a data set.
Aggregate replaced the script operation ‘Aggregate rows (Pivot)’. Details on this legacy operation can be found here.
For source data where the column types are set as follows:
* REF set to pick deductible with max reporting date
The aggregated output data will be:
Click to edit the operation name
List of all columns in all input files to the operation. Click on a source column to add it to your output dataset.
Columns appear in the the order found in the input files. Search can be used to locate a particular column. Once added to the output dataset, columns are shown faded-out in the Source panel.
List of columns which will be created in the output dataset.
Columns can be configured as Dimensions, Aggregations or References. The order of items in the output columns panel (top to bottom) indicates the order of columns in the output dataset (left to right). Output columns can be can be reordered or renamed, duplicated and deleted
Configuring an Aggregate operation
To configure the aggregate operation:
1. Add source columns to the output columns
Click on a source column to add it to the output dataset. To add multiple columns, click the 'Add all' button to add all columns visible in the source list to the outputs at once. If the list has been filtered by search, clicking 'Add all' will add only the search results to the outputs.
Alternatively, click ‘Select column’ at the bottom of the output columns list then select a source column from the drop down list.
2. Set output column types
As each input column is added to the output table, it is configured by default as a Dimension. Click on the column type to set it to an Aggregation, then select the aggregation type, or set it as a Reference aggregation if required.
Tip: Configuring a large number of columns
To make configuring a large number of columns easier, when you add a new column its default configuration will match the one above it in the columns list.
If you use the ‘Add all’ button, all the added columns will take the same configuration.
So if you need to add 50 Aggregate Sum columns, first add and configure one, then add the rest below – they will all be set up as Aggregate Sum.
3. Rename, reorder or duplicate output columns
By default, output column names take the same name as an input column. Click on the output column name to edit it. This can be useful for aggregated columns, e.g. rename Premium to Total Premium.
To duplicate an output column, click the duplicate button at the end of the row.
The order of the output columns can be changed by dragging the output column to the desired position.
Where a column should be aggregated across merged rows, the column type should be change to AGG and an aggregation type chosen.
Total of all values
Standard deviation of values
Number of numeric values
Count Non Empty
Number of values, excluding empty values: ""
Number of distinct values
When aggregating rows it’s fairly common to find that some columns are neither Dimensions nor straight forward Aggregations. This can often be the case with categories that change, but the change is not significant enough to warrant a separate row – instead you just want to pick one of the values. For example, picking the most recent value based on another column of dates.
Reference aggregations help with this. When a source column is set to Reference, one of the values in that column is selected by looking at the value in another reference column (e.g. date) then choosing the source value from the row with either the minimum or maximum value in the reference column (e.g. earliest / latest date)
In this case you may determine that for point-in-time data you want the most recent deductible. For this, using a standard aggregation (Sum, Mean, Max) doesn’t give the value required.
The reference option allows the selection of the deductible value based on the Reporting Date.
Any non-numeric values in the reference column will be ignored.
Where there are multiple rows with the same reference value the lowest value of the source column will be chosen by natural sort (numeric sort, then text ordering).
Unused input columns warning
In some scenarios you will want all available source columns to flow through an Aggregate operation and be included as output columns. As new inputs are added and changed over time, new source columns may flow into the stage, but without a corresponding entry in the Aggregate output panel they will be ignored.
The Aggregate operation can be configured to generate a warning if there are columns in the input data that are not being processed by the operation.
The warning is displayed in the stage panel of pipeline editor. For example, if a new column called “Deductible” is added to an operation input and the operation configuration is not updated to include it in the output, then the following warning will be shown:
When aggregating large data sets, it can be hard to know whether all rows were aggregated as intended. Quantemplate can validate your configuration, generating a report of merge failures.
For the following source data:
If we want to get the total Premium for each policy in a reporting period, we would expect two output rows, one for 2017-11, one for 2017-12.
If we want the output to retain the Product information we would add it to the Aggregate configuration as a Dimension. However, as the rows for 2017-12 contain different values for Product, this would result in two output rows for 2017-12.
Detecting this type of issue in the data manually can be time consuming, so we use Key Columns to create a validation warning when this occurs.
Identifying Key Columns
The Aggregate operations allows us to identify the Key Columns. The set of key columns indicate that:
For each unique set of values across the key columns we expect only one output row.
In the example above we can mark the ID and Period columns as key columns by clicking the key icon in Column Type.
When the pipeline is next run the Validation report will contain an entry for each Aggregation operation where key columns have been identified.
Clicking on the failure description drills down to show the un-merged rows, highlighting the dimension that stopped the merge.
Dimensions stopping rows from merging can often be resolved by changing the column type from Dimension to a Reference aggregation, allowing the selection of the ‘correct’ value.