# Calculating Weighted Averages When Joining Tables in SQL

•
•
•
•
•
•
•
•
•
•

I stumbled upon a very interesting jOOQ question on Stack Overflow that required the calculation of a weighted average. Why is that.
Problem description
Assuming you have this database (using PostgreSQL syntax):

create table transactions (
id bigint not null primary key,
lines bigint not null,
price numeric(18, 2) not null,
profit numeric(18, 2) not null
);

create table lines (
id bigint not null primary key,
transaction_id bigint not null references transactions,
total bigint not null,
quantity bigint not null,
profit numeric(18, 2) not null
);

As can be seen, this schema is slightly denormalised as the number of lines per transaction are precalculated in the transactions.lines column. This will turn out to be quite useful for this calculation, but it isn’t strictly necessary.
Now, in the previously linked Stack Overflow question, a report was desired that would calculate:
An aggregation of sums as provided by the line items
An aggregation of averages as provided by the transactions
This would be straightforward with two separate queries:
Sums provided by the line items

SELECT
sum(profit) AS total_profit,
sum(total) AS total_sales_amount,
sum(quantity) AS total_items_sold
FROM lines

Averages provided by the transactions

SELECT
avg(lines) AS avg_items_p_trx,
avg(price) AS avg_price_p_trx,
avg(profit) AS avg_profit_p_trx
FROM transactions

So far so good.
Doing it in one query
Now, these queries are simplified from the original, which needed to join the two tables in order to add additional predicates. Also, let’s assume that these tables are quite large, so running two queries might lead to the report being too slow. A single query would be much better.
We might be attempted to simply combined the two:

— Wrong query
SELECT
sum(l.profit) AS total_profit,
sum(l.total) AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
avg(t.lines) AS avg_items_p_trx,
avg(t.price) AS avg_price_p_trx,
avg(t.profit) AS avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

But this query is wrong. While the sums are still correct, the averages are not, simply because the join produces duplicate transaction rows per lines. Imagine a transaction having 3 or 5 lines:

SELECT
l.id AS line_id,
t.id AS transaction_id,
t.lines,
t.price
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

The output would be:

LINE_ID TRANSACTION_ID LINES PRICE
——————————————-
1 1 3 20.00
2 1 3 20.00
3 1 3 20.00
4 2 5 100.00
4 2 5 100.00
4 2 5 100.00
4 2 5 100.00
4 2 5 100.00

The average number of lines “avg_items_p_trx” should be 4 = (3 lines + 5 lines) / 2 transactions. But if we calculate avg(t.lines) over the entire data set, we get 4.25 (3×3 lines + 5×5 lines) / 8 items.
The average price “avg_price_p_trx” should be 60.00 = (20.00 + 100.00) / 2 transactions. But if we calculate avg(t.price) over the entire data set, we get 80.00 (3×20.00 + 5×100.00) / 8 items.
How can this be fixed?
Given that each transaction is duplicated because of the join with lines, we have to calculate a weighted average, not an ordinary average. The idea is that instead of using the AVG() aggregate function, we now have to divide the value we want to get an average of by the number of items (i.e. the number of times the value is repeated because of the join), and then divide the sum of that division by the number of transactions.
Prose never describes logic well, so let’s use code. The correct query is:

SELECT
sum(l.profit) AS total_profit,
sum(l.total) AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
sum(t.lines / t.lines) / count(DISTINCT t.id) avg_items_p_trx,
sum(t.price / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

With the above data set:

LINE_ID TRANSACTION_ID LINES LINES/LINES PRICE PRICE/LINES
—————————————————————-
1 1 3 1 20.00 6.66
2 1 3 1 20.00 6.66
3 1 3 1 20.00 6.66
4 2 5 1 100.00 20.00
4 2 5 1 100.00 20.00
4 2 5 1 100.00 20.00
4 2 5 1 100.00 20.00
4 2 5 1 100.00 20.00

We now get the correct weighted averages:
The average number of lines “avg_items_p_trx” is now 4 =
(3/3 + 3/3 + 3/3 + 5/5 + 5/5 + 5/5 + 5/5 + 5/5) / distinct transactions
The average price “avg_price_p_trx” is now 60.00 =
(20.00/3 + 20.00/3 + 20.00/3 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5) / 2 distinct transactions
Note that “avg_items_p_trx” can be simplified:

SELECT
sum(l.profit) AS total_profit,
sum(l.total) AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
count(*) / count(DISTINCT t.id) avg_items_p_trx,
sum(t.price / t.lines) / count(DISTINCT t.id) avg_price_p_trx,
sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM lines AS l
JOIN transactions AS t ON t.id = l.transaction_id

Done!
Normalised version
Notice that this solution profited from the fact that the number of lines per transaction was pre-calculated. We can of course also calculate it on the fly, e.g. using window functions. If it weren’t available, we could do it like this:

SELECT
sum(l.profit) AS total_profit,
sum(l.total) AS total_sales_amount,
sum(l.quantity) AS total_items_sold,
count(*) / count(DISTINCT t.id) avg_items_p_trx,
sum(t.price / l.lines) / count(DISTINCT t.id) avg_price_p_trx,
sum(t.profit / l.lines) / count(DISTINCT t.id) avg_profit_p_trx
FROM (
SELECT
l.*,
count(*) OVER (PARTITION BY l.transaction_id) lines
FROM lines AS l
) AS l
JOIN transactions AS t ON t.id = l.transaction_id

Or, we turn the entire join into a 1:1 relationship by pre-aggregating all the data from lines into one row per transaction. This works because we only calculate sums from the lines table:

SELECT
sum(l.profit_per_transaction) AS total_profit,
sum(l.total_per_transaction) AS total_sales_amount,
sum(l.quantity_per_transaction) AS total_items_sold,
avg(l.lines_per_transaction) AS avg_items_p_trx,
avg(t.price) AS avg_price_p_trx,
avg(t.profit) AS avg_profit_p_trx
FROM (
SELECT
l.transaction_id
sum(l.profit) AS profit_per_transaction,
sum(l.total) AS total_per_transaction,
sum(l.quantity) AS quantity_per_transaction,
count(*) AS lines_per_transaction
FROM lines AS l
GROUP BY l.transaction_id
) AS l
JOIN transactions AS t ON t.id = l.transaction_id

X ITM Cloud News ## The Difference Between SQL’s JOIN .. ON Clause and the Where Clause

Sun Nov 24 , 2019
Spread the love          A question that is frequently occurring among my SQL training‘s participants is: What’s the difference between putting a predicate in the JOIN .. ON clause and the WHERE clause? I can definitely see how that’s confusing some people, as there seems to be no difference at first sight, […] ## You May Like

• ### Official Microsoft Surface Duo SDK Tech Community has been launched! Will you join?

Cloud Computing – Consultancy – Development – Hosting – APIs – Legacy Systems

X-ITM Technology helps our customers across the entire enterprise technology stack with differentiated industry solutions. We modernize IT, optimize data architectures, and make everything secure, scalable and orchestrated across public, private and hybrid clouds. The enterprise technology stack includes ITO; Cloud and Security Services; Applications and Industry IP; Data, Analytics and Engineering Services; and Advisory.

Watch an animation of  X-ITM‘s Enterprise Technology Stack

We combine years of experience running mission-critical systems with the latest digital innovations to deliver better business outcomes and new levels of performance, competitiveness and experiences for our customers and their stakeholders.

X-ITM invests in three key drivers of growth: People, Customers and Operational Execution.

The company’s global scale, talent and innovation platforms serve 6,000 private and public-sector clients in 70 countries.

X-ITM’s extensive partner network helps drive collaboration and leverage technology independence. The company has established more than 200 industry-leading global Partner Network relationships, including 15 strategic partners: Amazon Web Services, AT&T, Dell Technologies, Google Cloud, HCL, HP, HPE, IBM, Micro Focus, Microsoft, Oracle, PwC, SAP, ServiceNow and VMware

.