What is a join and how do I join in Python?

Joins or merges can be one of the most difficult concepts to understand in data analytics. What type of join should I use? Which values should I join on? Is this the correct join type for my data ?

These are all valid questions and we are dealing with difficult concepts!

Types of joins

joins

A join (often known as a merge also) is used when you have two tables, with the same labels and data types that you want to amalgamate into one.

Let's look at the example in the image below:

![tables](https://raw.githubusercontent. com/apcoyne/joins/main/Screenshot%202023-05-11%20at%2017.09.36.png "tables")

We have two tables and we want to merge them so we have one table which includes the 'staff' column from table one, the 'salary' column from table two, with the common column, 'city' . However in this case, by including all of the information we are left with null values , we also have two records for Dublin with conflicting information, how do we address that? We also don't have any salary information for Oslo, so do we leave that blank or exclude altogether?

This is where we need a strategy to address this, and different types of joins are how we do it!

What we are looking at at a fundamental level is what we do about missing values. Values that do not find a match when we merge them.

![types](https://raw.githubusercontent. com/apcoyne/joins/main/Screenshot%202023-05-11%20at%2017.13.18.png "types")

An inner join returns only the rows that have matching values in both tables. In other words, it keeps only the rows that have matching data in both tables and leaves out the non-matching rows.

A left join returns all the rows from the left table and the matching rows from the right table. If there are no matching rows in the right table, it returns NULL values. In other words, it keeps all the rows from the left table and leaves out the non-matching rows from the right table.

A right join is similar to a left join, but it keeps all the rows from the right table and the matching rows from the left table. If there are no matching rows in the left table, it returns NULL values. In other words, it keeps all the rows from the right table and leaves out the non-matching rows from the left table.

A full outer join returns all the rows from both tables, including the non-matching rows. If there are no matching rows in either table, it returns NULL values. In other words, it keeps all the rows from both tables and leaves out only the non-matching rows.

So if we look back to our first example we can see what happens in each instance of a different type of join being applied and what will be included and excluded in each case.

![full example](https://raw.githubusercontent. com/apcoyne/joins/main/Screenshot%202023-05-11%20at%2018.44.26.png "full example")

Inner Join

An inner join returns only the rows that have matching values in both tables. In other words, it returns only the rows where the key values are present in both tables.

In this example, the resulting table will only contain the rows where the id value is present in both tables.

import pandas as pd

# Create the left table
left_table = pd.DataFrame({'id': [1, 2, 3, 4], 'name': ['John', 'Emily', 'Kate', 'Tom']})

# Create the right table
right_table = pd.DataFrame({'id': [2, 3, 4, 5], 'age': [25, 30, 35, 40]})

# Perform an inner join on the 'id' column
inner_join_table = pd.merge(left_table, right_table, on='id', how='inner')

print(inner_join_table)

Left Join

A left join returns all the rows from the left table and the matching rows from the right table. If there are no matching rows in the right table, the result will contain null values for those columns.

In this example, the resulting table will contain all the rows from the left table and only the matching rows from the right table, with null values for the columns where there is no match.


import pandas as pd

# Create the left table
left_table = pd.DataFrame({'id': [1, 2, 3, 4], 'name': ['John', 'Emily', 'Kate', 'Tom']})

# Create the right table
right_table = pd.DataFrame({'id': [2, 3, 4, 5], 'age': [25, 30, 35, 40]})

# Perform a left join on the 'id' column
left_join_table = pd.merge(left_table, right_table, on='id', how='left')

print(left_join_table)


Right Join

A right join returns all the rows from the right table and the matching rows from the left table. If there are no matching rows in the left table, the result will contain null values for those columns.

In this example, the resulting table will contain all the rows from the right table and only the matching rows from the left table, with null values for the columns where there is no match.


import pandas as pd

# Create the left table
left_table = pd.DataFrame({'id': [1, 2, 3, 4], 'name': ['John', 'Emily', 'Kate', 'Tom']})

# Create the right table
right_table = pd.DataFrame({'id': [2, 3, 4, 5], 'age': [25, 30, 35, 40]})

# Perform a right join on the 'id' column
right_join_table = pd.merge(left_table, right_table, on='id', how='right')

print(right_join_table)


Full Outer Join

A full outer join returns all the rows from both tables, with null values for the columns where there is no match.

Here's a visual representation of a full outer join:

In this example, the resulting table will contain all the rows from both tables, with null values for the columns where there is no match.

I hope this explanation and visual representation help to clarify the different types of joins in Python!

import pandas as pd

# Create the left table
left_table = pd.DataFrame({'id': [1, 2, 3, 4], 'name': ['John', 'Emily', 'Kate', 'Tom']})

# Create the right table
right_table = pd.DataFrame({'id': [2, 3, 4, 5], 'age': [25, 30, 35, 40]})

# Perform a full outer join on the 'id' column
full_outer_join_table = pd.merge(left_table, right_table, on='id', how='outer')

print(full_outer_join_table)




What did we learn ?

joins

Joins are not as tough as they seem!

The real pain when trying to merge data could end up being everything EXCEPT your join functions. Your joins may not work correctly due so something like column types not matching, column titles or data not matching, dates not matching and a seemingly infinite amount of other issues that will go wrong on you and break your heart more than the actual concept and code required to join your data!

Avatar for apcoyne

Written by apcoyne

Loading

Fetching comments

Hey! 👋

Got something to say?

or to leave a comment.