[DRAFT] Where git merging can go wrong

This is rough draft, and I still need to edit it.

This post describes a common pitfall when contributing code with git – how a common pattern – pulling main into a feature branch can result in lost work.

You begin a feature branch, and as you develop it, main moves forward too.

          o---o  feature
         /
o---o---o---o---o  main

As main diverges you’ll want to incorporate that work into your feature, so that you can benefit from the bugfixes/improvements going on in main. In a related post, I wrote about why you should rebase your feature onto main. However, I didn’t explain why the alternative, merging main into your feature is easy to mess up.

Let’s move forward with the example, describe the scary consequence, and then understand why it happens.

From the feature branch you merge in the latest with main, you resolve some conflicts, complete the merge, and add another commit finishing your feature.

          o---o---o---o feature
         /       /
o---o---o---o---o  main

With feature completed, a team member reviews your work and merges your changes into main¹. There are no conflicts.

          o---o---o---o feature
         /       /     \
o---o---o---o---o---o---o main

To everyone’s dismay, a critical bugfix added previously to main is now nowhere to be found!

What happened? Two small failures occurred: one by the feature developer and another by the reviewer. But these are plausible. The feature developer only tried to sync with the latest, and the reviewer didn’t review the non-conflicting changes closely enough.

Let’s assume that the critical bugfix is B, and let’s label M1 and M2 for the first and second merge respectively.

          o---o--M1---o feature
         /       /     \
o---o---o---B---o---o--M2 main

During M1, there were conflicts, and they were not resolved correctly. The author excluded the important bugfix added in B.

The merge algorithm in git is surprisingly simple. Let’s rewind before the first merge. How does git merge main into feature?

          o---o  feature
         /
o---o---o---o---o  main

First, you must understand what commits represent. You might assume each commit stores changes or a patch. Instead, every commit stores a snapshot of the project not unlike a zipfile of all the files at any given time. Git’s data structures do this and minimize redundancy. Every commit additionally can have one or more parents which in turn have parents and are the basis of connectedness in the graph. The internals are so simple they are worth peaking into! What distinguishes a merge from a non-merge commit? The number of parents!

Now onto merges!

Merge algorithm for merge(E, G):

          D---E  feature
         /
o---o---C---F---G  main

Find C, C = common_ancestor(E, G) or merge base
Compute diff1, diff1 = diff(C, E)
Compute diff2, diff2 = diff(C, G)
Where the diffs don’t overlap (touch the same lines), apply the changes to C
Where the diffs do overlap insert the conflict markers to show each version
When user commits³ set the parents to E and G.

No matter how complex the underlying graph, the merge only ever looks at three commits². The two commits to merge and their common ancestor, the merge base. D, F are never considered. This makes sensee when you consider that E, G are snapshots.

Suppose the feature developer followed these steps to produce M1. Then they proceed to commit G their final commit. The reviewer now proceeds to merge the finished feature into main. They need to compute merge(G, H).

          D---E--M1---H  feature
         /       /
o---o---C---F---G  main

This ascii graph makes it a little hard to understand the connections between these commits. Every one of the edges is directed. For example, the E commit has 1 parent D. If I have D, then I can lookup it’s parent and so on (but there is no connection in the opposite direction).

          D < E < M1 < H  feature
         /       /
        v       v      
o < o < C < F < G  main

Let’s add another commit (I) to main, and then compute merge(I, H):

          D---E--M1---H  feature
         /       /
o---o---C---F---G---I  main

The first step of merge(I, H) is to identify the common ancestor. It may be a bit surprising that git determines common_ancestor(I, H) to be G. There can be multiple common ancestors. C is also a common ancestor, but where one is the descendant of another, the descendant is preferred⁴.

We then compute diff1 = diff(G, I) and diff2 = diff(G, H). As it happened there were no conflicts from the changes introduced in I, and M2 is the merge result.

          D---E--M1---H  feature
         /       /     \
o---o---C---F---G---I--M2  main

Consider if during our second merge C had been chosen as the common ancestor (TODO choose common ancestor or merge base and stick with it!). M2 would have been the resolution of diff(C, H) and diff(C, I). This conceptually represents something very different. It’s saying do any of the changes in feature conflict with any of the changes outside of the feature. But M2 as we computed it before asked a different question. It asked do any of the changes from G to H conflict with any of the changes from G to I.

This is pictorally what that latter question looks like:

                 M1---H  feature
                 /     
o---o---C---F---G---I  main

The diff from C to G is not part of the equation. If any changes introduced in M1 would have overlapped with C to G, they now will not trigger any conflict.

The default behavior of git is to trust this merge M1 as resolving the history since C, such that this later merge, starts from G. But in our case M1 undid crucial changes (bugfix) introduced in F.

This can be addressed by better processes. For example, a team could have a maintainer who does all merging into the main branch, who is aware of all work. They should never trust these merges in feature branches (as the non-maintainer cannot really make these judgements).

Another good process is for non-maintainers to generally avoid making any merges. A string of connected non-merge commits encodes something specific, it encodes changes. The change from snapshot A to snapshot B and the change from B to C. But when you introduce a merge, the commit no longer represents a single change. It’s a different change depending on the selected ancestor.

Suppose you adopt a flow where every feature must be rebased (tacked onto the end of main) prior to merge into main.

          o---o---o  feature
         /
o---o---o  main

If you instruct git to create a merge commit always, then you’ll end up with this after the merge.

          o---o---o  feature
         /         \
o---o---o-----------o  main

Main’s merge will have two parents but its snapshot will be identical to feature. So what this merge encodes and only encodes in main is that there was a feature brought into main that was three commits long. Without forcing the merge you would end up with the following graph which doesn’t indicate there was even a feature to begin with.

o---o---o---o---o---o  main

If you want to convey changes to snapshots use non-merge commits. If you want to record feature history use merges.

if you want to convey the resolution of histories use a merge, but make the merge snapshot identical to one of the merge’s ancestors.

It would be interesting to explore a fork of git that required merge commits to not introduce changes, or changed merge commits altogether to not have a snapshot. Perhaps this is solved in the patch based VCS.

At a minimum the reviewer, can convert the feature branch into a no-merge branch and then merge it with the proper common ancestor.

1. In this graph, I'm showing an extra merge commit for illustrative purposes. Without conflicts, the default behavior would not create an extra commit. It would be a fast-forward merge. ↩

2. I'm assuming you're merging two things, merging N is not that different. ↩

3. There are no restrictions on the snapshot you include in the commit. It can contain any changes they don't have to exist in either side. ↩

4. In our diagrams, the further to the right the common ancestor, the smaller the diffs, and thus a lesser liklihood of conflict. ↩