December 3, 2013

Day 3 - 14 Tips for Git Giddiness in 2014

Written By: J. Paul Reed (@soberbuildeng)
Edited By: Adam Compton (@comptona)

Whether you're a developer or a systems administrator, it's inconceivable you haven't heard of Git by now.

Statements like "You can use any version control system you want, just as long as it starts with a 'g' and ends with an 'it'" illustrate the exploding popularity and ubiquity of the version control system Linux's Linus developed in a weekend.

Despite this, organizations making the transition from the previous generation of version control systems (and sometimes no version control at all) face many challenges.

Here are 14 tips—ten for the individual Git user, and four for organizations moving to Git—to make 2014 your best year using Git yet!

  1. Be an "Explicit Git User"

    A developer, tongue in cheek, once told me: "Git operates on the principle of most-surprise." That is often true, especially when Git's developers change the semantics of core commands every major release or so.

    One way to help reduce mistakes is to be (what my colleague Pete Cheslock dubbed) an "Explicit Git User." In other words, move away from using Git's default command arguments in daily workflows and always specify what you intend.

    Examples include:

    • Specify the repository and refspec arguments to git push (especially when using --force).
    • Favor git fetch (with the repository and refspec arguments!) over git pull; pull is a fetch, followed by a merge, so doing it this way separates the act of getting upstream content from integrating it into your local repository, which allows you to explicitly specify how you want to do that.

    You'll be glad you stopped relying on default command arguments, especially when they change in a major release.

  2. Learn How To Pull In Upstream Changes Without Creating a Merge Commit

    Pretty much every Git repository contains comments like Merge branch 'master' of repository_name into master. These are merge commits, automatically generated by Git when you've made changes on your local branch and you pull in upstream changes. Git treats these as divergent histories that must be merged together using—drumroll—a merge commit! Previous version control systems would add conflict markers in the file and make you resolve them, but without operating against the repository.

    The issue here is clarity; this commit adds zero value and clutters up the repository's history. It also makes graphical views of the repository history more confusing.

    A common way to avoid it: use the --rebase option with git pull. The downside is this requires a clean working directory, but that's a good practice when pulling anyway. It also assumes the simplistic usecase where you have made local commits, you want to integrate upstream commits that occurred in the remote repository, and you want behavior that is similar to the CVS or Subversion-workflow you may be used to.

    You can make this behavior the default by adding the following to your .gitconfig:[branch]
       autosetuprebase = always

  3. Learn to Rebase Your Commits to Tell a Story

    The next two tips are around Git's rebase functionality, which can be daunting and dangerous, but also powerful. Restricting yourself in the ways you use rebasing will help keep you clear of situations that are the makings of a Git horror story. One of those uses is rewriting history to tell a meaningful story to future developers (including, possibly, yourself).

    This method is recommended on feature branches, where you may have done a bunch of work, but the units of that work and the commit messages associated with them are less than useful. When I have this problem, I use the oldest-ancestor alias (see tip #7) to rewrite the commits to tell the story of my feature branch. On that branch, I run: git rebase -i $(git oldest-ancestor). I can then reorder and squash commits into something useful. Being an explicit Git user, when squashing commits, I reorder them first, complete the rebase, and then perform a separate rebase to squash the commits into a meaningful unit of work. This allows me to --abort the rebase if I get into an unexpectedly complex situation.

    When complete, my feature branch is ready to push upstream. If I've pushed it before—and this is where you see most of the warnings telling you not to use rebase at all, evar!—it will require a --force push; any others who've worked on my branch will need to know that I've rewritten history. Because of that, when working collaboratively on a feature branch, I will wait until the feature is ready for final review or to be merged before doing this.

  4. Learn to Rebase Your Branch to Clarify/Constrain the Scope of Your Changes (And Keep the Build Green)

    Once your feature branch is ready to merge back to the parent code-stream, it's always been a best practice to merge in all the changes from the upstream branch into your local branch first. This reduces conflicts when you merge back, and if you've been off on a branch for a long time, allows you to make sure the code still builds and passes unit tests.

    In Git, this can be done by merging the parent's code down into your feature branch before merging back. But there are advantages to achieving the same result with a rebase. (The biggest, for me, is that the operation takes place by "replaying" all of my commits on top of the current code; this can often make resolving conflicts simpler, since I can do it in smaller chunks, as opposed to all at once.)

    To do so, make sure you have the latest copy of the upstream branch. Then, on your feature branch, use git rebase UPSTREAM_BRANCH. This will, in effect, pull in all the upstream changes, stopping for you to resolve any conflicts as your feature branch commits are replayed.

    (If you're using Git Flow, this is even easier! Use git flow feature rebase.)

  5. Do Not Store Artifacts in Git (Even Lil' Ones)

    Previous version control systems handled binary artifacts well enough that it was common practice to dump JARs, shared-objects, and other dependent artifacts used in the software build process (or shipped in the product) into source control, right alongside the code. While there's nothing fundamentally wrong with using source control to track binaries—Pixar famously versions entire movies using Perforce—the industry is moving away from the practice, in lieu of proper artifact management, using tools specifically designed for the task.

    Because Git (currently) requires users to grab an entire copy of the repository to work, storing binary artifacts, even small ones, can balloon the size of the repository over time, causing times for clones and other costly Git operations to increase significantly. git rm'ing these binaries does not fix the problem, unless you undertake repository surgery (a service I am hired surprisingly often for).

    In the future, Git may address these issues, but proper artifact management provides a number of other benefits and is the future: resist the temptation to commit these types of files and put your binaries, even the small ones, under real artifact management.

  6. Submodules are Horrible... Except When They're Not

    Git's submodule feature has a largely bad reputation and it's not entirely undeserved. A stand-in for Subversion's externals, the feature breaks many standard Git conventions and often feels not-fully baked. (See: removing a submodule.)

    Having said that, submodules can serve a useful purpose: stitching multiple, internal repositories within a singular organization. One of the standard patterns spurred by Git is emphasis on smaller, narrowly scoped repositories, often at the library or module level. Repositories built of submodules, used to build the complete product can reduce a lot of the complexity of trying to stitch these repositories together yourself. But, they should point to repositories your organization manages; pointing to random content on Github is a recipe for future sadness.

  7. Learn [to Love] Git Aliases

    One of Git's most useful features is the ability to add commands to its lexicon, via the alias feature. Learn to love this feature: it allows you to create shortcut commands for complex operations.

    For instance, in my work I often need to find the common ancestor of a feature branch and master; to do this, I found an alias I called oldest-ancestor; it runs

    bash -c 'diff -u <(git rev-list --first-parent "${1:-master}") <(git rev-list --first-parent "${2:-HEAD}") | sed -ne \"s/^ //p\" | head -1'
    (Taken from StackExchange).

    Problem solved! Even commonly used simple operations can be aliased. One of my other favorites: mapping reset HEAD -- to unstage.

    Check out git-extras for more; I stole many of my favorites from Wil Moore's dotfiles.

  8. Take Time to Learn Git's Primitives

    If your job requires you to use Git more than three times per week, do yourself a favor and take time to learn Git's primitive objects: the blob, tree, commit, and tag objects.

    A firm understanding of how these objects relate to each other and how they're used to construct histories of source code development lines will help you understand why Git behaves differently than previous-generation source control systems.

  9. Be Careful What You Read On the Internet

    When running into brick walls with Git, engineers (including myself) invariably run to Google to find an answer. Take caution with those search results.

    In my experience, when untangling any reasonably complex Git problem (merges especially!), a search for the solution will return 20% of answers that look promising, but upon further reading, the language I used ("branch", "parent", "diff chunk", etc.) in my search is being used in a different context in the answer, and thus isn't helpful. About 40% of answers would be useful, if they applied to my team's repository structure and expectations. 10% of them will be just plain incorrect. This leaves me with just 30% of answers that are actually useful in the environment in which I'm using Git.

  10. Take it Easy on Each Other (And Yourself!)

    New Git users: in my training sessions, I find it important to let new users know: if Git seems difficult and confusing, it is. Much has been written about Git's cognitive inconsistencies.

    It's one of the few software development tools that makes it too easy to destroy your own (and others') content! For all the bandying we do in the software industry about the importance of user interface consistency, usability, and polish, Git came roaring out of the gate without an emphasis on any of these. Any usability expert will tell you this it's the tool's fault, not the user's. Git is continuing to evolve and improve, but if you ever feel frustrated or confused by Git, you are not alone.

    Seasoned Git users: I know developers who love to trade stories on how Git has screwed them. They seem to see it as a badge of honor that they blew their (and sometimes their team's) foot off with the tool. Sometimes they've been able to recover; sometimes not-so-much. While regaling these warstories, some label those who find Git confusing or counterintuitive to be incompetent, "not a craftsperson," or just plain stupid.

    If this describes you: stop it. There are many capable, competent developers, QA engineers, techpub writers, artists, and others who take great care in their work, but totally do not care about the version control tool they use; they just want to get their job done. The attitude that everyone must obsess over the intricacies of a single counterintuitive tool required to get work done does nothing to help Git's adoption. Plus, you come off as an unempathetic jerk.

For organizations adapting their development environment and release processes to Git:

  1. Your Team Needs a Git "Language Lawyer"

    The concept of a "language lawyer" came out of the necessity to have someone on the team who knew all the little crevices of complicated languages, like C++, where the standard was continually evolving, compilers supported different features, and platform oddities made development complex and error prone. Given Git's constant evolution, differing versions (especially on long-term support distributions), and platform oddities (*cough*Windows*cough*), current-day Git is at least as complicated as C++.

    Your team should have a "Git language lawyer." That person should follow the Git releases, read release notes and possibly even follow the mailing lists. They should track the state of the art of Git's supporting tools.

    This provides a resource for your organization to help frame discussions optimizing your team's Git workflows and be a source of deep technical knowledge when team members, old and new, run into problems.

  2. Make Time Early On to Discuss and Decide on Workflows

    An anti-pattern I commonly see is a subset of developers dragging the rest of the team into using Git and the repository quickly becoming a "bed-headed, tangled mess" of content, branches, and commits.

    This is due to Git supporting so many workflows. Failing to discuss and agree upon a workflow is a recipe for disaster. Some issues your team should come to agreement on include:

    • Are you going to use Git Flow? (This may sound like a stupid question, but most organizations I work with start using Git Flow and pretty quickly evolve its workflow into something else. It's a fine tool, but, for instance, Github doesn't use it, so it's a conversation worth having.)
    • For integrating content, is your team going to use a repository forking model or a branch/merge model?
    • If you're relying on branching, what name-space standards will you use? How will you clean up branches?
    • When merging, will you use merge commits or squash feature-branch merges to a single commit? What about feature branches that contain (or squash to) a single commit?

    These are just a few of the questions your team's "Git language lawyer" will be able to help discover what's best for your team.

  3. Use a Git Repository That Provides Guardrails

    Certain codelines—develop, master, release/hotfix branches—are intrinsically more important than feature and personal branches. They should be treated as such and have special policies around structure and naming (so they're easy for everyone to find), who can create them, how content flows into them, and whether or not their history can be rewritten.

    There are a number of repository management tools to help with this: gitolite, Gerrit, Gitlab, Stash. Use one of them.

    (Readers may note: hosted-Github isn't listed; it doesn't provide the ability to disable force-pushes. Github Enterprise has a form of the feature. Bitbucket, also a hosted-Git provider, includes this functionality.)

  4. Understand the 'D' in DVCS

  5. The D in DVCS is commonly expanded as "decentralized" or "distributed." I find neither to be accurate.

    In the real world in which we develop code, Github and company "repositories of record" provide obvious counterexamples to the "decentralized" descriptor. And while Git repositories technically are distributed, the term had a specific (and different) meaning than commonly used in computer science and if your DVCS implements fully-functional shallow clones, any claimed benefits of "decentralized" or "distributed" disappear.

    What do I think the 'D' stands for? Disconnected. Git allows you to do all of the operations we generally care about offline, including creating personal codeline histories which may not be important to the larger group.

    So why is the distinction even worth making? Many claim Git is better because everyone has a backup and so nothing will ever be lost. This is a precarious argument. (And, as Git evolves, it may not even be true!)

    More than that, it gives the impression that our (centralized!) repositories of record—and all of the services around them that produce our builds—aren't critical infrastructure, and are replaceable by a random development laptop with no impact on the organization's ability to develop and ship software. I know of few, if any, environments where that is practically true, and so that impression is disingenuous.

    Upshot: Git is great... but it doesn't mean we can start ignoring our (centralized) repositories of record and their accompanying services.

It would surprise me little if some of these tips and tricks generated discussion and even disagreement. I've found version control and code-line management of interest for over fifteen years now, and while I have many feelings about Git, I can't argue that it's done more in the past six years to make our industry aware of and care about these issues than release engineering has for... well, probably ever.

Hopefully at least a few of these will make you and your team's Git usage more productive, easier, and more fun in 2014!

Happy Holidays to all! (And to all, a clean merge!)

2 comments :

Unknown said...

There are some good points raised in this post. I personally just don't agree with the rebase approach. In my team we decided not to use rebase on anything that has been pushed in - ever. As mentioned in the above post itself, it rewrites history, so it can be very dangerous to use if you don't use it right. And from a history/audit/review perspective I think rebase is just too fragile.

preed said...

@Frederik:

I think as Git evolves, it will encompass the distinction between codelines that are "official," and "feature" or "developer" codelines that are able to be rewritten and tidied up until they are merged into the official codelines.

In those cases, the using a properly configured repository manager (org tip #3) ensures that engineers cannot rebase "official" codelines, but CAN rewrite, tidy up, and respond to code review feedback their own codelines before they are merged.

In my opinion, there's a lot of bad advice floating around about rebase because it is so dangerous; but with tooling around it (Git Flow's git flow feature rebase, and git repository permissions), it's an invaluable feature.

-preed