I have worked with many data scientists in the past years. One thing that I found common among them is the lack of software development skills. A simple, but important, practice in software development is version control which is kinda known as Git in the industry while other technologies exist. I found many data scientists are not very comfortable with Git mostly due to the fact that they did not understand why, where, and how they must use it. In this article, I described Git technology in simple words and provided you with scenarios where you must use it. I also describe the most important functionalities that you need for daily development: (a) saving changes, (b) inspecting the codebase, (c) undoing changes, and (d) rewriting history. Hope this helps you become more comfortable with this amazing technology.
During the data science interview, you may not be asked about, for example, “How Git must be used effectively?”. However, you must know how to work with Git from the very first day.
In a data science interview, you may not be asked about, for example, “How Git must be used effectively?”. However, you must know how to work with Git from the very first day. You would create a bad first impression on your manager if you are not comfortable working with Git; especially, with more advanced Git operations such as Rebase. The advanced Git operations may create issues that lower the speed of development if you do not use them correctly.
When I interview data scientists, I always evaluate their software development skills as well. Many interviewers do not conduct such evaluation but that does not mean they do not expect you to know Git. By not knowing Git technology, you end up having a highly inefficient collaboration with your colleagues that can easily endanger your position. So, be prepared for it.
— What is Git?
Git is a widely accepted software for tracking changes in the codebase (a.k.a. version control) used for coordinating development within the team. In version control, the main goal is to keep your local codebase in sync with the remote codebase while you pull all the latest changes made by other developers and push all the latest changes made by you. The remote codebase is the main codebase that developers must always look into it to find out the latest changes.
Git is a technology that enables you to keep every codebase in sync with the remote codebase while all of them are changed frequently. Git lets you save and retrieve snapshots of every step in development also called “state”. We have many examples, of state management technology in the software industry such as Docker (platform as a service) or Terraform (infrastructure as a service). Git is the widely accepted state-management technology for code changes.
Git lets you save and retrieve snapshots of every step in development also called “state”. We have many examples, of state management technology in the software industry such as Docker or Terraform. Git is the widely accepted state-management technology for code changes.
— How to use Git?
A common nightmare in development is to lose the code changes that you have worked on for a day due to any reason. Git gives you an opportunity to save your changes in both local codebase and remote repository. You can use
commit command to save your changes in the local codebase and
push command to save your changes in the remote codebase. If you want to save the latest changes temporarily, you can use
stash command. You usually use
stash when you want to, for example, switching between two branches without committing.
Inspecting the codebase
From time to time, you need to inspect the codebase to gather more information about the status of changes in the codebase. I described above that Git is a state management technology, so you must be able to track the state of the codebase anytime you want. You can use
status command to display and track the state of the codebase. You can also use
log command to display committed snapshots.
log command lets you list the project history, and search for specific changes.
For many reasons, you may need to undo some committed changes. You can use
reset command to undo changes. This command has two main types:
soft . When the
soft argument is passed, all of your code changes stay in your codebase, but all the corresponding commits would be deleted. When the
hard argument is passed, the state of the codebase is completely reset to the exact state or snapshot of the codebase which you pointed to. Each snapshot of the codebase is registered with a unique commit id. Let me share two scenarios when you need to use
- Scenario 1 (when you need to use
reset --soft) — You are working in a company that has a strict rule on development paradigm. You are working on a branch to develop feature A. At the same time, you start developing feature B. You forgot to create a new branch and committed code changes related to feature B on the same branch as feature A! After several commits, you notice it and decide to clean up your branch. You must undo your changes.
- Scenario 2 (when you need to use
reset --hard) — You are working with Git operations and out of a sudden, you made a mistake. All of your recent changes get deleted and you become very frustrated. You want to revert the codebase to a snapshot where you are certain about its quality. At this time, what you need is to go back to a snapshot of the codebase which is clean.
Note that the
hard argument must be used by cautious since it is deleted everything.
A group of developers may be against rewriting history; however, many others use this powerful feature to develop faster with less conflict. You can use this feature to build a clean history in the state management system and resolve conflicts fast if you use it correctly. The main operator that you must use to rewrite history is
rebase. You can use rebase to squash several commits into one commit.
In two main scenarios, you need to use
- Scenario 1 — You take a branch of the main (a.k.a., master) branch several days ago. You did a series of developments but your work is not finished yet. During this time, your colleagues finish their tasks and merge their changes into the main branch. Now, your branch is out of sync with the latest changes in the main branch. You must re-base your branch, taken from an older snapshot of the
mainbranch, on top of the latest changes in the
mainbranch. You may ask why? Because otherwise, you can not merge your changes to the main branch since your branch is outdated (i.e., may have some discrepancy with the
- Scenario 2 — You have worked in a branch for a while. You committed your recent changes in a large number of commits. You decided to squash your commits in a commit (i.e., rewriting the history) for two reasons. First, you want to have a clear path of development (also refers to as clean history) to ensure bad habits of development do not get pushed to the remote codebase. Second, you want to rebase on top of the recent changes in the
mainbranch. If you have many commits in your branch, conflict resolution (i.e., the discrepancy between your branch and the new main branch) becomes much harder since you have to resolve conflicts in every commit.
Note that if you are not an expert with Git, never try rewriting history by yourself or on a critical codebase. I highly suggest using an experimental codebase to master Git before applying any of these commands, especially the complex ones, on a critical project. You can read more in the article below.
— Last Words
If you want to learn more, you should learn about git-flows. What is git-flow, though? A git-based development workflow, also known as a git-flow, is a sequence of development steps to build and release software. To have effective collaboration, you must also know the nuts and bolts of the most essential git-flows. You can read about the two most essential git-flows in this article: Git-Flow is the Source of Productivity, Not Confusion.
I highly recommend you to read the awesome tutorial created by Atlassian. Plus, if you are not comfortable with Git CLI, you can use the great Atlassian’s Git client named Sourcetree. While it does not remove your need to use Git CLI, it gives you a user-friendly environment to use Git for many common needs.
Article originally posted here by Pedram Ataee, PhD. Reposted with permission.