fbpx
Supercharging Your Data Science Projects with GitHub Tools Supercharging Your Data Science Projects with GitHub Tools
Technology is advancing at a rapid pace, bringing new innovations that are transforming our workplaces. One role that is being especially... Supercharging Your Data Science Projects with GitHub Tools

Technology is advancing at a rapid pace, bringing new innovations that are transforming our workplaces. One role that is being especially disrupted by these advancements is that of the data scientist. Data science is already an exciting field, but new tools are taking it to the next level in terms of productivity and capabilities. With the help of these new technologies, data scientists can work faster and more efficiently than ever before. In this post, we will show you these advancements in action.

A data science project with Python, VS Code and GitHub Tools

Let’s dive deep into some innovative GitHub tools and features that can improve the productivity of your data science workflow. To explore them, let’s imagine we have been asked to create a predictive model to forecast the number of rentals for a bicycle rental business based on seasonality and weather conditions.

To build such a model, starting from an historical rental’s dataset, we are going to perform some data analysis and experiments in a Python Jupyter notebook on VS Code. The secret sauce of the productivity boost to our project is made by two main ingredients:

  • GitHub Copilot, an AI-empowered assistant, embedded into the VS Code interface and offering inline suggestions, slash commands and a chat experience.
  • GitHub Codespaces, a pre-defined development environment hosted in the Cloud.

Creating our workspace 

Before we can start writing our first line of Python code, or even start creating a new Jupyter notebook, we need to have the last version of Python installed on our local machine and the Python Extension installed in VS Code. Then, we’ll have to install Python libraries needed to explore, clean, and visualize the data and the ones needed to train and evaluate our machine learning model. This set of pre-requisites may vary from one project to another, and some of them might have conflicts and dependencies, requiring some additional efforts to our workflow. Also, if we collaborate with a team of colleagues on the same project, they should replicate the same installation processes to be able to contribute to our code.

This is the context in which GitHub Codespaces is tremendously helpful, enabling us to create a reproducible and pre-configured workspace for our project, that we can host and share on the Cloud. But how to get started?

Once we have enabled GitHub Copilot Chat in our IDE (VS Code), we can interact with this built-in virtual assistant through a chat interface, by asking questions in natural language or using chat agents.  We can think of a agent as an expert we can ask for support on a specific subject. Similarly to when we mention a user in the chat, we can use the @ symbol to start interacting with an agent.  To further define the scope of our request to the agent, we can also use a few slash commands.

For example, the prompt “@workspace /new workspace for a Jupyter Python notebook with a GitHub Codespaces configuration installing pandas, numpy and scikit-learn” will output a suggested directory structure for our project:

  • `.devcontainer/devcontainer.json` – configuration file for the GitHub Codespaces development container, specifying the Docker image to use and the extensions to install in the container.
  • `.devcontainer/Dockerfile`- a placeholder folder to add a Dockerfile containing additional instructions needed to create the Docker container image. 
  • `.devcontainer/requirements.txt` – configuration file listing the Python packages to install in the development container.
  • `notebooks/my_notebook.ipynb`- template Jupyter notebook file, importing pandas, numpy and scikit-learn.
  • `README.md`- placeholder file containing the documentation for the project.

Also, by clicking on “Create Workspace” at the bottom, the directory structure will be created locally, and the files will be initialized with some basic content that we can customize for our scenario.

Now, to create a GitHub Codespaces starting from here, we should first publish our code on GitHub, through the ‘Source Code’ panel of the sidebar menu in Visual Studio Code. 

Then, we can customize the configuration files that will be used to build the container. For example, we can add a custom Dockerfile and we can add GitHub Copilot and GitHub Copilot chat extensions in the devcontainer.json file with the following lines of code:

"customizations": {
        "vscode": {
            "extensions": [
                "github.copilot",
                "github.copilot-chat"
            ]
        }

Note that the ‘customizations’ field should be at the same level in the json structure as the container ‘name’. If the json file created by GitHub Copilot already has an extensions array, we just have to add the two extensions in the queue. 

In this way, we’ll be able to use GitHub Copilot features also in our remote environment.

After that, we can ask again support to GitHub Copilot chat to create a remote GitHub Codespaces on top of our repository, by asking: ‘How can I create now a GitHub Codespaces starting from the devcontainer configuration files I have in this folder structure?’

Following the instructions provided in the reply will enable us to build and open a GitHub Codespaces, configured with the pre-defined requirements.

Writing, debugging, and documenting our Python Code 

 Once we open our GitHub Codespaces in Visual Studio Code, we can start with our experiments. The first step of our project is to import the data we’ll be using to train our model into a Pandas dataframe. As we write our Python code, we can notice that GitHub Copilot provides us with inline suggestions (the grey line in the screenshot) that we can fully accept, accept only a portion, or ignore.

Also, since Pandas library was listed in the requirements file used to build the GitHub Codespaces, there’s no extra step needed before executing our first code cell.

Now let’s move further and let’s do some data visualization. Let’ imagine we would like to create a histogram with matplotlib representing the bike rentals distribution in our dataset.

In the example above, we forgot to define the axis object and so we are getting a NameError exception. In a case like this, GitHub Copilot can assist us troubleshooting the error. We just need to click on the Fix using Copilot button and we’ll get an analysis of the error along with the suggested code changes to fix it. 

After some data exploration and cleaning – which are out of the scope of this article – let’s suppose we are ready to train our regression model. Since this is the core part of our solution, we would like to have some clear documentation to accompany the code. We can accelerate the tedious but essential task of documenting our code with the GitHub copilot /doc command. 

By selecting the piece of code of interest and then typing the command in the chat window, we can easily get the desired output, we can use for example as the content of a markdown cell in our notebook. 

Summary 

In this article we provided some tips and tricks for data scientists to improve their productivity and enhance collaboration by leveraging GitHub tools, Python, and VS Code. We covered how to create a reproducible workspace using GitHub Codespaces and interact with GitHub Copilot through a chat interface to streamline project setup. We also demonstrated how GitHub Copilot provides inline suggestions, assists with debugging, and helps automate documentation tasks, ultimately improving the efficiency and effectiveness of data science projects.

If you are going to try the prompts we used in our demo in your environment, be aware that you might get different results. This is because GitHub Copilot is powered by the OpenAI GPT-4 model, which is non-deterministic as every large language model, meaning that for the same input we can get different outputs.

Interested in learning more about using GitHub and VS Code to improve your productivity? Watch the VS Code Explains series and attend the Supercharging your Data Science projects with GitHub tools webinar. 

About the Authors:

Carlotta Castelluccio Cloud Developer AdvocateCarlotta Castelluccio – AI Cloud Advocate.

Carlotta Castelluccio is a Cloud Advocate, focused on Machine Learning and Artificial Intelligence (AI) and based in Italy. Her mission as a Cloud Advocate is to help every student, developer or startup founder succeed with AI, by building innovative solutions responsibly. To achieve this goal, she develops technical content, and she hosts skilling sessions, enabling her audience to take the most out of AI technologies. She also engages with Microsoft communities and ensures they have an impact on Microsoft products’ evolutions and refinements.

 

Gabriela de QueirozGabriela De Queiroz – Director of AI – Microsoft for Startups

15+ years of experience in the data space, Gabriela has worked in research and in several startups from different industries, including Software, Financial, Advertisement, and Health. Throughout her career, she has built diverse teams, created sophisticated data science solutions, engaged with customers and stakeholders to deliver business insights and drive data-centric decisions. She is passionate about building innovative solutions, understanding business gaps, and customer needs, and delivering a flawless experience.

 

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1