Data Science from Cyberspace #1
By: Alex Perrier – ODSC data science team contributor
Every week we bring you a selection of the best data science articles we found floating around Cyberspace.
Writing an R package from scratch
In the Python vs R debate one argument in favor of R is the amazing diversity and richness of R packages. In June 2015 over 6700 packages were available on CRAN.
Now you too can create your own package and add your contribution to this data science treasure.
In this blog post, Hilary Parker, walks us through the creation of an R package. All the steps are covered from Creating a package directory, Adding functions and documentation, installing the package and pushing it to github. R packages are a great way to share your code. For a more in-depth exploration of R packages, see also the R package online book by Hadley Wickham, Chief Scientist at RStudio.
You’ve probably heard about TensorFlow the Neural Network / Deep Learning library opensourced by Google last November. A small team lead by Shan Carter (@shancarter) data visualization expert, and Daniel Smilkov (@dsmilkov) from Google Research have created a TensorFlow based website for educational purposes. This TensorFlow Playground allows you to experiment with different settings and problems and see the results in real time. The code has been open-sourced and is available on github under Apache License. Start with a choice of 4 datasets representing different 2D topologies, choose your input, the number of layers and number of neurons per layer and see the resulting convergence behavior and loss. You can also tune up your neural network with regularization (L1, L2), learning rate and activation type.
Catching Star Wars surprises and other spoilers with Machine Learning
When you are a data scientist, the world around you becomes ripe with datasets and Machine Learning projects begging to be explored.
Ruth Toner data scientist at Twitch decided to build a spoiler detection tool using classic Machine Learning models and Data Science methods.
By mining the Tumblr site for posts related to Star Wars, Ruth was able to develop a Spoiler detection tool called Fanguard which is available at http://fanguard.xyz. This project is a great example of data science applied to real world datasets: building the dataset, applying NLP techniques to remove noise, selecting and tuning the model and the features and assessing the results through proper metrics and visualization.
Data Science and Extreme Programming Explained
Extreme programming and Agile methodologies revolutionized software development when they were first introduced years ago and drove companies to move from rigid waterfall processes to adaptive collaboration workflows that were faster and in the end improved the overall quality of products and applications.
In this article, Ian Huston (@ianhuston) Data Scientist at Pivotal Labs in London, explores the similarity in the core concepts and values between the extreme software programming approach and the data scientist way. In short, both share a need for simplicity and open communication and do better when feedback goes with respect. In both domains, including the end user or the project stakeholder in the process boosts success and improves deliverables.
Ian Huston will be speaking at the next Open Data Science conference, in London on Oct 8-9.
Deep Learning for Chatbots
Chatbots aka conversational agents have enjoyed recent renewed interest from large companies and start-ups alike. But as the recent disaster with the Tay chatbot from Microsoft as shown, building smart generative chatbots is still rife with dangers.
In this article, Denny Britz (@dennybritz) goes over some of the Deep Learning techniques that are used to build these conversational agents. The post goes over key concepts and differences between the different types of chatbots such as retrieval (pick from a finite set of answers) vs generative models (generate content from scratch), open vs closed domain and the presence or absence of a specific goal. The context, the coherence of the answers, their diversity are all very particular challenges that need to be addressed.
This article is the first in a series and will be followed by more in depth technical ones.
The ODSC Conference
If you want to dig deeper into these Data Science and Machine Learning topics do not miss the next Open Data Science conference, Boston, May 20-22. Here’s a sample of talks and speakers:
- On R: Joe Cheng CTO at RStudio, JJ Allaire Founder and CEO at RStudio and Max Kuhn Director of Nonclinical Statistics in Pfizer R&D will talk about R and the R ecosystem (Rstudio, Notebooks, Visualization, rCharts…).
- On Deep Learning: Kevin Robinson from MIT will present “A tour of the TensorFlow codebase”, Kaz Sato from Google: “TensorFlow: Large-scale deep learning for intelligent computer systems”, and Chris Fregly from IBM “Using TensorFlow, GPUs, and Deep Neural Networks for Natural Language Processing and Real-time Predictions”.
- Real world applications are the bread and butter of Data Science. The conference, will showcase many such real world cases from start ups and big companies in domains such as security, social media mining, network analysis, health, housing and many more.
- On Agile methodologies for Data Science: Gil Benghiat, Founder at DataKitchen will give a talk on “How to Do Agile Analytics in Seven Shocking Steps”.
Check out the amazing list of speakers. With over a 100 talks, workshops and tutorials, the ODSC conference is a must go event for all data scientists.
To read more from Alex sign up for our newsletter or follow him on twitter @alexip.