R Data Science Digest: April 2021

A list of the most popular posts featured on R Posts you might have missed! in March 2021. All of the most exciting R resources in statistics, visualisation, reporting, data manipulation and lots more!

Featured posts

shinysurveys • Develop and deploy surveys in Shiny/R by Jonathan Trattner and Lucy D’Agostino McGowan

  • "shinysurveys provides easy-to-use, minimalistic code for creating and deploying surveys in Shiny. Originally inspired by Dean Attali’s shinyforms, our package provides a framework for robust surveys, similar to Google Forms, in R with Shiny."

autoplotly • Automatic Generation of Interactive Visualizations for Statistical Results by Yuan Tang

  • "provides functionalities to automatically generate interactive visualizations for many popular statistical results supported by ggfortify package with plotly.js and ggplot2 style. The generated visualizations can also be easily extended using ggplot2 syntax while staying interactive."

DataCoaster Tycoon: Building 3D Rollercoaster Tours of Your Data in R by Tyler Morgan-Wall

  • "a tech demonstration showing how to create animations through 3D space with R and rayrender’s new camera animation API."

Statistics

Machine learning

Visualisation

::Graphics

::Interactive

::Themes and palettes

Publishing

Data manipulation

R learning resources

Data

Utilities

Spatial

Enjoyed this article? Subscribe for future posts by email 👇

Please consider supporting me by becoming a patron and help to shape the future of ‘posts you might have missed’. Have a look over at https://www.patreon.com/alastairrushworth, thanks! 🙏

Python DS Digest: April 2021

A list of popular posts and resources shared on Python DS you might have missed! in March 2021. All of the most exciting python resources in machine learning, data manipulation, visualisation and lots more!

Machine learning: scikit-learn

Machine learning: deep learning

Machine learning: statistical learning

Data manipulation

Visualisation

Geospatial

Scientific python

Python development

Python in finance

Enjoyed this article? Subscribe for future posts by email 👇

Please consider supporting me by becoming a patron and help to shape the future of ‘posts you might have missed’. Have a look over at https://www.patreon.com/alastairrushworth, thanks! 🙏

Python DS Digest: March 2021

A list of popular posts and resources shared on Python DS you might have missed! in March 2021. Topics this month include machine learning, deep learning, data manipulation, statistical modelling, IDEs and notebooks and dashboards.

Machine learning

Deep learning

Feature selection and engineering

Notebooks and IDEs

Visualisation

Statistical modelling

Dashboards and apps

Data manipulation & pandas

Tools & utilies

Enjoyed this article? Subscribe for future posts by email 👇

R Data Science Digest: March 2021

A list of the most popular posts featured on R Posts you might have missed! in February 2021. All of the most exciting R resources in machine learning, visualisation, R shiny, data manipulation and lots more!

Machine learning

Visualisation

Shiny

Developer tools & utilities

Books, courses and learning resources

Markdown and reporting

Data manipulation

Spatial analysis

Statistical modelling

Enjoyed this article? Subscribe for future posts by email 👇

R Data Science Digest: February 2021

A thematic list of popular posts shared on R posts you might have missed in January 2021. Topics this month include rmarkdown, ggplot2, machine learning and loads of cool tools & utilities!

Top picks

Markdown and publishing

ggplot2 and visualisation

  • Creating and using custom ggplot2 themes • the best way to make each plot your own, by Tom Mock.

  • ggnewscale • Multiple Fill, Color and Other Scales in {ggplot2} by Elio Campitelli.

  • ggeasy • {ggplot2} shortcuts (transformations made easy) by Jonathan Carroll.

  • ggbernie • A {ggplot2} geom for adding Bernie Sanders to {ggplot2} by R CODER.

  • ggprism • {ggplot2} extension inspired by GraphPad Prism by Charlotte Dawson.

  • basetheme • Themes for base plotting system in R by Karolis Koncevičius

  • mully • R package to create, modify and visualize graphs with multiple layers by Frank Kramer.

  • Automating exploratory plots with ggplot2 and purrr by Ariel Muldoon.

  • ggstatsplot • ggscatterstats: a publication-ready scatterplot with all statistical details included in the plot itself to show association between two continuous variables by Indrajeet Patil.

  • popcircle • Circlepacked geo polygons by Timothée Giraud.

Spatial and Mapping

Machine Learning and Statistical Modelling

R Learning Resources

Workflow and Utilities

Tools

  • officeverse • This book deals with reporting from R with the packages {officer}, {officedown}, {flextable}, {rvg} and {mschart}, by David Gohel.

  • countrycode • Convert country names and country codes. Assigns region descriptors. by Vincent Arel-Bundock.

  • disk.frame • Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data by ZJD (http://evalparse.com).

  • vroom • Read and Write Rectangular Text Data Quickly (The fastest delimited reader for R, 1.48 GB/sec.) by Jim Hester and Hadley Wickham.

  • visdat • Using {visdat} (Preliminary Exploratory Visualisation of Data) by Nicholas Tierney.

Enjoyed this article? Subscribe for future posts by email 👇

Python DS Digest: February 2021

A list of popular posts and resources shared on Python DS you might have missed! in January 2021. Topics this month include NLP, data manipulation, machine learning, time series, Jupyter notebooks and dashboards.

NLP

  • datasets • The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools by Hugging Face.

  • Spacy Course • In the course, you’ll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches by Ines Montani & SpaCy.

  • ecco • Visualize and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2) by Jay Alammar.

Data Manipulation

Time series

  • Time series Forecasting in Python & R, Part 1 (EDA) • Time series forecasting using various forecasting methods in Python & R in one notebook by Sandeep Pawar.

  • deeptime • Python library for analysis of time series data including dimensionality reduction, clustering, and Markov model estimation by Moritz Hoffmann.

  • adtk • A Python toolkit for rule-based/unsupervised anomaly detection in time series by Arundo Analytics.

Machine Learning Tools

  • giotto-tda • A high-performance topological machine learning toolbox in Python by giotto.ai.

  • automlbenchmark • OpenML AutoML Benchmarking Framework by OpenML.

  • Ludwig • a toolbox that allows to train and test deep learning models without the need to write code, by Ludwig at Uber Open Source.

  • norse • Deep learning with spiking neural networks (SNNs) in PyTorch by Christian Pehle and Jens Egholm .

  • handtracking • Building a Real-time Hand-Detector using Neural Networks (SSD) on Tensorflow by Victor Dibia.

  • Python Autocomplete • Use Transformers and LSTMs to learn Python source code by LabML.

  • AutoGL • An autoML framework & toolkit for machine learning on graphs by the Media and Network Lab at Tsinghua University.

  • shapash • Shapash makes Machine Learning models transparent and understandable by everyone. Developed by MAIF (https://maif.github.io).

  • pyod • A Python Toolbox for Scalable Outlier Detection (Anomaly Detection) by Yue Zhao.

  • uncertainty-toolbox • A python toolbox for predictive uncertainty quantification, calibration, metrics, and visualization by Willie Neiswanger.

Learning Machine Learning

Jupyter

Dashboards and Web Apps

Utilities

Data Science learning resources

  • Coding for Economists Chapter 1. Intro to Mathematics with Code — In this chapter, you’ll learn about doing mathematics with code, including solving equations symbolically by Dr Arthur Turrell.

  • algorithms • Minimal examples of data structures and algorithms in Python by Keon.

  • intro-sc-python • Python Tools for Data Science, Machine Learning, and Scientific computing by Pablo Caceres.

  • Classic Computer Science Problems In Python • Source Code for the Book Classic Computer Science Problems in Python by David Kopec.

  • DS3 Practical Optim for ML • Notebooks from DS3 course on practical optimization by Alexandre Gramfort.

  • Data Science from Scratch (Chapter 8). Gradient Descent: Building gradient descent from the ground up, by Paul Apivat.

Visualisation

Enjoyed this article? Subscribe for future posts by email 👇

Exploratory Data Analysis: what’s the point?

Exploratory data analysis or EDA is one of the most important but difficult to codify parts of the data science toolkit. True exploratory analysis is without a sharply definable objective and evades being formalised into a set of clear steps. Despite this, EDA is used in at least a few very typical ways that connect to downstream tasks like data cleaning and hypothesis generation. But perhaps most importantly, it’s an integral part of how we learn to frame our thinking as data scientists. This post attempts to offer some perspective on the less-discussed ways in which EDA develops our contextual understanding of a data analysis.

EDA for checking, validation and cleaning

Let’s get the obvious stuff out of the way first. Where a rough analysis plan is already in place, and some data has been assembled to support the analysis, a type of EDA serves to identify potential issues that might require remedial work before progressing. This is probably the most common type of exploratory analysis and is more closely linked to the goals of data cleaning than pure analysis and insight. This a big topic and I won’t attempt an exhaustive list here, but instead will describe a few of the most common tasks.

The most common check is for the correctness of column types. Depending on the data source, different issues might arise here, but you’ll be familiar with at least some of these. Integers incorrectly encoded as strings, strings encoded as dates, unordered categories encoded as integers. Sometimes a column that should be numeric has the very occasional string entry. There are as many causes as there are issues: perhaps you didn’t specify the correct schema when you read the data; or the data are encoded in ambiguous way that results in an inappropriate type; or maybe some earlier data manipulation induced an unintended problem.

We often check the prevalence of missing values and their dependence on other important features – usually because a lot of analysis methods do not handle missing values natively. Some columns may be totally unusable if they are mostly missing. Remedies here might include dropping or transform columns, imputing missing values, or choosing an algorithm that handles missingness out of the box.

Distribution, shift and relevance: it is important to inspect the distribution of values in each column – and consider whether these look how we’d expect (where we have an expectation). Do the distributions covary, especially with time (data are almost never consistent with stationarity with respect to time). Thinking about distributional shift is crucial for making decisions around which window of data is most important or relevant for addressing a specific question. It might expose or confirm trends and temporal patterns that downstream analysis needs to be aware of.

Measuring pairwise association provides some basic insights into how columns covary and might help reveal columns that are collinear or even identical that could be removed without detriment. It might help uncover some of the overall structure in the data or indicate collections of related columns. Pairwise association measures, like Pearson correlation coefficients, are overused in this context and are limited to only providing a linear and unconditional view of pairwise association. Nevertheless, a lot of insight can be gleaned from this type of analysis if you know what to look for.

These types of techniques provide a first look at the data and answer important questions about quality, formatting and overall dependence structure. These steps can usually be carried out by the data analyst without any external support, and are generally well supported with easy-to-use code wrappers. These are absolutely essential steps and it’s possible to learn quite a lot about the data by applying them and thinking carefully about the results. But it’s very important to recognise that there is a limit to how much can be understood with this type of analysis. There’s a lot more to EDA.

grokking the data with EDA

When you claim to “grok” some knowledge or technique, you are asserting that you have not merely learned it in a detached instrumental way but that it has become part of you, part of your identity.

‘grok’, The Jargon File

I’ve totally made up the heading, but I think it’s by far the most important role of EDA and is mostly what the rest of this post is about. There is a sort of myth of the data analyst as a robotic processor of data, who is detached and passive. The reality is completely the opposite, where better data analysis will always come from an analyst with a deep understanding of the data and the processes that generated it. EDA has a crucial role in turning a data frame from a contextless collection of bytes into a meaningful representation of a physical process, transitioning the analyst from the passive processor to an expert with deeply internalised understanding of an area. This end state is intangible and qualitative because it happens completely in your own head. Consequently, this part of the EDA will be a creative and personal journey that is supported by a continuing internal conversation that probes and revisits your understanding of the broader context.

Building a data narrative

The data frame you have in front of you for analysis is an incomplete and encoded representation of some real world process. Part of your role as an analyst is to solve problems and generate insights that respects the story of how the data were generated. For want of a better description, let’s call this story the data narrative. Part of this narrative might be the sequencing of events that lead to each data record coming into existence, part of it might be the data’s lineage in terms of the processing, joins and wrangling required to produce the data frame you end up with. If you are already an expert in the area you are working, this narrative may already be engrained in you. The data narrative completely frames the work you do, how you interpret every insight or modeled output, and most importantly, the credibility with which you can influence your audience.

The data narrative is a complex form of metadata and is almost never part of the data frame. If you are fortunate, your organisation might keep clear and accessible documentation and data dictionaries that will be a huge first step to piecing together this narrative. However, it is often more typical that analysts are neither domain experts nor well-provided with nice documentation. In this case, the narrative is something that must be synthesised through detective work, drawing on a combination of data analysis and the experience of domain experts. This is, of course, much easier said than done.

A visual representation of the process of EDA and developing the data narrative.

The role of asking questions

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

R for Data Science, Hadley Wickham (2021)

EDA can’t happen in a computational vacuum. To do this well, we need to be alternating between interrogating the data and asking ourselves if what we find is consistent with our internal understanding of the data’s narrative. Does what I see make sense to me? Would I feel comfortable to explain it to someone else?

In large organisations, you might also be speaking with a domain expert to help with this (if that person isn’t you), though they needn’t be an internal expert if the data come from outside the organisation. If you don’t yet have direct access to such a person, demand that you do – this person will supercharge your eventual analysis and will be often be the difference between success or failure of the entire project. In the beginning, take lots of time to let experts talk more broadly about the data, as they understand all of the salient dependencies, anomalies and gotchas that will save you a lot of time in the long run. Take time to use simple data analysis to carefully confirm what you’re told. The key here is not checking for correctness, but to grow your understanding of the data: it’s important to remember that it’s one thing to be told something about the data narrative, but it’s much more meaningful to use your own analysis to see it expressed in the data.

As your understanding deepens and the analysis progresses, you’ll continue to find new patterns and structure in the data. Keep revisiting your understanding of the data narrative, and check whether what you are seeing is consistent with that. As your understanding of the data narrative matures, the gaps will come into focus: consider creative ways to use the data or ask a relevant question to close the gap. The relationship between internalised data narrative and data exploration is a two-way street.

Take time to talk your findings over with another data scientist. The key here is to aim to communicate your understanding of the data narrative without getting too mired in the technical details of the data. The process of preparing a narrative that you can explain to a colleague will help to consolidate what you’ve learned and quickly expose gaps. A fresh set of eyes will nearly always raise further questions or force you to think of your data from a different perspective.

Do we even have the right data?

An important byproduct of the process of building a better data narrative is that your understanding of what the most important or relevant questions to ask will improve. A crucial question to keep revisiting is whether the data you have is sufficient to address the most important questions. Are there additional data sources that you draw upon to enrich or improve the analysis? Are the columns you already have in your data frame defined correctly, or should they really be specified differently? It’s typical that data sets are assembled before anyone knows exactly how the data will be used and it can pay dividends to constantly revisit the question of whether the data contains everything sufficient to answer a particular question. Many problems in data science are much more easily solved by gathering the right data (or more of it) than by using fancier techniques.

Data hygiene and data splitting

If you frequently fit predictive models, you’ll be aware of the risks of overfitting and the need to reserve partitions of the data to check that your findings truly generalise to unseen data. The same is true for the iterative types of EDA discussed in this article. The more detailed your analysis is, the higher the risk that insights gleaned in your EDA are false discoveries (aka statistical flukes). It is important that the confirmatory part of your analysis (prediction accuracy measurement or hypothesis testing) occurs on a different piece of data to your EDA.

A related problem that frequently arises in machine learning projects is where EDA is run as a preliminary step before creating training and test splits. If the result of EDA influences your model choices (it nearly always will if done properly), then you’ve potentially reduced your test set’s ability to measure true out-of-sample error. So before you do anything, create a hygienic environment for your EDA by splitting your data, so that you don’t accidentally leak information from your test set into your model.

Creativity and the pitfalls of the data frame API

This was the tendency of jobs to be adapted to tools, rather than adapting tools to jobs.

Silvan Tomkins, Computer Simulation of Personality: Frontier of Psychological Theory (1963)

Almost all data analysis now begins with some form of data frame – a tabular data format with columns of mixed types, where each row is a record. In Python and R, data analysis tooling has coalesced around the data frame object, which has been a huge convenience and productivity boost for the analyst. I wouldn’t for a second debate that this hasn’t been a positive development, but there is a risk here that EDA, because of the ease and uniformity of use of the tooling, becomes an exercise in applying boilerplate code. This creates a hidden creativity trap where the analysis can become narrowed by the range of uses supported by a particular set of tools. While such tools are extremely powerful when they are genuinely supporting you to develop your understanding of the data narrative, it’s important to avoid becoming too reliant on any single tool.

My experience is that it’s good to have familiarity with tooling at multiple levels of abstraction. Extremely high level interfaces to auto-generate certain types of exploratory analysis are very handy, and big time savers when they provide just what you need. However, the majority of EDA is more creative in nature and becoming expert with data manipulation tools like dplyr and pandas in combination with graphical tools like matplotlib and ggplot2 provides much finer control and fewer restrictions on your creativity.

The main point here is that exploratory data analysis can’t and shouldn’t be automated, because it is a process to support a human (you) to learn, and to do that well, there are few shortcuts.

Closing thought: an analogy with critical reading and literary analysis

Like all good blog posts, my thinking on EDA began on Twitter. In the process, Jesse Mostipak made a great point that teaching EDA effectively might share similarities to the way students are taught to interrogate literary texts. I’d never considered EDA this way, but the analogy resonated strongly with me, and much of my thinking in this post owes a lot to being sent off in this direction, 🙏 thanks Jesse! There’s a lot to unpack in the analogy, and I have no training in critical reading so I can’t speak with any authority on that subject. Nevertheless, it seems that interrogating a text has broad similar to EDA in the sense of being driven by the goal of developing a deep understanding of a text.

This article by the Farnham Street blog summarises four levels of critical reading, originally proposed by Mortimer Adler. The final most analytical form of interrogation, called synotopical or comparative reading, hits on some of the themes I’ve discussed already:

This task is undertaken by identifying relevant passages, translating the terminology, framing and ordering the questions that need answering, defining the issues, and having a conversation with the responses.

The goal is not to achieve an overall understanding of any particular book, but rather to understand the subject and develop a deep fluency.

This is all about identifying and filling in your knowledge gaps.

Farnham Street blog, How to Read a Book: The Ultimate Guide by Mortimer Adler

Sounds familiar doesn’t it? Asking questions (of yourself), contextualising and framing, closing knowledge gaps and achieving fluency are all key parts of a successful EDA. What I’m most excited about here is that we can draw on the analytical framework of an existing and well-established discipline, as scaffolding to think about how we can make improvements to the way we teach and practice EDA. Again, full credit to Jesse for this idea.

What’s next?

This article was a bit of a free-flowing attempt to think harder about what EDA is and what its purpose is. In a future article, I’ll run through tooling that might be useful in everyday EDA, and consider ways to make practical improvements to our practice, drawing on inspiration from other analytical disciplines.

Why not subscribe for future articles?

If you enjoyed this article, why not subscribe by email for future updates and posts?

Python DS Digest: January 2021

This month, topics include Jupyter Notebook Extensions, Learning ML, ML Interpretability & fairness, Causal inference, Visualisation and lots more.

Jupyter Notebook Extensions

Learning ML

ML Interpretability & fairness

ML Tools

  • sktime-dl • sktime companion package for deep learning based on TensorFlow

  • captum • Model interpretability and understanding for PyTorch by PyTorch.

  • lightly • A python library for self-supervised learning, by Lightly.

  • tods • An Automated Time-series Outlier Detection System by Data Analytics Lab at Texas A&M University.

  • scikit-network Python package for the analysis of large graphs by scikit-network team.

  • river • Online machine learning in Python by Max Halford, Jacob Montiel, Saulo Martiello Mastelini, Geoffrey Bolmier.

  • ML Ops: Machine Learning Operations by INNOQ.

  • Monitoring Machine Learning Models in Production • A Comprehensive Guide by Christopher Samiullah.

Causal inference

  • causalml • Uplift modeling and causal inference with machine learning algorithms by Uber Engineering.

  • Causal Survival Analysis • Survival Analysis with examples on NHEFS data by Ayya Keshet, Hagai Rossman.

Visualisation

  • Bokeh • The Bokeh Visualization Library by Bokeh.

  • windrose • A Python Matplotlib, Numpy library to manage wind data, draw windrose (also known as a polar rose plot), draw probability density function and fit Weibull distribution by python-windrose.

  • Data Visualization Exercise with Star Wars dataset by Maico Rebong.

Data manipulation

Utilities

Cool applications

  • yasa • a Python package to analyze polysomnographic sleep recordings by Raphael Vallat.

  • lhotse • Lhotse is a Python library aiming to make speech and audio data preparation flexible and accessible to a wider community, by Piotr Żelasko.

  • cvlib • A simple, high level, easy to use, open source Computer Vision library for Python, by Arun Ponnusamy.

  • eo-learn • Earth observation processing framework for machine learning in Python, by Sentinel Hub.

Papers and paper discovery

Bayesian Machine Learning