Dr. Juan Camilo Orduz
/
Recent content on Dr. Juan Camilo OrduzHugo -- gohugo.ioen-usWed, 01 Sep 2021 00:00:00 +0000ISLR2 - Survival Analysis Lab (lifelines)
/islr2_survival_analysis/
Wed, 01 Sep 2021 00:00:00 +0000/islr2_survival_analysis/In this notebook we provide a python implementation of the lab from the Survival Analysis - Chapter 11 of the second edition of the book An Introduction to Statistical Learning (Second Edition). You can find a free pdf version of the book here. We will use the lifelines python package, which you can find in this repository. There is a nice introduction into survival analysis on the documentation. There are also many concrete examples and guidelines to use the package.Exploring Tools for Interpretable Machine Learning
/interpretable_ml/
Thu, 01 Jul 2021 00:00:00 +0000/interpretable_ml/In this notebook we want to test various ways of getting a better understanding on how non-trivial machine learning models generate predictions and how features interact with each other. This is in general not straight forward and key components are (1) understanding on the input data and (2) domain knowledge on the problem. Two great references on the subject are:
Interpretable Machine Learning, A Guide for Making Black Box Models Explainable by Christoph Molnar Interpretable Machine Learning with Python by Serg Masís Note that the methods discussed in this notebook are not related with causality.Feature Engineering: patsy as FormulaTransformer
/formula_transformer/
Sat, 01 May 2021 00:00:00 +0000/formula_transformer/In this notebook I want to describe how to create features inside scikit-learn pipelines using patsy-like formulas. I have used this approach to generate features in a previous post: GLM in PyMC3: Out-Of-Sample Predictions, so I will consider the same data set here for the sake of comparison.
Remark: Very recently (2021-09-01) I discovered there is an implementation of this transformer in scikit-lego, see PatsyTransformer. In addition, please refer to the great tutorial on patsy in calmcode.GLM in PyMC3: Out-Of-Sample Predictions
/glm_pymc3/
Mon, 04 Jan 2021 00:00:00 +0000/glm_pymc3/In this notebook I explore the glm module of PyMC3. I am particularly interested in the model definition using patsy formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread “Out of sample” predictions with the GLM sub-module on the (great!Gaussian Processes for Time Series Forecasting with PyMC3
/gp_ts_pymc3/
Sat, 02 Jan 2021 00:00:00 +0000/gp_ts_pymc3/In this notebook we translate the forecasting models developed for the post on Gaussian Processes for Time Series Forecasting with Scikit-Learn to the probabilistic Bayesian framework PyMC3. I strongly recommend looking into the following references for more details and examples:
References:
An Introduction to Gaussian Process Regression PyMC3 Docs: Gaussian Processes PyMC3 Docs Example: CO2 at Mauna Loa Bayesian Analysis with Python (Second edition) - Chapter 7 Statistical Rethinking - Chapter 14 Prepare Notebook1 import numpy as np import pandas as pd import matplotlib.Simple Bayesian Linear Regression with TensorFlow Probability
/tfp_lm/
Tue, 06 Oct 2020 00:00:00 +0000/tfp_lm/In this post we show how to fit a simple linear regression model using TensorFlow Probability by replicating the first example on the getting started guide for PyMC3. We are going to use Auto-Batched Joint Distributions as they simplify the model specification considerably. Moreover, there is a great resource to get deeper into this type of distribution: Auto-Batched Joint Distributions: A Gentle Tutorial, which I strongly recommend (see this post to get a brief introduction on TensorFlow probability distributions).Open Data: Berlin Kitas
/kitas_berlin/
Sat, 19 Sep 2020 00:00:00 +0000/kitas_berlin/In this notebook I want to explore some data I found on the Berlin Open Data portal daten.berlin.de. The data source contains information of Kitas (Kindertagesstätte, i.e. kindergartens) in Berlin. This is a big topic as finding a spot in a Kita in Berlin is extremely difficult. We first provide an initial exploratory data analysis of the data set, then we merge it with population data to create some geo-location maps.A Simple Hamiltonian Monte Carlo Example with TensorFlow Probability
/tfp_hcm/
Fri, 24 Jul 2020 00:00:00 +0000/tfp_hcm/In this post we want to revisit a simple bayesian inference example worked out in this blog post. This time we want to use TensorFlow Probability (TFP) instead of PyMC3.
References:
Statistical Rethinking is an amazing reference for Bayesian analysis. It also has a sequence of online lectures freely available on YouTube.
An introduction to probabilistic programming, now available in TensorFlow Probability
There are many examples on the TensorFlow’s GitHub repository.Regression Analysis & Visualization
/lm_viz/
Fri, 26 Jun 2020 00:00:00 +0000/lm_viz/In this notebook I want to collect some useful visualizations which can help model development and model evaluation in the context of regression analysis. I use many visualization resources not just only to share results but as a key component of my workflow: data QA, EDA, feature engineering, model development, model evaluation and communicating results. In this notebook I focus on a simple regression model (time series) with statsmodels and visualization with matplotlib and seaborn.A Glimpse into TensorFlow Probability Distributions
/intro_tfd/
Tue, 16 Jun 2020 00:00:00 +0000/intro_tfd/In this notebook we want to go take a look into the distributions module of TensorFlow probability. The aim is to understand the fundamentals and then explore further this probabilistic programming framework. Here you can find an overview of TensorFlow Probability. We will concentrate on the first part of Layer 1: Statistical Building Blocks. As you could see from the distributions module documentation, there are many classes of distributions. We will explore a small sample of them in order to get an overall overview.Disease Spread Simulation (Animation)
/infection_sim/
Tue, 28 Apr 2020 00:00:00 +0000/infection_sim/We describe how to generate a basic disease spread simulation. We explore how to do animations in Matplotlib.Getting Started with Spectral Clustering
/spectral_clustering/
Sat, 04 Apr 2020 00:00:00 +0000/spectral_clustering/In this post I want to explore the ideas behind spectral clustering. I do not intend to develop the theory. Instead, I will unravel a practical example to illustrate and motivate the intuition behind each step of the spectral clustering algorithm. I particularly recommend two references:
For an introduction/overview on the theory, see the lecture notes A Tutorial on Spectral Clustering by Prof. Dr. Ulrike von Luxburg. For a concrete application of this clustering method you can see the PyData’s talk: Extracting relevant Metrics with Spectral Clustering by Dr.The Volume of the d-Ball via Monte Carlo Simulation
/vol_d_ball/
Mon, 24 Feb 2020 00:00:00 +0000/vol_d_ball/In this notebook we run Monte Carlo simulations to estimate the volume of the \(d\)-ball \[ B^{d}:=\{x \in \mathbb{R}^d : ||x|| \leq 1\}. \] There are many ways to obtain a closed formula for this volume , see for example this Wikipedia article. Here we do it via sampling just for fun!
Main Idea Consider a square \(A_{d}\subset \mathbb{R}\) centered at the origin with side length \(2\). We estimate the volume of the \(d\)-ball \(B^{d}:=\{x \in \mathbb{R}^d : ||x|| \leq 1\}\subset A^{d}\) by sampling uniformly from \(A\) and computing the proportions of vectors having length less or equal than one.Forecasting Weekly Data with Prophet
/fb_prophet/
Fri, 21 Feb 2020 00:00:00 +0000/fb_prophet/In this notebook we are present an initial exploration of the Prophet package by Facebook. From the documentation:
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.Exploring TensorFlow Probability STS Forecasting
/intro_sts_tfp/
Tue, 11 Feb 2020 00:00:00 +0000/intro_sts_tfp/In this notebook we explore the Structural Time Series (STS) Module of TensorFlow Probability. We follow closely the use cases presented in their Medium blog. As described there: An STS model expresses an observed time series as the sum of simpler components 1:
\[ f(t) = \sum_{k=1}^{N}f_{k}(t) + \varepsilon, \quad \text{where}\quad \varepsilon \sim N(0, \sigma^2). \]
Each summand \(f_{k}(t)\) has a particular structure, e.g. specific seasonality, trend, autoregressive terms, etc.Intro ML in Production: Flask, Docker and GitHub Actions
/ml_prod_intro/
Tue, 28 Jan 2020 00:00:00 +0000/ml_prod_intro/We describe how to set up a toy-model repository to train and dockerize a machine learning model with data store on aws s3.Drawing Manifolds in LaTeX with TikZ
/manifold_fig_latex/
Fri, 10 Jan 2020 00:00:00 +0000/manifold_fig_latex/We give some LaTex code to create figures of manifolds with boundaries.Open Data: Germany Maps Viz
/germany_plots/
Tue, 07 Jan 2020 00:00:00 +0000/germany_plots/In this post I want to show how to use public available (open) data to create geo visualizations in python. Maps are a great way to communicate and compare information when working with geolocation data. There are many frameworks to plot maps, here I focus on matplotlib and geopandas (and give a glimpse of mplleaflet).
Reference: A very good introduction to matplotlib is the chapter on Visualization with Matplotlib from the Python Data Science Handbook by Jake VanderPlas.The Graph Laplacian & Semi-Supervised Clustering
/semi_supervised_clustering/
Thu, 05 Dec 2019 00:00:00 +0000/semi_supervised_clustering/In this post we want to explore the semi-supervided algorithm presented Eldad Haber in the BMS Summer School 2019: Mathematics of Deep Learning, during 19 - 30 August 2019, at the Zuse Institute Berlin. He developed an implementation in Matlab which you can find in this GitHub repository. In addition, please find the corresponding slides here.
Prepare Notebook import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns; sns.The Lapacian on the 2-Torus
/laplacian_2torus/
Sun, 13 Oct 2019 00:00:00 +0000/laplacian_2torus/In this blog post I want to describe the explicit computation of the Laplacian on differential forms on the \(2\)-Torus \(T^2\subset \mathbb{R}^3\). This surface can be obtained by rotating the circle \((x-a)^2+y^2=r^2\) around the \(z\)-axis (\(0<r<a\)). Locally, this surface can be parametrized by the equations \[ x = (a+r\cos u)\cos v,\\ y = (a+r\cos u)\sin v,\\ z = r\sin u, \]
where \(0<u,v<2\pi\).PyData Berlin 2019: Gaussian Processes for Time Series Forecasting (scikit-learn)
/gaussian_process_time_series/
Thu, 10 Oct 2019 00:00:00 +0000/gaussian_process_time_series/In this notebook we run some experiments to demonstrate how we can use Gaussian Processes in the context of time series forecasting with scikit-learn. This material is part of a talk on Gaussian Process for Time Series Analysis presented at the PyCon DE & PyData 2019 Conference in Berlin.
Update: Additional material and plots were included for the Second Symposium on Machine Learning and Dynamical Systems at The Fields Institute (virtual event).satRday Berlin 2019: Remedies for Severe Class Imbalance
/class_imbalance/
Sat, 15 Jun 2019 00:00:00 +0000/class_imbalance/In this post I present a concrete case study illustrating some techniques to improve model performance in class-imbalanced classification problems. The methodologies described here are based on Chapter 16: Remedies for Severe Class Imbalance of the (great!) book Applied Predictive Modeling by Max Kuhn and Kjell Johnson. I absolutely recommend this reference to anyone interested in predictive modeling.
This notebook should serve as an extension of my talk given at satRday Berlin 2019: A conference for R users in Berlin.Seasonal Bump Functions
/bump_func/
Thu, 11 Apr 2019 00:00:00 +0000/bump_func/Motivated by the nice talk on Winning with Simple, even Linear, Models by Vincent D. Warmerdam, I briefly describe how to construct certain class of bump functions to encode seasonal variables in R.
Prepare Notebook library(glue) library(lubridate) library(magrittr) library(tidyverse) Generate Data Let us generate a time sequence variable stored in a tibble.
# Define time sequence. t <- seq.Date(from = as.Date("2017-07-01"), to = as.Date("2019-04-01"), by = "day") # Store it in a tibble.An Introduction to Gaussian Process Regression
/gaussian_process_reg/
Mon, 08 Apr 2019 00:00:00 +0000/gaussian_process_reg/Updated Version: 2019/09/21 (Extension + Minor Corrections)
After a sequence of preliminary posts (Sampling from a Multivariate Normal Distribution and Regularized Bayesian Regression as a Gaussian Process), I want to explore a concrete example of a gaussian process regression. We continue following Gaussian Processes for Machine Learning, Ch 2.
Other recommended references are:
Gaussian Processes for Timeseries Modeling by S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson & S.Bayesian Regression as a Gaussian Process
/reg_bayesian_regression/
Mon, 01 Apr 2019 00:00:00 +0000/reg_bayesian_regression/In this post we study the Bayesian Regression model to explore and compare the weight and function space and views of Gaussian Process Regression as described in the book Gaussian Processes for Machine Learning, Ch 2. We follow this reference very closely (and encourage to read it!). Our main objective is to illustrate the concepts and results through a concrete example. We use PyMC3 to run bayesian sampling.
References:Sampling from a Multivariate Normal Distribution
/multivariate_normal/
Sat, 23 Mar 2019 00:00:00 +0000/multivariate_normal/In this post I want to describe how to sample from a multivariate normal distribution following section A.2 Gaussian Identities of the book Gaussian Processes for Machine Learning. This is a first step towards exploring and understanding Gaussian Processes methods in machine learning.
Multivariate Normal Distribution Recall that a random vector \(X = (X_1, , X_d)\) has a multivariate normal (or Gaussian) distribution if every linear combination
\[ \sum_{i=1}^{d} a_iX_i, \quad a_i\in\mathbb{R} \] is normally distributed.Dockerize a ShinyApp
/dockerize-a-shinyapp/
Sat, 02 Mar 2019 00:00:00 +0000/dockerize-a-shinyapp/In this post I want to describe how to dockerize a simple Shiny App. Docker is a great way of sharing and deploying projects. You can download it here.
Resources:
R Docker tutorial, recommended for Docker beginners. Running a shiny app in a docker container by Mark Sellors (which is an updated and more complete version of this post). Assume you have a project folder structure as follows:The Spectral Theorem for Matrices
/the-spectral-theorem-for-matrices/
Sat, 02 Feb 2019 00:00:00 +0000/the-spectral-theorem-for-matrices/When working in data analysis it is almost impossible to avoid using linear algebra, even if it is on the background, e.g. simple linear regression. In this post I want to discuss one of the most important theorems of finite dimensional vector spaces: the spectral theorem. The objective is not to give a complete and rigorous treatment of the subject, but rather show the main ingredientes, some examples and applications.Movie Plots Text Generation with Keras
/movie_plot_text_gen/
Sun, 13 Jan 2019 00:00:00 +0000/movie_plot_text_gen/In this post I show some text generation experiments I ran using LSTM with Keras. For the preprocessing and tokenization I used SpaCy. The aim is not to present a completed project, but rather a first step which should be then iterated.
Resources There are many great resources and blog posts about the subject (and similar experiments). Here I mention the ones I found particularly useful for the general theory:Exploring the Curse of Dimensionality - Part II.
/exploring-the-curse-of-dimensionality-part-ii./
Tue, 01 Jan 2019 00:00:00 +0000/exploring-the-curse-of-dimensionality-part-ii./I continue exploring the curse of dimensionality. Following the analysis form Part I., I want to discuss another consequence of sparse sampling in high dimensions: sample points are close to an edge of the sample. This post is based on The Elements of Statistical Learning, Section 2.5, which I encourage to read!
Uniform Sampling Consider \(N\) data points uniformly distributed in a \(p\)-dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin.Text Mining, Networks and Visualization: Plebiscito Tweets
/text-mining-networks-and-visualization-plebiscito-tweets/
Thu, 20 Dec 2018 00:00:00 +0000/text-mining-networks-and-visualization-plebiscito-tweets/Nowadays social media generates a vast amount of raw data (text, images, videos, etc). It is a very interesting challenge to discover techniques to get insights on the content and development of social media data. In addition, as a fundamental component of the analysis, it is important to find ways of communicating the results, i.e. data visualization. In this post I want to present a small case study where I analyze Twitter text data.Exploring the Curse of Dimensionality - Part I.
/exploring-the-curse-of-dimensionality-part-i./
Sun, 09 Dec 2018 00:00:00 +0000/exploring-the-curse-of-dimensionality-part-i./In this post I want to present the notion of curse of dimensionality following a suggested exercise (Chapter 4 - Ex. 4) of the book An Introduction to Statistical Learning, written by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
When the number of features \(p\) is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made.From Pelican to Blogdown
/pelican_to_blogdown/
Sun, 02 Dec 2018 00:00:00 +0000/pelican_to_blogdown/Here I want to discuss my transition from Pelican to Blogdown and present some personal learnings. In June 2017 I decided to build a personal website/portafolio. I chose Pelican, because:
It is written in Python, which was the programing language I was mainly working on.
I wanted to include some Jupyter notebook I had already written.
A great post: Building a data science portfolio: Making a data science blog explaining the procedure and using GitHub Pages to publist it.\(S^1\)-Equivariant Dirac operators on the Hopf Fibration
/hopf_fibration/
Sun, 11 Nov 2018 00:00:00 +0000/hopf_fibration/In this expository article I discuss the definition and basic properties of the Hopf fibration, with particular emphasis on Dirac-type operators induced, in the sense of Brüning and Heintze, by the Hodge-de Rham and spin-Dirac operators. In addition, we compute the Dirac-Schrödinger type operator introduced in my PhD thesis.Introduction to R Plumber : Expose a Caret model to a web API
/intro_plumber/
Fri, 12 Oct 2018 00:00:00 +0000/intro_plumber/In this post we present a simple example of how to expose a prediction model to a web API using the Plumber package.Circle Radius Fit for a Cloud of Points
/circle-radius-fit-for-a-cloud-of-points/
Sun, 09 Sep 2018 00:00:00 +0000/circle-radius-fit-for-a-cloud-of-points/We explore how to include an R notebook into a pelican post. As an example, we describe how to fit a circle onto a cloud of points.From Bachelor to PhD: Geometric and Topological Methods for Quantum Field Theory
/vdl_experience/
Thu, 02 Aug 2018 00:00:00 +0000/vdl_experience/We give an introduction to PyMC3, a probabilistic programming framework written in Python. We revise the basic mathematical theory and present two concrete examples.PyData Berlin 2018: On Laplacian Eigenmaps for Dimensionality Reduction
/laplacian_eigenmaps_dim_red/
Sun, 08 Jul 2018 00:00:00 +0000/laplacian_eigenmaps_dim_red/This post contains the slides and material from a talk I gave at PyData Berlin 2018. I presented the paper <em>Laplacian Eigenmaps for Dimensionality Reduction and Data Representation</em> by <a href="http://web.cse.ohio-state.edu/~belkin.8/">Mikhail Belkin</a> and <a href="http://people.cs.uchicago.edu/~niyogi/">Partha Niyogi</a>.Probability that a given observation is part of a bootstrap sample?
/bootstrap/
Wed, 29 Nov 2017 00:00:00 +0000/bootstrap/We study the problem of computing the probability that a given observation is part of a bootstrap sample. We include some numerical simulations.Induced Dirac-Schrödinger operators on semi-free circle quotients
/phd/
Sat, 11 Nov 2017 00:00:00 +0000/phd/I present the content of my PhD Thesis in mathematics, which has now been published in The Journal of Geometric Analysis.Introduction to Bayesian Modeling with PyMC3
/intro_pymc3/
Sun, 13 Aug 2017 00:00:00 +0000/intro_pymc3/We give an introduction to PyMC3, a probabilistic programming framework written in Python. We revise the basic mahematical theory and present two concrete examples.Web scraping with Beautiful Soup: Plebiscito Colombia (October 2nd)
/plebiscito/
Sun, 09 Jul 2017 00:00:00 +0000/plebiscito/We describe how to use Beautiful Soup to scrape the official goverment website in order to get the results of the peace referendum in Colombia.The Dirac operator on the 2-sphere
/the-dirac-operator-on-the-2-sphere/
Thu, 29 Jun 2017 00:00:00 +0000/the-dirac-operator-on-the-2-sphere/The objective of this post is to explore MathJax, a JavaScript display engine for LaTeX. Being my first post writen with this tool, I want to present a short but fun example: I will give a description of the explicit computation of the spin-Dirac operator (of the unique complex spinor bundle!) on the 2-sphere \(S^2\) equipped with the standar round metric. A more detailed treatment can be found in my expository paper.Python Exercise: Distance to Rectangle
/rectangle/
Wed, 28 Jun 2017 00:00:00 +0000/rectangle/In this first post we get started with a small python script to explore the basic capabilities of Pelican.About
/about/
Mon, 01 Jan 0001 00:00:00 +0000/about/Dr. Juan Camilo Orduz Mathematician & Data Scientist
I am a mathematician (PhD from Humboldt Universität zu Berlin) and Data Scientist (Wolt). On this website you can find more information about me and some of my projects.
You can find the code associated with the blog posts on this GitHub repository.
I am part of the team running the Berlin Time Series Analysis Meetup.Curriculum Vitae
/cv/
Mon, 01 Jan 0001 00:00:00 +0000/cv/Work Experience 01/10/2021 (Senior) Data Scientist, Wolt, Berlin, Germany.
01/03/2020-30/09/2021 Senior Data Scientist, HelloFresh SE, Berlin, Germany.
Main contributor of the data science efforts in the meal-box, recipe and add-ons forecasting, using methods in machine learning and time series analysis for various international markets. Maintained several time series models running in production and contributed to internal utility libraries in Python. Main tech stack: Python, Docker, Concourse for the CI/CD pipelines and Airflow for job scheduling.