Apache Kafka **2.8.0** is finally out and you can now have early-access to KIP-500 that removes the Apache Zookeeper dependency. Instead, Kafka now relies on an internal Raft quorum that can be activated through **Kafka Raft metadata mode**. …

Converting a PySpark DataFrame to Pandas is quite trivial thanks to `toPandas()`

method however, this is probably one of the most costly operations that must be used sparingly, especially when dealing with fairly large volume of data.

Pandas DataFrames are stored in-memory which means that the operations over them are faster…

Dropping columns from DataFrames is one of the most commonly performed tasks in PySpark. In today’s short guide, we’ll explore a few different ways for deleting columns from a PySpark DataFrame. Specifically, we’ll discuss how to

- delete a single column
- drop multiple columns
- reverse the operation and instead, select the…

Adding new columns to PySpark DataFrames is probably one of the most common operations you need to perform as part of your day-to-day work.

In today’s short guide, we will discuss about how to do so in many different ways. …

In one of my previous roles, I discussed the difference between parametric and non-parametric methods in the context of Machine Learning.

Parametric methods make assumptions about the relationship between the data and the function to be estimated and thus they are generally inflexible. For example, we may **assume** that the…

In the big data era Object Storage architecture continuously gains more traction by teams that wish to store, archive, and manage high volumes of data.

In today’s article, we are going to discuss the fundamental concepts around Object Storage architecture. …

In one of my previous articles, I discussed the difference between prediction and inference in the context of Statistical Learning. Despite their main difference with respect to the end goal, in both approaches we need to estimate an unknown function *f*.

In other words, we need to learn a function…

In simple terms, Statistical Learning refers to a collection of methods and approaches that can be applied to **estimate an unknown function** ** f**.

For example, let’s suppose we have to work with some real estate data so that we can potentially find a relationship between the characteristics (i.e. the predictors…

Randomness is a fundamental mathematical concept that is usually used in the context of programming as well. Sometimes, we may need to introduce some randomness when creating some toy data or when we need to perform some specific calculations that will be dependent on some random event.

In today’s article…

The application of a particular function over pandas columns is a quite common approach when it comes to data transformation. In today’s short guide, we are going to discuss how to apply pre-defined or lambda functions over one or more columns in pandas DataFrames.

Additionally, we will discuss how to…

Machine Learning Engineer | Python Developer | https://www.buymeacoffee.com/gmyrianthous