These three are the main measures of central tendency. The central tendency lets us know the “normal” or “average” values of a dataset. If you’re just starting with data science, this is the right tutorial for you.
By the end of this tutorial you’ll:
Understand the concept of mean, median, and mode Be able to create your own mean, median, and mode functions in Python Make use of Python’s statistics module to quickstart the use of these measurements
Let’s get into the different ways to calculate mean, median, and mode.
Calculating the Mean in Python
The mean or arithmetic average is the most used measure of central tendency. A dataset is a collection of data, therefore a dataset in Python can be any of the following built-in data structures:
Lists, tuples, and sets: a collection of objects Strings: a collection of characters Dictionary: a collection of key-value pairs
We can calculate the mean by adding all the values of a dataset and dividing the result by the number of values. For example, if we have the following list of numbers: The mean or average would be 3.5 because the sum of the list is 21 and its length is 6. Twenty-one divided by six is 3.5. You can perform this calculation with the below calculation: In this tutorial, we’ll be using the players of a basketball team as our sample data.
Creating a Custom Mean Function
Let’s start by calculating the average (mean) age of the players in a basketball team. The team’s name will be “Pythonic Machines”. Breaking down this code:
The “pythonic_machine_ages” is a list with the ages of basketball players We define a mean() function which returns the sum of the given dataset divided by its length The sum() function returns the total sum (ironically) of the values of an iterable, in this case, a list. Try to pass the dataset as an argument, it’ll return 211 The len() function returns the length of an iterable, if you pass the dataset to it you’ll get 8 We pass the basketball team ages to the mean() function and print the result.
If you check the output, you’ll get: This output represents the average age of the basketball team players. Note how the number doesn’t appear in the dataset but describes precisely the age of most players.
Using mean() from the Python Statistic Module
Calculating measures of central tendency is a common operation for most developers. That’s because Python’s statistics module provides diverse functions to calculate them, along with other basic statistics topics. Since it’s part of the Python standard library you won’t need to install any external package with PIP. Here’s how you use this module: In the above code, you just need to import the mean() function from the statistics module and pass the dataset to it as an argument. This will return the same result as the custom function we defined in the previous section: Now you have crystal clear the concept of mean let’s continue with the median measurement.
Finding the Median in Python
The median is the middle value of a sorted dataset. It is used — again — to provide a “typical” value of a determined population. In programming, we can define the median as the value that separates a sequence into two parts — The lower half and the higher half —. To calculate the median, first, we need to sort the dataset. We could do this with sorting algorithms or using the built-in function sorted(). The second step is to determine whether the dataset length is odd or even. Depending on this some of the following process:
Odd: The median is the middle value of the dataset Even: The median is the sum of the two middle values divided by two
Continuing with our basketball team dataset, let’s calculate the players’ median height in centimeters: As you can see, since the dataset length is odd, so we can take the middle value as the median. However, what would happen if a player just got retired? We would need to calculate the median taking the two middle values of the dataset
Creating a Custom Median Function
Let’s implement the above concept into a Python function. Remember the three steps we need to follow to get the median of a dataset:
Sort the dataset: We can do this with the sorted() function Determine if it’s odd or even: We can do this by getting the length of the dataset and using the modulo operator (%) Return the median based on each case: Odd: Return the middle value Even: Return the average of the two middle values
That would result in the following function: Printing the result of our datasets: Output: Note how we create a data variable that points to the sorted database at the start of the function. Although the lists above are sorted, we want to create a reusable function, therefore sorting the dataset each time the function is invoked. The index stores the middle value — or the upper-middle value — of the dataset, by using the integer division operator. For instance, if we were passing the “pythonic_machine_heights” list it would have the value of 4. Then we check if the length of the dataset is odd by comparing the result of the modulo operation with any value that isn’t zero. If the condition is true, we return the middle element, for instance, with the “pythonic_machine_heights” list: On the other hand, if the dataset is even we return the sum of the middle values divided by two. Note that data[index -1] gives us the lower midpoint of the dataset, while data[index] supplies us with the upper midpoint.
Using median() from the Python Statistic Module
This way is much simpler because we’re using an already existent function from the statistics module. Personally, if there is something already defined for me, I would use it because of the DRY —Don’t repeat yourself — principle (in this case, don’t repeat other’s code). You can calculate the median of the previous datasets with the following code: Output:
Computing the Mode in Python
The mode is the most frequent value in the dataset. We can think of it as the “popular” group of a school, that may represent a standard for all the students. An example of mode could be the daily sales of a tech store. The mode of that dataset would be the most sold product of a specific day. As you can appreciate, the mode of the above dataset is “laptop” because it was the most frequent value in the list. Let’s analyze the sales of another day: The dataset above has two modes: “mouse” and “headphones” because both have a frequency of two. This means it’s a multimodal dataset. What if we can’t find the mode in a dataset, like the one below? This is called a uniform distribution, basically, it means there is no mode in the dataset. Now you have a quick grasp on the concept of mode, let’s calculate it in Python.
Creating a Custom Mode Function
We can think of the frequency of a value as a key-value pair, in other words, a Python dictionary. Recapitulating the basketball analogy, we can use two datasets to work with: The points per game, and the sneaker sponsorship of some players. To find the mode first we need to create a frequency dictionary with each one of the values present in the dataset, then get the maximum frequency, and return all the elements with that frequency. Let’s translate this into code: Checking the result passing the two lists as arguments: Output: As you can see, the first print statement gave us a single mode, while the second returned multiple modes. Explaining deeper the code above:
We declare a frequency dictionary We iterate over the dataset to create a histogram — the statistical term for a set of counters (or frequencies) — If the key is found in the dictionary then, it adds one to the value If it’s not found we create a key-value pair with a value of one The most_frequent variable stores — ironically — the biggest value (not key) of the frequency dictionary We return the modes variable which consists of all the keys in the frequency dictionary with the most frequency.
Note how important is variable naming to write readable code.
Using mode() and multimode() from the Python Statistic Module
Once again the statistics module provides us a quick way to do basic statistics operations. We can use two functions: mode() and multimode(). The code above imports both functions and define the datasets we’ve been working with. Here comes the little difference: The mode() function returns the first mode it encounters, while multimode() returns a list with the most frequent values in the dataset. Output: Using the multimode() function: Output:
To Sum Up
Congratulations! If you followed so far, you learned how to calculate the mean, median, and mode, the main central tendency measurements. Although you can define your custom functions to find mean, median, and mode, it’s recommended to use the statistics module, since it’s part of the standard library and you need to install nothing to start using it. Next, read a friendly introduction to data analysis in Python.