Data structures, numpy and pandas

data_structures_numpy_pandas

I think computers are a miracle. Think about it everything on your mobile phone or those graphics in a video game are basically a bunch of 1s and 0s that are created by electrical constriction and dilation. I think it'll not do computers a favor without talking a bit about this before we start.

$2^3 = 8$ $2^{10} = 1024$ $2^{20} = 1,048,576$ and so on.

Basic data structures in python

The basic data structures are very similar in most programming languages. However they might be built differently for example every base data structure in python is an object. In the following paragraphs we'll talk about them.

String

This is a very important data structure that is used to represent characters. Usually a string holds about 8 bits of information so for example the string "python" would take up about 6 bytes of space in your computer. A string in python as mentioned above is an object and has many properties available for manipulation. One of the most important thing to note is that it is an iterable. Which means that it exposes an interface that lets you iterate through it. Like below


x
my_string = "I love python"
for i in my_string:
    print(i)
def is_panlindrome(s):
    return s == s[::-1]
print(is_panlindrome("madam"))
c = "love" in my_string
print(c)

The first example loops through each string note that even the space is a string and prints each character. The second example utilizes syntactic sugar in python to reverse the string to check if is a palindrome.

The string object has some functions available like capitalize, join, isupper, lower etc.


xxxxxxxxxx
a = ["a", "b", "c"]
b = "".join(a)
c = "-".join(a)
print(b,c)

Numerical data types

Python has 3 numerical data types. The integer, floating point numbers and complex numbers. All three are objects.


xxxxxxxxxx
my_int = 5
my_float = 10.0
my_complex = 2 + 1j
print(type(my_int))
print(type(my_float))
print(type(my_complex))

16 bits of of memory storage is usually required for an integer. The reason behind this is beyond the scope of this tutorial but think about it if we persisted with 8bits as with the string we would soon have memory overflow with very large integers.

Arrays (lists in python)

Asides from the integer and strings, I feel arrays are the building block of programming languages. When I started out learning how to program c++ back in the days I used to wonder what the heck is an array but these data structures are very important. In languages like c#, arrays are usually held in memory sequences that is if you have an array of length 5, then a 5 unit long memory block will be used to hold this array. However in python this is different because lists are objects and objects are located by reference in the memory. One of the main advantages of an array is that it offers indexing and in computer parlance O(1) time complexity to retrieve an element if you know the location of this element.


xxxxxxxxxx
my_array = [1,2,4,5]
print(my_array[0])
print(my_array[2])
print(my_array[15])

Arrays are zero indexed which means they usually start from zero. The are also iterables which means that you can iterate arrays using the for statement.


xxxxxxxxxx
my_array = [1,2,3,4,5]
for i in my_array:
  print(i)
for i in range(len(my_array)):
  print(my_array[i])
  
for i,j in enumerate(my_array):
  print(i,j)
  
my_other_array = ['data science', 'machine learning', 'pcr', 'cs', 'python']
for i,j in zip(my_array, my_other_array):
  print(i,j)
  
a = [ i for i in range(5)]
print(a)

Arrays are very powerful in python numpy arrays are even more powerful because of vectorized operations. In data analysis, data science, machine learning we love arrays because we can make matrices out of them. Below I will show an example of a matrix and matrix multiplication with an array.

N_{ij} = \sum{R_{ik}*C_{kj}} \\ \begin{bmatrix} 2 & 3 \\5 & 6 \end{bmatrix} * \begin{bmatrix} 1 & 1 \\2 & 2 \end{bmatrix} \\ 1 * \begin{bmatrix} 2 \\5 \end{bmatrix} + 2 * \begin{bmatrix} 3 \\6 \end{bmatrix} = \begin{bmatrix} 8 \\17 \end{bmatrix} \\ 1 * \begin{bmatrix} 2 \\5 \end{bmatrix} + 2 * \begin{bmatrix} 3 \\6 \end{bmatrix} = \begin{bmatrix} 8 \\17 \end{bmatrix}\\ \begin{bmatrix} 8 &8 \\17 & 17 \end{bmatrix}


xxxxxxxxxx
my_matrix = [[ i for i in range(5)] for i in range(5)]
print(my_matrix)
matrix_one = [[2,3],[5,6]]
matrix_two = [[1,1],[2,2]]
matrix_three = [[ 0 for i in range(len(matrix_one))] for i in range(len(matrix_one))]
for i in range(len(matrix_one)):
  for j in range(len(matrix_two)):
    for k in range(len(matrix_one[0])):
      matrix_three[i][j] += matrix_one[i][k] * matrix_two[k][j]
print(matrix_three)


xxxxxxxxxx
import numpy as np
matA = np.array([[2,3],[5,6]])
matB = np.array([[1,1],[2,2]])
matC = np.matmul(matA, matB)
print(matC)

Tuples

Tuples are immutable data types in python. They are very similar to arrays and I always think of them when I want to manage coordinates in python for some weird reason.


xxxxxxxxxx
class my_cordinate_system:
    def __init__(self, cords):
        self.x = cords[0]
        self.y = cords[1]
    def __add__(self, other):
        return (self.x + other.x, self.y + other.y)
a = my_cordinate_system((1,2))
b = my_cordinate_system((3,4))
print(a+b)

In the above code, I have used function overload to add two tuples.


xxxxxxxxxx
a = tuple(i for i in range(5))
print(a)
a[0] = 10

Dictionaries

Dictionaries are also called hash maps. They are very important data structures because like arrays, you can retrieve elements in O(1) time complexity. Dictionaries usually have a key value pair relationship. That is you have a key and for each key you have value which can hold any other type of data structure.


xxxxxxxxxx
my_dict = {
  "machine learning": 5,
  "data science": 20,
  "artificial intelligence": 100
}
print(my_dict["machine learning"])
courses = ["machine learning", "data science", "artifical intelligence"]
scores = [5,10,100]
a = {key: value for key, value in zip(courses, scores)}
print(a)
b = [i for i in a]
print(b)
c = [i for i in a.items()]
print(c)
d = [i for i in a.values()]
print(d)

Sets

Sets are unordered immutable collection of objects. These are usually useful for filtering out repeated objects or performing set operations.


xxxxxxxxxx
a = [1,1,2,3,4]
b = set(a)
print(b)
c = set([1,8,10])
union = c.union(b)
print(union)
intersection = c.intersection(b)
print(intersection)

Other data structures

There are many other data structures for example stacks, queues, binary search trees, graphs etc these are not covered here.

Pandas

For this section, I will solve a problem and then we can go over the code step by step and try to implement it in other ways too. This is the problem:

TASK

Make a program which reads all files named "stream.csv" ( matches any string) from the "input" folder
- If programming on Windows, make sure to use portable methods for building filepaths (ex os.path.join())
Calculate their "aggregate stream" in-memory
Save the results to an output file named stream_aggregate.csv in the "output" directory
The calculation and saving of the file should happen when the program is invoked from a terminal in the source directory, eg python my_program.py

Mathematical definitions

Flow-weighted temperature is a weighted average temperature, where the impact of each stream on the "average" is proportional to the streams flow at that time.
The values in the aggregate stream is calculated independently for each timestamp Stream i has, at each timestamp t, a flow f_i(t) and a temperature T_i(t) Aggregated flow (f) is calculated as a simple sum each hour. For n streams: f(t) = (f_0(t) + f_1(t) + ... + f_n(t) ) Flow-weighted temperature T_fw(t) is calculated as: T_fw(t) = (T_0(t)*f_0(t) + T_1(t)*f_1(t) + ... + T_n(t)*f_n(t)) / (f_0(t) + f_1(t) + ... + f_n(t)) = (T_0(t)*f_0(t) + T_1(t)*f_1(t) + ... + T_n(t)*f_n(t)) / f(t)

Sample data

3 files with 24-25 hours of hourly data is provided
Each dataset has 3 columns: datetime, flow, temperature
The aggregated stream should have the exact same column names and formats as input streams
The data in the output file should have the same resolution as the input files (2 figures after the decimal point)

Data irregularities

The below irregularities in the data must be handled gracefully by the program:

The second stream has data 1 hour more than the other streams. This hour should be included in the output. For reference, the output this hour should be equal to the input of the only stream that has data that hour
The third stream has missing data (blank fields) one hour. These should be treated as NaN, which should not affect the calculation (this hour, the calculation will only be based on the remaining two streams)

Tools

We don't have strict requirements on libraries, testing frameworks or tools to be used. However, python-pandas or Gonum are recommended.

To Deliver

Code package with program and tests.

Solution


xxxxxxxxxx
import os
from os.path import join
import pandas as pd
INPUT_STRING = "stream"
INPUT_FOLDER = join(os.getcwd(), "input")
OUTPUT_FILE = join(os.getcwd(), "output/stream_aggregate.csv")
def file_search(root, filename):
    '''
        This searches for the file with a common string. In the case of this test stream
        Returns a generator object
        Parameters
        ----------
        root: this is the string of the root folder to look in.
        filename: this is the string to search for in the file.
        Notes
        --------
        This is basically return a generator object that one can loop on and load the data into it.
        Returns
        --------
        Generator object
    '''
    for file in os.listdir(root):
        full_path = join(root, file)
        if filename in file:
            yield full_path
def load_merge(files_path):
    '''
        This function performs the main task for this test.
        Returns None
        Parameters
        ----------
        files_path: a generator object.
        Notes
        --------
        This is function does the majority of the task in the work taking advantage of vectorized operations offered by pandas.
        When it completes the task, it saves it in the output file.
        References
        -----------
        https://pandas.pydata.org/pandas-docs/stable/index.html
        Returns
        --------
        None
    '''
    df = None
    for i, fp in enumerate(files_path):
        p = pd.read_csv(fp,index_col="datetime")
        p["temperature"] = p['flow'] * p['temperature']
        if i == 0:
            df = p
        else:
            df = df.add(p, axis=1,fill_value=0)
    df["temperature"] = df["temperature"] / df["flow"]
    df = df.round(2)
    df.to_csv(OUTPUT_FILE)
if __name__ == "__main__":
    files_path = file_search(INPUT_FOLDER, INPUT_STRING)
    load_merge(files_path)


xxxxxxxxxx
import unittest
from aggregate import file_search
import os
from os.path import join
import types
'''
    This class handles unit testing in the project.
    It can also be executed from the cmd line
'''
INPUT_STRING = "stream"
INPUT_FOLDER = join(os.getcwd(), "input")
class TestAggregate(unittest.TestCase):
    def test_file_search(self):
        ''' 
            Checks if the function file search returns a genarator.
        '''
        res = file_search(INPUT_FOLDER,INPUT_STRING)
        gen = types.GeneratorType
        self.assertIsInstance(res, gen)
if __name__ == "__main__":
    unittest.main()

https://docs.python.org/3/library/stdtypes.html

https://numpy.org/doc/stable/contents.html

https://pandas.pydata.org/pandas-docs/stable/index.html

Search This Blog

George Ewah Uche: My dev story

Data structures, numpy and pandas

Comments

Post a Comment

Popular posts from this blog

How we processed data of over 100gb with 16gb of ram

AWS networking basics with terraform and a tiny bit of microservices

Python meets linear algebra