Data structures, numpy and pandas

data_structures_numpy_pandas

I think computers are a miracle. Think about it everything on your mobile phone or those graphics in a video game are basically a bunch of 1s and 0s that are created by electrical constriction and dilation. I think it'll not do computers a favor without talking a bit about this before we start.

Computers are made up of a central processing unit cpu for short and random access memory. The cpu is where the calculations are done and the ram is where short term memory is stored. The computer understands only 1s and 0s which are called bits. 8 of these bits make a byte . A kilobyte comprises of 1024 of such bits . A megabyte consists of 1,048,576 of such bits and so on.

Basic data structures in python

The basic data structures are very similar in most programming languages. However they might be built differently for example every base data structure in python is an object. In the following paragraphs we'll talk about them.

String

This is a very important data structure that is used to represent characters. Usually a string holds about 8 bits of information so for example the string "python" would take up about 6 bytes of space in your computer. A string in python as mentioned above is an object and has many properties available for manipulation. One of the most important thing to note is that it is an iterable. Which means that it exposes an interface that lets you iterate through it. Like below

The first example loops through each string note that even the space is a string and prints each character. The second example utilizes syntactic sugar in python to reverse the string to check if is a palindrome.

 

The string object has some functions available like capitalize, join, isupper, lower etc.

Numerical data types

Python has 3 numerical data types. The integer, floating point numbers and complex numbers. All three are objects.

16 bits of of memory storage is usually required for an integer. The reason behind this is beyond the scope of this tutorial but think about it if we persisted with 8bits as with the string we would soon have memory overflow with very large integers.

 

Arrays (lists in python)

Asides from the integer and strings, I feel arrays are the building block of programming languages. When I started out learning how to program c++ back in the days I used to wonder what the heck is an array but these data structures are very important. In languages like c#, arrays are usually held in memory sequences that is if you have an array of length 5, then a 5 unit long memory block will be used to hold this array. However in python this is different because lists are objects and objects are located by reference in the memory. One of the main advantages of an array is that it offers indexing and in computer parlance O(1) time complexity to retrieve an element if you know the location of this element.

Arrays are zero indexed which means they usually start from zero. The are also iterables which means that you can iterate arrays using the for statement.

Arrays are very powerful in python numpy arrays are even more powerful because of vectorized operations. In data analysis, data science, machine learning we love arrays because we can make matrices out of them. Below I will show an example of a matrix and matrix multiplication with an array.

Tuples

Tuples are immutable data types in python. They are very similar to arrays and I always think of them when I want to manage coordinates in python for some weird reason.

In the above code, I have used function overload to add two tuples.

 

Dictionaries

Dictionaries are also called hash maps. They are very important data structures because like arrays, you can retrieve elements in O(1) time complexity. Dictionaries usually have a key value pair relationship. That is you have a key and for each key you have value which can hold any other type of data structure.

Sets

Sets are unordered immutable collection of objects. These are usually useful for filtering out repeated objects or performing set operations.

Other data structures

There are many other data structures for example stacks, queues, binary search trees, graphs etc these are not covered here.

Pandas

For this section, I will solve a problem and then we can go over the code step by step and try to implement it in other ways too. This is the problem:

TASK

  • Make a program which reads all files named "stream.csv" ( matches any string) from the "input" folder

    • If programming on Windows, make sure to use portable methods for building filepaths (ex os.path.join())
  • Calculate their "aggregate stream" in-memory

  • Save the results to an output file named stream_aggregate.csv in the "output" directory

  • The calculation and saving of the file should happen when the program is invoked from a terminal in the source directory, eg python my_program.py

Mathematical definitions

  • Flow-weighted temperature is a weighted average temperature, where the impact of each stream on the "average" is proportional to the streams flow at that time.
  • The values in the aggregate stream is calculated independently for each timestamp Stream i has, at each timestamp t, a flow f_i(t) and a temperature T_i(t) Aggregated flow (f) is calculated as a simple sum each hour. For n streams: f(t) = (f_0(t) + f_1(t) + ... + f_n(t) ) Flow-weighted temperature T_fw(t) is calculated as: T_fw(t) = (T_0(t)*f_0(t) + T_1(t)*f_1(t) + ... + T_n(t)*f_n(t)) / (f_0(t) + f_1(t) + ... + f_n(t)) = (T_0(t)*f_0(t) + T_1(t)*f_1(t) + ... + T_n(t)*f_n(t)) / f(t)

Sample data

  • 3 files with 24-25 hours of hourly data is provided
  • Each dataset has 3 columns: datetime, flow, temperature
  • The aggregated stream should have the exact same column names and formats as input streams
  • The data in the output file should have the same resolution as the input files (2 figures after the decimal point)

Data irregularities

The below irregularities in the data must be handled gracefully by the program:

  • The second stream has data 1 hour more than the other streams. This hour should be included in the output. For reference, the output this hour should be equal to the input of the only stream that has data that hour
  • The third stream has missing data (blank fields) one hour. These should be treated as NaN, which should not affect the calculation (this hour, the calculation will only be based on the remaining two streams)

Tools

  • We don't have strict requirements on libraries, testing frameworks or tools to be used. However, python-pandas or Gonum are recommended.

To Deliver

  • Code package with program and tests.

 

Solution

 

https://docs.python.org/3/library/stdtypes.html

https://numpy.org/doc/stable/contents.html

https://pandas.pydata.org/pandas-docs/stable/index.html

Comments

Popular posts from this blog

How we processed data of over 100gb with 16gb of ram

AWS networking basics with terraform and a tiny bit of microservices

Python meets linear algebra