Python for Data Engineering with chardet, io and logging modules

Problem

When developing custom data engineering workloads, we deal with one or many data sources. We also have a destination where the data must end up. As data engineers, our task is to figure out how to make the source data available in a structured and consistent way for downstream analytical purposes such as BI reporting. We usually do this task using the ETL/ELT pattern.

What are some helpful Python libraries to consider using when developing our pipelines? How can we best take advantage of those libraries, keeping in mind some caveats?

Solution

In this series, we provide guidance and boilerplate code on six Python modules for creating scalable and robust data engineering pipelines. The goal is to provide examples that you can adapt easily to your modular solutions. The modules included in this series are attrs, sqlalchemy, pandas, chardet, io, and logging. Each one will have a dedicated section in the article with the following structure: installation, why to use it, examples, and caveats.

In a previous article, we examined the first three packages. Now it is time to examine the next and final three: chardet, io, and logging. Read on to learn about these modules based on my experience building data engineering workloads.

chardet

Installation

pip install chardet

Why Use chardet?

Computers do not see text as we humans do. While we see the various characters and strings, computers are designed to deal with binary representations of the data. From the binary representation, computers parse the actual text that we can see. In broad terms, the text encoding is what sits between the binary representation and human human-readable form of some pieces of text. The encoding instructs the computer what symbol it is dealing with and how to store it in memory or on disk. Encoding is important when you must transfer international data from one place to another. The data engineer must find out the encoding to store the data correctly in the destination. While the UTF-8 encoding is dominant for most countries and languages, there are languages that use symbols from non-Latin alphabets. These languages, such as Chinese, Japanese, or Bulgarian, have different encodings.

The purpose of chardet is to detect this encoding in cases where no standardization exists.

Example

To illustrate using chardet, I have generated two CSV files:

Contains rows in German:

Contains rows written in Bulgarian, which uses Cyrillic characters:

The editor is not using the correct encoding to parse the text. Therefore, as seen in the image, it is impossible to read the content. This is an example of how we know the symbols are from a certain language. However, in a real scenario, I may be completely unaware of what sits behind these diamond question marks.

To overcome this problem, I can write a simple piece of code to experiment with the detection of those encodings:

# MSSQLTips.com (Python)
01: import chardet
02: from chardet.resultdict import ResultDict
03: 
04: 
05: def detect_file_encodings(files: list[str]) -> ResultDict:
06:     for file in files:
07:         with open(file, 'rb') as f:
08:             raw_data = f.read()
09:             result = chardet.detect(raw_data)
10: 
11:             print(f"\nCSV Encoding Detection for {file}:")
12:             print(f"Encoding: {result['encoding']}")
13:             print(f"Confidence: {result['confidence']}")
14: 
15: 
16: detect_file_encodings(['german_data.csv', 'bg_data.csv'])

As a result, I get UTF-8 for the German file and ISO-8859-5 for the Bulgarian file. By default, chardetalso calculates a confidence result indicating the level of certainty behind the result:

For both languages, the confidence level is high. We can be quite certain when referring to this encoding in our downstream processing logic.

Caveats

For larger files, you may want to provide only a sample of the first bytes to chardet. Also, there is no need to use chardet if the text comes with explicit character encoding information.

io

io is a built-in module available out-of-the-box in any Python installation. It is Python’s core module for dealing with several types of input and output streams.

Installation

This is a built-in module. You only need to import it.

Why Use io?

The io module allows you to handle incoming data streams: to move them or store them elsewhere. There are three types of io streams: text, binary, and raw. Here we will focus on the binary streams represented by the BytesIO object.

Example

If we build on top of the previous example, here is how we can store a CSV in binary form in memory, then use the in-memory object to read the CSV file.

# MSSQLTips.com (Python)
01: from io import BytesIO
02: import pandas as pd
03: 
04: container = BytesIO()
05: 
06: with open('german_data.csv', 'rb') as f:
07:     raw_data = f.read()
08:     container.write(raw_data)
09: 
10:     print(f'\nBinary representation: {container}')
11: 
12:     container.seek(0)
13:     df = pd.read_csv(container)
14:     print(f'\nDataframe from binary buffer:\n{df}')

In the output, we can see the address in memory where the binary object is stored right now. Next, we can see the human-readable content:

If I had to move this CSV file to an FTP server or blob storage using a microservice such as Azure Functions, then io would come in handy.

Caveats

You should be mindful of the size of the in-memory object and the encoding in case you use Text I/O.

logging

In Python, there exists a universal module offering a flexible event logging system. Tracking is helpful in recording and displaying the telemetry data of your data pipelines.

Installation

This is a built-in module. You only need to import it.

Why Use logging?

The logging module is helpful when recording telemetry data, e.g., when a process starts, what part of it executes, has it finished, etc. Using the various logging levels, you can record or output to the console status, warning, error, or debug messages.

Example

Here is an example of how to configure a logging instance:

# MSSQLTips.com (Python)
01: import logging
02: import time
03: 
04: 
05: def create_logger() -> logging.Logger:
06:     FORMAT = '%(asctime)s - %(levelname)s - %(message)s'
07:     logger = logging.Logger('test logger')
08: 
09:     handler = logging.StreamHandler()
10:     formatter = logging.Formatter(FORMAT)
11:     handler.setFormatter(formatter)
12:     logger.addHandler(handler)
13: 
14:     return logger
15: 
16: 
17: my_logger = create_logger()
18: 
19: my_logger.info('Info message')
20: time.sleep(1)
21: my_logger.warning('Warning message')
22: time.sleep(1)
23: my_logger.error('Error message')
24: time.sleep(1)
25: my_logger.debug('Debug message')

Let us break it down:

5: Define a custom function that will output a logger with our configuration.
6: Define a format. Logging supports certain placeholders called LogRecord attributes. The LogRecord is the type of the logger output. Check out the supported LogRecord attributes.
7: Instantiate a logger object providing a mandatory name.
9: Instantiate a handler. The handler object defines where the logging stream goes. There are many handlers supported. The most common ones are:
- StreamHandler: Sends logging output to streams such as the console.
- FileHandler: Sends logging output to files on disk.
- HTTPHandler: Sends logging messages to a web server.
10: Set the formatter of our logging instance.
11: Set the formatter to the handler.
12: Add the handler to the logging object.
17: Assign our logger to a variable.
19 – 25: Use the logger to output different messages.

Here is how it looks:

The output is:

To develop this example a bit further, let us use the code for reading a CSV file into a memory object and wrap it with a logger:

# MSSQLTips.com (Python)
01: from io import BytesIO
02: import pandas as pd
03: from lgg import create_logger
04: 
05: my_logger = create_logger()
06: 
07: container = BytesIO()
08: 
09: with open('german_data.csv', 'rb') as f:
10:     try:
11:         raw_data = f.read()
12:         container.write(raw_data)
13:         my_logger.info('Successfully stored raw data in memory.')
14:     except Exception as e:
15:         my_logger.error(e)
16: 
17:     container.seek(0)
18:     try:
19:         df = pd.read_csv(container)
20:         my_logger.info('Successfully read data from memory.')
21:     except Exception as e:
22:         my_logger.error(e)
23:

All we have done here compared to the example in the io section is to remove the print statements, then replace them with try/except blocks that log the relevant success or error message. The result is:

Caveats

Too much logging is not necessary. Use the logging functionality at important or critical steps of your program, for debugging and/or showing errors.

Conclusion

In this series, we examined six important Python modules that enable building smart, robust, and scalable data engineering pipelines. In the first part, we discussed attrs, sqlalchemy, and pandas. This second part focused on chardet for character encoding detection, io for binary stream handling, and logging for tracking event telemetry and outputting vital information to the console.

Next Steps

Resources

Community

Subscribe