Information Technology for Data Compression and Transformation by Means of Amazon EMR

Authors

DOI:

https://doi.org/10.31861/sisiot2025.1.01004

Keywords:

information technology, Big Data, AWS, distributed processing, data compression

Abstract

As data processing volumes grow in various fields, the demand for applications capable of efficiently managing, processing, and transforming large amounts of information is also increasing. Modern approaches to storing and processing large amounts of data are primarily based on universal text formats, such as CSV and JSON. Their prevalence can be explained by their compatibility with a wide range of software tools and ease of integration. These formats are inefficient when dealing with massive volumes of data, particularly when scaling systems or executing analytical queries. The lack of built-in compression, row structure, and metadata leads to significant time and computing resources, which creates a conflict between the requirements for speed and cost-effectiveness of processing and the technical capabilities of traditional text formats. Columnar storage formats, such as Parquet and ORC, offer an alternative. They employ a compact structure tailored for quick analytical queries in distributed computing settings. Effective coding, indexing, and built-in compression techniques considerably lower data sizes and speed up processing. This research aims to develop and experimentally verify the technology of automated data conversion from inefficient text formats to Parquet and ORC formats using Apache Airflow and Amazon EMR. The proposed architecture involves creating a cloud pipeline that performs data conversion and subsequent storage in formats focused on analytical workloads. The system uses Apache Airflow for process orchestration, Amazon EMR and Apache Spark for distributed processing, AWS S3 as scalable storage, AWS Glue for metadata management, and Amazon Athena for SQL access to transformed data. This approach solves performance problems by offering a flexible, reliable, cost-effective solution that adapts to different work scenarios and workloads.

Downloads

Download data is not yet available.

Author Biographies

  • Yevhen Kyrychenko, Yuriy Fedkovich Chernivtsi National University

    Kyrychenko Yevhen is currently a Ph.D. student at the Department of Software Engineering, Yuriy Fedkovych Chernivtsi National University, Ukraine. He received his B.Sc. and M.Sc. degrees in Computer Science from Ivan Franko National University of Lviv. His research interests include cloud computing, big data technologies, and distributed systems.

  • Igor Malyk, Yuriy Fedkovych Chernivtsi National University

    Doctor of Physical and Mathematical Sciences, Professor, Head of Department of Mathematical Problems of Control and Cybernetics, Yuriy Fedkovych Chernivtsi National University, Chernivtsi, Ukraine. Field of scientific interests: stochastic analysis, financial mathematics, machine learning, simulation of random processes.

References

Apache Software Foundation, Apache Parquet Documentation, 2023. [Online]. Available: https://parquet.apache.org/

D. J. Abadi, P. A. Boncz, and S. Harizopoulos, “Column-Oriented Database Systems,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1664–1665, Aug. 2009, doi: 10.14778/1687553.1687609.

Apache Software Foundation, “Apache Airflow Documentation,” 2024. [Online]. Available: https://airflow.apache.org/docs/

Amazon Web Services, “Amazon EMR Developer Guide,” 2023. [Online]. Available: https://docs.aws.amazon.com/emr/

Amazon Web Services, “Storage Best Practices for Data & Analytics,” 2022. [Online]. Available: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/

Amazon Web Services, “AWS Glue Documentation,” 2023. [Online]. Available: https://docs.aws.amazon.com/glue/

Amazon Web Services, “Amazon Athena Documentation,” 2023. [Online]. Available: https://docs.aws.amazon.com/athena/

M. Kleppmann, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, 1st ed. Sebastopol, CA: O’Reilly Media, 2017.

M. Armbrust et al., “Spark SQL: Relational Data Processing in Spark,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015, pp. 1383–1394, doi: 10.1145/2723372.2742797.

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster Computing with Working Sets,” in Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing (HotCloud), 2010.

Y. Mercadier, “Distance Measures for Probability Distributions,” 2022. [Online]. Available: https://distancia.readthedocs.io

U. Kiran and J. Murphy, Building Production Pipelines with Apache Airflow. Birmingham, UK: Packt Publishing, 2020.

M. Moazeni, “Automating Stock Market Data Pipeline with Apache Airflow, Spark, Postgres,” Medium, 2023. [Online]. Available: https://medium.com/@mehran1414/automating-stock-market-data-pipeline-with-apache-airflow-minio-spark-and-postgres-b67f7379566a

A. Lakshman and P. Malik, “Cassandra: A Decentralized Structured Storage System,” ACM SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35–40, Apr. 2010.

AWS Big Data Blog, “Best Practices for Using Amazon Athena,” Amazon Web Services, 2020. [Online]. Available: https://aws.amazon.com/blogs/big-data/best-practices-for-using-amazon-athena/

Cloud Native Computing Foundation, “CNCF Cloud Native Landscape,” 2022. [Online]. Available: https://landscape.cncf.io/

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2016, doi: 10.1145/2939672.2939785.

Downloads


Abstract views: 15

Published

2025-06-30

Issue

Section

Articles

How to Cite

[1]
Y. Kyrychenko and I. Malyk, “Information Technology for Data Compression and Transformation by Means of Amazon EMR”, SISIOT, vol. 3, no. 1, p. 01004, Jun. 2025, doi: 10.31861/sisiot2025.1.01004.

Similar Articles

1-10 of 57

You may also start an advanced similarity search for this article.