Information Technology for Data Compression and Transformation by Means of Amazon EMR
DOI:
https://doi.org/10.31861/sisiot2025.1.01004Keywords:
information technology, Big Data, AWS, distributed processing, data compressionAbstract
As data processing volumes grow in various fields, the demand for applications capable of efficiently managing, processing, and transforming large amounts of information is also increasing. Modern approaches to storing and processing large amounts of data are primarily based on universal text formats, such as CSV and JSON. Their prevalence can be explained by their compatibility with a wide range of software tools and ease of integration. These formats are inefficient when dealing with massive volumes of data, particularly when scaling systems or executing analytical queries. The lack of built-in compression, row structure, and metadata leads to significant time and computing resources, which creates a conflict between the requirements for speed and cost-effectiveness of processing and the technical capabilities of traditional text formats. Columnar storage formats, such as Parquet and ORC, offer an alternative. They employ a compact structure tailored for quick analytical queries in distributed computing settings. Effective coding, indexing, and built-in compression techniques considerably lower data sizes and speed up processing. This research aims to develop and experimentally verify the technology of automated data conversion from inefficient text formats to Parquet and ORC formats using Apache Airflow and Amazon EMR. The proposed architecture involves creating a cloud pipeline that performs data conversion and subsequent storage in formats focused on analytical workloads. The system uses Apache Airflow for process orchestration, Amazon EMR and Apache Spark for distributed processing, AWS S3 as scalable storage, AWS Glue for metadata management, and Amazon Athena for SQL access to transformed data. This approach solves performance problems by offering a flexible, reliable, cost-effective solution that adapts to different work scenarios and workloads.
Downloads
References
Apache Software Foundation, Apache Parquet Documentation, 2023. [Online]. Available: https://parquet.apache.org/
D. J. Abadi, P. A. Boncz, and S. Harizopoulos, “Column-Oriented Database Systems,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1664–1665, Aug. 2009, doi: 10.14778/1687553.1687609.
Apache Software Foundation, “Apache Airflow Documentation,” 2024. [Online]. Available: https://airflow.apache.org/docs/
Amazon Web Services, “Amazon EMR Developer Guide,” 2023. [Online]. Available: https://docs.aws.amazon.com/emr/
Amazon Web Services, “Storage Best Practices for Data & Analytics,” 2022. [Online]. Available: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/
Amazon Web Services, “AWS Glue Documentation,” 2023. [Online]. Available: https://docs.aws.amazon.com/glue/
Amazon Web Services, “Amazon Athena Documentation,” 2023. [Online]. Available: https://docs.aws.amazon.com/athena/
M. Kleppmann, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, 1st ed. Sebastopol, CA: O’Reilly Media, 2017.
M. Armbrust et al., “Spark SQL: Relational Data Processing in Spark,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015, pp. 1383–1394, doi: 10.1145/2723372.2742797.
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster Computing with Working Sets,” in Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing (HotCloud), 2010.
Y. Mercadier, “Distance Measures for Probability Distributions,” 2022. [Online]. Available: https://distancia.readthedocs.io
U. Kiran and J. Murphy, Building Production Pipelines with Apache Airflow. Birmingham, UK: Packt Publishing, 2020.
M. Moazeni, “Automating Stock Market Data Pipeline with Apache Airflow, Spark, Postgres,” Medium, 2023. [Online]. Available: https://medium.com/@mehran1414/automating-stock-market-data-pipeline-with-apache-airflow-minio-spark-and-postgres-b67f7379566a
A. Lakshman and P. Malik, “Cassandra: A Decentralized Structured Storage System,” ACM SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 35–40, Apr. 2010.
AWS Big Data Blog, “Best Practices for Using Amazon Athena,” Amazon Web Services, 2020. [Online]. Available: https://aws.amazon.com/blogs/big-data/best-practices-for-using-amazon-athena/
Cloud Native Computing Foundation, “CNCF Cloud Native Landscape,” 2022. [Online]. Available: https://landscape.cncf.io/
T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2016, doi: 10.1145/2939672.2939785.
Published
Issue
Section
License
Copyright (c) 2025 Security of Infocommunication Systems and Internet of Things

This work is licensed under a Creative Commons Attribution 4.0 International License.