| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidateCandidate's Name
Phone: PHONE NUMBER AVAILABLEEmail: EMAIL AVAILABLEProfessional SummaryBig Data Developer with nearly 10 years of well-rounded experience in ETL pipeline development in top technology firms, cloud services, and legacy systems data engineering using tools like Spark, Kafka, Hadoop, Hive, AWS, and AzureProficient in crafting resilient data infrastructure and reporting solutions, with extensive experience in distributed platforms and systems, coupled with a deep comprehension of databases, data storage, and movementDeveloped Python scripts to automate the extraction, transformation, and loading (ETL) of multi-terabyte datasets into a Hadoop ecosystem, enhancing data availability and analysis speed.Collaborated with DevOps teams to identify business requirements and implement CI/CD pipelines, showcasing expertise in Big Data Technologies such as AWS, Azure, GCP, Databricks, Kafka, Spark, Hive, Sqoop, and HadoopEngineered and optimized Apache Spark jobs in Scala for batch processing of multi-terabyte datasets, achieving a 30% reduction in processing times and resource consumption.Deployed Apache Airflow to orchestrate data migration from legacy systems to a cloud-based data lake, automating the entire migration process to minimize downtime and ensure data consistency.Designed and implemented a data cleansing and validation process using SQL and PL/SQL, enhancing data quality for a multinational marketing campaign management system.Implemented data governance practices in data modeling, including metadata management and data quality metrics, to improve the reliability of business intelligence reports.Specialized in architecting and deploying fault-tolerant data infrastructure on Google Cloud Platform (GCP), harnessing services like Google Compute Engine, Google Cloud Storage, and Google Kubernetes Engine (GKE)Demonstrated proficiency in designing and executing data pipelines using GCP data services such as BigQuery, Cloud Dataflow, Cloud Dataproc, and Cloud Composer, while ensuring robust data security and governance practicesIngested and queried data from multiple NoSQL databases, including MongoDB, Cassandra, HBase, and AWS DynamoDBLeveraged containerization technologies like Docker and Kubernetes to optimize performance and scalability on GCPExtensive hands-on experience with AWS services including EC2, S3, RDS, and EMR, leveraging them effectively in various data projectsExpertise in implementing CI/CD pipelines on AWS, utilizing AWS Lambda for serverless computing solutions, and employing event-driven architecturesProficient in AWS data services such as Athena and Glue for data analytics and ETL processes, with a proven track record of creating interactive dashboards and reports using AWS QuickSightApplied knowledge in extracting data from AWS services and utilizing reporting tools like Tableau, PowerBI, and AWS QuickSight to generate tailored reports.Technical SkillsPROGRAMMING LANGUAGESPython, Scala, Java, PySpark, Spark, SQL, Shell Scripting, MapReduce, SparkSQL, HiveQL.CLOUD PLATFORMSAWS, Azure, GCP, Databricks.BIG DATA TOOLSKafka, Spark, Hive, Apache Nifi, HBase, Cloudera, Flume, Sqoop, Hadoop, HDFS, Spark streaming, YARN, SQL, Oracle, MongoDB, Cassandra, Oozie.CLOUD TOOLSAWS Glue, EMR, S3, Lambda, Step Functions, SNS, SQS, IAM, DyanmoDB, Redshift, RDS, Azure Data Factory, Data Lake Gen2, Power BI, Synapse.PROJECT METHODSETL, CI/CD, Unit Testing, Debugging, Agile, Scrum, Test-Driven Development, Version Control (Git).ExperienceBIG DATA ENGINEER APPLE US AUSTIN, TEXAS JUNE 2021 PRESENTWorked in the Data pipeline team, focusing on optimizing the data pipeline's efficiency by monitoring, troubleshooting, and performing bug fixes.Leveraged technologies such as Hadoop, Kafka, and Hive, ensured seamless data flow, and utilized monitoring tools like Datadog and AWS CloudWatch for trend analysis.Conducted bug fixes and code cleanup to enhance pipeline performance.Gathered and analyzed data, ensuring its integrity through unit testing.Utilized Airflow for job orchestration and Python, and PySpark for coding.Employed SQL, Hive, Presto, Oracle, and Flume for data querying and processing.Monitored and troubleshooted jobs using Airflow, AWS CloudWatch, Oozie, Spark UI, Cloudera Manager, and Kubernetes.Created comprehensive data pipelines using PySpark to process streaming data, integrating with Apache Kafka and storing results in HBase for real-time analytics.Architected and coded Python applications for real-time data processing using Apache Kafka and Apache Storm, resulting in a 40% increase in processing efficiency.Facilitated data movement across environments, ensuring smooth ingestion from various sources.Engineered CI/CD pipelines using Jenkins integrated with Ansible to automate deployment and orchestration of big data workflows.Developed Python scripts for code cleanup, job monitoring, and data migration.Collaborated with cross-functional teams to address business needs and enhance data systems.Implemented Apache Airflow to automate and manage machine learning workflows, including data preprocessing, model training, hyperparameter tuning, and model evaluation, significantly speeding up iteration cycles.Designed and implemented real-time data analytics solutions using Scala and Spark Structured Streaming, processing live streams from social media platforms to generate instant insights.Analyzed data patterns using Grafana, CloudWatch metrics, and dashboards.Automated workflows with shell scripts to streamline data extraction into the Hadoop framework.Designed and implemented a Terraform solution to automate the provisioning of cloud infrastructure for big data applications, reducing setup time by over 40%.Overcame timezone challenges while collaborating with offshore team members to ensure timely task completion and issue resolution.CLOUD DATA ENGINEER DRISCOLLS WATSONVILLE CA OCTOBER 2020 JUNE 2021Loaded and transformed large sets of structured and semi-structured data using AWS Glue.Implemented AWS Datasync tasks to automate data transfer between on-premises storage and AWS S3, streamlining backup and disaster recovery operations.Designed and implemented a scalable data lake on AWS S3, optimizing data storage and retrieval for a multi-petabyte dataset across a distributed analytics platform.Developed producer /consumer scripts for Kafka to process JSON responses in Python.Implemented different instance profiles and roles in IAM to connect tools in AWS.Evaluated and proposed new tools and technologies to meet the needs of the organization.Architected and deployed a Snowflake environment to support a data lake solution, integrating data from multiple sources at a scale of petabytes per day.Designed and implemented modular, reusable DBT models to streamline data processing and improve maintainability.Leveraged Terraform to enforce cloud security best practices across big data deployments, aligning infrastructure with compliance requirements.Excellent understanding/knowledge of tools in AWS like Glue, EMR, S3, Lambda, Redshift, and Athena.Engineered real-time data ingestion pipelines using AWS Kinesis Streams, facilitating high-throughput data flow for immediate processing and analytics in a financial trading application.Integrated DBT with Snowflake to automate data pipeline execution and enhance data warehouse performance.Utilized Apache Airflow to manage periodic updates and data refreshes in a geospatial analysis platform, ensuring that the latest satellite imagery and sensor data were available for environmental impact studies.Experience creating and maintaining a data warehouse in AWS Redshift.Experienced in Object-oriented programming, integrating and testing Software implementations collecting business specifications, user requirements, design confirmation, development, and documenting the entire software development life cycle and QA.Configured and managed AWS EKS clusters to orchestrate containerized big data applications, ensuring robust scalability and high availability for microservices-driven architectures.Architected a high-performance data warehousing solution using AWS Redshift, integrating data from various sources via AWS Glue and optimizing query performance for BI tools.Designed a NoSQL solution using AWS DynamoDB to support web-scale applications, implementing best practices for data modeling to ensure low-latency access and high throughput.Utilized Scala to develop a custom library for geospatial analysis on top of Spark, enabling complex spatial queries and visualizations for urban planning data.Able to learn and adapt quickly to emerging new technologies and tools in the cloud and programming languages.Created Spark jobs using Databricks and AWS infrastructure and Python as a programming language.Used Apache Airflow to automate the processing and delivery of data products to external clients, including automated quality checks and secure file transfer protocols to ensure data integrity and confidentiality.Monitored and tuned AWS EMR cluster performance using CloudWatch and custom metrics, reducing job execution times and cost per job by optimizing resource allocation.Developed and deployed big data processing jobs on AWS EMR, utilizing Apache Spark and Hadoop to perform complex transformations and aggregations on large datasets.Developed Kafka producer and consumer programs and ingested data into AWS S3 buckets.Leveraged AWS Lambda and S3 triggers to automate image processing workflows, enabling real-time image analysis and metadata tagging for a media library.Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes, supporting full relational KQL in Confluent Kafka.BIG DATA DEVELOPER HOME DEPOT ATLANTA GA JUNE 2019 OCTOBER 2020Attended sprint planning, daily scrums, and retrospective meetings.Evaluated requirements and helped in the creation of user stories.Developed the code and modules for different ETL pipelines.Implemented automated testing frameworks in Jenkins pipelines to ensure code quality and catch issues early in the development cycle.Designed and implemented an Azure Data Lake storage solution, enabling scalable data storage and analytics across a distributed architecture for a healthcare data management platform.Processed large sets of structured and unstructured data using Spark and Scala as a programming language.Imported data from DB2 to HDFS using Apache Nifi in the Azure cloud.Worked as part of the Big Data Engineering team and worked on pipeline creation activities in the Azure environment.Integrated Jenkins with version control systems like Git to enable continuous integration and seamless code updates.Led the design and implementation of a dimensional data modeling project for a retail analytics platform, which included the development of star schemas to facilitate faster OLAP queries.Created a data lineage and metadata management workflow using Apache Airflow, which automated the tracking and documentation of data sources, transformations, and uses, ensuring compliance with data governance standards.Implemented data transformation pipelines within Snowflake using Snowflake SQL and SnowPipe to ingest real-time data streams for immediate analytics.Leveraged Azure Databricks for real-time data processing and machine learning, developing predictive models that improved demand forecasting for a retail chain.Created custom macros in DBT to simplify complex SQL transformations and improve code readability.Designed and implemented a microservices architecture using MongoDB as the backend database, which supported asynchronous data processing and improved scalability for a financial services application.Created a real-time analytics dashboard by streaming data from Azure Event Hubs into Azure Databricks and subsequently visualizing in Power BI, providing actionable insights for operational teams.Collaborated with data analysts to define and document business logic and data transformations using DBT.Implemented a log analytics solution using Azure Databricks to process and analyze multi-terabyte log files, identifying critical performance bottlenecks and security threats.Developed a secure and compliant data ingestion pipeline into Azure Data Lake (ADLS) using Azure Data Factory, ensuring data integrity and privacy for financial services data.Configured and managed Azure HDInsight clusters running Apache Hadoop and Spark, optimizing resource allocation and performance for cost-effective big data processing.Architected and deployed a high-availability cluster using PostgreSQL, achieving 99.99% uptime and enhancing data reliability for a critical healthcare data management system.Created Hive tables, loaded data, and wrote Hive queries for data analysis using external tables.Performed incremental data load in the Apache Hive data warehouse as part of the daily data ingestion workflow.Developed Airflow DAGs to automate the scaling of cloud infrastructure based on workload metrics, allowing dynamic resource management and cost optimization.Integrated Azure Event Hubs with Azure Stream Analytics to capture and analyze millions of live events for a smart city traffic management system.Involved in performance tuning of Hive queries to improve data processing and retrieving.Developed and maintained a data warehouse using Kimball methodology, focusing on the integration of disparate data sources into a cohesive analytical environment.Utilized Azure Data Lake Analytics to run U-SQL jobs on unstructured data, extracting meaningful insights for a digital marketing campaign analysis.Engineered a scalable IoT data platform on Azure using HDInsight and Azure Blob Storage, processing sensor data from thousands of devices to optimize manufacturing processes.Experience in using different file formats like Parquet, Avro, ORC, and CSV, JSON.Performed data enrichment, cleansing, and data aggregations through RDD transformations using Apache Spark in Scala.BIG DATA ENGINEER ASICS IRVINE CA MARCH 2018 JUNE 2019Developed shell scripts to automate workflows for pulling data from various databases into Google Cloud Storage, using Google Cloud Functions for orchestrating data access via BigQuery views.Crafted SQL queries for data analysis in BigQuery, utilizing its standard SQL capabilities to manage and analyze data stored in BigQuery datasets.Ingested data into relational databases (Cloud SQL for MySQL and PostgreSQL) from Google Cloud Storage using Dataflow, which efficiently handled data transformation and loading tasks.Orchestrated a secure data exchange platform using Apache Airflow, which managed the end-to-end encryption and transfer of sensitive datasets across different legal jurisdictions.Conducted extensive data profiling and modeling for a data warehouse project, ensuring compliance with HIPAA regulations while supporting complex queries.Optimized Tableau dashboards for performance, reducing load times and improving user experience.Developed complex data transformation pipelines in Scala integrating with Apache Kafka for efficient data ingestion and stream processing in a financial services environment.Developed Python modules to encrypt and decrypt data using advanced cryptographic standards, ensuring secure data storage and transmission in distributed systems.Configured Cloud SQL instances and integrated them with Google Cloud Data Fusion for seamless data ingestion and transformation within the GCP ecosystem.Utilized Dataflow to extract data from Cloud SQL and load it into BigQuery, leveraging Apache Beam pipelines written in Python for data processing tasks.Created custom visualizations in Tableau using JavaScript API to meet unique business requirements.Implemented partitioning and indexing strategies on a multi-terabyte PostgreSQL database, significantly improving performance and manageability for large-scale analytics applications.Implemented a MongoDB-based document store for a content management system, enabling dynamic content updates and personalized user experiences through schema-less data structures.Leveraged Apache Airflow for real-time error detection and alertingMigrated data from Google Cloud Storage to BigQuery using the Parquet file format, taking advantage of BigQuery's external table functionalities for efficient query performance.Created a monitoring solution using Docker, Grafana, and Prometheus to monitor big data processing jobs and optimize resource allocation.Created high-performance algorithms in Scala to support machine learning workflows on large datasets, significantly improving model training efficiency and accuracy.Processed and synchronized data between BigQuery and Cloud SQL (MySQL) using federated queries and Dataflow to maintain consistency and optimize access.Implemented data processing pipelines in Apache Beam, executed on Dataflow, to preprocess datasets before loading them into Cloud SQL databases.Automated and scheduled Dataflow pipelines using Cloud Scheduler and Pub/Sub to trigger batch processing jobs based on time or event-driven criteria.DATA ENGINEER CVS WOONSOCKET RI JUNE 2016 MARCH 2018Involved in requirements analysis, design, development, implementation, and documentation of ETL pipelines.Proposed big data solutions for the business requirements.Worked with different teams in the organization in gathering the system requirements.Performed big data extraction, processing, and storage using Hadoop, sqoop, SQL, and hiveQL.Created a multi-tenant data processing architecture using Apache Airflow, enabling clients in a shared analytics environment to execute independent ETL tasks securely and efficiently.Engineered time-series data storage in Cassandra, utilizing its fast write capabilities to support high-throughput logging and monitoring of industrial automation systems.Engineered a Scala application to perform text mining and sentiment analysis on large volumes of customer feedback data, helping to enhance customer service strategies.Wrote Python scripts to automate cloud infrastructure deployments, including data storage and compute resources, using AWS Boto3 SDK, reducing manual setup time by 70%.Developed and maintained robust ETL pipelines using SQL and Python to automate data transfer between MySQL and Hadoop ecosystems, enhancing data availability for analytics.Developed a temporal data model to capture the history and changes over time in a legal document management system, facilitating advanced version control and audit capabilities.Leveraged Terraform to enforce cloud security best practices across big data deployments, aligning infrastructure with compliance requirements.Developed map-reduce jobs in Java for data transformation and processing.Scheduled the pipelines using Oozie to orchestrate map-reduce workflows in Unix environments.Implemented a custom monitoring and reporting system using Apache Airflow that aggregates performance metrics from various databases and microservices,Employed Pythons Matplotlib and Seaborn libraries to create visualization dashboards that track data trends and performance metrics, shared across the organization via a Flask-based web application.Imported/Exported data between MySql and HDFS using Sqoop.DATA ENGINEER DICK'S SPORTING GOODS CORAOPOLIS, PENNSYLVANIA SEPTEMBER 2014 JUNE 2016Commissioned and de-commissioned data nodes and performed name-node maintenance.Designed and built a scalable web scraping system using Python, BeautifulSoup, and Scrapy to collect and parse structured data from over 1,000 websites daily.Backed up and cleared logs from HDFS space to enable optimal utilization of nodes.Wrote shell scripts for time-bound command execution.Architected a fault-tolerant data lake using Scala and Apache Hadoop, facilitating scalable storage and efficient querying of structured and unstructured data.Edited and configured HDFS and tracker parameters.Designed and implemented a scalable data warehousing solution using Microsoft SQL Server, integrating data from multiple ERP systems, resulting in a unified view for business intelligence reporting.Led the migration of an Oracle database to Microsoft SQL Server, including schema conversion and data integrity validation, which streamlined operational processes and reduced costs.Automated configuration and maintenance of a 100-node Hadoop cluster using Ansible playbooks, enhancing system reliability and operational efficiency.Scripted requirements using BigSQL and provided time statistics of running jobs.Assisted with code review tasks in simple to complex Map/reduce Jobs using Hive and Pig.Leveraged Scala to build and maintain scalable and robust ETL frameworks that support the ingestion and normalization of data from diverse sources into a centralized Hive warehouse.Leveraged Pythons Pandas library to perform data cleansing, preparation, and exploratory data analysis, significantly reducing data preprocessing time for downstream analytics.Performed Cluster Monitoring using the Big Insights ionosphere tool.Worked with application teams to install operating system, Hadoop updates, patches, and version upgrades as required.Installed Oozie workflow engine to run multiple Hive and Pig jobs.EducationKENNESAW STATE UNIVERSITYINFORMATION TECHNOLOGY |