Data Engineer Resume Trenton, NJ

Data Engineer Resume Trenton, NJ
Resumes | Register
Candidate Information
Name	Available: Register for Free
Title	data engineer
Target Location	US-NJ-Trenton
Email	Available with paid plan
Phone	Available with paid plan
20,000+ Fresh Resumes Monthly
View Phone Numbers
Receive Resume E-mail Alerts
Post Jobs Free
Link your Free Jobs Page
... and much more
Register on Jobvertise Free
Related Resumes
Big Data Engineering Fords, NJ
Data Engineer Senior Piscataway, NJ
Data Analyst Engineer Manhattan, NY
Data Engineer C++ New York City, NY
Data Analyst/Engineer Jersey City, NJ
Data Engineer Analytics Whippany, NJ
Senior Data Engineer Manhattan, NY
Click here or scroll down to respond to this candidate
Candidate's Name
EMAIL AVAILABLEPHONE NUMBER AVAILABLEProfessional Summary:Overall, 8 years of professional work experience as a Data Engineer, working with Python, Spark, AWS, SQL & Micro Strategy in design, development, Testing and Implementation of business application systems for Health Care and Educational sectors.Experience in system analysis, design, development, testing and implementation of projects (SDLC) and capable of handling responsibilities independently as well as proactive team members.Experience in designing and implementing data engineering pipelines and analyzing data using AWS stack like AWS EMR, AWS Glue, EC2, AWS Lambda, Athena, Redshift, Scoop and Hive.Experience in Windows 2000/2003/2008 server and XP/Professional & Vista Client administration.Experience in creating Python functions to transform the data from Azure storage to Azure SQL on Azure Databricks platform.Experience in programming using Python, Scala, Java, and SQL.Experience in implementing Azure data solutions, provisioning storage accounts, Azure Data Factory, Azure Databricks, Azure Synapse Analytics and Azure Blob Storage.Experience of architecture of Distributed Systems and parallel processing frameworks.Experience working with various data analytics in AWS Cloud like EMR, Redshift, S3, Athena, and Glue.Experience in developing production ready spark applications using Spark RDDAPIs, Data frames, Spark-SQL and Spark-Streaming API's.Experience in working GCP services like Big Query, Cloud Storage (GCS), cloud function, cloud dataflow, Pub/sub, Cloud Shell, GSUTIL, Data Proc, Operations Suite (Stack driver).Experience in fine tuning spark applications to improve performance and troubleshooting failures in spark applications.Experience in implementing and orchestrating data pipelines using Oozie and Airflow.Experience in using Spark Streaming, Spark SQL and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs.Experience with setup, configuration and maintain ELK stack (Elasticsearch, Logstash and Kibana) and OpenGrok source code (SCM)Proficient in importing/exporting data from RDBMS to HDFS using Sqoop.Experienced in importing real time streaming logs and aggregating the data to HDFS using Kafka and Flume.Experience in setting up Hadoop clusters on cloud platforms like AWS and GCPExperience in Using hive extensively to perform various data analytics required by business teams.Experience in working various data formats like Parquet, Orc, Avro, Json etc.,Experience in installing Prometheus and Grafana using Helm to monitor the application performance in the Kubernetes cluster.Proficiency in working with SQL/ NoSQL like MongoDB, Cassandra, MySQL and PostgreSQL.Experience in utilizing ETL tools, Talend Data Integration, SQL Server Integration Services (SSIS), Developed slowly changing dimension (SCD) mappings using Type-I, Type-II, and Type-III methods.Experience in automating end-to-end data pipelines with strong resilience and recoverability.Proficient in Oracle PL/SQL and shell scripting.Experience in Database / ETL Performance Tuning: Broad Experience in Database Development including effective use of Database objects, SQL Trace, Explain Plan, Different types of Optimizers, Hints, Indexes, Table Partitions, Sub Partitions, Materialized Views, Global Temporary tables, Autonomous Transitions, Bulk Binds, Capabilities of using Oracle Built-in Functions.Experience in developing production ready spark applications using Spark RDD APIs, Data frames and Spark-SQL.Experience on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.Experience in designing, implementing, and maintained infrastructure using Terraform to enable version-controlled, reproducible infrastructure deployments.Experience with container-based deployments using Docker, creating Docker files, Docker images, Docker hub and Docker registries, installing and configuring Kubernetes.Experience on SnowSQL and SnowpipeExperience on process-oriented Data Analyst having excellent analytical, quantitative, and problem-solving skills using SQL, Micro Strategy, Advanced Excel, and Python.Proficient in writing unit testing code using Unit Test/PyTest and integrating the test code with the build process.Experience in using Python scripts to parse XML and JSON reports and load the information in database.Experience with version control systems like Git, GitHub, Bitbucket to keep the versions and configurations of the code organized.TECHNICAL SKILLS:Big Data Eco-System: HDFS, GPFS, HIVE, SQOOP, SPARK, YARN, PIG, KafkaHadoop Distributions: Hortonworks, Cloudera, IBM Big Insights,Operating Systems: Windows, Linux (Centos, Ubuntu)Programming Languages: PYTHON, Scala, SHELL SCRIPTING, Windows, Server, Advance Server, Windows, VMware ESX Red hat Linux.Databases: Hive, MYSQL, NETEZZA, Postgre SQL.IDE Tools & Utilities: IntelliJ IDEA, ECLIPSE, PYCHARM, AGINITY WORKBENCH, GIT, Markup Languages: HTMLETL: Data stage (Designer/Monitor/Director)Job Scheduler: Control-M, IBM Symphony Platform, AMBARI, Apache Air flowReporting Tools: Talend, Tableau, LumiraCloud Computing Tools: AWS, GCPScrum Methodologies: Agile, Asana, JiraOthers: MS Office, RTC, Service Now, OPTIM, IGC (Info sphere Governance catalog), WinSCP, MS VisioPROFESSIONAL EXPERIENCE:CVS Health Chicago, IL Oct 2023 - PresentSenior Data EngineerResponsibilities:Design and implement scalable, high-performance data architectures on AWS.Develop data models, schemas, and structures for efficient storage and retrieval.Designed and established an Enterprise Data Lake to support diverse use cases, including analytics, processing, storing, and reporting of large and rapidly changing datasets.Responsible for maintaining the quality of reference data in the source by executing operations such as cleaning, transformation, and ensuring integrity in a relational environment through close collaboration with stakeholders and solution architects.Developed a Security Framework to grant fine-grained access to objects in AWS S3 using AWS Lambda and DynamoDB.Configured and managed Kerberos authentication principles to establish secure network communication on the cluster. Conducted testing of HDFS, Hive, Pig, and MapReduce for new user access.Conducted comprehensive architecture and implementation assessments of various AWS services like Amazon EMR, Redshift, and S3.Created on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.Designed, developed, tested, implemented, and supported Data Warehousing ETL using Talend and Hadoop Technologies.Implemented machine learning algorithms in Python to predict user order quantities for specific items, utilizing Kinesis Firehose and the S3 data lake.Utilized AWS EMR for transforming and transferring large volumes of data to and from other AWS data stores and databases, including Amazon S3 and DynamoDB.Used AWS Glue for the data transformation, validate and data cleansing.Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch.Utilized Spark SQL for Scala and Python interfaces, automatically converting RDD case classes to schema RDD.Imported data from different sources like HDFS/HBase into Spark RDD, leveraging PySpark for computations and generating output responses.Configured to receive real time data from the Apache Kafka and store the stream data to HDFS using Kafka connect.Involved in designing different components of system like big-data event processing framework Spark, distributed messaging system Kafka and SQL database PostgreSQL.Created reusable Terraform modules to encapsulate infrastructure components, promoting code reusability and maintaining a modular codebase.Created custom Grafana dashboards to monitor key performance indicators (KPIs) of data processing pipelines.Created Lambda functions with Boto3 to deregister unused AMIs across all application regions, reducing costs for EC2 resources.Coded BTEQ scripts for Teradata to perform tasks such as loading data, implementing transformations, addressing issues like SCD 2 date chaining, and cleaning up duplicate records.Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.Configured documents which allow Airflow to communicate to its PostgreSQL database.Provided consultation on the architecture, design, development, and deployment of solutions within the Snowflake Data Platform. The primary focus is on fostering a data-driven culture across enterprises.Check the data and tables structure in the PostgreSQL & Redshift databases and run the queries to generate reports.Created a reusable framework designed for future migrations, streamlining the ETL process from relational database management systems (RDBMS) to the Data Lake. This framework utilizes Spark Data Sources and Hive data objects.Executed data blending and preparation using Alteryx and PostgreSQL for Tableau consumption, and published data sources to the Tableau server.Developed Kibana Dashboards based on Logstash data, integrating various source and target systems into Elasticsearch for near real-time log analysis of end-to-end transactions.Environment: Glue, lambda, S3, EMR, DynamoDB, Kibana, Python, Talend, Terraform, Grafana, Apache Spark, Boto3, Postgre SQL, HDFS, Pig, Hive, Teradata, RDD, Redshift.Santander Bank (NYC, NY) Mar 2023 - Sep 2023Senior Data EngineerResponsibilities:Developed PySpark pipelines which transforms the raw data from several formats to parquet files for consumption by downstream systems.Created views on top of data in Hive which will be used by the application using Spark SQL.Applied security on data using Apache Ranger to set row level filters and group level policies on data.Normalized the data according to the business needs like data cleansing, modifying the datatypes and various transformations using Spark, Scala and AWS EMR.Responsible for maintaining, managing and upgradation of Hadoop cluster connectivity and security.Used AWS Glue services like crawlers and ETL jobs to catalog all the parquet files and make transformations over data according to the business needs.Created libraries and SDKs which will be helpful in making JDBC connections to hive database and query the data using Play framework and various AWS services.Developed scripts using Spark which are used to load the data from Hive to Amazon RDS(Aurora) at a faster rate.Worked on creating the CI/CD pipelines using tools like Jenkins and Rundeck which will be responsible for scheduling the daily jobs.Used Grafana data source plugins to visualize and correlate data from Azure Log.Deployed and maintained applications on cloud infrastructure using Terraform.Created jobless in Talend for the processes which can be used in most of the jobs in a project like to Start job and Commit job.Implemented Change Data Capture technology in Talend to load deltas to a Data Warehouse.Worked with AWS services like S3, Glue, EMR, SNS, SQS, Lambda, EC2, RDS and Athena to process data for the downstream customers.Using modern data analytics and visualization tools including elastic search, Kibana build efficient application metrics monitors.Developed Sqoop jobs which will be responsible for importing the data from Oracle to AWS S3.Developed a utility which transforms and exports the data from AWS S3 to AWS glue and sends alerts and notifications to downstream systems (AI and Data Analytics) once the data is ready for usage.Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.Developed pipelines for auditing the metrics of all applications using AWS Lambda, Kinesis Firehoses.Developed end to end pipeline which exports the data from parquet files in S3 to Amazon RDS.Worked on optimizing performance of Hive queries using Hive LLAP and various other techniques.Environment: Spark, Python, Hadoop, Hive, Sqoop, play framework, Talend, Apache Ranger, S3, Grafana, Terraform, EMR, EC2, SNS, SQS, Lambda, Zeppelin, Kinesis, Athena, Jenkins, Rundeck and AWS Glue.CYIENT (INDIA) Aug 2020 July 2022Senior Data EngineerResponsibilities:Involved in writing Spark applications using Python to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.Developed multiple POCs using PySpark and deployed them on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata and developed code in reading multiple data formats on HDFS using PySpark.Written Spark applications using Scala to interact with the PostgreSQL database using Spark SQL Context and accessed Hive tables using Hive Context.Loaded the data into Spark data frames and performed in-memory data computation to generate the output as per the requirements.Worked on AWS Cloud to convert all premises, existing processes, and databases to AWS Cloud.Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.Used AWS Redshift, S3, Spectrum and Athena services to query large amounts of data stored on S3 to create a Virtual Data Lake without having to go through the ETL process.Implemented CI/CD process on Apache Airflow and DAGs by building Airflow Docker images by Docker-Compose and deploying on AWS ECS ClusterWorked on Airflow performance tuning of the DAG s and task instance.Worked on Airflow scheduler (celery) and worker setting in airflow.cfg file.Utilized Grafana dashboards to analyze query performance, resource utilization, and data throughput.Worked with Terraform key features such as Infrastructure as code, Execution plans, Resource Graphs, Change Automation and extensively used Auto scaling launch configuration templates for launching amazon EC2 instances while deploying Micro services.Created Hooks and custom operator, operator will sense trigger files in S3 and start the data pipeline process.Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure cron jobs in productions.Used Airflow for orchestration and scheduling of the ingestion scripts.Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using Talend.Implemented Multiple Data Pipeline DAG s and Maintenance DAGs in Airflow Orchestration.Configured and set up Airflow dags as per the requirement to run our spark commands in airflow parallel and sequential.Creating job flow using Airflow in python language and automating the jobs. Airflow will have a separate stack for deploying DAG s on and will run jobs on EMR on EC2 Cluster.Responsible for Design and maintenance of databases using Python. Developed Python based APIs (RESTful Web Services) by using Flask, SQL Alchemy, and Postgres SQL.Designed and developed automation scripts and batch jobs to create data pipelines between multiple data sources, Spark based analytics platform (Databricks).Environment: AWS EMR, AWS Glue, Redshift, Athena, Apache Airflow, Python, Kubernetes, Talend, Terraform, Docker, AKS, Flask, Rest API s, Hadoop, HDFS, Oracle, Hive, Spark, Python, Hive, PyCharm, Grafana, Cyber Security API s and External Libraries.Sparsh Technologies (INDIA) Mar 2018 - July 2020Data EngineerResponsibilities:Responsible for developing a highly scalable and flexible authority engine for all customer data.Worked on Resetting of customer attributes that provide insight about customer. Purchase frequency, marketing channel, Group on deal categorization. Advocate different sources of data using SQL, HIVE, and SCALA.Integrated 3rd party data agencies (Gender, age, other purchase history from other sites) try to integrate that data to existing data store.Used Kafka HDFS Connector to export data from Kafka topic to HDFS files in a variety of formats and integrates with apache hive to make data immediately available for SQO querying.Normalized the data according to the business needs like data cleansing, modifying the data types and various transformations using Spark, Scala and GCP Dataproc.Implemented dynamic partitioning in BigQuery tables and used appropriate file format, compression technique to improve the performance of PySpark jobs in the DATAPROC.Built a system for analyzing the column names from all tables and identifying personal information columns of data across on-premises Databases (data migration) to GCP.Processed and loaded bound and unbound Data from Google Pub/Subtopic to Big-Query using cloud Dataflow with Python.Worked on partitions of Pub/Sub messages and setting up the replication factors.Implemented Disaster management for creating Elasticsearch clusters in two DC and configure Logstash to send same data to two clusters from Kafka.Maintained and developed Docker images for a tech stack including Cassandra, Kafka, Apache, and several in house written Java services running in Google Cloud Platform (GCP) on Kubernetes.Operation focused, including building proper monitoring on data process and data quality.Effectively worked and communicated with product, marketing, business owners, and business intelligence and the data infrastructure and warehouse teams.Performed analysis on data discrepancies and recommended solutions based upon root cause.Worked on performance tuning of existing and new queries using SQL Server Execution plan, SQL Sentry Plan Explorer to identify missing indexes, table scans, index scans.Redesigned and tuned stored procedures, triggers, UDF, views, indexes and increase the performance of the slow running queries.Experienced in snowflake to create and Maintain Tables and views.Optimized queries by adding necessary non-clustered indexes and covering indexes.Developed Power Pivot/SSRS (SQL Server Reporting Services) Reports and added logos, pie charts, bar graphs for display purposes as per business needs.Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.Designed SSRS reports using parameters, drill down options, filters, sub reports.Developed internal dashboards for the team using Power BI tools for tracking daily tasks.Environment: Python, GCP Spark, Hive, Scala, Docker, Snowflake, Jupyter Notebook, Shell Scripting, SQL Server 2016/2012, T-SQL, SSIS, Visual studio, Power BI, PowerShell.Qualcomm (INDIA) July 2016 Feb 2018Software EngineerResponsibilities:Developed Hive Scripts, Hive UDFs, and Python Scripting and used Spark (Spark-SQL, Spark-shell) to process data in Hortonworks.Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of Spark.Designed and Developed Scala code for data pull from cloud-based systems and applying transformations on it.Usage of Sqoop to import data into HDFS from MySQL database and vice versa.Implemented optimized joins to perform analysis on different data sets using MapReduce programs.Created continuous integration and continuous delivery (CI/CD) pipeline on AWS that helps to automate steps in software delivery process.Processing of load and transforming the large data sets of structured, unstructured, and semi structured data in Hortonworks.Implemented Partitioning, Dynamic Partitions and Buckets in HIVE & Impala for efficient data access.Worked in Agile environment and used rally tool to maintain the user stories and tasks.Worked on HiveQL, join operations, writing custom UDF's and having good experience in optimizing Hive Queries.Running queries using Impala and used BI tools and reporting tool (tableau) to run ad-hoc queries directly on Hadoop.Worked on Apache Tez, an extensible framework for building high performance batch and interactive data processing applications Hive.Used Spark framework with Scala and Python. Good exposure to performance tuning hive queries and MapReduce jobs in spark (Spark-SQL) framework on Hortonworks.Collecting and aggregating large amounts of log data using Kafka and staging data in HDFS Data Lake for further analysis.Used Hive to analyze data ingested into HBase by using Hive-HBase integration and HBase Filters to compute various metrics for reporting on the dashboard.Developed shell scripts in UNIX environment to automate the dataflow from source to different zones in HDFS.Created and defined job work flows as per their dependencies in Oozie and e-mail notification service upon completion of job for the team that request for the data and monitored jobs using Oozie on Hortonworks.Designed both time driven and data driven automated workflows using Oozie.EDUCATION:Bachelors in computer science Dr. B.R. Ambedkar University (2016)Master s in Data Science Florida State University (2023)
Respond to this candidate
Your Message
Please type the code shown in the image: