Data Engineer Project Development Resume...

Data Engineer Project Development Resume...
Resumes | Register
Candidate Information
Name	Available: Register for Free
Title	Data Engineer Project Development
Target Location	US-FL-Lake Worth
Email	Available with paid plan
Phone	Available with paid plan
20,000+ Fresh Resumes Monthly
View Phone Numbers
Receive Resume E-mail Alerts
Post Jobs Free
Link your Free Jobs Page
... and much more
Register on Jobvertise Free
Business Analyst Data Architect Project Manager Davie, FL
Security Engineer Data Center Miramar, FL
Qa Engineer Data Warehouse Pompano Beach, FL
Electrical Engineer Project Management Lantana, FL
Scrum Master Data Engineer Weston, FL
Senior Data Engineer Delray Beach, FL
Click here or scroll down to respond to this candidate
Candidate's Name
Data EngineerPhone: PHONE NUMBER AVAILABLE Email: EMAIL AVAILABLEProfessional Summary:Around 9 years of professional IT experience involving project development, implementation, deployment and maintenance using Bigdata technologies in designing and implementing complete end-to-end Hadoop-based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.Good in Hadoop components like MapReduce, Kafka, Pig, Hive, Spark, HBase, Oozie and Sqoop . Experience in working with different Hadoop distributions like CDH and Hortonworks. Good knowledge of MAPR distribution & Amazons EMR.Strong experience in writing scripts using Python API, PySpsark API and Spark API for analyzing the data.Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDDs, and Spark YARN.Experience in Server infrastructure development on AWS Cloud, extensive usage of Virtual Private Cloud (VPC), CloudFormation JSON template, CloudFront, ElastiCache, Redshift, SNS, SQS, SES, IAM, EBS, ELK, Auto Scaling, DynamoDB, Route53, and CloudTrail.Good Knowledge of AWS components like EC2 Instance, S3 and EMR.Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage, Azure AD Licenses, Office365).Experience working with Data Lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.Good experience in creating data ingestion pipelines, data transformations, data management, data governance, and real-time streaming at an enterprise level.Hands on experience with Snowflake Data warehouse database, Used it for ETL and Backend for Viz reporting.Solid experience in using various file formats like CSV, TSV, Parquet, ORC, JSON and AVRO.Extensive experience with the searching engines like Apache Lucene, Apache Solr and Elastic Search.Excellent understanding of Data Ingestion, Transformation and Filtering. Provides Output for multiple stakeholders at the same time.Coordinated with the Machine Learning team to perform Data Visualization using Cognos TM1, PowerBI and Tableau. Developed Spark and Scala applications for performing event enrichment, data aggregation, and de-normalization for different stakeholders.Designed new data pipelines and made the existing data Pipelines to be more efficient.Experience developing iterative algorithms using Spark Streaming in Scala and Python to build near-real-time dashboards. Design and develop ETL integration patterns using Python on Spark.Develop a framework for converting existing PowerCenter mappings to PySpark(Python and Spark) Jobs.Ensure that Memcached, Redis, Elasticsearch, and GIT are properly configured and maintained.Worked on NoSQL databases like HBase, Cassandra and MongoDB.Experienced with performing CRUD operations using HBase Java Client API and Solr API.Good experience in working with cloud environments like Amazon Web Services (AWS) EC2 and S3.Experience in Implementing Continuous Delivery pipelines with Maven, Ant, Jenkins, and AWS.Strong Experience in working with Databases like Oracle 10g, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns. Responsible for estimating the cluster size, and monitoring and troubleshooting of the Spark Databricks cluster.Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.Hands-on experience in fetching the live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka. Good knowledge of creating Data Pipelines in SPARK using SCALA.Experience in developing Spark Programs for Batch and Real-Time Processing. Developed Spark Streaming applications for Real-Time Processing. Good knowledge of Spark components like Spark SQL, MLlib, Spark Streaming and GraphX. Expertise in integrating the data from multiple data sources using Kafka.Knowledge about unifying data platforms using Kafka producers/ consumers and implementing pre-processing using storm topologies. Kafka Deployment and Integration with Oracle databases.Experience data processing like collecting, aggregating, and moving from various sources using Apache Kafka.Comprehensive knowledge of the Software Development Life Cycle (SDLC). Exposure to Waterfall, Agile and Scrum models. Highly adept at promptly and thoroughly mastering new technologies with a keen awareness of new industry developments and the evolution of next-generation programming solutions. Technical Skills:Big Data SpaceHadoop, MapReduce, Pig, Hive, HBase, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Elastic Search, Solr, MongoDB, Cassandra, Avro, Storm, Parquet, Snappy, AWSHadoop Distributions Cloudera (CDH4, and CDH5), Hortonworks, Apache EMR, Azure HD Insights Databases & warehouses NoSQL, Oracle, DB2, MySQL, SQL Server, MS Access, Teradata. Languages Python, Java, SQL, PL/SQL, Scala, JavaScript, Shell Scripts, C/C++ IDE Pycharm, Eclipse, NetBeans JDeveloper, IntelliJ IDEA. RDBMS Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL, DB2. ETL Tools Informatica, AB Initio, TalendReporting Cognos TM1, Tableau, SAP BO, Power BIProfessional Experience:Liberty Mutual Boston, Massachusetts June 2022  Present Sr Data EngineerResponsibilities:Experienced in designing and deployment of Hadoop clusters and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Kafka, and Spark with Cloudera distribution.Worked on Cloudera distribution and deployed on AWS EC2 Instances.Used Spark Streaming to divide streaming data into batches as input to Spark engine for batch processing.Performed Data Ingestion from multiple internal clients using Apache Kafka.Worked on integrating Apache Kafka with Spark Streaming process to consume data from external REST APIs and run custom functions.Involved in performance tuning of Spark jobs using Cache and using complete advantage of the cluster environment.Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.Developed Spark code using Python for Pyspark and Spark-SQL for faster testing and processing of data.Used Python to parse XML files and created flat files from them.Built the machine learning model includes: SVM, random forest, and XGboost to score and identify the potential new business case with Python Scikit-learn.Orchestrated Docker containers using Kubernetes for production deployment.Deployed services as Kubernetes Pods, ensuring high availability and load balancing across multiple instances.Set up Kubernetes Ingress to manage external traffic and route it to the appropriate microservices.Automated the building, testing, and deployment of Docker containers to Kubernetes clusters upon code commits.Utilized Docker for containerization and Kubernetes for orchestration to create a robust microservices architecture.Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing.Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, and PySpark.Developed Spark scripts by using Scala Shell commands as per the requirement.Developed Automation Regressing Scripts for validation of ETL process between multiple databases are useds like AWS Redshift, Oracle, T-SQL, and SQL Server using Python.Stored the time-series transformed data from the Spark engine built on top of a Hive platform to Amazon S3 and Redshift.Configured spark streaming data to receive real-time data from Kafka and store it in HDFS.Adoption of Terraform for use with our delivery pipeline in collaboration with the SRE and development team.Strong AWS, Linux/Bash, and automation scripting experience (Terraform/CloudFormation).Developed API for using AWS Lambda to manage the servers and run the code in AWS.Experienced in AWS Elastic Beanstalk for app deployments and worked on AWS Lambda with Amazon Kinesis.Created cron jobs through Amazon Lambda to initiate our daily batch data pulls and execute our continuous integration tests under CircleCI.Developed data ingestion modules (both real-time and batch data load) to data into various layers in S3, Redshift and Snowflake using AWS Kinesis, AWS Glue, AWS Lambda and AWS Step Functions.Experience dimensional modelling (Star schema, Snowflake schema), transactional modelling and SCD (Slowly changing dimension).Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.Used Amazon EMR for Data processing among a Hadoop Cluster of virtual servers on Amazon-related EC2 and S3.Created a Queuing system that takes in files from various sources formats them and uploads them to an s3 bucket using AWS Lambda and SQS.Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to Snowflake.Excellent experience in optimizing the snowflake tables for performance improvements and cost savings.Using Python in Spark to extract the data from Snowflake and upload it daily.Reduced access time by refactoring information models, query streamlining and actualized Redis store to help Snowflake.Configured Continuous Integration system to execute suites of automated tests on desired frequencies using Jenkins, Maven & GIT.Environment: Hadoop, HDFS, HBase, Hive, Spark, Cloudera, AWS EC2, S3, ERM, Redshift, Sqoop, Kafka, Yarn, Snowflake, Shell Scripting, Scala, Python, Oozie, Agile methods, MySQL Northern Trust -Chicago, Illinois June 2019  May 2022 Data EngineerResponsibilities:Hands on experience working with ADF, ADLS, Blob Storage, SQL server, Azure Devops, Synapse and Azure DatabricksUsed Copy Activity to copy the data from on-prem SQL server into ADLS.Used Databricks templates to submit the jobs from ADF on to Azure Databricks clusters.Worked extensively with spark for doing data ingestion, data transformations and handled CDC process using Delta Lake tables.Developed Spark code using Python, Scala and Spark-SQL worked with both batch and Streaming datasets for faster processing of data.Designed a custom Spark REPL application to handle similar datasets.Experienced in development using the Cloudera distribution system.Enhanced security by configuring Kubernetes Network Policies to control pod-to-pod communication.Prepared comprehensive documentation on Docker and Kubernetes configurations, deployment procedures, and best practices.Have extensive working experience with Snowflake data warehouse database.Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation.Performed Hive test queries on local sample files and HDFS files.Utilized Apache Spark with Python to develop and execute Big Data Analytics.Hands-on coding - Write and test the code for the Ingest automation process - Full and Incremental Loads. Design the solution and develop the program for data ingestion using - Sqoop, map-reduce, Shell script &python.Developed various automated scripts for DI (Data Ingestion) and DL (Data Loading) using Python map Reduce.Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure DatabricksDeveloped the application on Eclipse IDE.Developed Hive queries to analyze data and generate results.Used Spark Streaming to divide streaming data into batches as input to spark engine for batch processing.Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Spark and Sqoop.Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.Used Scala to write code for all Spark use cases.Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.Created Hive tables and worked on them using Hive QL.Assisted in loading large sets of data (Structured, Semi Structured, and Unstructured) to HDFS.Developed Spark SQL to load tables into HDFS to run select queries on top.Used Visualization tools such as Power View for Excel, and Tableau for visualizing and generating reports. Environment: Hadoop, Hive, Java, Linux, Oracle, MySQL, Spark, Hortonworks, HDFS, Solr, HAR, HBase, Oozie, Scala, Agile, Azure, Azure Data Bricks, Azure Data Factory, Azure Data Lake, Python, SOAP API web services, Java, PowerBI, Tableau, Apache airflow, JiraSignify Health -Dallas,Texas Aug 2017June 2019Data EngineerResponsibilities:Enhanced building a centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift, Lambda, and Glue.Developed the scripts to automate the ingestion process using Pyspark as needed through various sources such as API, AWS S3, Teradata, and Redshift.Implemented AWS Redshift by Extracting, transforming, and loading data from various heterogeneous data sources and destinations.Deployed Workload Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer running Adhoc queries, this allowed for a more reliable and faster reporting interface, giving sub-second query responses for basic queries.Build a real-time streaming pipeline utilizing Kafka, Spark Streaming, and Redshift.Worked on migrating datasets and ETL workloads from On-prem to AWS Cloud.Built a series of Spark applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.Worked extensively on fine-tuning spark applications and providing production support to various pipelines running in production.Worked on automating the infrastructure setup, launching, and termination of EMR clusters.Worked on creating Kafka producers using Kafka Producer API for connecting to external Rest live stream applications and producing messages to Kafka topic.Implemented ETL Migration services by developing the AWS Lambda functions to generate a serverless data pipeline that can be written to AWS Glue Catalog and queried from AWS Athena.Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and Pyspark.Developed robust and scalable data integration pipelines to transfer data from the s3 bucket to the Redshift database using Python and AWS GlueManaged metadata to ensure the quick and efficient finding of data for customer projects in AWS Data Lake and its complex functions like AWS Lambda, and AWS Glue.Created Hive external tables on top of datasets loaded in S3 buckets and created various hive scripts to produce a series of aggregated datasets for downstream analysis.Developed the AWS glue ETL data transformation, validation, and data cleansing and catalog with crawler to get the data from S3 and perform SQL query operations.Designed in creating S3 buckets in custom policies for access management for the clients using AWS IAM (Identity Access Management).Extensive Experience in working with Cloudera (CDH4 & 5), and AWS Amazon EMR, to fully leverage and implement new Hadoop features.Created Airflow DAGs for Batch Processing to orchestrate Python data pipelines for CSV file preparation pre-ingestion, using conf to parameterize for a multitude of input files.Orchestrated the Apache Airflow to author workflows as directed acyclic graphs (DAGs), to visualize batch and real-time data pipelines running in production, monitor progress, and troubleshoot issues when needed.Enhanced the design and implementation of fully automated Continuous Integration, Continuous Delivery, Continuous Deployment pipelines, and DevOps processes for Agile projects (CI/CD).Managed workload and utilization of the team. Coordinated resources and processes to achieve Tableau implementation plans.Environment: AWS S3, Apache Airflow, EMR (Elastic Map Reduce), Redshift, Athena, Glue, Spark, Scala, Python, Hive, Kafka, Pig, HDFS, git, DynamoDB, Lambda, Step Functions, Parquet, Tableau. HSBC - India June 2014 - Aug 2017ETL DeveloperResponsibilities:Gathered business requirements and prepared technical design documents, target-to-source mapping documents, and mapping specification documents.Extensively worked on Informatica Power Center.Hands on experience working with SSIS, SQL server, SSRS and SSAS.Parsed complex files through Informatica Data Transformations and loaded them into Database.Optimized query performance by Oracle hints, forcing indexes, working with constraint-based loading, and a few other approaches.Collaborated with Data engineers and operation team to implement ETL process, and Snowflake models wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.Extensively worked on UNIX Shell Scripting for splitting groups of files into various small files and file transfer automation.Integrated Hadoop into traditional ETL, accelerating the extraction, transformation, and loading of massive semi- structured and unstructured data. Loaded unstructured data into Hadoop distributed File System (HDFS).Worked with Autosys scheduler for scheduling different processes.Performed basic and unit testing.Assisted in UAT Testing and provided necessary reports to the business users. Environment: Informatica Power Center, Oracle, MySQL, UNIX Shell Scripting, Autosys. Education:Bachelors in Computer ScienceOsmania University
Respond to this candidate
Your Message
Please type the code shown in the image: