| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidateCandidate's Name
Sr. Big Data EngineerPhone: PHONE NUMBER AVAILABLE; Email Id: EMAIL AVAILABLEProfessional SummaryA seasoned professional with over 15 years of expertise in Big Data Development and ETL technologies and overall, 19 years of experience in ITRenowned for creating innovative, large-scale data systems through meticulous analytical approaches.Recognized for delivering high-quality results consistently across multiple projects and platforms.Mastered the management and architecture of data lakes, ensuring robust and scalable data solutions.Skilled in using Google Cloud Data Prep for effective data preparation and transformation tasks.Proficient in Terraform for efficient management and provisioning of GCP resources.Designed and orchestrated complex workflows using Google Cloud Composer.Leveraged Google Cloud Pub/Sub for scalable, event-driven messaging solutions.Implemented comprehensive compliance monitoring with Google Cloud Audit Logging.Specialized in high-volume data processing and analysis using AWS EMR and AWS Lambda.Designed and implemented data warehousing solutions utilizing AWS Redshift for efficient querying and analysis.Extensive experience with AWS CloudWatch for resource monitoring and AWS IAM for robust security controls.Developed intricate workflows using AWS Step Functions to streamline complex data processes.Formulated technical strategies and authored CloudFormation Scripts to ensure efficient project execution.Expertise in designing scalable data architectures using Azure Data Lake, Azure Synapse Analytics, and Azure SQL.Constructed and managed sophisticated data pipelines with Azure Data Factory.Skilled in using Azure data services like Azure SQL Database and Azure Data Factory for managing large datasets.Proficient in implementing Azure Databricks for scalable Spark-based data analytics solutions.Specialized in optimizing Spark performance across platforms such as Databricks, Glue, EMR, and on-premises systems.Led successful on-premises data migration projects to the cloud, ensuring seamless transitions and enhanced performance.Extensive hands-on experience with Apache Spark, Kafka, and a variety of ETL technologies.Proficient in major components of the Hadoop Ecosystem, enabling robust data solutions.Expertise in integrating Apache NIFI with Apache Kafka for streamlined data flow and processing.Designed and managed data transformation and filtration patterns using Spark, Hive, and Python for optimal data processing.Upgraded MapR, CDH, and HDP clusters, ensuring enhanced system performance and seamless transitions.Developed scalable and reliable solutions for real-time and batch data movement across multiple systems.Extensive experience in managing and optimizing on-premises data storage systems, including relational databases and distributed file systems.Skilled in deploying and configuring systems like Apache Hadoop and Apache Spark to ensure efficient data processing.Validated technical and operational feasibility of Hadoop developer solutions, providing valuable insights for project success.Engaged directly with business stakeholders to understand project needs and align technical solutions with overarching goals.Proficient in designing and implementing comprehensive data security and access controls.Adept at developing intricate workflows and comprehensive technical strategies for efficient project execution.Technical SkillsProgramming Languages: Python, Scala, PySpark, SQLScripting: Hive, MapReduce, SQL, Spark SQL, Shell ScriptingIDE: Jupyter Notebooks, Eclipse, IntelliJ, PyCharm Vs CodeDatabase & Tools: Redshift, DynamoDB, SynapseDB, Big Table, AWS RDS, SQL Server, PostgreSQL, Oracle, MongoDB, Cassandra, HBaseHadoop Distributions: Hadoop, Cloudera Hadoop, Hortonworks HadoopELT Tools: Spark, NIFI, AWS Glue, AWS EMR, Data Bricks, Azure Data Factory, Google Data flowFile Format & Compression: CSV, JSON, Avro, Parquet, ORCFile Systems: HDFS, S3, Google Storage, Azure Data LakeCloud Platforms: Amazon AWS, Google Cloud (GCP), Microsoft AzureOrchestration Tools: Apache Airflow, Step Functions, OozieContinuous Integration CI/CD: Jenkins, Code Pipeline, Docker, Kubernetes, Terraform, CloudFormationVersioning: Git, GitHubProgramming Methodologies: Object-Oriented Programming, Functional ProgrammingProject Methods: Agile, Kanban, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Thinking, Lean Six SigmaData Visualization Tools: Tableau, Kibana, Power BI, AWS Quick SightSearch Tools: Apache Lucene, ElasticsearchSecurity: Kerberos, Ranger, IAMProfessional ExperienceLead Big Data EngineerCitibank, New York, NY, August 2023 PresentDesigned and implemented high-throughput data ingestion pipelines using Google Cloud Dataflow at Citibank to process streaming and batch data, ensuring minimal latency and high resilience.Executed large-scale data migrations to Google BigQuery, utilizing best practices for data transfer and establishing reliable data streams for real-time analytics.Spearheaded the adoption of Google BigQuery ML, enabling machine learning capabilities directly within the data warehouse for the creation, training, and deployment of ML models on large datasets.Develop robust ETL (Extract, Transform, Load) pipelines using Scala to process and manage large datasets.Developed streamlined Terraform templates for agile infrastructure provisioning within GCP at Citibank.Managed Hadoop clusters on GCP using Dataproc and associated tools efficiently at Citibank.Processed real-time data using Google Cloud Pub/Sub and Dataflow, building optimized data pipelines and coordinating executions with Apache Airflow.Integrate NiFi with various data sources and destinations, including databases, cloud storage, messaging systems, and APIs.Designed and managed PostgreSQL databases, including schema creation, indexing, and performance tuning to optimize query execution for large-scale data operations.Ensured resilient data warehousing solutions by orchestrating robust data storage in Google Cloud Storage and BigQuery using Terraform.Leveraged Google Kubernetes Engine (GKE) to skilfully manage compute instances and deploy containerized applications, optimizing resource utilization and scalability through Docker.Enhanced data processing efficiency by leveraging Google Cloud Data Prep and Dataflow and ensured data integrity and security through meticulous governance measures using Google Cloud IAM.Implemented security measures for Kafka clusters, including encryption, authentication, and authorization.Optimized resource usage and reduced operational costs by managing Hadoop/Spark clusters with Google Cloud Dataproc, utilizing preemptible VMs and autoscaling.Optimize PySpark code for performance, including memory management, data partitioning, and caching strategies.Worked closely with data scientists and analysts to build comprehensive data pipelines and conduct in-depth analysis within GCP's ecosystem.Utilized Databricks for efficient ETL pipeline development, ensuring streamlined data processing workflowsDeveloped & maintained ETL pipelines using Apache Spark & Python on GCP for large-scale data processing and analysis.Implement NiFi processors, controllers, and other components to build scalable and efficient data pipelines.Orchestrated event-driven architectures using Google Cloud Functions to execute serverless functions in response to cloud events, reducing operational complexity.Managed data flow through Kafka topics and partitions, addressing bottlenecks or congestion effectively.Use PySpark SQL and DataFrames API to perform complex queries and data manipulations.Optimize HiveQL queries and Python scripts for performance, ensuring efficient use of resources and reducing query execution times.Design and develop scalable big data applications using Scala and related frameworks like Apache Spark and Akka.Harnessed a range of GCP services, including Compute Engine, Cloud Load Balancing, Storage, SQL, Stackdriver Monitoring, and Deployment Manager.Configured precise GCP firewall rules to manage VM instance traffic, optimizing content delivery and reducing latency through GCP Cloud CDN.Automated data conversion and ETL pipeline generation using Python scripts, optimizing performance and scalability through Spark, YARN, and GCP Dataflow.Led the migration of enterprise-grade applications to GCP, significantly enhancing scalability and reliability while ensuring seamless integration with existing on-premises solutions.Fostered innovation by setting up collaborative environments using GCPs Cloud Source Repositories and integrated tooling for source code management.Designed and implemented Snowflake data warehouses, developed data models, and crafted dynamic data pipelines for real-time processing across diverse data sources.Employed natural language toolkits for data processing, analysing keywords, and generating word clouds.Mentored budding engineers in ETL best practices, Snowflake, and Python, while managing and evolving ETL pipelines using Apache Spark and Python on GCP.Design and build scalable ETL pipelines using PySpark to handle large datasets.Contributed expertise across onshore and offshore development models, architecting scalable data solutions that integrate Snowflake, Oracle, GCP, and various other components.Collaborated extensively with data scientists and analysts on diverse projects, including fraud detection, risk assessment, and customer segmentation, actively contributing to recovery plan development and implementation.Lead AWS Big Data EngineerBristol Myers Squibb, New York, NY, Mar 2021 Jul 2023Led the deployment orchestration of Amazon EC2 instances and proficiently managed S3 storage, tailoring instances for specialized applications and Linux distributions.Managed critical AWS services including S3, Athena, Glue, EMR, Kinesis, Redshift, IAM, VPC, EC2, ELB, Code Deploy, RDS, ASG, and CloudWatch, constructing a resilient and high-performance cloud architecture.Provided expert insights on AWS system architecture, implementing stringent security measures and protocols to safeguard crucial assets and ensure regulatory compliance.Optimized data processing capabilities using Amazon EMR and EC2, driving enhancements in system performance and scalability to meet the demanding requirements.Integrate Hive with various data sources and destinations using Python, enabling seamless data flow across systems.Integrated MySQL with Apache Spark for big data processing, facilitating seamless data extraction, transformation, and loading operations.Designed and refined data processing workflows, leveraging AWS services such as Amazon EMR and Kinesis, to ensure streamlined and efficient data analysis processes.Optimize Scala code for performance and scalability, ensuring efficient use of resources.Leveraged AWS Glue for data cleansing and preprocessing, conducted real-time analysis using Amazon Kinesis Data Analytics, and utilized services like EMR, Redshift, DynamoDB, and Lambda for scalable data processing.Automate Hive job execution and data workflows using Python-based scheduling tools such as Apache Airflow or custom scripts.Applied data quality checks and validations during the ETL process to ensure accuracy and consistency of the transformed data.Optimizing ETL processes for performance and scalability, such as tuning data transformations and managing resource utilization.Implemented robust data governance and security protocols through AWS IAM and Amazon Macie, ensuring the safeguarding of sensitive data assets.Utilized AWS Glue and Fully Managed Kafka for seamless data streaming, transformation, and preparation.Implemented robust error handling and logging mechanisms to monitor ETL jobs and ensure data integrity throughout the pipeline.Optimize NiFi data flows for performance, scalability, and resource efficiency.Established and managed data storage solutions on Amazon S3 and Redshift, ensuring optimal accessibility and organization of data assets.Utilize advanced Scala features like immutability, type safety, and functional programming to write high-quality, maintainable code.Monitor and fine-tune NiFi performance to handle high-volume, high-velocity data streams effectively.Seamlessly integrated Snowflake into the data processing workflow, elevating data warehousing capabilities to enhance decision-making processes.Evaluated opportunities for on-premises data infrastructure migration to AWS, orchestrating data pipelines via AWS Step Functions, and leveraging Amazon Kinesis for event-driven processing.Proficiently utilized Apache Airflow for workflow automation, orchestrating intricate data pipelines and automating tasks to enhance operational efficiency.Implemented AWS CloudFormation to automate the provisioning of AWS resources, ensuring consistent and repeatable deployments across multiple environments.Sr. Big Data Engineer (AWS)Etsy, Brooklyn, New York City, NY, Sep 2018 Feb 2021Led migration of existing Hive scripts and shell orchestration code to Databricks SQL scripts and workflow orchestration, leveraging AWS services for modernized infrastructure.Developed and executed comprehensive migration strategy covering analysis, design, code migration, orchestration setup, CI/CD integration, utilizing Databricks Asset Bundles, Workflows, and AWS services.Collaborated with cross-functional teams to gather requirements and design migration strategies aligned with business objectives, utilizing AWS services like Amazon S3, AWS EMR, and third-party tools.Migrated core logic code and orchestration code to Databricks, leveraging AWS EMR for efficient data processing and workflow management.Set up Databricks environments within AWS ecosystem and integrated them with Databricks' native tools for automated deployment and environment consistency.Conducted development testing using Databricks SQL Editor, Job Runs, Compute, and Catalog to validate migrated scripts and workflows, emphasizing AWS compatibility.Managed code reviews and approvals utilizing Databricks' collaborative features, ensuring coding standards and AWS integration.Coordinated comprehensive testing of migrated solutions using custom data validation tool in Python, addressing discrepancies while ensuring AWS compatibility.Automated ETL tasks using Python and Apache NiFi, improving data processing efficiency and reducing manual intervention.Managed changes to Control M job schedules using Shell and Python scripts to reflect migration to Databricks and ensure AWS integration.Developed and managed ETL workflows using Apache Airflow, automating data extraction, transformation, and loading processes across multiple data sources. Executed parallel runs of migrated processes alongside existing systems using Databricks functionalities, validating consistency and accuracy aligned with AWS services.Planned and executed final cutover from legacy systems to Databricks, utilizing AWS services for minimal disruption.Conducted cleanup activities using Visual Studio Code, Git, and AWS CLI, focusing on AWS environment optimization.Provided documentation and knowledge transfer to stakeholders and team members using ConfluenceDesigned and optimized algorithms for real-time data processing using Databricks and PySpark features, ensuring compatibility within AWS environment.Conducted post-migration analysis to evaluate efficiency of migrated pipelines on Databricks platform, focusing on AWS performance metrics.Identified areas for further optimization or enhancement based on stakeholder feedback, prioritizing AWS integration and efficiency.Sr. Cloud Data EngineerLiberty Mutual Insurance, Boston, Massachusetts, Jan 2016 Aug 2018Developed cutting-edge web applications using microservices by integrating a variety of Azure technologies such as Cosmos DB, App Insights, Blob Storage, API Management, and Functions.Successfully executed data transformations using Azure SQL Data Warehouse and Azure Databricks, handling crucial tasks such as data cleaning, normalization, and standardization.Develop and maintain scalable data processing pipelines using Python to handle large volumes of data.Leveraged Azure Databricks to process and store petabytes of data from multiple sources, integrating Azure SQL Database, external databases, Spark RDDs, and Azure Blob Storage.Modeled Hive partitions within Azure HDInsight and implemented caching techniques in Azure Databricks to optimize RDD performance and accelerate data processing.Extracted and profiled data from various formats, including Excel spreadsheets, flat files, XML, relational databases, and data warehouses, utilizing SQL Server Integration Services (SSIS) effectively.Integrate Python with big data technologies such as Apache Hadoop, Apache Spark, and Apache Kafka.Engineered complex, maintainable Python and Scala code within Azure Databricks, facilitating seamless data processing & analytics. Transitioned MapReduce jobs to PySpark in Azure HDInsight for enhanced performance.Orchestrated ETL workflows using UNIX scripts in Azure Blob Storage and Linux Virtual Machines, automating processes and ensuring robust error handling and file operations.Created comprehensive documentation for Kafka configurations, processes, and best practices.Utilized Google Bigtable for scalable NoSQL database solutions, enabling rapid read and write operations for large datasets.Planned & architected Azure Cloud environments for migrated IaaS VMs and PaaS role instances, ensuring accessibility and scalability.Developed Apache Airflow modules for seamless workflow management across cloud services.Communicated effectively with team members and management regarding Kafka-related initiatives, challenges, and solutions.Optimized query performance and empowered data-driven decision making, using Snowflake's scalable data solutions and SQL-based analysis.Data EngineerHyundai Motor America, Fountainvalley, CA, Jun 2013 Dec 2015Designed and optimized data models and schemas using Amazon Redshift for querying structured data.Employed PySpark to ingest both structured and unstructured financial data from multiple sources, storing and querying processed data with Amazon DynamoDB.Implemented real-time data ingestion and processing pipelines with Amazon Kinesis, ensuring scalable data streaming capabilities.Developed high-performance data processing applications utilizing PySpark libraries.Utilized Python and Scala for data manipulation and modeling within Spark and other data processing frameworks.Optimized the performance of the data pipeline by tuning AWS Glue and adjusting resource allocation.Enhanced data security with AWS Secrets Manager for managing encryption keys and secrets.Integrated Amazon DynamoDB for large-scale semi-structured data storage, optimizing retrieval performance.Implemented data quality checks to validate the completeness and accuracy of data loaded into Amazon S3 and DynamoDB.Orchestrated workflow scheduling and automation using AWS Step Functions.Leveraged Amazon Athena for interactive querying and analysis of data stored in S3.Developed data visualization dashboards and reports using Amazon QuickSight.Implemented error handling and logging to ensure the reliability of the data pipeline.Deployed and configured Kinesis Data Streams for optimal performance.Contributed to data engineering best practices, including code review, documentation, and performance optimization.Participated in cross-functional teams to design and implement scalable data solutions.Implemented CI/CD pipelines via Jenkins for ETL code deployment and version control using Git.Designed Spark SQL queries for real-time data extraction and integrated Spark Streaming for Kinesis data consumption.Ensured data quality through automated testing in CI/CD pipelines and implemented ETL pipeline automation using Jenkins and Git.Configured and maintained ELK stack (Elasticsearch, Logstash, Kibana) for centralized logging, real-time search, and data visualization.Hadoop AdministratorConocoPhillips, Houston, Texas, Jan 2009 May 2013Administered Apache Hadoop clusters, managing components such as Hive, HBase, Zookeeper, and Sqoop for robust application development and data governance.Seamlessly handled Cloudera CDH & Hortonworks HDP installations and upgrades across various environments to ensure uninterrupted operations.Developed and implemented data security policies, procedures, and standards to protect sensitive data assets in accordance with regulatory requirements and industry best practices.Collaborated with cross-functional teams to gather requirements and design scalable data solutions in Snowflake aligned with business objectives.Developed high-performance data processing pipelines using Java and Big Data frameworks (e.g., Hadoop, Spark, Flink) to extract, transform, and load (ETL) large datasets.Managed the commissioning, decommissioning, and recovery of Data Nodes, and orchestrated multiple Hive Jobs using the Oozie workflow engine.Enhanced MapReduce job execution efficiency through targeted monitoring and optimization strategies.Implemented High Availability protocols and automated failover mechanisms with Zookeeper, eliminating single points of failure for Name nodes.Collaborated with data analysts to provide innovative solutions for analysis tasks and conducted comprehensive reviews of Hadoop and Hive log files.Facilitated efficient data exchange between RDBMS, HDFS, and Hive using Sqoop, implementing seamless data transfer protocols.Integrated CDH clusters with Active Directory, strengthening authentication with Kerberos for enhanced security.Maximized fault tolerance by configuring and utilizing Kafka brokers for data writing and leveraged Spark for faster processing when needed.Utilized Hive for dynamic table creation, data loading, and UDF development, working closely with Linux server admin teams to optimize server hardware and OS functionality.Proposed and implemented continuous improvements to enhance the efficiency and reliability of Kafka clusters.Performed regular security assessments and penetration tests of big data environments to identify and remediate security vulnerabilities and weaknesses.Partnered with application teams to facilitate OS updates and execute Hadoop version upgrades.Streamlined workflows using shell scripts for seamless data extraction from diverse databases into Hadoop.Continuously monitor and evaluate the effectiveness of data security controls and measures through metrics, audits, and performance assessments.Managed and worked with NoSQL databases like MongoDB, Cassandra, and HBase.Data AnalystSplunk, San Francisco, CA, Feb 2005 Dec 2008Gathered data from diverse sources and integrated it into the Splunk platform, ensuring data accuracy and completeness.Oversaw the monitoring and management of data within Splunk, including ensuring data quality, availability, and security.Performed advanced data analysis using Splunk to uncover trends, patterns, anomalies, and insights that drive business decisions and actions.Developed and generate regular and ad-hoc reports, as well as create visualizations and dashboards in Splunk to effectively communicate findings to stakeholders.Developed and implemented statistical models and predictive analytics solutions within Splunk to forecast trends, identify opportunities, and mitigate risks.Collaborated with cross-functional teams to understand business requirements, translate them into data analytics solutions, and effectively communicate findings and recommendations.Utilized advanced features and capabilities of Splunk to optimize data analytics processes, improve efficiency, and enhance insights generation.Provided guidance, mentorship, and training to junior analysts and stakeholders on best practices for utilizing Splunk for data analysis and visualization.Stayed updated with the latest trends, tools, and technologies in data analytics and Splunk, and actively seek opportunities to enhance existing processes and methodologies.Ensured compliance with data governance policies, regulations, and security standards when handling and analyzing sensitive data within Splunk.Identified and resolved technical issues, challenges, and discrepancies within Splunk to maintain data integrity and system functionality.Documented data analysis methodologies, processes, and findings, maintaining clear and comprehensive documentation for reference and knowledge sharing purposes.Lead and/or participated in data analytics projects, from conception to execution, ensuring timely delivery of high-quality insights and solutions.EducationMaster of Information TechnologyUniversity of AdgerBachelor of Science: Computer ScienceUniversity of Adger |