Data Engineer Machine Learning Resume Le...

Data Engineer Machine Learning Resume Le...
Resumes | Register
Candidate Information
Name	Available: Register for Free
Title	Data Engineer Machine Learning
Target Location	US-TX-Lewisville
Email	Available with paid plan
Phone	Available with paid plan
20,000+ Fresh Resumes Monthly
View Phone Numbers
Receive Resume E-mail Alerts
Post Jobs Free
Link your Free Jobs Page
... and much more
Register on Jobvertise Free
Principal Machine Learning Engineer | Senior Data Scientist Dallas, TX
Machine Learning Data Science Data Analyst Data Engineering Arlington, TX
Data Engineer Big Denton, TX
Machine Learning Big Data Plano, TX
Data Engineer Senior Denton, TX
Data Engineer Dallas, TX
Click here or scroll down to respond to this candidate
Candidate's Name
Data EngineerPHONE NUMBER AVAILABLE EMAIL AVAILABLESUMMARYOverall 8+ years of IT experience in Analysis, Design, Development and Big Data in Scala, Spark, Hadoop, Pig and HDFS environment and experience in Python.Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimensions modeling for OLAP.Strong experience in building fully automated Continuous Integration & Continuous delivery pipelines and DevOps processors for agile store-based Applications in Retail and Transportations domain.Databricks is a unified data analytics platform that accelerates innovation by unifying data science, engineering, and business. It is designed to streamline data processing, data mining, machine learning, and analytics workflows.Good Knowledge in Amazon Web Services (AWS) concepts like EC2, S3, EMR, Elastic Cache, DynamoDB, Redshift, Aurora.Experience in developing scripts using Python, Shell Scripting to do Extract, Load and Transform data working knowledge of AWS Redshift. SQL Server Integration Services (SSIS) is a component of Microsoft SQL Server that provides tools for data integration and workflow applications. It is used for data mining, transformation, and loading (ETL) operations.Using Terraform to manage AWS Lambda functions allows for the automation of the infrastructure as code (IaC), making deployment, scaling, and management more efficient and consistent.Data Management and Integration are crucial processes in modern organizations to ensure that data is accurate, consistent, and accessible across various systems and applications. Effective data management and integration strategies enable businesses to make informed decisions, improve operational efficiency, and gain competitive advantages.Extensively used SQL, Numpy, Pandas, Hive for Data Analysis and Model building.ADF is designed for ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) operations in the cloud, enabling data movement and transformation across various sources and destinations.Utilizes Apache Spark, a powerful open-source data processing engine, for large-scale data processing.Experience in analyzing, designing, and developing ETL Strategies and processes, writing ETL specifications.Proficient knowledge and hand on experience in writing shell scripts in Linux.Encompasses practices, architectural techniques, and tools for managing data throughout its lifecycle, from creation and storage to archiving and deletion.Troubleshooting and support are essential activities for maintaining the health, performance, and reliability of systems, applications, and networks. Effective troubleshooting and support processes help quickly identify and resolve issues, minimizing downtime and improving user satisfactionFocuses on ensuring data quality, security, availability, and governance.A repository of articles, guides, FAQs, and troubleshooting steps to help users resolve common issues independently.Continuously updated with new information and solutions.AWS Identity and Access Management (IAM) is a critical component for securely managing access to AWS resources. It allows administrators to control who can access specific resources, under what conditions, and with what permissions.Expertise in Hadoop Ecosystems tools which including HDFS, YARN, MapReduce, Pig, Hive, Sqoop, Flume, Spark, Zookeeper and Oozie.Experience on data architecture including data ingestion pipeline design, data modeling and data mining, machine learning and advanced data processing.Google BigQuery is a fully-managed, serverless data warehouse that enables scalable and cost-effective analysis of large datasets using SQL.ETL (Extract, Transform, Load) development and maintenance involve creating and managing data pipelines that extract data from various sources, transform it to meet business requirements, and load it into target data systems. Effective ETL processes are crucial for data warehousing, business intelligence, and analytics.Tools that allow support teams to access and control user systems remotely to diagnose and fix issues.Enhances the efficiency and speed of problem resolution.Extracted data is transformed to match the target schema and business rules. Transformations can include cleaning, deduplication, aggregation, and enrichment.Amazon Web Services (AWS) is a comprehensive and widely adopted cloud platform offering over 200 fully-featured services from data centers globally. AWS is known for its reliability, scalability, and vast array of services that cater to various needs, from computing power to storage, machine learning, and artificial intelligence.Developed multiple POCs using PySpark, Scala and deployed on the YARN Cluster, compared the performance of Spark, with Hive and SQL.Proactive support and troubleshooting prevent issues from disrupting operations, leading to increased productivity.Seamlessly integrates with other Azure services, such as Azure Storage, Azure SQL Database, Azure Synapse Analytics, and more.Privacera is a unified data governance and security platform that helps organizations manage and secure data across multiple cloud environments. It provides automated data discovery, classification, access control, and monitoring capabilities to ensure data privacy and compliance with regulatory requirements.Provides timely and reliable data to support business intelligence and analytics, enabling better decision-making.Proficient in designing, scheduling, and monitoring complex data pipelines and workflows using Apache Airflow.Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python technologies.SISS is designed for ETL operations to extract data from various sources, transform it according to business rules, and load it into target databases.Experience on GCP cloud services such as DataFlow, DataProc, GS-Util, Pub-Sub, BigQuery and GCS Buckets.Hand on experience on Extranction, Transformation and Loading of data with different file formats like Text, Avro, Parquet and Orc.Hands on experience on installation, configuration and deployment of product on new edge nodes that connect and contact kafka cluster for data acquisition.Integrated Apache Spark with Kafka to perform web analytics. Uploaded click stream data from kafka to HDFS, HBase and Hive integrated with Spark. Performance tuning and optimization are crucial aspects of maintaining and enhancing the efficiency and speed of systems, applications, and databases. The goal is to ensure optimal resource utilization, reduce response times, and improve overall system performance.Extracts data from source systems, transforms it to meet business requirements, and loads it into target systems.Commonly used in data warehousing and data migration projects.Defining user stories and driving the agile board in JIRA during project execution, participating in sprint demo and retrospective.By leveraging AWS, organizations can build and deploy applications quickly and securely, taking advantage of the extensive range of services and global infrastructure that AWS provides.Building and deploying machine learning models using BigQuery ML for predictive analytics.Supports streaming data into Big Query in real-time, enabling near real-time analytics.Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new technologies with ease and be a good team member.TECHNICAL SKILLSHadoop EcosystemHDFS, SQL, YARN, PIG Latin, MapReduce, Hive, Sqoop, Spark, Yarn, Strom, Zookeeper, Oozie, Kafka, Storm, FlumeProgramming LanguagesPython, Java, PySpark, Scala, Shell ScriptingBig Data PlatformsHortonworks, ClouderaAWS/GCP/Azure PlatformEC2, S3, EMR, Redshift, DynamoDB, Aurora, VPS, Glue, Kinesis, Boto3, DataFlow, DataProc, BigQuery, Data Storage, Cloud FunctionOperating SystemsLinux, Windows, UNIXDatabasesNetezza, Oracle, MySQL, UDB, HBase, MongoDB, Cassandra, SnowflakeDevelopment MethodsAgile/Scrum, WaterfallIDEsPyCharm, IntelliJ, Ambari, DataBricksData VisualizationTableau, BO Reports, SplunkEDUCATIONMaster of Science, Information & Computer SciencesSOUTHERN ARKANSAS UNIVERSITY USABachelor of Technology, Electronics & Communication EngineeringJAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY INDIAWORK EXPERIENCEData Engineer Inmar Intelligence Winston-Salem, NC June 2021  PresentProject Description: Inmar Intelligences customer-centric approach is evident through our success helping companies dynamically engage audiences, build brand loyalty, create efficiencies and drive profitable growth. Working on developing and maintaining the data pipelines to migrate from Hadoop to Google Cloud Platform(GCP) and the data transformation as required.Key Contributions:Experience in building and architecting multiple Data pipelines, for Data Ingestion and transformation in GCP using Python and Java Programming.Build programs using Python code and execute it in the DataFlow to store it in GCP buckets. Run the data validation between the source file and BigQuery tables.Develop and deploy the Spark Scala code to run the jobs on the DataProc and load the data into BigQuery.Working with the Business and Data Quality teams to make sure the data is validated and make changes as per Business team requests in the Payload.Terraform enables defining AWS Lambda functions, their configurations, and associated resources in declarative configuration files.Provides data transformation capabilities using data flows, SQL, and external services such as Azure Databricks and HDInsight.SSIS provides a graphical design interface, SQL Server Data Tools (SSDT), for creating and managing ETL workflows.SSIS supports a variety of data flow tasks to read from sources, transform data, and write to destinations.Includes built-in transformations like sorting, aggregating, merging, and data conversion.An open-source storage layer that brings ACID (atomicity, consistency, isolation, durability) transactions to Apache Spark and big data workloads.Ensures data reliability and improves performance for batch and streaming data.Seamlessly integrates with major cloud service providers like AWS, Azure, and Google Cloud, as well as popular data storage and processing platforms.Worked on Google Pub/sub messaging system by creating topics using consumer and producer to ingest data into the application for Spark to process the data.SISSIncludes a variety of transformations like sorting, merging, aggregating, and data conversion.Used Dataproc to load the incremental and non-incremental data into the Big Query.Worked on various file formats in Hadoop and GCS such as Avro, ORC and parquet.Automates the creation, update, and deletion of Lambda functions and associated resources (e.g., IAM roles, API Gateway).Maintain clear and regular communication with users throughout the troubleshooting and support process.The process of adjusting and refining system configurations, code, queries, and hardware to enhance performance. Focuses on identifying and eliminating bottlenecks.Implemented Apache Airflow to automate, schedule and monitor the data pipelines daily.Working knowledge and experience on cloud technologies such as GCP and AWS.Attend Daily Scrum Meetings, Sprint review and Sprint meetings to provide the update on the assigned tasks and stories in Jira.Integrates with MLflow for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment.Optimizes resource usage with auto-scaling and serverless options, ensuring cost-effective operation.Pre-built ML libraries and tools for model training, tuning, and deployment.Maintaining the project based code in Git-Hub repositories and using GIT for software development and version controlling.Integrates with MLflow for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment.Pre-built ML libraries and tools for model training, tuning, and deployment.Deployment, scheduling, and execution of SSIS packages.Versioning and configuration management support.By implementing effective troubleshooting and support processes, organizations can ensure the reliability, efficiency, and satisfaction of their systems and users.By leveraging Databricks, organizations can accelerate their data-driven initiatives, streamline their workflows, and enhance collaboration across data teams, ultimately driving faster and more accurate insights.Environment: Hadoop, GCP, AWS, Azure, Hive, Scala, Apache Beam, Spark, Python, SQL, DataFlow, Pub/Sub, DataProc, Airflow, GitHub, Linux, Jira.Big Data Engineer Capital One Chicago,IL Feb 2021  June 2021Project Description: Capital One is one of the largest banking companies in both the US and Canada, where it handles millions of its customers. As most of the companies are moving their data to the cloud, Capital One has created a Onelake cloud platform where all the data is stored in the cloud and can be easily accessible to the users. We are building the data pipelines to move data to the cloud using various tools.Key Contributions:Worked on data transformation where I was responsible for the data development, data analysis, data quality and data maintenance efforts of the application which my team was responsible for.Migrated an existing Main-Frame application to AWS . Used AWS services like EC2 and S3 for data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.Performed data cleaning including transforming variables and dealing with missing value and ensured data quality, consistency, integrity using Pandas, NumPy. Amazon EC2 (Elastic Compute Cloud): Scalable virtual servers in the cloud.AWS Lambda: Serverless compute service that runs code in response to events and automatically manages the compute resources.Helps organizations comply with various data protection regulations such as GDPR, CCPA, HIPAA, and more.Amazon ECS (Elastic Container Service): Fully managed container orchestration service.Amazon EKS (Elastic Kubernetes Service): Managed Kubernetes service for running containerized applications.Created the Metadata in the Nebula (Glue) for the data which we are migrating into the Onelake and make sure the columns and data types remain the same.Automatically scales compute resources up or down based on workload demands, optimizing cost and performance.Integrates with Azure DevOps and GitHub for version control and continuous integration/continuous deployment (CI/CD) of data pipelines.Ensure data security through encryption, access control, and secure data transfer.By implementing effective data management and integration strategies, organizations can ensure that their data is reliable, accessible, and valuable, driving better business outcomes and operational efficiency.Ensure data security through encryption, access control, and secure data transfer.By implementing effective data management and integration strategies, organizations can ensure that their data is reliable, accessible, and valuable, driving better business outcomes and operational efficiency.Schedule and manage SSIS packages using SQL Server Agent for automated workflows.Worked in building ETL pipeline for data ingestion, data transformation, data validation on cloud service AWS, working along with data steward under data compliance.Exploratory data analysis (EDA), feature engineering, and model training using machine learning libraries.Worked on Hadoop ecosystem in PySpark on Amazon EMR and Databricks.Helped business people to minimize the manual work they were doing and created python scripts like OneLake, S3, Databricks, Amazon Redshift, Snowflake to get the cloud metrics and make their efforts easiest as daily run jobs.Populating cloud-based data warehouses such as Azure Synapse Analytics with data from various sources for analytics and reporting.Migrating data from on-premises databases to cloud-based data warehouses or data lakes.Experience in Developing Spark applications using Spark - SQL and PySpark in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data.Used Git-hub and Jenkins for CI/CD operations. Also maintained Arrow jobs where the daily batch jobs will be running.By leveraging Azure Data Factory, organizations can create robust, scalable, and efficient data integration solutions that support a wide range of data scenarios, from simple ETL processes to complex data workflows involving multiple sources and transformations.Created the detailed designed documents for the user understanding in the Confluence pages.Environment: Python, Java,SQL, HDFS, DataBricks, PySpark, Yarn, Pandas, Numpy, AWS S3, EMR, AWS Redshift, Spectrum, Glue, Snowflake, Jenkins, Arrow.Big Data Engineer Pilot Company Knoxville, TN Jan 2020  Jan 2021Project Description: Pilot Company is one of the leading Truck and Travel Center business company in the US and Canada. To be updated with new cloud systems and data warehouse, the company is moving the on premise data from Netezza to AWS S3 and updating the present SQL code transformations to PySpark (Python programming language in Spark). Finally the data from on premise will be moved to AWS cloud and the transformation will be written for better performance of the companys daily data job executions.Key Contributions:Design, develop and maintain non-production and production transformations in AWS environment and create the data pipelines using PySpark Programming.Uses standard SQL syntax for querying data, making it accessible for users familiar with SQL.Analyzed the SQL scripts to design and develop the solution to implement in PySpark.Within a Data Flow Task, add source components (e.g., OLE DB Source), transformation components (e.g., Data Conversion, Conditional Split), and destination components (e.g., OLE DB Destination).Meeting compliance requirements for data protection regulations like GDPR, CCPA, and HIPAA.Migrated data from on-prem Netezza data warehouse to Cloud AWS S3 buckets using data pipelines written PySpark.Experience on Amazon EMR cluster for setting up Spark configuration and managing Spark jobs.SSIS Deploy the SSIS package to the SSIS catalog on a SQL Server instance.Used Jupyter Notebook and Spark-Shell to develop, test and analyze Spark jobs before scheduling the customized Active Batch Jobs.Create individual IAM users for people or services that need access to AWS resources.Users can be assigned unique security credentials (e.g., passwords, access keys).Configures event sources and triggers for the Lambda function, such as API Gateway, S3, DynamoDB, or CloudWatch Events.Performed data analysis and data quality check using Apache Spark Machine learning libraries in Python.Implemented Amazon Redshift, Spectrum and Glue for the migration of the Fact and Dimensions tables to the Production environment.Develop and test the SQL code in Spectrum for the time to execute the transform and the cost of the query scanned data per transform.Enforces data governance policies consistently across different environments and platforms.Integrates with existing data catalogs and governance tools for seamless policy management.Manages dependencies, such as packaging code and libraries, often using tools like the AWS CLI or Terraform's archive provider.Integrate Terraform with CI/CD pipelines for automated testing and deployment of infrastructure changes.Written programs in Python for creating the External Tables in the Glue for the respective tables which are located in the S3 buckets to use it in the Amazon Spectrum.Created Active Batch jobs to automate the Pyspark and SQL functions as daily run jobs.Worked on developing SQL DDL to create, drop and alter the tables.Maintain the project based code in Git-Hub repositories.Connects to a wide range of data sources such as SQL Server, Oracle, DB2, Excel, flat files, and more.Supports OLE DB, ODBC, ADO.NET, and other connectors.Centrally manage multiple AWS accounts and apply policies across them using AWS Organizations.Using Terraform with AWS Lambda provides a robust, scalable, and efficient way to manage serverless applications, enhancing automation, consistency, and productivity.Documented all the changes that are implemented in the RedShift SQL code using Confluence and Atlassian Jira, which includes the technical changes and data schema type changes.By leveraging IAM effectively, organizations can ensure secure, compliant, and scalable access management to AWS resources, protecting sensitive data and operations from unauthorized access.SQLEnvironment: Python, Java, SQL, HDFS, PySpark, Yarn, Pandas, Numpy, SparkML, AWS S3, EMR, AWS Redshift, Spectrum, Glue, Netezza, Active Batch.Data Engineer USAA Dallas, TX Sep 2018  Jan 2020Project Description: United Services Automobile Association (USAA) is a diversified Financial services group of company, the groups main business is USAA insurance and banking services. To be a regulatory compliant company, USAA came with a third part tool of tokenization and detokenization of PCI, PII and PHI Sensitive Data across various databases, data pipelines and data warehouses. Here we work with different Data Scientist, Application teams on analyzing the Sensitive data in the flow or the data that already exists in data bases. Also, support the application flows and data bases where the sensitive data is being de-tokenized for the Business use cases and Analyzing the data by the Data Scientists.Key Contributions:Design robust, reusable and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.Maintain an up-to-date inventory of all data sources and repositories to ensure comprehensive coverage.Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.Good knowledge of troubleshooting and tuning SparkSQL applications and HiveQL scripts to achieve optimal performance.Facilitates the management of SSIS packages, including deployment, scheduling, and execution.Supports versioning and configuration management.Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations.Hands on experience on developing UDF, Data Frames and SQL queries in Spark SQL.Built real-time streaming data pipelines with Kafka and Spark streaming.Worked on developing ETL workflows on the data obtained using Python for processing it in Netezza and Oracle using Spark.Hands on experience in working with Continuous Integration and Deployment (CI/CD) using Jenkins, Docker.Querying multiple databases like Snowflake, Netezza, UDB and MySQL for data processing.Adjusting and refining system configurations, code, queries, and hardware to enhance performance.Identifying and eliminating bottlenecks.By leveraging Google BigQuery, organizations can perform efficient, scalable, and cost-effective data analysis, gaining valuable insights and driving data-driven decision-making.Developing ETL pipelines in and out data warehouse using combination of Python and Snowflake SnowSQL. Writing SQL queries against Snowflake.Good experience working on analysis tools like Tableau, Splunk for regression analysis, pie charts and bar graphs.Environment: Python, Java, Spark, Kafka, Hive, Yarn, Jenkins, Docker, Tableau, Splunk, BO Reports, Netezza, Oracle, UDB, MySQL, Snowflake, IBM Datastage.Big Data Engineer/Hadoop Developer DXC Technology Austin, TX Feb 2017  Sep 2018Project Description: DXC Technology is a multinational company providing Business to Business IT services across various platforms to the users. To build the big data/ cloud solutions systems for the applications within the organization. Building the new data pipelines for the migration of the existing data warehouses to the cloud-based data warehouses. Also giving support and maintaining the existing big data warehouses and data pipelines.Key Contributions:Responsible for the design, implementation and architecture of very large-scale data intelligence solutions around big data platforms.Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.Improving the design and architecture of systems, databases, and applications to achieve the best possible performance.A continuous process that adapts to changing workloads and requirements.Azure Data Factory is designed for ETL (Extract, Transform, Load) operations in the cloud, enabling data movement and transformation across various sources and destinations.Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.Good experience in developing multiple Kafka Producers and Consumers as per business requirements.Involved in SQL queries performing spark transformations using Spark RDDs and PySpark.SSIS packages can be scheduled and managed using SQL Server Agent for automated execution.Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and actions in Spark.Written programs in Spark using Python, PySpark and Pandas packages for performance tuning, optimization and data quality validations.Worked on developing Kafka Producers and Kafka Consumers for streaming millions of events per second on streaming data.SQL Server Profiler: For monitoring and optimizing SQL Server performance.Oracle AWR (Automatic Workload Repository): For Oracle database performance analysis.MySQL EXPLAIN: To analyze and optimize MySQL query performance.Implemented a distributing messaging queue to integrate with Cassandra using Apache Kafka.Worked on Tableau to customize interactive reports, worksheets, and dashboards.By leveraging Privacera, organizations can enhance their data governance and security posture, ensuring that sensitive data is protected, compliance requirements are met, and data management processes are streamlined and efficient.Environment: Python, SQL, Web Services,, Spark, Kafka, Hive, Yarn, Cassandra, Netezza, Oracle, Tableau, AWS, GitHub, Shell Scripting.Big Data Engineer Trinet Austin, TX Jan 2016 - Feb 2017Project Description: TriNet provides businesses with HR solutions including payroll, benefits, risk management andcompliance with all one place. Responsible for collecting the data from the various data sources, transforming the datato the required formats and storing the data in various data warehouses. Developing the required code for enabling thedata pipelines across the various data sources.Key Contributions:Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution.Convert raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.Worked on building end to end data pipelines on Hadoop Data Platforms.Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.Data Factory Seamlessly integrates with the Microsoft SQL Server ecosystem and other data sources, making it versatile for different environments.Designed, developed and tested Extract Transform Load (ETL) applications with different types of sources.Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDDs.Experience with PySpark for using Spark libraries by using Python scripting for data analysis.Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.Implemented Kafka Custom encoders for custom input format to load data into Kafka portions. Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security and resource automation.Environment: Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, WebServices, Linux RedHat, Unix.
Respond to this candidate
Your Message
Please type the code shown in the image: