Data Engineer Resume Austin, TX

Data Engineer Resume Austin, TX
Resumes | Register

Candidate Information
Name	Available: Register for Free
Title	Data Engineer
Target Location	US-TX-Austin
Email	Available with paid plan
Phone	Available with paid plan

20,000+ Fresh Resumes Monthly

View Phone Numbers

Receive Resume E-mail Alerts

Post Jobs Free

Link your Free Jobs Page

... and much more
Register on Jobvertise Free

Related Resumes

Senior big data engineer San Marcos, TX

Technology Lead Data Engineer Round Rock, TX

Data Quality Engineer Austin, TX

Data Engineer Big San Antonio, TX

Software Engineer Data Management Georgetown, TX

Data Engineer Integration Austin, TX

Data Engineer Processing Leander, TX

Click here or scroll down to respond to this candidate

NAME: Candidate's Name
ROLE: DATA ENGINEER
PHONE: PHONE NUMBER AVAILABLE
EMAIL:EMAIL AVAILABLELINKEDIN URL: LINKEDIN LINK AVAILABLE
PROFESSIONAL SUMMARY:Data Engineer with 9+ years of experience executing data driven solutions to increase efficiency, accuracy, and utility of internal Data Processing. Crafted and engineered reusable components and frameworks for data ingestion, cleansing, and ensuring data quality. Leveraged technologies such as Apache Spark, Python, and Scala within environments including Databricks, Azure Data Lake, Azure Data Factory, HDInsights, Amazon EMR, S3, AWS Glue, Snowflake, and AWS Redshift. Extensive experience in Analyzing, Developing, Managing, and implementing various stand - alone, client-server enterprise applications using Python mapping the requirements to the systems.
Well versed with various aspects of ETL/ELT processes used in loading and updating Teradata/Oracle data warehouse.
Skilled in engineering and optimizing distributed data processing pipelines using Scala within the Hadoop ecosystem, leveraging CAP theorem, partitioning, and replication principles. Strong expertise in integrating HBase for real-time data storage and retrieval, driving efficient and scalable large-scale analytics across data infrastructures. Developed Scaladoc-style documentation for Scala code, enhancing code clarity and facilitating team collaboration. Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
Experience in developing web applications by using Python, C++, XML, CSS, and HTML.
Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, and applying Machine Learning concepts.
Extensive experience in Amazon Web Services (Amazon EC2, S3, Simple DB, RDS, Redshift, Elastic Load Balancing, Elastic Search, Lambda function, SQS, AWS Identity and access management, EBS and Cloud Formation). Experience with Snowflake Multi-Cluster Warehouses, Snowflake Virtual Warehouses.
Manage Amazon Web Services (AWS) infrastructure with orchestration tools such Terraform and Jenkins Pipeline. Experience in building ETL pipelines using NiFi. Performed ETL operations using Informatica. Developed data ingestion modules (both real time and batch data load) to data into various layers in S3, Redshift and Snowflake using AWS Kinesis, AWS Glue, AWS Lambda and AWS Step Functions Experienced in working with various Python IDE's using PyCharm, PyScripter, PyDev, NetBeans and Sublime Text
Hands-on experience in handling database issues and connections with SQL and NoSQL databases like MongoDB, Cassandra, DynamoDB by installing and configuring various packages in python.
Good Knowledge in writing different kinds of tests like Unit test/Pytest and build them.
Experienced with version control systems like Git, GitHub, CVS, and SVN to keep the versions and configurations of the code organized.
Experience in developing Web Services like SOAP, REST, and RESTful APIs with Python, implementing OAuth for secure authentication and adhering to industry API standards to ensure robust and scalable services. Experienced in writing SQL Queries, Stored procedures, functions, packages, tables, views, triggers using relational database like Oracle, DB2, MySQL, PostgreSQL, and MS SQL server.
Experience in creating Spark applications using PySpark, Spark-SQL, and Spark APIs including RDD, MLlib, GraphX, and Streaming for data extraction, transformation, and aggregation from multiple source systems, uncovering insights into customer usage patterns. Created logic app workflow to read data from SharePoint and store in the blob container. Having experience with code analysis, code management in Azure Databricks. Having experience in creating secrets and accessing key vaults for database, SharePoint credentials. Created pipelines, datasets, linked services in Azure Data Factory. Automated deployment using Data Factory's integration with Azure Pipelines Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK s Hands on experience with creating solution driven Dashboards by developing different chart types including Pie Charts, Bar Charts, Line Charts, Scatter Plots and Histograms in Tableau Desktop, QlikView and Power BI.
Developed Python scripts to automate data sampling process.
Hands-on experience in writing, testing and implementation of the Cursors, Procedures, Functions, Triggers, and Packages at Database level using PL/SQL. Application development with oracle forms and report with OBIEE, discoverer, report builder and ETL development.TECHNICAL SKILLS:Bigdata/Hadoop Technologies: MapReduce, Spark, SparkSQL, Spark Streaming, Kafka, PySpark, Airflow, Pig, Hive, Flume, Yarn, Oozie, ETL
Scripting Languages: HTML5, CSS3, C, C++, XML, SAS, Scala, Python, Shell Scripting, SQLDatabases: Microsoft SQL Server, MySQL, Oracle, TeradataNO SQL Databases: Cassandra, HBase, MongoDBETL Tools: InformaticaDevelopment Tools: Microsoft SQL Studio, Azure Databricks, Eclipse, NetBeans.Operating Systems: All versions of Windows, UNIX, LINUXReporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Power BI, TableauCloud Technologies: AWS, Azure, GCP
PROFESSIONAL EXPERIENCE:Client: Comcast , NJ Nov 2020 - Present
Role: GCP Data Engineer
Responsibilities: Writing scripts using PySpark, Scala, and Java within the Spark framework to utilize RDDs/DFs and facilitate advanced data analytics.
Leveraged Scala and HBase to enhance data management in Hadoop clusters, focusing on building and deploying Spark applications that performed advanced analytics, data normalization, and machine learning tasks, while ensuring seamless interaction with HBase for fast data processing. Installed and configured Apache airflow for workflow management and created workflows in python and Java. Developed pipelines for auditing the metrics of all applications using GCP Cloud functions, and Dataflow for a pilot project.
Involved in building Data Models and Dimensional Modeling with 3NF, Star, and Snowflake schemas for OLAP and Operational Data Store applications, leveraging HBase for efficient data management. Writing scripts using PySpark, Scala, and Java within the Spark framework to utilize RDDs/DFs and facilitate advanced data analytics. Developed and deployed outcomes using Spark and Scala code in Hadoop clusters running on GCP. Developed end-to-end pipeline, which exports the data from parquet files in Cloud Storage to GCP Cloud SQL. Implemented Spark SQL to access Hive tables into Spark for faster data processing, applying partitioning and bucketing strategies to optimize storage and query performance across large datasets. Used Hive to do transformations, joins, filter and some pre-aggregations before storing the data. Data visualization for some data set by pyspark in jupyter notebook. Utilized Apache Spark with Python and Scala to develop and execute Big Data Analytics and Machine Learning applications. Configured Airflow DAGs for various data feeds, validated and visualized data using Tableau, and created and validated Hive views with HUE to ensure data accuracy and effective reporting. Working with GCP cloud using in GCP Cloud storage, Data-Proc, Data Flow, Big- Query, EMR, S3, Glacier and EC2 Instance with EMR cluster. Used Kafka for data ingestion for different data sets, and leveraging Apache NiFi to automate the movement and transformation of data between systems, ensuring real-time data integration and efficient processing. Importing and exporting data into HDFS and assisted in exporting analyzed data to RDBMS using SQOOP. Involved in building Data Models and Dimensional Modeling with 3NF, Star and Snowflake schemas for OLAP and Operational data store applications. Wrote complex spark applications for performing various de-normalization of the datasets and creating a unified data analytics layer for downstream teams. Involved in creating Hive scripts for performing ad hoc data analysis required by the business teams. Used apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators. Migrated an Oracle SQL ETL to run on google cloud platform using cloud dataproc & bigquery, cloud pub/sub for triggering the airflow jobs. Created two instances in GCP where one is for development and the other for production. Worked on Autosys for scheduling the Oozie Workflows. Created Metric tables, End user views in Snowflake to feed data for Tableau refresh. Used SVN for branching, tagging, and merging. Writing scripts using PySpark, Scala and Java within the Spark framework to utilize RDDs/DFs and facilitate advanced data analytics Involved in developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS. Validating the source file for Data Integrity and Data Quality by reading header and trailer information and column validations. Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solution. Wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake. Developed Python scripts to automate data sampling process. Developed and deployed Big Data Analytics and Machine Learning applications using Apache Spark with Python and Scala, including implementing solutions with Spark ML and Mllib in Hadoop clusters on GCP.
Built data pipelines and visualizations by importing data with Sqoop into Hadoop and creating visual tools using Pyplot, NumPy, Pandas, and SciPy in Jupyter NotebookEnvironment: GCP, Airflow, Snowflake, Hive, Pyspark, Python, SQL, Oozie, spark, numpy, pandas, scipy, Teradata, HDFS, Pig, Scala, HBase, Hadoop,Snowflake, Tableau, Oracle, bigquery, Kafka.Client :JP Morgan , NJ Mar 2020 - Oct 2022Role: Azure Data Engineer
Responsibilities: Used Azure SQL Data Warehouse and Azure SQL, create and implement database solutions. Implemented Spark using Scala and SparkSql for faster testing and processing of data.
Architected and deployed large-scale BI applications using Hadoop and HBase within the Azure Data Platform, leveraging Scala for efficient data processing and integration with Azure services like Data Lake and Databricks. Used Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/ Data bricks, NoSQL DB), architect and deploy medium to large scale BI applications. Integrated HBase with Hadoop to manage large-scale data storage and retrieval, ensuring efficient and scalable data processing within the Hadoop ecosystem for advanced analytics tasks. Designed and deployed typical system migration procedures on Azure (Lift and Shift/Azure Migrate, and other third-party solutions). Gathered needs from business users, visualizations, and teach them on how to use self-service BI technologies. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) processing the data in Azure Databricks. Setup new Kafka Cluster by installing necessary Kafka components, run performance test, setup security, and setup monitoring. Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool. Data was pulled into Power BI from a variety of sources, including SQL Server, Excel, Oracle, SQL Azure, and others. Participated in SaaS vendor RFI/RFP and vendor selection process by mapping the product feature to the WB users Business function and process and vendor reference check, convert existing invoices and matters to the Council Link SaaS System. Designed, installed, configured, and integrated Enterprise monitoring for PaaS Cloud environment. Proposedd architectures that consider Azure's cost/spend and make recommendations for data infrastructure that is the suitable size. Developed Spark programs using Scala to compare the performance of Spark with Hive and SparkSQL.
Developed spark streaming application to consume JSON messages from Kafka and perform transformations.
Involved in developing a MapReduce framework that filters bad and unnecessary records.
Ingested data from RDBMS and performed data transformations, then exported the transformed data to Cassandra, Apache HBase, and Apache Kudu, ensuring optimized storage and quick access for analytics. Optimized data processing pipelines to handle large volumes of data from various structured and unstructured sources.
Addressed performance and scaling issues by implementing strategies such as partitioning, bucketing, and data indexing within Apache Spark and Hadoop environments.
Collaborated with cross-functional teams to triage and resolve performance bottlenecks, ensuring scalable and efficient data solutions. Evaluated Snowflake Design considerations for any change in the application Build the Logical and Physical data model for snowflake as per the changes required by Defining virtual warehouse sizing for Snowflake for different type of workloads. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs with Scala.
Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
Exported the analyzed data to the relational databases using Sqoop to further visualize and generate reports for the BI team. Integrated HBase with Hadoop to enable real-time data processing and analytics, utilizing Scala to write performance-optimized code for data transformations and implementing robust data management practices across the Hadoop ecosystem. Worked with Spark Ecosystem using Scala and Hive Queries on different data formats like Text file and parquet.
Developed Python scripts to clean the raw data.
Environment: SQL, Kafka, Python, Snowflake, Visualization, NoSQL, SQL Server, Excel, Oracle, SQL Azure, Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, and Azure SQL Data warehouse, Apache Airflow, SAAS, PAAS.Client : Walgreens Boots Alliance, IL. Oct 2018 - Feb 2020
Role: AWS Data Engineer
Responsibilities: Day to-day responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL queries in Snowflake. Performed data ETL by collecting, exporting, merging and massaging data from multiple sources and platforms including SSRS/SSIS (SQL Server Integration Services) in SQL Server.
Worked with cross-functional teams (including data engineer team) to extract data and rapidly execute from MongoDB through MongoDB connector.
Worked on automating the provisioning of AWS cloud using cloud formation for Ticket routing techniques.
Worked with Amazon Redshift tools like SQL workbench/J, PG Admin, DB Hawk, and Squirrel SQL.
Design DAGS using Airflow to process the dimension table and staged it presto views.
Design and developed airflow DAGS to retrieve data from Amazon s3 and built ETL pipeline using PySpark to process the same to build the dimensions
Involved in Migrating data from Merck Pharma On-Premises to (AWS) Amazon web services.
Developed, tested and deployed python scripts to create airflow DAGS.
Integrate with Databricks using airflow operators to run Notebooks on scheduled basis.
Writing scripts using Python (or Go Lang) and familiarity with the following tools: AWS Cloud Lambda, AWS S3, AWS EC2, AWS Redshift, AWS Postgre
Build Docker Images to run Airflow on local environment to test the Ingestion as well as ETL pipelines Used Amazon ECS (Elastic Container Service) to run Amazon ECS clusters using AWS Fargate and provided serverless compute capacity for containers.
Implemented Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline
Developed python scripts to schedule each dimension process as task and set dependencies for the same
Worked on cloud computing infrastructure (Amazon Web Services EC2) and considerations for scalable, distributed systems
Environment: ETL, SQL Server, AWS, Amazon ECS, Amazon s3, Airflow, Snowflake, Databricks, MongoDB, Python, AWS cloud, Tableau, Spark, SQL, Python, PySpark, Kafka.Client: AT&T, Dallas, TX. Feb 2017 - Sep 2018Role: AWS Data Engineer
Responsibilities: Responsible to tune ETL procedures and schemas to optimize load and query Performance. Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing and Reporting of voluminous, rapidly changing data. Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect. Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB. Designed and developed Security Framework to provide fine-grained access to objects in AWS S3 using AWS Lambda, DynamoDB Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users. Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, and S3. Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 Data Lake. Used AWS EMR to secutiytransform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD. Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response. Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources. Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages). Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon SageMaker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python.Client: Webmaxy, Pune July 2012 - Dec 2015Role: Spark/Data Engineer
Responsibilities: Used Hive Queries in Spark-SQL for analysis and processing the data. Written shell scripts that run multiple Hive jobs which helps to automate different hive tables incrementally which are used to generate different reports using Tableau for the Business use. Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and handled Json Data Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data Involved in business analysis and technical design sessions with business and technical staff to develop requirements document and ETL design specifications. Write complex SQL scripts to avoid Informatica Look-ups to improve the performance as the volume of the data was heavy. Responsible for design, development, Data Modelling, of Spark SQL Scripts based on Functional Specifications. Designed and developed extract, transform, and load (ETL) mappings, procedures, and schedules, following the standard development lifecycle Developed Autosys scripts to schedule the Kafka streaming and batch job Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and Apache Storm etc. Ingested streaming data into Hadoop using Spark, Storm Framework and Scala. Responsible for building scalable distributed data solutions using Hadoop and migrate legacy applications to Hadoop. Wrote the Spark code in Scala to connect to Hbase and read/write data to the HBase table. Extracted data from different databases and to copy into HDFS using Sqoop and has an expertise in using compression techniques to optimize the data storage.Environment: Spark, Kafka, Hadoop, Sqoop, HDFS, Oracle, SQL Server, MongoDB, Python, Scala, Shell Scripting, Tableau, Map Reduce, Oozie, Pig, Hive.

Respond to this candidate
Your Email	«
Your Message
Please type the code shown in the image: