| 20,000+ Fresh Resumes Monthly | |
|
|
| | Click here or scroll down to respond to this candidateRAHEEM SDPhone:PHONE NUMBER AVAILABLEEmail: EMAIL AVAILABLEPROFESSIONAL SUMMARY:Around 10+ years of IT development experience, including experience in Data Engineer ecosystem, and related technologies.Good understanding of Apache Spark, Kafka, Storm, Talend, RabbitMQ, Elastic Search, Apache Solr, Splunk and BI tools such as Tableau.Knowledge of Hadoop administration activities using Cloudera Manager and Apache Ambari.Experience working with Cloudera, Amazon Web Services (AWS), Microsoft Azure and HortonworksWorked on Import and Export of data using Sqoop from RDBMS to HDFS.Have good knowledge in Containers, Docker and Kubernetes for the runtime environment for the CI/CD system to build, test, and deploy.Integrating BigQuery with other Google Cloud Platform services (e.g., Dataflow, Data prep and Data Studio) and third-party tools for data integration, processing, visualization, and reporting.Created machine learning models with help of python and scikit-learn.Utilized Apache Airflow to design, schedule, and monitor complex data pipelines, ensuring timely execution and error handling.Designed and implemented Elasticsearch indexes to efficiently store and retrieve structured and unstructured data, enabling fast and accurate search capabilities.Identified and resolved performance bottlenecks through query profiling, index optimization, and shard management, improving search responsiveness and resource utilization.Hands on experience in loading data (Log files, Xml data, JSON) into HDFS using Flume/Kafka.Used packages like Numpy, Pandas, Matplotlib, Plotly in python for exploratory data analysis.Hands on experience with cloud technologies such as Azure HDInsight, Azure Data Lake, AWS EMR, Athena, Glue and S3.Good knowledge in using Apache NiFi to automate the data movement between different Hadoop systems.Experience in performance tuning by using Partitioning, Bucketing and Indexing in Hive.Experience with Software development tools such as JIRA, GIT and SVN.Designing and optimizing relational database schemas in MySQL for efficient storage, retrieval, and querying of structured data.Developed APIs to facilitate seamless integration of AI models into production environmentsAs the Azure DevSecOps Engineer, created and maintained Azure DevOps organizations, self-hosted.Worked on containerization to optimize the CI/continuous deployment (CD) workflow as group efforts.Experience in SSIS project-based deployments in Azure cloud.Proficient in using data modeling tools such as ERwin, Power Designer, or similar tools to create and manage complex data models.Implemented robust automated testing using Python frameworks like Pytest and Robot, ensuring comprehensive software quality assurance.Capable in using Amazon S3 to support data transfer over SSL and the data gets encrypted automatically once it is uploaded.Familiarity with DBT (Data Build Tool) for managing and orchestrating data transformation workflows, enhancing data quality and maintainability.Developed interactive dashboards and visualizations using Amazon quick Sight to provide real-time insights into business performance metrics.Direct experience in developing microservices, REST APIs, and web services, with proficiency in .Net, Azure, and Ocelot.Designed serverless applications using AWS Lambda to reduce infrastructure management overheadDeveloped SSIS packages to automate data extraction from various banking systems, ensuring data accuracy and consistencyExtensive experience in Java SE and Java EE, including features of the latest Java versionsProficient in creating compelling data visualizations using tools such as Tableau and Power BI.TECHNICAL SKILLS:Hadoop and Big Data TechnologiesHDFS, MapReduce, Flume, Sqoop, Pig, Hive, Morphline, Kafka, Oozie, Spark, Nifi, Zookeeper, Elastic Search, tableau Apache Solr, snowflake, Talend, Cloudera Manager, R Studio, Confluent, GrafanaNoSQLHBase, Couchbase, Mongo, CassandraProgramming and Scripting LanguagesC, SQL, Python, C++, Shell scripting, RWeb ServicesXML, SOAP, Rest APIDatabasesOracle, DB2, MS-SQL Server, MySQL, MS-Access, TeradataWeb Development TechnologiesJavaScript, CSS, CSS3, HTML, HTML5, Bootstrap, XHTML, JQUERY, PHPOperating SystemsWindows, Unix (Red Hat Linux, Cent OS, Ubuntu), MAC-OS,IDEDevelopmentToolsEclipse, Net Beans, IntelliJ, R StudioBuild ToolsMaven, Scala Build Tool (SBT), AntEDUCATION:Bachelors in computer science 2013Masters northwest Missouri State University-computer science 2017PROFESSIONAL EXPERIENCE:Sr. Data EngineerCapital One Plano, TX Nov 2021 Till* dateResponsibilities:Involved in analyzing business requirements and prepared detailed specifications that follow project guidelines required for project development.Responsible for data extraction and data ingestion from different data sources into S3 by creating ETL pipelines using Spark and Hive.Designing and optimizing database schemas in ClickHouse for efficient storage and querying of analytical data.Used Pyspark for dataframes, ETL, Data Mapping, Transformation and Loading in complex and high-volume environment.Designed and executed regression testing suites to ensure that changes to existing code or data pipelines do not introduce unintended consequences or errors.Implemented data validation and verification checks to ensure data completeness, consistency, and accuracy throughout the data lifecycle.Monitoring Kubernetes clusters and workloads using built-in metrics, logs, and third-party monitoring solutions like Prometheus and Grafana.Designing and developing ETL jobs and workflows using AWS Glue's serverless, fully managed data integration service.Built robust ETL processes with Django, transforming raw data into valuable insights for business intelligence and analytics.Integrated Lambda functions with other AWS services like API Gateway, S3, DynamoDB, SQS, SNS, and RDSExtensively worked with pyspark / Spark SQL for data cleansing and generating dataframes and RDDs.Co-ordinated with the other team members to write and generate test scripts, test cases for numerous user stories.Collaborated with data scientists and software engineers to integrate generative AI solutions into existing systems and applications.Implemented NLP models for text generation, achieving human-like fluency and coherence in generated contentDesigning and optimizing data schemas in Google BigQuery for efficient storage, querying, and analysis of large-scale datasets.Loading data into Amazon Redshift from various sources, such as Amazon S3, RDS, DynamoDB, and other relational databases, using COPY commands, AWS Glue, or other ETL tools.Defining and scheduling Hadoop MapReduce, Hive, Pig, and Spark jobs using Oozie's XML-based workflow definitions.Defining task dependencies and workflows using Luigi's Python-based API to orchestrate ETL processes.Building efficient Docker images by leveraging multi-stage builds, optimizing layer caching, and minimizing image size.Creating and maintaining MySQL database instances, configuring server settings, and managing database users and permissions.Configuring and managing PostgreSQL instances, including installation, configuration, tuning, and monitoring for optimal performance and reliability.Designing and orchestrating complex ETL workflows using Directed Acyclic Graphs (DAGs) in Apache Airflow.Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.Worked on EMR clusters of AWS for processing Big Data across a Hadoop Cluster of virtual servers.Developed Spark Programs for Batch Processing.Developed Spark code using python for pyspark/Spark-SQL for faster testing and processing of data.Involved in design and analysis of the issues and providing solutions and workarounds to the users and end-clients.Designed and built data processing applications using Spark on AWS EMR cluster which consumes data from AWS S3 buckets, apply necessary transformations and store the curated business ready datasets onto Snowflake analytical environment.Developed functionality to perform auditing and threshold checks for error handling for smooth and easier debugging and data profiling.Integrating AWS Glue with AWS services (e.g., S3, Redshift, RDS, DynamoDB) for seamless data integration and processing.Developing custom Luigi tasks and targets to interact with data sources, filesystems, and external services.Build data quality framework to run data rules that can generate reports and send emails of business critical successful and failed job notifications to business users daily.Used spark to build tables that require multiple computations and non equi-joins.Scheduled various spark jobs for daily and weekly.Configuring and managing Apache Airflow clusters, schedulers, executors, and workers for optimal performance and scalability.Developed RESTful APIs with Django Rest Framework (DRF) to facilitate data access and manipulation across various services and applications.Created ETL using Python APIs/AWS Glue/terraform/Gitlab to consume data from different sourcesystems (Smartsheet/ QuickBase/ google sheets) to SnowflakeModelled Hive partitions extensively for faster data processing.Configuring Oozie coordinators and workflows to orchestrate ETL tasks and dependencies in Hadoop ecosystems.Experience with Alation data catalog for metadata management and governance, facilitating data discovery and collaboration across the organization.Integrating ClickHouse with other data processing and visualization tools (e.g., Apache Kafka, Apache Spark, Grafana) for data ingestion, processing, and analysis.Used Bit Bucket to collaboratively interact with the other team members.Involved in Agile methodologies, daily scrum meetings and sprint planning.Environment:Apache Airflow, Spark, Hive, click House, AWS Glue, AWS S3, AWS Redshift, AWS EMR, Luigi, Docker, MySQL, PostgreSQL, Pandas, Snowflake, AWS services (S3, Redshift, RDS, DynamoDB), Apache Kafka, Grafana, Python, Terraform, GitLab, Oozie, Apache Kafka, Bit Bucket, Agile methodologies.Sr. DataEngineerCitizens Bank Rhode Island (Remote) Sep 2018 Oct 2021Responsibilities:Developed and managed Azure Data Factory pipelines that extracted data from various data sources, transformed it according to business rules, and loaded into an Azure SQL database.Developed and managed Databricks python scripts that utilized Pyspark and consumed APIs to move data into an Azure SQL databaseCreated a new data quality check framework project in Python that utilized pandasImplemented source control and development environments for Azure Data Factory pipelines utilizing Azure Repos.Optimized dashboard performance and user experience by applying best practices in data visualization design and layout.Implemented advanced analytical functions and calculations within quick Sight to uncover trends, patterns, and outliers in large datasets.Configuring Luigi schedulers, workers, and resources to optimize task execution and resource utilization.Integrating Airflow with cloud services (e.g., AWS, GCP, Azure) for seamless data ingestion, processing, and storage.Implementing Docker in CI/CD pipelines to automate testing, building, and deploying containerized applications with tools like Jenkins or GitLab CI/CD.Monitoring and managing Oozie job execution, logs, and job history to track workflow progress and performance.Utilizing Amazon Redshift Spectrum to query data directly from Amazon S3, enabling seamless integration with data lakes and external storage.Providing documentation, training, and support to users and teams on PostgreSQL best practices, troubleshooting, and performance optimization.Collaborated with external vendors and partners to integrate third-party data sources and tools, expanding the capabilities and functionalities of the Quick Sight platform.Implemented data governance policies and security measures within Quick Sight to ensure compliance with regulatory requirements and protect sensitive information.Utilized Apex Salesforce to implement various enhancements into SalesforceCreating and managing AWS Glue crawlers to automatically discover, catalog, and infer schemas from various data sources.Created Power BI Datamarts and reports for various stakeholders in the businessDeveloped and maintained ETL packages in the form of IBM Data stage jobs and SSIS packages.Debugging and troubleshooting Kubernetes deployments using kubectl commands, container logs, and cluster events to identify and resolve issues.Loading data into BigQuery from various sources, such as Google Cloud Storage, Cloud SQL, Google Sheets, and other databases, using ingestion methods like batch loading, streaming inserts, or Dataflow pipelines.Designed and coded a utility utilizing C# and .NET to create a reports dashboard from ETL metrics tables.Designed, tuned and documented a SSAS cube that acted as an OLAP data source for a dashboard that kept track of process guidance concepts throughout the company utilizing Visual Studio 2010.Created a Microsoft Excel file that acted as an aggregate report on company vacation hours using the PowerPivot plugin.Implemented machine learning models in Python to perform predictive analytics for risk assessment and fraud detection in financial transactions.Developed a Windows PowerShell script that acted as an email error notification system for an internal ETL management framework for SSIS packages.Extending Airflow's functionality with plugins and extensions to support specialized use cases and industry-specific requirements.Experienced in the creation of Data Lake by extracting data from various sources with different file format JSON, Parquet, CSV and RDBMS into Azure Data LakesIntegrating MySQL with other data storage and processing systems (e.g., Hadoop, Spark, Kafka) using connectors and APIs for data ingestion and processing.Ingested data into Azure Storage or Azure SQL, Azure DW and processed the data in Azure Databricks.Worked on Design/creation of complex SSIS or DTSX packages utilizing Package.Developing data links and data loading scripts, including data transformations, reconciliations and accuracies.Worked on several Azure storage systems like BLOB and Data Lake.Skilled in conducting in-depth data analysis using Tableau, including trend analysis, forecasting, cohort analysis, and segmentation to identify patterns, trends, and outliers in data sets.Integrating Oozie with Hadoop ecosystem components (e.g., HDFS, YARN, Hive and Spark) for data processing and analytics.Developed and managed data ingestion pipelines with Kinesis Data Streams, ensuring reliable and scalable data flow from various sourcesConverted older SSIS ETL packages to a new design that made use of custom ETL management framework.Highlight the specific data visualization techniques employed in each project, such as bar charts, line charts, scatter plots, histograms, and pie charts, to effectively represent different types of data and insights.Worked on Parameterized reports, Linked reports, Ad hoc reports, Drilldown reports and Sub reports. Designed and developed stored procedures, queries and views necessary to support SSRS reports.Environment:Azure Data Factory, Databricks, PySpark, Luigi, Airflow, Docker, AWS Glue, Amazon Redshift Spectrum, PostgreSQL, Quick Sight, Click House, Power BI, IBM Data stage, SSIS, Kubernetes, BigQuery, SSAS, PowerPivot, PowerShell, T-SQL, Azure Synapse, Azure Data Lake, MySQL, Tableau, Oozie, Snowflake, SSRS.Azure Data EngineerAT & T Dallas, TX Dec 2016 Jul 2018Responsibilities:Analyze, design, and build Modern data solutions using Azure PaaS service to support visualization ofdata.Developed spark programming code in Python Data bricks workbooks.Performance tuning of SQOOP, Hive and Spark jobs.Implemented proof of concept to analyze the streaming data using Apache Spark with Python; UsedMaven/SBT for build and deploy the Spark programs.Monitoring Luigi task execution, progress, and dependencies to track workflow status and performance.Created Application Interface Document for the downstream to create a new interface to transfer andreceive the files through Azure Data Share.Configuring and orchestrating AWS Glue jobs and triggers to automate data transformation and loading processes.Utilizing PostgreSQL's support for foreign data wrappers (FDWs) to integrate with external data sources and databases for federated querying and data access.Creating pipelines, data flows and complex data transformations and manipulations using ADF andPySpark with Databricks.Automating deployment and version control of Airflow DAGs using CI/CD pipelines and source code management tools.Writing and optimizing SQL queries in BigQuery, leveraging features like standard SQL, nested and repeated fields, and user-defined functions (UDFs) to transform and analyze data.Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.Implementing data quality checks, error handling, and retries to ensure reliability and integrity in Oozie workflows.Providing training, documentation, and support to users and stakeholders on using Apache Airflow effectively for ETL orchestration.Implemented Spark using Scala and Spark SQL for faster testing and processing of data.Optimizing AWS Glue job performance and resource utilization by tuning parameters, partitioning data, and optimizing queries.Worked on complex U-SQL for the data transformation and loading table and report generation.Used Docker containers using Docker Swarm or Kubernetes to automate deployment, scaling, and load balancing.Involved in developing Confidential Data Lake and in building Confidential Data Cube on Microsoft Azure HDINSIGHT cluster.Designed and Published Power BI Visualizations and Dashboards to various Business Teams forBusiness use and Decision making.Integrating Docker with version control systems (e.g., Git) to manage Dockerfile and Docker Compose configurations.Integrating Luigi with cloud platforms and storage services for seamless data integration and processing.Created Linked service to land the data from Caesars SFTP location to Azure Data Lake.Using Git to update existing versions of the Hadoop Py-Spark script to its new model.Environment:Azure Data Engineer, Spark SQL, Python, Teradata, SQOOP, Hive, Apache Spark, Maven, SBT, ETL, Luigi, Azure Data Share, TALEND, Linux, AWS Glue, Amazon Redshift, Kubernetes, CI/CD, Jenkins, GitLab CI/CD, ArgoCD, PostgreSQL, ADF, PySpark, Databricks, Airflow, BigQuery, Tableau, Scala, Spark Streaming, Oozie, Talend ESB, Azure HDInsight, Spark Core, Power BI, Docker, Kubernetes, Azure Data Lake, HDINSIGHT, HIVE, Spark Concepts, Snowflake DB, Informix, Sybase, Git.Python/Hadoop DeveloperCouth InfoTech PVT LTD- Hyderabad, India Apr 2015 July 2016Responsibilities:Develop of Spark Sql application, Big Data Migration from Teradata to Hadoop and reduce Memory utilization in Teradata analytics.Requirement Gathering and Leading Team for the development of the Big Data environment and Spark ETL logics migrations.Involve in requirement gathering from the Business Analysts, and participate in discussions with users, functional analysts for the Business logics implementation.Responsible for end-to-end design on Spark Sql, Development to meet the requirements. Advice the business on best practices in the Spark Sql while making sure the solution meet the business needs.Debugging Docker containers using logs, container inspection, and remote debugging tools to troubleshoot issues and errors.Emphasize the importance of documentation and training provided to users, including user guides, documentation manuals, and training sessions conducted, to ensure effective utilization of Tableau dashboards.Configuring Kubernetes storage solutions such as persistent volume claims (PVCs) and storage classes to provide persistent storage for stateful applications.Implementing Amazon Redshift security best practices, including IAM authentication, VPC security groups, and encryption-at-rest to protect sensitive data and ensure compliance with regulatory requirements.Implementing data quality checks, data validation rules, and error handling mechanisms in AWS Glue scripts and jobs.Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.Creating Views on Top of the HIVE tables and give it to customers for the analytics.Analyzing Hadoop cluster and different big data analytic tools including Pig, HBase and Sqoop. Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache and stored the data into HDFS for analysis.Developed UNIX shell scripts to load large number of files into HDFS from Linux File System Developed Custom Input Formats in MapReduce jobs to handle custom file formats and to convert them into key-value pairs.Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.Environment:Python, Hadoop, Spark SQL, Power BI, Teradata, Docker, Tableau, Kubernetes, Oozie, Amazon Redshift, PostgreSQL, Hive, AWS Glue, Sqoop, Pig, HBase, Linux, RDBMS, Apache, Hortonworks Data Platform, MapReduce, Unix shell scripting.SQL/ SSIS DeveloperAdran, Hyderabad, India Jun 2013 Mar 2015Responsibilities:Created SSIS Packages to import and export data from Excel, Text files and transformed data from various data sources using OLE DB connection by creating various SSIS package.Designed and implemented ETL processes based in SQL, T-SQL, stored procedures, triggers, views, tables, user defined functions and security using SQL SERVER 2012, SQL Server Integration Service.Wrote Triggers and Stored Procedures to capture updated and deleted data from OLTP systems.Identified slow running query and optimization of stored procedures and tested applications for performance, data integrity using SQL Profiler.Implementing query caching and result caching in BigQuery to accelerate query execution and reduce costs for repeated queries.Extensively worked with SSIS tool suite, designed and created mapping using various SSIS transformations like OLEDB command, Conditional Split, Lookup, Aggregator, Multicast and Derived Column.Configured SQL mail agent for sending automatic emails on errors and developed complex reports using multiple data providers, using defined objects, chart.Querying and manipulating multidimensional cube data through MDX Scripting.Developed drill through, drill down, parameterized, cascaded and sub-reports using SSRS.Responsible for ongoing maintenance and change management to existing reports and optimize report performance SSRS.Environment:SSIS Packages, SQL Server 2012, SQL Server Integration Services (SSIS), SQL Server, T-SQL, stored procedures, triggers, SQL Profiler, Big Query, OLE DB, SSIS transformations OLEDB command,, SQL mail agent, MDX Scripting, SSAS (SQL Server Analysis Services), SSRS (SQL Server Reporting Services) KPIs. |