Usage Examples and Tutorials ============================ Droplet-Film Model Development Project This comprehensive guide provides practical examples and tutorials for using all machine learning methods available in the DFT Development framework. Each method is demonstrated with complete code examples, function calls, and best practices. DFT Model Tutorial ================== DFT Implementation Overview ---------------------------- The DFT (Droplet-Film Model) is the core physics-informed model in the framework. It combines fundamental droplet and film flow principles with data-driven optimization to predict critical flow rates in gas wells. The model is interpretable, uses physical parameters, and is a good starting point before trying other ML approaches. Basic DFT Setup --------------- .. code-block:: python # Import required modules import numpy as np from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt # Import DFT framework components (from project src and models) from src.dfm_src import DFT from models.utils import Helm DFT Data Preparation -------------------- Your CSV must include the 10 feature columns below, plus **Qcr** (target), **Gasflowrate**, and **Test status** (used by the data loader for splitting). Load data with **Helm** (same parameter names as in the code): .. code-block:: python # Load and prepare data using Helm data_path = "path/to/your/well_data.csv" builder = Helm( path=data_path, drop_cols=None, includ_cols=['Dia', 'Dev(deg)', 'Area (m2)', 'z', 'GasDens', 'LiquidDens', 'g (m/s2)', 'P/T', 'friction_factor', 'critical_film_thickness'], test_size=0.2, scale=False # DFT uses unscaled physical features ) # Access training and test data (Helm stores them as attributes) X_train = builder.X_train X_test = builder.X_test y_train = builder.y_train y_test = builder.y_test DFT Model Training ------------------ .. code-block:: python # Create and train the DFT physics model dft_model = DFT(seed=42, feature_tol=1.0, dev_tol=1e-3, multiple_dev_policy="max") # Fit the model (optimizes physics parameters and alpha values) dft_model.fit(X_train, y_train) Making Predictions with DFT --------------------------- .. code-block:: python # Predict critical flow rates on test data # The model uses training data internally for alpha assignment y_pred = dft_model.predict(X_test) Evaluating DFT Performance -------------------------- .. code-block:: python # Compute metrics mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"DFT MSE: {mse:.4f}") print(f"DFT R²: {r2:.4f}") # Optional: plot predictions vs actual plt.figure(figsize=(6, 5)) plt.scatter(y_test, y_pred, alpha=0.6) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label='Perfect') plt.xlabel('Actual Critical Flow Rate') plt.ylabel('Predicted Critical Flow Rate') plt.title('DFT Model: Predictions vs Actual') plt.legend() plt.tight_layout() plt.show() Understanding DFT Output (Optional) ----------------------------------- After fitting, the model stores optimized parameters in ``dft_model.opt_params`` (five global parameters plus one alpha per training sample). The physics equation uses these to balance droplet vs film effects. See the API Reference for full details. XGBoost Tutorial ================ XGBoost Implementation Overview ------------------------------- XGBoost (Extreme Gradient Boosting) is a powerful gradient boosting framework that provides excellent performance for regression tasks. In the DFT Development framework, XGBoost is used to predict critical flow rates in gas wells with high accuracy and interpretability. Basic XGBoost Setup ------------------- .. code-block:: python # Import required modules import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score import xgboost as xgb import matplotlib.pyplot as plt # Import DFT framework components from models.utils import Helm from src.dfm_src import DFT XGBoost Data Preparation ------------------------ .. code-block:: python # Load and prepare data using Helm (CSV must include Qcr, Gasflowrate, Test status) data_path = "path/to/your/well_data.csv" builder = Helm( path=data_path, includ_cols=['Dia', 'Dev(deg)', 'Area (m2)', 'z', 'GasDens', 'LiquidDens', 'g (m/s2)', 'P/T', 'friction_factor', 'critical_film_thickness'], test_size=0.2, scale=True # XGBoost benefits from scaled features ) # Access data (use _rdy for scaled features when scale=True) X_train, X_test = builder.X_train_rdy, builder.X_test_rdy y_train, y_test = builder.y_train, builder.y_test XGBoost Model Training ---------------------- .. code-block:: python # Create and configure XGBoost model xgb_model = xgb.XGBRegressor( n_estimators=1000, max_depth=6, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8, random_state=42, early_stopping_rounds=50, eval_metric='rmse' ) # Train the model xgb_model.fit( X_train, y_train, eval_set=[(X_test, y_test)], verbose=False ) # Make predictions y_pred = xgb_model.predict(X_test) # Evaluate performance mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"XGBoost MSE: {mse:.4f}") print(f"XGBoost R²: {r2:.4f}") XGBoost Hyperparameter Tuning ----------------------------- .. code-block:: python from sklearn.model_selection import GridSearchCV # Define parameter grid for hyperparameter tuning param_grid = { 'n_estimators': [500, 1000, 1500], 'max_depth': [4, 6, 8], 'learning_rate': [0.01, 0.1, 0.2], 'subsample': [0.8, 0.9, 1.0] } # Perform grid search grid_search = GridSearchCV( xgb.XGBRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1 ) grid_search.fit(X_train, y_train) # Get best parameters best_params = grid_search.best_params_ print(f"Best XGBoost parameters: {best_params}") # Train final model with best parameters best_xgb_model = grid_search.best_estimator_ y_pred_best = best_xgb_model.predict(X_test) XGBoost Feature Importance Analysis ----------------------------------- .. code-block:: python # Get feature importance (Helm stores feature_names) feature_importance = best_xgb_model.feature_importances_ feature_names = builder.feature_names # Create feature importance plot plt.figure(figsize=(10, 6)) plt.barh(feature_names, feature_importance) plt.xlabel('Feature Importance') plt.title('XGBoost Feature Importance') plt.tight_layout() plt.show() # Print feature importance for name, importance in zip(feature_names, feature_importance): print(f"{name}: {importance:.4f}") PySINDy Tutorial ================ PySINDy Implementation Overview ------------------------------- PySINDy (Python Sparse Identification of Nonlinear Dynamics) discovers governing equations from data using sparse regression. In the DFT context, PySINDy can identify the underlying physics equations that govern liquid loading in gas wells. PySINDy Setup and Installation ------------------------------ .. code-block:: python # Install PySINDy if not already installed # pip install pysindy import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from pysindy import SINDy from pysindy.optimizers import STLSQ from pysindy.feature_library import PolynomialFeatures PySINDy Data Preparation ------------------------ .. code-block:: python # Prepare data for PySINDy (requires time series or derivative data) # For DFT, we'll create synthetic time series based on well parameters def create_synthetic_time_series(data, n_timesteps=100): """Create synthetic time series for PySINDy analysis""" time_series = [] for _, row in data.iterrows(): # Create time series based on well parameters t = np.linspace(0, 10, n_timesteps) # Simulate critical flow rate evolution q_crit = (row['Dia'] * row['GasDens'] * np.sqrt(row['g (m/s2)'] * row['z']) / (row['friction_factor'] * row['critical_film_thickness'])) # Add some dynamics q_t = q_crit * (1 + 0.1 * np.sin(t) + 0.05 * np.random.normal(0, 1, n_timesteps)) # Calculate derivatives dq_dt = np.gradient(q_t, t) time_series.append({ 'time': t, 'q_crit': q_t, 'dq_dt': dq_dt, 'well_params': row }) return time_series # Create synthetic time series time_series_data = create_synthetic_time_series(X_train.head(10)) PySINDy Model Training ---------------------- .. code-block:: python # Prepare data for PySINDy X_sindy = [] X_dot_sindy = [] for ts in time_series_data: # Use well parameters as features features = np.array([ts['well_params'][col] for col in X_train.columns]) X_sindy.append(features) # Use derivative as target X_dot_sindy.append(ts['dq_dt'].mean()) X_sindy = np.array(X_sindy) X_dot_sindy = np.array(X_dot_sindy) # Scale the data scaler = StandardScaler() X_sindy_scaled = scaler.fit_transform(X_sindy) # Create PySINDy model optimizer = STLSQ(threshold=0.1) feature_library = PolynomialFeatures(degree=2) sindy_model = SINDy( optimizer=optimizer, feature_library=feature_library, feature_names=X_train.columns.tolist() ) # Fit the model sindy_model.fit(X_sindy_scaled, X_dot_sindy.reshape(-1, 1)) # Print discovered equations print("PySINDy discovered equations:") sindy_model.print() PySINDy Model Evaluation ------------------------ .. code-block:: python # Make predictions X_test_scaled = scaler.transform(X_test) predictions = sindy_model.predict(X_test_scaled) # Evaluate performance mse_sindy = mean_squared_error(y_test, predictions.flatten()) r2_sindy = r2_score(y_test, predictions.flatten()) print(f"PySINDy MSE: {mse_sindy:.4f}") print(f"PySINDy R²: {r2_sindy:.4f}") # Plot results plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) plt.scatter(y_test, predictions.flatten(), alpha=0.6) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--') plt.xlabel('Actual Critical Flow Rate') plt.ylabel('Predicted Critical Flow Rate') plt.title('PySINDy Predictions vs Actual') plt.subplot(1, 2, 2) plt.plot(y_test.values[:50], label='Actual', alpha=0.7) plt.plot(predictions.flatten()[:50], label='PySINDy', alpha=0.7) plt.xlabel('Sample Index') plt.ylabel('Critical Flow Rate') plt.title('PySINDy Time Series Prediction') plt.legend() plt.tight_layout() plt.show() PySR Tutorial ============= PySR Implementation Overview ---------------------------- PySR (Python Symbolic Regression) uses genetic programming to find symbolic expressions that fit data. It's particularly useful for discovering interpretable mathematical relationships in the DFT domain. PySR Setup and Installation --------------------------- .. code-block:: python # Install PySR if not already installed # pip install pysr import pysr import numpy as np from sklearn.preprocessing import StandardScaler PySR Data Preparation --------------------- .. code-block:: python # Prepare data for PySR # PySR works best with normalized data scaler_pysr = StandardScaler() X_train_scaled = scaler_pysr.fit_transform(X_train) X_test_scaled = scaler_pysr.transform(X_test) # Create feature names for PySR feature_names = [f"x{i}" for i in range(X_train.shape[1])] PySR Model Training ------------------- .. code-block:: python # Configure PySR pysr_model = pysr.PySRRegressor( niterations=100, # Number of iterations binary_operators=["+", "-", "*", "/"], unary_operators=["exp", "log", "abs", "sqrt"], populations=30, # Number of populations ncyclesperiteration=550, # Number of cycles per iteration maxsize=20, # Maximum size of expressions maxdepth=10, # Maximum depth of expressions early_stop_condition=1e-6, # Early stopping condition timeout_in_seconds=300, # Timeout in seconds random_state=42 ) # Train PySR model print("Training PySR model...") pysr_model.fit(X_train_scaled, y_train) # Print the best equation print(f"PySR best equation: {pysr_model.get_best()}") PySR Model Evaluation --------------------- .. code-block:: python # Make predictions y_pred_pysr = pysr_model.predict(X_test_scaled) # Evaluate performance mse_pysr = mean_squared_error(y_test, y_pred_pysr) r2_pysr = r2_score(y_test, y_pred_pysr) print(f"PySR MSE: {mse_pysr:.4f}") print(f"PySR R²: {r2_pysr:.4f}") # Plot results plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) plt.scatter(y_test, y_pred_pysr, alpha=0.6) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--') plt.xlabel('Actual Critical Flow Rate') plt.ylabel('Predicted Critical Flow Rate') plt.title('PySR Predictions vs Actual') plt.subplot(1, 2, 2) plt.plot(y_test.values[:50], label='Actual', alpha=0.7) plt.plot(y_pred_pysr[:50], label='PySR', alpha=0.7) plt.xlabel('Sample Index') plt.ylabel('Critical Flow Rate') plt.title('PySR Time Series Prediction') plt.legend() plt.tight_layout() plt.show() QLattice Tutorial ================= QLattice Implementation Overview -------------------------------- QLattice is a symbolic regression tool that discovers mathematical expressions using quantum-inspired algorithms. It's particularly effective for finding interpretable models in scientific applications. QLattice Setup and Installation ------------------------------- .. code-block:: python # Install QLattice if not already installed # pip install feyn import feyn import pandas as pd import numpy as np from sklearn.model_selection import train_test_split QLattice Data Preparation ------------------------- .. code-block:: python # Prepare data for QLattice # QLattice works with pandas DataFrames qlattice_data = pd.DataFrame(X_train, columns=X_train.columns) qlattice_data['critical_flow_rate'] = y_train # Split data train_data, val_data = train_test_split(qlattice_data, test_size=0.2, random_state=42) QLattice Model Training ----------------------- .. code-block:: python # Configure QLattice qlattice_model = feyn.QLattice( random_seed=42, workers=4 # Number of parallel workers ) # Define the problem problem = feyn.Problem( task_type=feyn.TaskType.REGRESSION, target_column='critical_flow_rate' ) # Train the model print("Training QLattice model...") qlattice_model.fit( train_data, problem, n_epochs=10, max_complexity=10 ) # Get the best model best_model = qlattice_model.get_best() print(f"QLattice best model: {best_model}") QLattice Model Evaluation ------------------------- .. code-block:: python # Make predictions y_pred_qlattice = best_model.predict(X_test) # Evaluate performance mse_qlattice = mean_squared_error(y_test, y_pred_qlattice) r2_qlattice = r2_score(y_test, y_pred_qlattice) print(f"QLattice MSE: {mse_qlattice:.4f}") print(f"QLattice R²: {r2_qlattice:.4f}") # Plot results plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) plt.scatter(y_test, y_pred_qlattice, alpha=0.6) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--') plt.xlabel('Actual Critical Flow Rate') plt.ylabel('Predicted Critical Flow Rate') plt.title('QLattice Predictions vs Actual') plt.subplot(1, 2, 2) plt.plot(y_test.values[:50], label='Actual', alpha=0.7) plt.plot(y_pred_qlattice[:50], label='QLattice', alpha=0.7) plt.xlabel('Sample Index') plt.ylabel('Critical Flow Rate') plt.title('QLattice Time Series Prediction') plt.legend() plt.tight_layout() plt.show() Ensemble Methods Tutorial ========================= Ensemble Implementation Overview -------------------------------- Ensemble methods combine multiple models to improve prediction accuracy and robustness. In the DFT framework, we can combine physics-based models with machine learning approaches. Ensemble Model Creation ----------------------- .. code-block:: python from sklearn.ensemble import VotingRegressor from sklearn.linear_model import LinearRegression import numpy as np # Create ensemble of different models ensemble_model = VotingRegressor([ ('xgb', best_xgb_model), ('linear', LinearRegression()), ('dft', DFT(seed=42, feature_tol=1.0, dev_tol=1e-3)) ]) # Train ensemble ensemble_model.fit(X_train, y_train) # Make predictions y_pred_ensemble = ensemble_model.predict(X_test) # Evaluate performance mse_ensemble = mean_squared_error(y_test, y_pred_ensemble) r2_ensemble = r2_score(y_test, y_pred_ensemble) print(f"Ensemble MSE: {mse_ensemble:.4f}") print(f"Ensemble R²: {r2_ensemble:.4f}") Model Comparison ---------------- .. code-block:: python # Compare all models models = { 'XGBoost': y_pred_best, 'PySINDy': predictions.flatten(), 'PySR': y_pred_pysr, 'QLattice': y_pred_qlattice, 'Ensemble': y_pred_ensemble } # Create comparison plot plt.figure(figsize=(15, 10)) for i, (name, predictions) in enumerate(models.items(), 1): plt.subplot(2, 3, i) plt.scatter(y_test, predictions, alpha=0.6) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--') plt.xlabel('Actual Critical Flow Rate') plt.ylabel('Predicted Critical Flow Rate') plt.title(f'{name} Predictions vs Actual') # Calculate and display metrics mse = mean_squared_error(y_test, predictions) r2 = r2_score(y_test, predictions) plt.text(0.05, 0.95, f'MSE: {mse:.4f}\nR²: {r2:.4f}', transform=plt.gca().transAxes, verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8)) plt.tight_layout() plt.show() Advanced Usage Examples ======================= Custom Alpha Strategy Implementation ------------------------------------ .. code-block:: python def custom_alpha_strategy(well_data): """Custom alpha strategy based on well deviation angle""" alpha_values = [] for _, row in well_data.iterrows(): dev_angle = row['Dev(deg)'] if dev_angle < 30: # Vertical wells alpha = 0.3 elif dev_angle < 60: # Inclined wells alpha = 0.5 else: # Horizontal wells alpha = 0.7 alpha_values.append(alpha) return np.array(alpha_values) # Use custom alpha strategy custom_alpha = custom_alpha_strategy(X_train) Production Deployment Example ----------------------------- .. code-block:: python import joblib from datetime import datetime # Save trained models for production use model_artifacts = { 'xgb_model': best_xgb_model, 'sindy_model': sindy_model, 'pysr_model': pysr_model, 'qlattice_model': best_model, 'ensemble_model': ensemble_model, 'scaler': scaler, 'feature_names': X_train.columns.tolist(), 'training_date': datetime.now().isoformat() } # Save to disk joblib.dump(model_artifacts, 'dft_models.pkl') # Load models for prediction loaded_models = joblib.load('dft_models.pkl') # Make predictions with loaded models def predict_critical_flow_rate(well_data, model_name='ensemble'): """Predict critical flow rate for new well data""" model = loaded_models[f'{model_name}_model'] scaler = loaded_models['scaler'] # Scale the data well_data_scaled = scaler.transform(well_data) # Make prediction prediction = model.predict(well_data_scaled) return prediction # Example usage new_well_data = X_test.iloc[:1] # Single well prediction = predict_critical_flow_rate(new_well_data, 'xgb') print(f"Predicted critical flow rate: {prediction[0]:.4f}") Batch Processing Example ------------------------ .. code-block:: python def batch_predict_critical_flow_rates(well_data_batch, model_name='ensemble'): """Process multiple wells in batch""" predictions = [] for well_data in well_data_batch: prediction = predict_critical_flow_rate(well_data, model_name) predictions.append(prediction[0]) return np.array(predictions) # Process multiple wells batch_wells = [X_test.iloc[i:i+1] for i in range(10)] batch_predictions = batch_predict_critical_flow_rates(batch_wells, 'xgb') print(f"Batch predictions: {batch_predictions}") Performance Optimization ======================== Memory-Efficient Processing --------------------------- .. code-block:: python def process_large_dataset(data_path, chunk_size=1000): """Process large datasets in chunks to manage memory""" results = [] for chunk in pd.read_csv(data_path, chunksize=chunk_size): # Process chunk chunk_predictions = predict_critical_flow_rate(chunk, 'xgb') results.extend(chunk_predictions) return np.array(results) Parallel Processing ------------------- .. code-block:: python from joblib import Parallel, delayed def parallel_predict(well_data_list, model_name='xgb', n_jobs=4): """Parallel prediction for multiple wells""" predictions = Parallel(n_jobs=n_jobs)( delayed(predict_critical_flow_rate)(well_data, model_name) for well_data in well_data_list ) return np.array(predictions) This comprehensive tutorial provides complete examples for using all machine learning methods in the DFT Development framework, including XGBoost, PySINDy, PySR, and QLattice, with practical function calls and real-world usage scenarios.