U-SQL 入門⑩ ~ 結合 ~

U-SQL では以下の結合が利用可能です。すべての JOIN において行セットの変数を指定することも SELECT 文を記述することもできます。

  • CROSS JOIN
  • INNER JOIN
  • OUTER JOIN
  • SEMIJOIN
  • ANTISEMIJOIN

CROSS JOIN

行セットのデカルト積を返します。

@employees = SELECT *
               FROM (VALUES
                      ("Rafferty", (int?) 31)
                    , ("Jones", (int?) 33)
                    , ("Heisenberg", (int?) 33)
                    , ("Robinson", (int?) 34)
                    , ("Smith", (int?) 34)
                    , ("Williams", (int?) null)) AS E(EmpName, DepID);

@departments = SELECT *
                FROM (VALUES
                       ((int) 31, "Sales")
                     , ((int) 33, "Engineering")
                     , ((int) 34, "Clerical")
                     , ((int) 35, "Marketing")) AS D(DepID, DepName);

@rs_cross = 
    SELECT e.EmpName, e.DepID, d.DepName
    FROM @employees AS e
         CROSS JOIN @departments AS d
    WHERE ((d.DepName == "Engineering") || (d.DepName == "Marketing"));

OUTPUT @rs_cross
TO "/Output/ReferenceGuide/Joins/CrossJoins/ExampleA.csv"
USING Outputters.Csv(outputHeader : true);

出力結果

EmpName DepID DepName
Rafferty 31 Engineering
Jones 33 Engineering
Heisenberg 33 Engineering
Robinson 34 Engineering
Smith 34 Engineering
Williams Engineering
Rafferty 31 Marketing
Jones 33 Marketing
Heisenberg 33 Marketing
Robinson 34 Marketing
Smith 34 Marketing
Williams Marketing

INNER JOIN

それぞれの行セットの指定したキーのカラムの値が一致するデータだけを返します。こちらの例では INNER JOIN にサブクエリを指定しています。

@employees = SELECT *  
               FROM (VALUES   
                      ("Rafferty", (int?) 31)  
                    , ("Jones", (int?) 33)  
                    , ("Heisenberg", (int?) 33)  
                    , ("Robinson", (int?) 34)  
                    , ("Smith", (int?) 34)  
                    , ("Williams", (int?) null)) AS E(EmpName, DepID);  

@departments = SELECT *  
                FROM (VALUES  
                       ((int) 31, "Sales")  
                     , ((int) 33, "Engineering")  
                     , ((int) 34, "Clerical")  
                     , ((int) 35, "Marketing")) AS D(DepID, DepName);  

@rs_inner =   
    SELECT e.DepID AS EmpDepID, d.DepID, e.EmpName, d.DepName       
    FROM @employees AS e  
         INNER JOIN (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d   
         ON e.DepID == d.DepID;  

OUTPUT @rs_inner   
TO "/output/rsInnerJoin.csv"  
USING Outputters.Csv(outputHeader : true);  

出力結果

EmpDepID DepID EmpName DepName
31 31 Rafferty Sales
33 33 Jones Engineering
33 33 Heisenberg Engineering
34 34 Robinson Clerical
34 34 Smith Clerical

OUTER JOIN

指定したキーでそれぞれの行セットを結合します。キーが一致しない場合でも、指定した側のデータは維持されます。

LEFT OUTER JOIN の例
@employees = SELECT *
               FROM (VALUES
                      ("Rafferty", (int?) 31)
                    , ("Jones", (int?) 33)
                    , ("Heisenberg", (int?) 33)
                    , ("Robinson", (int?) 34)
                    , ("Smith", (int?) 34)
                    , ("Williams", (int?) null)) AS E(EmpName, DepID);

@departments = SELECT *
                FROM (VALUES
                       ((int?) 31, "Sales")
                     , ((int?) 33, "Engineering")
                     , ((int?) 34, "Clerical")
                     , ((int?) 35, "Marketing")) AS D(DepID, DepName);

@rs_leftouter =
    SELECT e.EmpName, d.DepName
    FROM @employees AS e
         LEFT OUTER JOIN @departments AS d
         ON e.DepID == d.DepID;

OUTPUT @rs_leftouter
TO "/output/rsLeftOuterJoin.csv"
USING Outputters.Csv(outputHeader : true); 

出力結果

EmpName DepName
Rafferty Sales
Jones Engineering
Heisenberg Engineering
Robinson Clerical
Smith Clerical
Williams

※RIGHT OUTER JOIN 、FULL OUTER JOIN も同様にサポートされます。

SEMIJOIN

SEMIJOIN は SELECT * FROM A WHERE A.Key IN (SELECT B.Key FROM B) のようなフィルターのような動作になります。LEFT SEMIJOIN と RIGHT SEMIJOIN の 2 つがあります。

LEFT SEMIJOIN の例
@employees = SELECT *  
               FROM (VALUES   
                      ("Rafferty", (int?) 31)  
                    , ("Jones", (int?) 33)  
                    , ("Heisenberg", (int?) 33)  
                    , ("Robinson", (int?) 34)  
                    , ("Smith", (int?) 34)  
                    , ("Williams", (int?) null)) AS E(EmpName, DepID);  

@departments = SELECT *  
                FROM (VALUES  
                       ((int?) 31, "Sales")  
                     , ((int?) 33, "Engineering")  
                     , ((int?) 34, "Clerical")  
                     , ((int?) 35, "Marketing")) AS D(DepID, DepName);  

@emps_in_valid_dept =  
    SELECT e.EmpName, e.DepID  
    FROM @employees AS e  
         LEFT SEMIJOIN @departments AS d  
         ON e.DepID == d.DepID;  

OUTPUT @emps_in_valid_dept   
TO "/output/rsLeftSemiJoinEmployeesInValidDept.csv"  
USING Outputters.Csv(outputHeader : true);

出力結果

EmpName DepID
Rafferty 31
Jones 33
Heisenberg 33
Robinson 34
Smith 34

ANTISEMIJOIN

ANTISEMIJOIN は SEMIJOIN の反対の動作をします。SELECT * FROM A WHERE A.key NOT IN (SELECT B.key FROM B) のように、マッチしない行を返します。

RIGHT ANTISEMIJOIN の例
@employees = SELECT *  
               FROM (VALUES   
                      ("Rafferty", (int?) 31)  
                    , ("Jones", (int?) 33)  
                    , ("Heisenberg", (int?) 33)  
                    , ("Robinson", (int?) 34)  
                    , ("Smith", (int?) 34)  
                    , ("Williams", (int?) null)) AS E(EmpName, DepID);  

@departments = SELECT *  
                FROM (VALUES  
                       ((int) 31, "Sales")  
                     , ((int) 33, "Engineering")  
                     , ((int) 34, "Clerical")  
                     , ((int) 35, "Marketing")) AS D(DepID, DepName);  

@depts_without_emps =  
    SELECT d.DepName, d.DepID  
    FROM @employees AS e  
         RIGHT ANTISEMIJOIN (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d   
         ON e.DepID == d.DepID;  

OUTPUT @depts_without_emps   
TO "/output/rsRightAntiSemiJoinDeptsWithoutEmployees.csv"  
USING Outputters.Csv(outputHeader : true);

出力結果

DepName DepID
Marketing 35

この記事はこちらの公式ドキュメントを参考にしたものです。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/u-sql-select-selecting-from-joins

U-SQL 入門⑨ ~ 変数の使用 ~

DECLARE を使って変数を宣言することができます。これを使うことによって U-SQL のメンテナンス性が向上します。
また、Azure Data Factory を使って U-SQL を実行する際は、Azure Data Factory で定義したパラメータを U-SQL に渡すことができます。

DECLARE @in  string = "/Samples/Data/SearchLog.tsv";
DECLARE @out string = "/output/SearchLog-scalar-variables.csv";

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM @in
    USING Extractors.Tsv();

OUTPUT @searchlog   
    TO @out
    USING Outputters.Csv();

T-SQL と異なり、以下の様な記述はエラーになります。

// エラーになる
DECLARE @var string;

DECLARE @var string = "";
SET @var = "a value"

以下の様に行セットも DECLARE では使用できません。

DECLARE @maxval int = SELECT MAX(value) FROM data;

この記事はこちらの公式ドキュメントを参考にしたものです。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/declare-variables-u-sql

Azure Data Factory から U-SQL パラメータを渡す場合
U-SQL スクリプトを使用したデータ変換 - Azure | Microsoft Docs

"parameters": {
    "in": "/Samples/Data/SearchLog.tsv",
    "out": "/output/SearchLog-scalar-variables.csv"
}

U-SQL 入門⑧ ~ 行番号を振る ~

行に番号を振るには ROW_NUMBER ウィンドウ関数を使います。下記にいくつかの例を紹介します。

1. 普通に行番号を振る場合

@employees = 
    SELECT * FROM 
        ( VALUES
        ("Noah",   "Engineering", 100, 10000),
        ("Sophia", "Engineering", 100, 15000),
        ("Liam",   "Engineering", 100, 30000),
        ("Amy",    "Engineering", 100, 35000),
        ("Justin", "Engineering", 100, 15000),
        ("Emma",   "HR",          200, 8000),
        ("Jacob",  "HR",          200, 8000),
        ("Olivia", "HR",          200, 8000),
        ("Mason",  "Executive",   300, 50000),
        ("Ava",    "Marketing",   400, 15000),
        ("Ethan",  "Marketing",   400, 9000) 
        ) AS T(EmpName, DeptName, DeptID, Salary);

@result =
    SELECT ROW_NUMBER() OVER () AS RowNumber,
    EmpName, DeptName, Salary
    FROM @employees;

OUTPUT @result
TO "/Output/ReferenceGuide/Ranking/row_number/exampleA.csv"
USING Outputters.Csv(outputHeader: true);

出力結果

RowNumber EmpName DeptName Salary
1 Noah Engineering 10000
2 Sophia Engineering 15000
3 Liam Engineering 30000
4 Amy Engineering 35000
5 Justin Engineering 15000
6 Emma HR 8000
7 Jacob HR 8000
8 Olivia HR 8000
9 Mason Executive 50000
10 Ava Marketing 15000
11 Ethan Marketing 9000

2. Salary の高い順に振る場合

@employees = 
    SELECT * FROM 
        ( VALUES
        ("Noah",   "Engineering", 100, 10000),
        ("Sophia", "Engineering", 100, 15000),
        ("Liam",   "Engineering", 100, 30000),
        ("Amy",    "Engineering", 100, 35000),
        ("Justin", "Engineering", 100, 15000),
        ("Emma",   "HR",          200, 8000),
        ("Jacob",  "HR",          200, 8000),
        ("Olivia", "HR",          200, 8000),
        ("Mason",  "Executive",   300, 50000),
        ("Ava",    "Marketing",   400, 15000),
        ("Ethan",  "Marketing",   400, 9000) 
        ) AS T(EmpName, DeptName, DeptID, Salary);

// ORDER BY を使って Salary 列の降順で並べている
@result =
    SELECT ROW_NUMBER() OVER (ORDER BY Salary DESC) AS RowNumber,
    EmpName, DeptName, Salary
    FROM @employees;

OUTPUT @result
TO "/Output/ReferenceGuide/Ranking/row_number/exampleB.csv"
USING Outputters.Csv(outputHeader: true);

出力結果

RowNumber EmpName DeptName Salary
1 Mason Executive 50000
2 Amy Engineering 35000
3 Liam Engineering 30000
4 Sophia Engineering 15000
5 Justin Engineering 15000
6 Ava Marketing 15000
7 Noah Engineering 10000
8 Ethan Marketing 9000
9 Emma HR 8000
10 Jacob HR 8000
11 Olivia HR 8000

3. 部署ごとに Salary の高い順に振る場合

@employees = 
    SELECT * FROM 
        ( VALUES
        ("Noah",   "Engineering", 100, 10000),
        ("Sophia", "Engineering", 100, 15000),
        ("Liam",   "Engineering", 100, 30000),
        ("Amy",    "Engineering", 100, 35000),
        ("Justin", "Engineering", 100, 15000),
        ("Emma",   "HR",          200, 8000),
        ("Jacob",  "HR",          200, 8000),
        ("Olivia", "HR",          200, 8000),
        ("Mason",  "Executive",   300, 50000),
        ("Ava",    "Marketing",   400, 15000),
        ("Ethan",  "Marketing",   400, 9000) 
        ) AS T(EmpName, DeptName, DeptID, Salary);

// PARTITION BY を使って DeptID ごとに分けている
@result =
    SELECT ROW_NUMBER() OVER(PARTITION BY DeptID ORDER BY Salary DESC) AS RowNumberByDept,
           EmpName, DeptName, Salary
    FROM @employees;

OUTPUT @result
TO "/Output/ReferenceGuide/Ranking/row_number/exampleC.csv"
USING Outputters.Csv(outputHeader: true);
RowNumberByDept EmpName DeptName Salary
1 Amy Engineering 35000
2 Liam Engineering 30000
3 Sophia Engineering 15000
4 Justin Engineering 15000
5 Noah Engineering 10000
1 Emma HR 8000
2 Jacob HR 8000
3 Olivia HR 8000
1 Mason Executive 50000
1 Ava Marketing 15000
2 Ethan Marketing 9000

この記事はこちらの公式ドキュメントを参考にしたものです。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/row-number-u-sql

U-SQL 入門⑦ ~ U-SQL 内で行セットを作成 ~

以下の様にして行セットを U-SQL 内で生成することができます。ちょっとテストする時などに便利。

@employees = 
	SELECT * FROM ( VALUES
		(1, "Noah",   "Engineering", 100, 10000),
		(2, "Sophia", "Engineering", 100, 20000),
		(3, "Liam",   "Engineering", 100, 30000),
		(4, "Emma",   "HR",          200, 10000),
		(5, "Jacob",  "HR",          200, 10000),
		(6, "Olivia", "HR",          200, 10000),
		(7, "Mason",  "Executive",   300, 50000),
		(8, "Ava",    "Marketing",   400, 15000),
		(9, "Ethan",  "Marketing",   400, 10000) )
	AS T(EmpID, EmpName, DeptName, DeptID, Salary);

上記のように書くと @employees には以下のようなデータが格納される。

EmpID EmpName DeptName DeptID Salary
1 Noah Engineering 100 10000
2 Sophia Engineering 100 20000
3 Liam Engineering 100 30000
4 Emma HR 200 10000
5 Jacob HR 200 10000
6 Olivia HR 200 10000
7 Mason Executive 300 50000
8 Ava Marketing 400 15000
9 Ethan Marketing 400 10000

U-SQL 入門⑥ ~ 集約関数 ~

U-SQL にはビルトインの集約関数が提供されています。これらの集約関数は SELECT 文の中で使用できます。

集約関数の一覧

ANY_VALUE グループの中から任意の 1 つの値を取得する
ARRAY_AGG グループごとに SQL.ARRAY 値を生成する
AVG 平均
COUNT 個数のカウント
MAP_AGG Key, Value 値を取り SQL.MAP 値を生成する
MAX 最大値
MIN 最小値
SUM 合計
STDEV 標本標準偏差
STDEVP 標準偏差
VAR 標本分散
VARP 母分散

集約関数の実行例

@rs =
	SELECT
		Region,
		COUNT() AS NumSessions,
		SUM(Duration) AS TotalDuration,
		AVG(Duration) AS AvgDuration,
		MAX(Duration) AS MaxDuration,
		MIN(Duration) AS MinDuration,
		STDEV(Duration) AS StdevDuration,
		STDEVP(Duration) AS StdevPDuration,
		VAR(Duration) AS VarianceDuration,
		VARP(Duration) AS VariancePDuration
	FROM @searchlog
	GROUP BY Region;

その他集約関数

ANY_VALUE

ANY_VALUE 関数はグループの中から任意の 1 つの値を取得します。
以下ドキュメントにあるサンプルクエリを実行した結果です。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/any-value-u-sql
クエリ

@employees = 
 SELECT * FROM ( VALUES
     (1, "Noah",   "Engineering", 100, 10000),
     (2, "Sophia", "Engineering", 100, 20000),
     (3, "Liam",   "Engineering", 100, 30000),
     (4, "Emma",   "HR",          200, 10000),
     (5, "Jacob",  "HR",          200, 10000),
     (6, "Olivia", "HR",          200, 10000),
     (7, "Mason",  "Executive",   300, 50000),
     (8, "Ava",    "Marketing",   400, 15000),
     (9, "Ethan",  "Marketing",   400, 10000) )
 AS T(EmpID, EmpName, DeptName, DeptID, Salary);

@result =
    SELECT DeptName,
           ANY_VALUE(EmpName) AS ArbitraryEmployee
    FROM @employees
    GROUP BY DeptName; 

OUTPUT @result
TO "/Output/ReferenceGuide/any_value/exampleB.csv"
USING Outputters.Csv(outputHeader: true);

実行結果

DeptName ArbitraryEmployee
Engineering Noah
Executive Mason
HR Emma
Marketing Ava
ARRAY_AGG

ARRAY_AGG 関数はグループごとに SQL.ARRAY 値を生成します。
以下ドキュメントにあるサンプルクエリを実行した結果です。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/array-agg-u-sql
クエリ

@employees = 
    SELECT * FROM 
        ( VALUES
        ("Noah",   "cell:030-0074321"),
        ("Noah",   "office:030-0076545"),
        ("Sophia", "cell:(5) 555-4729"),
        ("Sophia", "office:(5) 555-3745"),
        ("Liam",   "cell:(5) 555-3932"),
        ("Amy",    "cell:(171) 555-7788"),
        ("Amy",    "office:(171) 555-6750"), 
        ("Amy",    "home:(425) 555-6238"),
        ("Justin", "cell:0921-12 34 65"),
        ("Justin", "office:0921-12 34 67"),
        ("Emma",   (string)null),
        ("Jacob",  ""),
        ("Olivia", "cell:88.60.15.31"),
        ("Olivia", "office:88.60.15.32"),
        ("Mason",  "cell:(91) 555 22 82"),
        ("Mason",  "office:(91) 555 91 99"), 
        ("Mason",  "home:(425) 555-2819"),
        ("Ava",    "cell:91.24.45.40"),
        ("Ava",    "office:91.24.45.41"),
        ("Ethan",  "cell:(604) 555-4729"),
        ("Ethan",  "office:(604) 555-3745"),
        ("David",  "cell:(171) 555-1212"),
        ("Andrew", "cell:(1) 135-5555"),
        ("Andrew", "office:(1) 135-4892"),
        ("Jennie", "cell:(5) 555-3392"),
        ("Jennie", "office:(5) 555-7293")
        ) AS T(EmpName, PhoneNumber);

@result =
    SELECT EmpName,
           string.Join(", ", ARRAY_AGG(PhoneNumber)) AS PhoneNumbers
    FROM @employees
    WHERE !string.IsNullOrEmpty(PhoneNumber)
    GROUP BY EmpName;

OUTPUT @result
TO "/Output/ReferenceGuide/Aggregate/array_agg/exampleA.csv"
USING Outputters.Csv(outputHeader: true);

実行結果

EmpName PhoneNumbers
Amy cell:(171) 555-7788, office:(171) 555-6750, home:(425) 555-6238
Andrew office:(1) 135-4892, cell:(1) 135-5555
Ava cell:91.24.45.40, office:91.24.45.41
David cell:(171) 555-1212
Ethan cell:(604) 555-4729, office:(604) 555-3745
Jennie office:(5) 555-7293, cell:(5) 555-3392
Justin cell:0921-12 34 65, office:0921-12 34 67
Liam cell:(5) 555-3932
Mason cell:(91) 555 22 82, office:(91) 555 91 99, home:(425) 555-2819
Noah cell:030-0074321, office:030-0076545
Olivia office:88.60.15.32, cell:88.60.15.31
Sophia office:(5) 555-3745, cell:(5) 555-4729
MAP_AGG

MAP_AGG 関数は Key, Value 値を取り SQL.MAP 値を生成します。
以下ドキュメントにあるサンプルクエリを実行した結果です。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/map-agg-u-sql
クエリ

@employees = 
    SELECT * FROM 
        ( VALUES
        ("Noah",   "cell",   "030-0074321"),
        ("Noah",   "office", "030-0076545"),
        ("Sophia", "cell",   "(5) 555-4729"),
        ("Sophia", "office", "(5) 555-3745"),
        ("Liam",   "cell",   "(5) 555-3932"),
        ("Amy",    "cell",   "(171) 555-7788"),
        ("Amy",    "office", "(171) 555-6750"), 
        ("Amy",    "home",   "(425) 555-6238"),
        ("Justin", "cell",   "0921-12 34 65"),
        ("Justin", "office", "0921-12 34 67"),
        ("Emma",   (string)null, (string)null),
        ("Jacob",  "", ""),
        ("Olivia", "cell",   "88.60.15.31"),
        ("Olivia", "office", "88.60.15.32"),
        ("Mason",  "cell",   "(91) 555 22 82"),
        ("Mason",  "office", "(91) 555 91 99"), 
        ("Mason",  "home",   "(425) 555-2819"),
        ("Ava",    "cell",   "91.24.45.40"),
        ("Ava",    "office", "91.24.45.41"),
        ("Ethan",  "cell",   "(604) 555-4729"),
        ("Ethan",  "office", "(604) 555-3745"),
        ("David",  "cell",   "(171) 555-1212"),
        ("Andrew", "cell",   "(1) 135-5555"),
        ("Andrew", "office", "(1) 135-4892"),
        ("Jennie", "cell",   "(5) 555-3392"),
        ("Jennie", "office", "(5) 555-7293")
        ) AS T(EmpName, PhoneType, PhoneNumber);

@result =
    SELECT EmpName,
           String.Join(",", MAP_AGG(PhoneType, PhoneNumber).Select(p => String.Format("{0}:{1}", p.Key, p.Value))).Trim() AS PhoneNumbers
    FROM @employees
    WHERE !string.IsNullOrWhiteSpace(PhoneType)
    GROUP BY EmpName;

OUTPUT @result
TO "/Output/ReferenceGuide/Aggregate/map_agg/exampleA.csv"
USING Outputters.Csv(outputHeader: true);

実行結果

EmpName PhoneNumbers
Amy cell:(171) 555-7788,home:(425) 555-6238,office:(171) 555-6750
Andrew cell:(1) 135-5555,office:(1) 135-4892
Ava cell:91.24.45.40,office:91.24.45.41
David cell:(171) 555-1212
Ethan cell:(604) 555-4729,office:(604) 555-3745
Jennie cell:(5) 555-3392,office:(5) 555-7293
Justin cell:0921-12 34 65,office:0921-12 34 67
Liam cell:(5) 555-3932
Mason cell:(91) 555 22 82,home:(425) 555-2819,office:(91) 555 91 99
Noah cell:030-0074321,office:030-0076545
Olivia cell:88.60.15.31,office:88.60.15.32
Sophia cell:(5) 555-4729,office:(5) 555-3745

この記事はこちらの公式ドキュメントを参考にしたものです。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/aggregate-functions-u-sql